Data Types:: Basic Statistics
Data Types:: Basic Statistics
Random Variable
A random variable describes the probabilities for an uncertain future
numerical outcome of a random process.
It is variable because it can take one of several possibilities.
It is a random because there is some chance associated with each
possible value.
SLIDE-16
Poker cards example:
Suppose you have randomly picked a card from the card deck. What
is the probability that this card will be?
- Bigger than 10?
- Equal to or Bigger than 10?
- Smaller than 3
- Greater than 4 and less than 8
SLIDE-17
Expected Value
For a probability distribution, the mean of the distribution is known
as the expected value
The expected value intuitively refers to what one would find if they
repeated the experiment an infinite number of times and took the
average of all of the outcomes
Mathematically, it is calculated as the weighted average of each
possible value
The formula for calculating the expected value for a discrete random
variable X, denoted by μ, is:
∑ Xp(X)
The variance of a discrete random variable X, denoted by σ2 is
σ2 = ∑ [(x-µ/σ)] 2 = ∑ (x- µ)2p(x)
SLIDE-22
Graphical techniques:
1) Bar plot : plotting each point in bar shape
SLIDE-23
Histogram: Represents frequency distribution of data, how many
observations of take the value within certain interval.
SLIDE-24
Third Business Moment: Skewness
4rth Business Moment: Kurtosis
Skewness
• A measure of asymmetry in the distribution
• Mathematically it is given by: E [(x-µ/σ)] 3
• Negative skewness implies mass of the Distribution is
concentrated on the Right
Kurtosis
• A measure of the “Peakedness” of the distribution
• Mathematically it is given by E[(x-µ/σ)]4 -3
• For Symmetric distributions, negative Kurtosis implies wider peak
and thinner tails
SLIDE-25
Boxplot:
• Range (IQR): The middle half of a data set falls within the inter-
quartile range. – Inter Quartile Range.
• Box Plot: This graph shows the distribution of data by dividing the
data into four groups with the same number of data points in each
group. The box contains the middle 50% of the data points and
each of the two whiskers contain 25% of the data points. It displays
two common measures of the variability or spread in a data set
• Range: It is represented on a box plot by the distance between the
smallest value and the largest value, including any outliers. If you
ignore outliers, the range is illustrated by the distance between the
opposite ends of the whiskers
SLIDE-26
Normal Distribution
The normal random variable takes values from -∞ to +∞
The Probability associated with any single value of a random
variable is always zero
Area under the entire curve is always equal to 1.
SLIDE-27
. Characterized by bell shaped
Properties:
• 68.26% of values lie within ±1 σ from the mean
• 95.46% of the values lie within ±2 σ from the mean
• 99.73% of the values lie within ± 3σ from the mean
SLIDE-28
X~N(µ,σ)
Characterized by mean, µ, and standard deviation, σ
SLIDE-29
Z scores, Standard Normal Distribution:
• For every value (x) of the random variable X, we can calculate Z
score : Z = (X-µ)/ σ
• Interpretation − How many standard deviations away is the value
from the mean?
SLIDE-30
Calculating Probability from Z distribution
Suppose GMAT scores can be reasonably modelled using a normal
distribution
− µ = 711 σ = 29
SLIDE-31
• What is P (697≤ X ≤ 740)?
• Step 1 : Use P(x1 ≤ X ≤ x2) = Use P( X ≤ x2) − P( X ≤ x1)
•
• Step 2 : Calculate P( X ≤ x2) and P( X ≤ x1) as before
P(X ≤ 740) = P (Z ≤ 1) = 0.84; P(X ≤ 697) = P ( Z ≤ - 0.5) = 0.31
SLIDE-32
Normal Quantile plot (Q-Q plot):
SLIDE-33
Sampling variation
- Sample mean varies from one sample to another.
- Sample mean can be (and most likely is) different from the
population mean.
- Sample mean is a random variable.
SLIDE-34
Central Limit Theorem
The Distribution of the sample mean
- will be normal when the distribution of data in the population is
normal
- will be approximately normal even if the distribution of data in the
population is not normal if the “sample size” is fairly large
Mean (X) = µ (the same as the population mean of the raw data)
SLIDE-35
Sample Size Calculation
A Sample Size of 30 is considered large enough, but that may /may not
be adequate
More Precise conditions
- n > 10( K3 )2 , where ( K3 ) is sample skewness and
- n > 10( K4 ) , where ( K4) is sample kurtosis
SLIDE-36
Confidence Interval
• What is the Probability of tomorrow’s temperature being 42
degrees?
• Probability is ‘0’
• Can it be between [-50⁰C & 100⁰C]?
SLIDE-37
Case Study: Confidence Interval
• A University with 100,000 alumni is thinking of offering a new
affinity credit card to its alumni.
• Profitability of the card depends on the average balance
maintained by the card holders.
• A Market research campaign is launched, in which about 140
alumni accept the card in a pilot launch.
• Average balance maintained by these is $1990 and the standard
deviation is $2833. Assume that the population standard
deviation is $2500 from previous launches.
• What we can say about the average balance that will be held after
a full−fledged market launch?
SLIDE-38
Interval estimates of parameters
• Based on sample data
− The point estimate for mean balance = $1990
− Can we trust this estimate?
• What do you think will happen if we took another random sample
of 140 alumni?
• Because of this uncertainty, we prefer to provide the estimate as
an interval (range) and associate a level of confidence with it
• Interval Estimate = Point Estimate ± Margin of Error
SLIDE-39
SLIDE-42
Confidence Interval Interpretation
Consider the 95% Confidence interval for the mean income:
[$1576, $2404]
Does this mean that?
- The mean balance of the population lies in the range?
- The mean balance is in this range 95% of the time?
- 95% of the alumni have balance in this range?
Interpretation 1 : Mean of the population has a 95% chance of
being in this range for a random sample
Interpretation 2 : Mean of the population will be in this
range for 95% of the random samples
SLIDE-43
What if we don’t know Sigma?
• Suppose that the alumni of this university are very
different and hence population standard deviation from
previous launches cannot be used
We replace σ with our best guess (point estimate) s, which
is the standard deviation of the sample:
Calculate:
As n -> ꝏ
tn -> N(0,1)
i.e., as the degrees of the freedom increase, the t-
distribution approaches the standard normal distribution.
Slide-45
Confidence Interval for mean with unknown Sigma
𝑥 ± Z1-ᾳ σ/ √𝑛 where Z1-ᾳ satisfies p (-Z1-ᾳ ≤ Z ≤ Z1-ᾳ) = 1-ᾳ
Instead of above equation we can use the below t distribution
equation
𝑥 ± t1-ᾳ, n-1 s/ √𝑛 where t1-ᾳ, n-1 satisfies p (-t1-ᾳ, n-1 ≤ Tn-1≤ t1-ᾳ, n-1) = 1-ᾳ
Slide-46
Calculating t-value
• Construct a 95% confidence interval for the mean card balance
and interpret it?
n = 140; σ = $2500; 𝑥 = $ 1990
σ𝑥 = 2833/sqrt(140) = 239.46
Calculate t0.95, 139 = 1.98
Then the 95% confidence interval for balance is [$1516, $2464]