Statistics Refresher
Statistics Refresher
1. Business:
○ Statistics helps in forecasting sales, analyzing trends, and improving decision-
making.
○ Example: A retail store uses sales data to predict demand during holiday
seasons.
2. Healthcare:
○ Analyzing patient outcomes and testing the efficacy of drugs rely on statistical
methods.
○ Example: A clinical trial uses statistical analysis to determine the effectiveness of
a new vaccine.
3. Social Sciences:
○ Statistics aids in studying behavior patterns and interpreting survey results.
○ Example: Analyzing voting patterns in elections to understand demographic
preferences.
Data Collection
1. Primary Data:
○ Collected directly for a specific purpose, often through surveys, interviews, or
experiments.
○ Example: A company surveys customers to gauge satisfaction with a new
product.
2. Secondary Data:
○ Obtained from existing sources like reports, databases, or government records.
○ Example: Using census data to analyze population growth trends for urban
planning.
1. Nominal Data
● Definition: Data with categories in a specific order, but the intervals between them are
not uniform.
● Example: Customer Rating (Poor, Average, Excellent).
● "Excellent" is higher than "Average," but the difference between "Poor" and "Average"
may not equal the difference between "Average" and "Excellent."
3. Interval Data
● Definition: Data with equal intervals between values but no true zero.
● Example: Temperature (°C) (15, 25, 30).
● The difference between 15°C and 25°C is the same as between 25°C and 30°C.
However, 0°C does not mean the absence of temperature (no true zero).
4. Ratio Data
● Definition: Data with equal intervals and a true zero, allowing for meaningful
comparisons like ratios.
● Example: Weight (kg) (60, 70, 85).
● 0 kg means no weight, and comparisons like "Person 2 is 1.2 times heavier than Person
3" are valid.
Discrete Data
● Definition: Discrete data consists of countable, distinct values, often integers. These
values represent quantities that cannot be subdivided meaningfully.
● Example from the Table:
○ Number of Pets: {0, 1, 2, 3}.
○ These are exact numbers representing how many pets a person owns. You can’t
have 1.5 pets.
● Key Features:
○ Finite or countable.
○ Typically whole numbers.
○ Example in Real Life: Number of books on a shelf, number of cars in a parking
lot.
Continuous Data
● Definition: Continuous data can take any value within a given range and often includes
decimals. It represents measurements that can be infinitely precise.
● Examples from the Table:
○ Height (cm): {160.5, 165.8, 172.3, 158.0}.
■ These are measurements, and any value is possible within the height
range (e.g., 160.51 cm or 160.52 cm).
○ Weight (kg): {70.2, 65.5, 68.9, 60.0}.
■ These are also measurements, and decimal precision is meaningful (e.g.,
68.91 kg or 68.92 kg).
● Key Features:
○ Infinite possible values within a range.
○ Measured rather than counted.
○ Example in Real Life: Distance, temperature, or speed.
A concise table summary for the distribution of Nominal, Ordinal, Interval, and Ratio data
under Discrete and Continuous data types:
Measure Of Central Tendency
1. Mean
2. Median
3. Mode
4. Weighted Mean
● Definition: The mean where some values contribute more due to their assigned
weights.
● Formula: Weighted Mean=∑(Value×Weight)/∑(Weight)
● Example (Score with Weight):
Weighted Mean = [(85×2)+(90×3)+(80×1)+(95×2)+(88×2)] / (2+3+1+2+2) = 88.33
● Excel Formula: =SUMPRODUCT(B2:B6, C2:C6)/SUM(C2:C6)
● Use Case: Often used in academics (weighted average of grades), project evaluations,
and financial analysis.
5. Trimmed Mean
● Definition: The mean after removing a certain percentage of the smallest and largest
values.
● Example (Salary, 10% Trim):
Exclude 10% from both ends: 40000, 60000
Remaining Salaries: 48000, 50000, 55000
Trimmed Mean = (48000+50000+55000) / (3) = 51000
● Excel Formula: =TRIMMEAN(E2:E6, 0.1)
● Use Case: Used in sports scoring (removing outliers like highest and lowest judges'
scores) or in robust data analysis.
6. Weighted Median
Measures of Dispersion
Measures of dispersion indicate how spread out the data is. They include Range, Variance,
Standard Deviation, Percentiles, and Quartiles. Understanding outliers is also crucial in data
analysis.
1. Range
● Definition: The difference between the maximum and minimum values in a dataset.
● Manual Formula: Range = Maximum - Minimum
● Example: Data: {10, 15, 20, 25, 30}
Range = 30−10=20
● Excel Formula: =MAX(A1:A5) - MIN(A1:A5)
● Use Case:
○ Quick measure of variability in datasets like daily temperatures or stock prices.
2. Variance
● Definition: Measures the average squared deviation of each data point from the mean.
● Manual Formula: Variance=∑(Value−Mean)^2 / Number of Values
● Example: Data: {10, 15, 20, 25, 30}
Mean = 20
Variance =¿
● Excel Formula:
○ For a population: =VAR.P(A1:A5)
○ For a sample: =VAR.S(A1:A5)
● Use Case:
○ Variance is used in finance to measure investment risk.
3. Standard Deviation
● Definition: The square root of variance, representing the average distance of data
points from the mean.
● Manual Formula: Standard Deviation=Variance
● Example: Data: {10, 15, 20, 25, 30}
Variance = 50
Standard Deviation =squareroot ( variance) = 7.07
● Excel Formula:
○ For a population: =STDEV.P(A1:A5)
○ For a sample: =STDEV.S(A1:A5)
● Use Case:
○ Used in quality control and risk analysis to assess consistency.
4. Percentiles
5. Quartiles
Excel Formula
=QUARTILE.EXC(A1:A5, 1) (Q1)
=QUARTILE.EXC(A1:A5, 2) (Median or Q2)
=QUARTILE.EXC(A1:A5, 3) (Q3)
● Use Case:
○ Identifying outliers by using the Interquartile Range (IQR).
6. Outliers
● Definition: Data points that are significantly different from others in the dataset.
● Technique for Identifying Outliers:
○ Compute IQR: IQR=Q3−Q1
○ Lower Bound: Q1−1.5xIQR
○ Upper Bound: Q3+1.5xIQR
○ Example: Data: {10, 15, 20, 25, 30,100}
Q1 = 13.75 , Q3 = 47.5 , IQR = 33.75
Lower Bound = 13.75 − 1.5×33.75 = −36.875
○ Upper Bound = 47.5 + 1.5×33.75 = 98.125
Outlier = 100
○ Any data point below the Lower Bound or above the Upper Bound is
considered an outlier.
● Use Case:
○ Detecting outliers in sales, student test scores, or scientific experiments.
7. Coefficient of Variation
Definition:
● The Coefficient of Variation (CV) is a measure of relative variability, expressed as a
percentage of the mean. It helps in comparing the degree of variation between datasets
with different units or scales.
Excel Formula:
● Mean: =AVERAGE(A1:A5)
● Standard Deviation: STDEV(A1:A5)
● CV: =STDEV(A1:A5)/AVERAGE(A1:A5)*100
Significance:
8. Covariance
Definition:
Formula:
Covariance = [∑( X−X mean)(Y −Y mean)]/n
Manual Calculation:
● Datasets:
X = {10, 20, 30, 40, 50}
Y = {15, 25, 35, 45, 55}
● Find Deviations:
X− mean(X) = {-20, -10, 0, 10, 20}
Y− mean(Y) = {-20, -10, 0, 10, 20}
● Multiply Deviations:
[X− mean(X)][Y− mean(Y)] = 400,100,0,100,400
● Calculate Covariance:
Covariance= (400+100+0+100+400) / 5 = 200
Excel Formula:
Significance:
● Covariance only indicates direction, not the strength or magnitude of the relationship.
9. Correlation
Definition:
Correlation measures the strength and direction of the linear relationship between two
variables. It is standardized, ranging from -1 to 1:
Pearson Correlation
Formula:
r = Covariance(X, Y) / SD(X) × SD(Y)
Manual Calculation:
● Datasets:
X = {10, 20, 30, 40, 50}
Y = {15, 25, 35, 45, 55}
● Covariance(X, Y) = 200
● SD(X)=14.14, SD(Y) = 14.14
● Correlation = r = 200 / (14.14×14.14) = (200 / 200) = 1
Simple Linear Regression is a statistical technique used to predict the value of a dependent
variable (Y) based on an independent variable (X). The relationship between X and Y is
modeled as a straight line:
Y= mX + c
Where:
Example
Make Predictions
Y= mX + c
Example:
● If c = 45 , m = 6 , X=6 , Y = 6(6) + 45 = 81
1. Slope(m):
○ If positive, Y increases as X increases.
○ If negative, Y decreases as X increases.
2. Intercept(c):
○ The starting value of Y when X= 0
3. Predictions:
○ You can estimate Y for given values of X