0% found this document useful (0 votes)
5 views

Statistics Refresher

Uploaded by

Abhishek Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Statistics Refresher

Uploaded by

Abhishek Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Importance of Statistics

1. Business:
○ Statistics helps in forecasting sales, analyzing trends, and improving decision-
making.
○ Example: A retail store uses sales data to predict demand during holiday
seasons.
2. Healthcare:
○ Analyzing patient outcomes and testing the efficacy of drugs rely on statistical
methods.
○ Example: A clinical trial uses statistical analysis to determine the effectiveness of
a new vaccine.
3. Social Sciences:
○ Statistics aids in studying behavior patterns and interpreting survey results.
○ Example: Analyzing voting patterns in elections to understand demographic
preferences.

Data Collection

1. Primary Data:
○ Collected directly for a specific purpose, often through surveys, interviews, or
experiments.
○ Example: A company surveys customers to gauge satisfaction with a new
product.
2. Secondary Data:
○ Obtained from existing sources like reports, databases, or government records.
○ Example: Using census data to analyze population growth trends for urban
planning.

Data Classification with Examples

1. Nominal Data

● Definition: Data that represents categories without inherent order.


● Example: Car Color (Red, Blue, Green).
● There is no ranking or logical order to the colors.
2. Ordinal Data

● Definition: Data with categories in a specific order, but the intervals between them are
not uniform.
● Example: Customer Rating (Poor, Average, Excellent).
● "Excellent" is higher than "Average," but the difference between "Poor" and "Average"
may not equal the difference between "Average" and "Excellent."

3. Interval Data

● Definition: Data with equal intervals between values but no true zero.
● Example: Temperature (°C) (15, 25, 30).
● The difference between 15°C and 25°C is the same as between 25°C and 30°C.
However, 0°C does not mean the absence of temperature (no true zero).

4. Ratio Data

● Definition: Data with equal intervals and a true zero, allowing for meaningful
comparisons like ratios.
● Example: Weight (kg) (60, 70, 85).
● 0 kg means no weight, and comparisons like "Person 2 is 1.2 times heavier than Person
3" are valid.

More Types Of Data

Discrete Data

● Definition: Discrete data consists of countable, distinct values, often integers. These
values represent quantities that cannot be subdivided meaningfully.
● Example from the Table:
○ Number of Pets: {0, 1, 2, 3}.
○ These are exact numbers representing how many pets a person owns. You can’t
have 1.5 pets.
● Key Features:
○ Finite or countable.
○ Typically whole numbers.
○ Example in Real Life: Number of books on a shelf, number of cars in a parking
lot.

Continuous Data

● Definition: Continuous data can take any value within a given range and often includes
decimals. It represents measurements that can be infinitely precise.
● Examples from the Table:
○ Height (cm): {160.5, 165.8, 172.3, 158.0}.
■ These are measurements, and any value is possible within the height
range (e.g., 160.51 cm or 160.52 cm).
○ Weight (kg): {70.2, 65.5, 68.9, 60.0}.
■ These are also measurements, and decimal precision is meaningful (e.g.,
68.91 kg or 68.92 kg).
● Key Features:
○ Infinite possible values within a range.
○ Measured rather than counted.
○ Example in Real Life: Distance, temperature, or speed.

Table of Statistical Measures

A concise table summary for the distribution of Nominal, Ordinal, Interval, and Ratio data
under Discrete and Continuous data types:
Measure Of Central Tendency

1. Mean

● Definition: The average of the values in a dataset.


● Formula: Mean = (Sum of all values)/(Number of values)
● Example (Age): Mean Age=(25+30+35+40+455=35) / 5
● Excel Formula: =AVERAGE(B2:B6)
● Use Case: Commonly used to calculate average performance, salaries, or ages in a
group.

2. Median

● Definition: The middle value when the data is sorted.


● Example (Salary):
Sorted Salaries: 40000, 48000, 50000, 55000, 60000
Median = 50000 (middle value in the sorted list).
● Excel Formula: =MEDIAN(E2:E6)
● Use Case: Useful in skewed data where extreme values (outliers) could distort the
mean. For example, finding the typical household income in a neighborhood.

3. Mode

● Definition: The most frequently occurring value in the dataset.


● Example (Weight):
Mode = 2 (appears 3 times in the Weight column).
● Excel Formula: =MODE(C2:C6)
● Use Case: Used in product preference studies (e.g., finding the most popular product
color or size).

4. Weighted Mean

● Definition: The mean where some values contribute more due to their assigned
weights.
● Formula: Weighted Mean=∑(Value×Weight)/∑(Weight)
● Example (Score with Weight):
Weighted Mean = [(85×2)+(90×3)+(80×1)+(95×2)+(88×2)] / (2+3+1+2+2) = 88.33
● Excel Formula: =SUMPRODUCT(B2:B6, C2:C6)/SUM(C2:C6)
● Use Case: Often used in academics (weighted average of grades), project evaluations,
and financial analysis.

5. Trimmed Mean

● Definition: The mean after removing a certain percentage of the smallest and largest
values.
● Example (Salary, 10% Trim):
Exclude 10% from both ends: 40000, 60000
Remaining Salaries: 48000, 50000, 55000
Trimmed Mean = (48000+50000+55000) / (3) = 51000
● Excel Formula: =TRIMMEAN(E2:E6, 0.1)
● Use Case: Used in sports scoring (removing outliers like highest and lowest judges'
scores) or in robust data analysis.

6. Weighted Median

● Definition: The median when weights are applied to the values.


● Example:
○ Values sorted by Score in ascending order first : {80 (1), 85 (2), 88 (2), 90 (3), 95
(2)}
○ Weighted Median = 90 (middle weighted value).
● Excel Formula: Weighted Median doesn’t have a direct Excel function but can be
computed manually
● Use Case: Used in income analysis to find the typical income, accounting for family size
or population density.

Measures of Dispersion

Measures of dispersion indicate how spread out the data is. They include Range, Variance,
Standard Deviation, Percentiles, and Quartiles. Understanding outliers is also crucial in data
analysis.

1. Range

● Definition: The difference between the maximum and minimum values in a dataset.
● Manual Formula: Range = Maximum - Minimum
● Example: Data: {10, 15, 20, 25, 30}
Range = 30−10=20
● Excel Formula: =MAX(A1:A5) - MIN(A1:A5)
● Use Case:
○ Quick measure of variability in datasets like daily temperatures or stock prices.

2. Variance

● Definition: Measures the average squared deviation of each data point from the mean.
● Manual Formula: Variance=∑(Value−Mean)^2 / Number of Values
● Example: Data: {10, 15, 20, 25, 30}
Mean = 20
Variance =¿
● Excel Formula:
○ For a population: =VAR.P(A1:A5)
○ For a sample: =VAR.S(A1:A5)
● Use Case:
○ Variance is used in finance to measure investment risk.

3. Standard Deviation

● Definition: The square root of variance, representing the average distance of data
points from the mean.
● Manual Formula: Standard Deviation=Variance
● Example: Data: {10, 15, 20, 25, 30}
Variance = 50
Standard Deviation =squareroot ( variance) = 7.07
● Excel Formula:
○ For a population: =STDEV.P(A1:A5)
○ For a sample: =STDEV.S(A1:A5)
● Use Case:
○ Used in quality control and risk analysis to assess consistency.

4. Percentiles

● Definition: The value below which a given percentage of data falls.


● Manual Formula:
Arrange data in ascending order and identify the value corresponding to the desired
percentage position. => Position = [(percentile/100)] x (n+1), n = no of data points
● If P is an integer, the 90th percentile is the value at that position.
● If P is a decimal, interpolate between the nearest two values.
❖ Pk=Value at Floor Position+(Fractional Part of Position)×(Difference Between Next
and Floor Values)
● Example: Data: {10, 15, 20, 25, 30}
25th Percentile (P,25) = 12.5
● Excel Formula: =PERCENTILE.EXC(A1:A5, 0.25)
● Use Case:
○ Percentiles are commonly used in test scoring and income distribution.

5. Quartiles

● Definition: Divide data into four equal parts.


○ Q1 (25th Percentile): First quartile.
○ Q2 (50th Percentile): Median.
○ Q3 (75th Percentile): Third quartile.
● Manual Formula: Quartiles are calculated like percentiles at 25%, 50%, and 75%.

Excel Formula
=QUARTILE.EXC(A1:A5, 1) (Q1)
=QUARTILE.EXC(A1:A5, 2) (Median or Q2)
=QUARTILE.EXC(A1:A5, 3) (Q3)

● Use Case:
○ Identifying outliers by using the Interquartile Range (IQR).

6. Outliers

● Definition: Data points that are significantly different from others in the dataset.
● Technique for Identifying Outliers:
○ Compute IQR: IQR=Q3−Q1
○ Lower Bound: Q1−1.5xIQR
○ Upper Bound: Q3+1.5xIQR
○ Example: Data: {10, 15, 20, 25, 30,100}
Q1 = 13.75 , Q3 = 47.5 , IQR = 33.75
Lower Bound = 13.75 − 1.5×33.75 = −36.875
○ Upper Bound = 47.5 + 1.5×33.75 = 98.125
Outlier = 100
○ Any data point below the Lower Bound or above the Upper Bound is
considered an outlier.
● Use Case:
○ Detecting outliers in sales, student test scores, or scientific experiments.

7. Coefficient of Variation

Definition:
● The Coefficient of Variation (CV) is a measure of relative variability, expressed as a
percentage of the mean. It helps in comparing the degree of variation between datasets
with different units or scales.

CV = (Standard deviation / Mean) x 100

Excel Formula:

● Mean: =AVERAGE(A1:A5)
● Standard Deviation: STDEV(A1:A5)
● CV: =STDEV(A1:A5)/AVERAGE(A1:A5)*100

Significance:

● A higher CV indicates more variability relative to the mean.


● Used in fields like finance (risk analysis), biology (experimental variability), and
manufacturing (quality control).

8. Covariance

Definition:

Covariance measures the directional relationship between two variables:

● Positive covariance: Variables increase or decrease together.


● Negative covariance: One variable increases while the other decreases.

Formula:
Covariance = [∑( X−X mean)(Y −Y mean)]/n

Manual Calculation:

● Datasets:
X = {10, 20, 30, 40, 50}
Y = {15, 25, 35, 45, 55}

● Calculate the Means:


mean(X) = 30 , mean(Y) = 35

● Find Deviations:
X− mean(X) = {-20, -10, 0, 10, 20}
Y− mean(Y) = {-20, -10, 0, 10, 20}
● Multiply Deviations:
[X− mean(X)][Y− mean(Y)] = 400,100,0,100,400

● Calculate Covariance:
Covariance= (400+100+0+100+400) / 5 = 200

Excel Formula:

● Use: =COVAR(range1, range2)

Significance:

● Covariance only indicates direction, not the strength or magnitude of the relationship.

9. Correlation

Definition:

Correlation measures the strength and direction of the linear relationship between two
variables. It is standardized, ranging from -1 to 1:

● 1: Perfect positive correlation.


● -1: Perfect negative correlation.

Pearson Correlation

Formula:
r = Covariance(X, Y) / SD(X) × SD(Y)

Manual Calculation:

● Datasets:
X = {10, 20, 30, 40, 50}
Y = {15, 25, 35, 45, 55}

From the previous covariance example:

● Covariance(X, Y) = 200
● SD(X)=14.14, SD(Y) = 14.14
● Correlation = r = 200 / (14.14×14.14) = (200 / 200) = 1

This indicates a perfect positive correlation.


Excel Formula:

● Use: = CORREL(range1, range2)

Comparison of Covariance and Correlation

10. Simple Linear Regression in Excel

Simple Linear Regression is a statistical technique used to predict the value of a dependent
variable (Y) based on an independent variable (X). The relationship between X and Y is
modeled as a straight line:

Y= mX + c
Where:

● c : Intercept (value of Y when X=0).


● m : Slope (change in Y for a one-unit change in X).

Example

● Independent Variable (X): Study Hours.


● Dependent Variable (Y): Exam Scores.
Manual calculation of Slope and intercept
● slope(m) = ∑(X− X mean)( y− y mean )/∑ ¿
● intercept(c) = Y mean−slope( X mean) = y – mx

Relationship of slope, covariance and correlation


m=r [SD(Y )/SD (X )]=Covariance( X ,Y )/ variance(X )

Use Excel Functions to Calculate Regression Coefficients


You can calculate the slope (m) and intercept (c) using these formulas:
● Slope (m): =SLOPE(known_y's, known_x's)

● Intercept (c): = INTERCEPT(known_y's, known_x's)

Make Predictions

Use the regression equation to predict Y for new values of X:

Y= mX + c

Example:

● If c = 45 , m = 6 , X=6 , Y = 6(6) + 45 = 81

Alternatively, use Excel’s function:

● =FORECAST.LINEAR(new_x, known_y's, known_x's)

Insights from Regression

1. Slope(m):
○ If positive, Y increases as X increases.
○ If negative, Y decreases as X increases.
2. Intercept(c):
○ The starting value of Y when X= 0
3. Predictions:
○ You can estimate Y for given values of X

You might also like