Complete Data Analysts RoadMap
Complete Data Analysts RoadMap
Bhavesh Arora
Crack Top Analyst Roles with This 15-
Day STATISTICS Series!
Curated by
SAURABH G
Founder at DataNiti
6+ Years of Experience | Senior Data Engineer
Linkedin: www.linkedin.com/in/saurabhgghatnekar
BHAVESH ARORA
Senior Data Analyst at Delight Learning Services
M.Tech – IIT Jodhpur | 3+ Years of Experience
Linkedin: www.linkedin.com/in/bhavesh-arora-11b0a319b
Connect with us: https://topmate.io/bhavesh_arora/
✅ 1. Mean (Average)
The sum of all values divided by the number of values.
Python Code:
import numpy as np
data = [10, 20, 30, 40, 50]
print("Mean:", np.mean(data))
✅ 2. Median
The middle value when data is sorted.
● If odd: middle value
● If even: average of two middle values
Python Code:
print("Median:", np.median(data))
✅ 3. Mode
The most frequently occurring value in a dataset.
Python Code:
from scipy import stats
print("Mode:", stats.mode(data).mode[0])
✅ 4. Range
The difference between the highest and lowest value.
Formula:
Python Code:
print("Range:", max(data) - min(data))
Formulas:
● Population Variance (σ²):
Python Code:
import numpy as np
data = [10, 20, 30, 40, 50]
print("Sample Variance:", np.var(data, ddof=1)) # ddof=1 for sample
✅ 2. Standard Deviation (σ or s)
The square root of variance. It’s in the same unit as the data.
Formula:
Python Code:
print("Standard Deviation:", np.std(data, ddof=1))
Formula:
Python Code:
Q1 = np.percentile(data, 25)
Q3 = np.percentile(data, 75)
print("IQR:", Q3 - Q1)
✅ 1. Types of Probability
Type Description Example
Theoretical Based on reasoning or Coin flip: P(Heads) =
known outcomes 0.5
Based on actual
experiments/data 3 out of 10 students
Based on intuition or
Experimental are left-handed →
experience P(LH) = 0.3
“I feel there’s a 70%
Subjective chance of rain”
Python Code:
favorable = 3
total = 10
print("Probability:", favorable / total)
✅ 5. Real-World Use Cases
✅ 3. Types of Errors
Type Meaning
Type I Error Rejecting H₀ when it's actually true (False Positive)
Type II Error Failing to reject H₀ when it's false (False Negative)
print("t-statistic:", t_stat)
print("p-value:", p_val)
if p_val < 0.05:
print("Reject H₀")
else:
print("Fail to reject H₀")
✅ 6. Common Tests
Test Use Case
Z-test Known population std dev, n > 30
T-test Unknown std dev, small samples
Chi-square test Categorical variables
ANOVA Comparing >2 groups
Proportion test Proportion-based hypotheses
✅ 2. One-Sample T-Test
Checks if the sample mean is significantly different from a known value
(usually the population mean).
from scipy.stats import ttest_1samp
import numpy as np
data = np.array([22, 21, 23, 20, 24])
t_stat, p_val = ttest_1samp(data, popmean=21)
print("T-Statistic:", t_stat)
print("P-Value:", p_val)
✅ 5. Assumptions of T-Test
● Data is continuous and normally distributed
● Observations are independent
● Equal variances (in some cases)
Confused between Z-Test, T-Test, and ANOVA? You're not alone. These
statistical tests help determine if group differences are real or due to
chance—but each has its own use-case and assumptions.
✅ 1. Z-Test
Used when:
● Population standard deviation is known
● Sample size is large (n ≥ 30)
● Data is normally distributed
Use case: Testing population mean with large sample
✅ 2. T-Test
Used when:
● Population standard deviation is unknown
● Sample size is small (n < 30)
● Data is approximately normal
import numpy as np
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
cov_matrix = np.cov(x, y)
print("Covariance Matrix:\n", cov_matrix)
correlation = np.corrcoef(x, y)
print("Correlation Matrix:\n", correlation)
✅ 3. Key Differences Table
Feature Covariance Correlation
Range No fixed range [-1, 1]
Unit-dependent Yes No
Interpretation Direction only Direction + Strength
Standardized? No Yes
Use Cases PCA, Portfolio Analysis Feature selection, EDA
🔸 Q1: Can two variables have high covariance but low correlation?
✅ Yes, if the units or scales are different, correlation may still be low.
✅
🔹 Day 11: Chi-Square Test – For Categorical Data Only
The Chi-Square Test (χ²) is a powerful statistical method used to
determine if there’s a significant association between categorical
variables. It’s your go-to test when analyzing survey results,
demographics, marketing funnels, and more!
import pandas as pd
from scipy.stats import chi2_contingency
# Sample contingency table
data = [[30, 10],
[20, 40]]
📌 Applications
● Is gender related to product preference?
● Are age groups independent of newsletter signup?
● Do regions and voting patterns relate?
The Central Limit Theorem (CLT) is one of the most fundamental ideas
in statistics and data science. It explains why normal distribution
appears so often, even when the data itself isn’t normal!
✅ 4. Key Terms
Term Meaning
Population Entire group
Sample Subset of the population
Sampling Distribution Distribution of sample statistics
✅ 5. CLT Assumptions
● Samples are random and independent
● Sample size n ≥ 30 is usually sufficient
● Population must have finite variance
📌 Applications of CLT
● A/B testing
● Quality control in manufacturing
● Estimating population mean from sample
● Predictive analytics in ML workflows
💬 INTERVIEW QUESTIONS (Medium-High)
🔸 Q1: Why is the Central Limit Theorem important for hypothesis
testing?
✅ Because it allows us to use the normal distribution even when
data isn't normal!
✅
🔸 Q4: What if the population is already normal?
✅sample
Then the sampling distribution of the mean is also normal for any
size.
✅ 1. What is a t-Test?
A t-Test determines whether the difference in means is statistically
significant when the sample size is small and the population standard
deviation is unknown.
✅ 2. Types of t-Tests
Test Type
Use Case
One Sample t-Test
Compare one group mean to a known
Two Sample value
(Independent) t-Test
Paired Sample t-Test Compare two independent group means
(e.g., A/B testing)
Compare two related groups (e.g.,
before vs. after)
✅ 3. Assumptions of t-Test
● Data is normally distributed (CLT helps here!)
● Observations are independent
● Variances are equal (for two-sample t-test)
● Data is measured on interval/ratio scale
✅ 4. Python Examples
📌 One-Sample t-Test
from scipy import stats
import numpy as np
sample = [23, 25, 21, 22, 24, 26]
t_stat, p_val = stats.ttest_1samp(sample, popmean=20)
print("t-statistic:", t_stat, "p-value:", p_val)
✅ 5. Interpreting p-Value
p-value Interpretation
p < 0.05 Reject the null hypothesis (significant)
p ≥ 0.05 Fail to reject the null (not significant)
📌 Real-World Applications
● A/B Testing (website layout A vs. layout B)
● Medical Studies (treatment vs. control)
● Before/After Improvements (new teaching method vs. old)
✅violated.
🔸 Q3: What’s the null hypothesis in a t-test?
✅ That the means are equal (no significant difference).
🔸 Q4: Why use t-distribution instead of normal?
✅ t-distribution accounts for extra uncertainty in small samples.
✅ 1. What is ANOVA?
ANOVA helps us determine whether the mean differences across
multiple groups are statistically significant.
It avoids the risk of Type I error that comes from performing multiple t-
tests.
✅ 2. Types of ANOVA
Type Use Case
One-Way 1 categorical IV (grouping) → 1 continuous DV
ANOVA (outcome)
Two-Way
2 categorical IVs → 1 continuous DV
ANOVA
Same subjects measured multiple times (like
Repeated
paired t-test)
Measures
✅ 3. Assumptions of ANOVA
● Groups are independent
● Normal distribution of the outcome variable
● Equal variances across groups (homogeneity of variances)
● Outcome variable is continuous
✅ 4. Python Example: One-Way ANOVA
from scipy.stats import f_oneway
✅ 5. Interpreting Results
● F-statistic: Ratio of between-group variance to within-group
variance
● p-value:
📈 Real-World Applications
● Comparing test scores across 3 different teaching methods
● Measuring user engagement across 3 website layouts
● Comparing sales performance in different regions
💬 INTERVIEW QUESTIONS (Medium to High)
🔸 Q1: Why not just use multiple t-tests instead of ANOVA?
✅ Multiple t-tests increase the chance of Type I error.
🔸 Q2: What does a high F-statistic mean?
Greater variation
✅groups. between group means compared to within
✅ 3. Assumptions of Chi-Square
● Data should be frequencies (counts), not percentages
● Categories must be mutually exclusive
● Expected frequency in each cell should be ≥ 5 for validity
● Observations are independent
✅ 5. Interpreting Results
● Chi² value: Larger = more difference between observed and
expected
● p-value:
o p < 0.05 → Variables are likely associated
o p ≥ 0.05 → Variables are likely independent
📈 Real-World Applications
● Is gender associated with product preference?
● Does education level impact voting behavior?
● Do different ads get clicked by different age groups?
4. How do you interpret the Area Under the ROC Curve (AUC-ROC)?
AUC-ROC measures a classifier's ability to distinguish between classes:
● AUC = 1: Perfect classification.
● AUC = 0.5: No discriminative ability (equivalent to random
guessing).
Higher AUC indicates better model performance.
5. What is the Central Limit Theorem and why is it important?
The Central Limit Theorem states that the sampling distribution of the
sample mean approaches a normal distribution as the sample size
increases, regardless of the population's distribution. This is crucial for
making inferences about population parameters.