Introduction to Statistical Modeling in big data
Introduction to Statistical Modeling in big data
Key Concepts
1. Random Variables and Probability Distributions
A random variable represents a data attribute whose values result from some
probabilistic process.
The probability distribution defines the likelihood of different outcomes.
Common distributions:
o Bernoulli (binary outcomes)
o Gaussian/Normal (continuous, bell-shaped)
Statistical classification models predict the class label of an instance based on estimated
probabilities.
For an instance with features X=(x1,x2,...,xn) the goal is to compute the probability of
class CC:
P(C∣X)=P(X∣C)P(C)/P(X)
P(X∣C)=∏i=1n P(x_i | C)
Dataset:
Email ID Contains "buy" Contains "free" Contains "click" Class (Spam/Not Spam)
1 Yes No Yes Spam
2 No Yes No Not Spam
3 Yes Yes Yes Spam
4 No No Yes Not Spam
P(Spam)=2/4=0.5
P(Not Spam)=2/4=0.5
Suppose a new email contains "buy" = Yes, "free" = No, "click" = Yes. We want to predict if it's
spam.
Compute:
P(Spam∣X)∝P(Spam)×P(buy=Yes∣Spam)×P(free=No∣Spam)×P(click=Yes∣Spam) |
=0.5×1.0×0.5×1.0=0.25
Similarly,
P (Not Spam∣X)∝0.5×0×0.5×0.5=0
Summary
Statistical modeling provides a probabilistic framework for data classification and
prediction.
Naive Bayes is a foundational statistical model that is widely used due to its simplicity
and surprisingly good performance.
Estimation of prior and conditional probabilities is key.
Model evaluation is necessary to ensure accuracy.