Probability Notes
Probability Notes
Probability Notes
1. Def
d. independent events- If the outcome of one event does not affect the outcome of
another
e. Central tendency
i. Mean: The "balancing point" of a dataset - sum all the values and divide by
the number of values. Like the center of gravity of a seesaw.
ii. Median: The middle value of a dataset when ordered from least to greatest.
Think of it as the "fair share" value.
iii. Mode: The most frequent value in a dataset. It's the "popular kid" of the
data set.
iv. Variance: How "spread out" a dataset is from its mean. Imagine a dance
floor - variance tells you how far dancers are typically from the center
Probability Notes 1
(mean).
v. Standard deviation: The square root of the variance. It's like the "average
distance" dancers are from the center of the dance floor.
iii. Mixture. conditional, for example- for values <5, it is discrete and for >5, it is
continuous
i. Px (x) ≥ 0 and
ii. The probability density function is non-negative for all the possible values,
i.e. f(x)≥ 0, for all x.
Probability Notes 2
i. CDF- cumulative distribution function- The cumulative distribution function (cdf)
gives the probability that the random variable X is less than or equal to x and is
usually denoted F ( x )
j. Moment - The moments are the expected values of X, e.g., E(X), E(X²), E(X³),
… etc.The first moment is E(X) is called mean , The second moment is E(X²)
is called variance,
ii. The fourth moment is about how heavy its tails are.- Kurtosis
3. Every trial is an independent trial, which means the outcome of one trial
does not affect the outcome of another trial.
iii. Geometric Distribution- Models the number of Bernoulli trials needed to get
the first success. A geometric distribution is a special case of the negative
Probability Notes 3
binomial distribution where the number of successes required (r) is equal to
1.
2. You know the mean number of events occurring within a given interval
of time or space. This number is called λ (lambda), and it is assumed to
be constant. This is calculated over an entire time period. Imagine
distribution is calculated for 1 year when sample is for 10 years long so
lambda will be per year average.
m. Continuous Distributions
Probability Notes 4
2. The CDF of the distribution looks something like this
ii. Gamma Distribution - Describes the time until a specified number of events
occur, with a constant rate of occurrence.
Probability Notes 5
1. In summary, the shape parameter ( α) controls the form and skewness
of the Gamma distribution, while the scale parameter ( β) adjusts the
spread and scale of the distribution.
Probability Notes 6
understood as related to the number of events or stages that need to
occur for the process to be completed.
a. For example, if α=3 in a coffee shop, it could mean that there are
three key stages in serving a customer: taking the order, preparing
the order, and delivering the order.
3. Where to use
Probability Notes 7
a. it can model products that are more likely to fail either early in their
life (due to manufacturing defects) or later (due to wear and tear).
b. It's used to analyze failure rates and predict remaining useful life,
accommodating a wide range of failure behaviors.
4. Suppose you're assessing the viability of a wind farm. The wind speed
data over several years shows variability. By using the Weibull
distribution, you can model the wind speeds effectively, considering the
varying rates of occurrence of different wind speeds. The shape
parameter will tell you about the distribution of wind speeds (more high-
speed winds vs. more moderate winds), and the scale parameter will
give an idea of the 'typical' wind speed.
Probability Notes 8
2. Furthermore, it can be used to approximate other probability
distributions,
4. Whereas, the normal distribution doesn’t even bother about the range.
The range can also extend to –∞ to + ∞ and still we can find a smooth
curve.
2. The data points for our log-normal distribution are given by the X
variable. When we log-transform that X variable (Y=ln(X)) we get a Y
Probability Notes 9
variable which is normally distributed. Similarly when we take the
exponential of a normal distribution we get the log normal distribution
i. Given two random variables that are defined on the same probability space,
[1] the joint probability distribution is the corresponding probability
distribution on all possible pairs of outputs
ii. The joint distribution can just as well be considered for any given number of
random variables.
iii. It has
Probability Notes 10
B=Red (2/3)(2/3)=4/9 (1/3)(2/3)=2/9 4/9+2/9=2/3
q. Conditional PMF
ii. Conditional Mean and variance- Calculate the average and spread of a
variable given that we know the value of another variable. They explore
how the behavior of one variable changes depending on the specific value
of another variable.
iii. Covariance- Measure of how much two random variables vary together.
Covariance is a measure of linear relationship between the random
variables. If the relationship between the random variables is nonlinear, the
covariance might not be sensitive to the relationship, which means, it does
not relate the correlation between two variables.
r. Central Limit Theorem- States that, under certain conditions, the sum of a large
number of random variables is approximately normal. In probability theory, the
central limit theorem (CLT) states that, under appropriate conditions, the
distribution of a normalized version of the sample mean converges to a
standard normal distribution. This holds even if the original variables
themselves are not normally distributed. There are several versions of the CLT,
each applying in the context of different conditions.
Probability Notes 11
s. Sampling from a distribution- Process of selecting a random sample of
individuals from a statistical population.
v. Importance Sampling
Probability Notes 12
1. A family of algorithms that generate samples by constructing a Markov
chain that has the desired distribution as its stationary distribution.
t. Chi-Squared test
i. The chi-square test is used to estimate how likely the observations that are
made would be, by considering the assumption of the null hypothesis as
true.
u. T distribution
ii. The heaviness of the tail is controlled by something called the v parameter,
for v=0, it is the cauchy distribution but for v→inf it becomes the standard
normal distriution
Probability Notes 13
4. With fewer degrees of freedom (smaller samples), the t-distribution has
fatter tails than a normal distribution. This reflects the greater
uncertainty in estimating the mean from a smaller sample.
2. When working with small samples (typically less than 30), the sample
mean might not perfectly reflect the population mean.
iii. You can change degrees of freedom by checking how many samples you
are measuring for variability comparison sake
w. Estimation
1. Unbiassed
2. Biassed
Probability Notes 14
x. CR inequality - The Chernoff bound, also known as the Chernoff inequality, is a
powerful tool in probability theory that provides upper bounds on the tail
probabilities of certain random variables. It's particularly useful in analyzing the
probability of rare events, assessing the accuracy of probability approximations,
and establishing convergence rates in large-scale systems.
ii. It focuses on the "tails" of the distribution, where extreme events occur.
i. Steps-
2. Then assume a model that would be able to contain these points for
example normal distribution
6. So now let’s say we have 3 points and we need to calculate the MLE of
the three points in a Gaussian plane. We will fir the Gauss formula for
all three points and multiply them to get the joint probability. For ex: for
numbers 9,9.5 and 11 this is how it would look like
Probability Notes 15
7. We just have to figure out the values of μ and σ that results in giving the
maximum value of the above expression.
9. The above expression for the total probability is actually quite a pain to
differentiate, so it is almost always simplified by taking the natural
logarithm of the expression. This is absolutely fine because the natural
logarithm is a monotonically increasing function.
10. In order to find the mean, we differential wrt the mean and in order to
find the standard deviation we differentiate wrt the standard deviation
i. Maximum likelihood estimation (MLE) as you saw had a nice intuition but
mathematically is a bit tedious to solve. So we use a different technique for
estimating parameters called the Method of Moments (MoM)
ii.
ab. Hypotheiss testing -Hypothesis testing is the process of checking the validity of
the claim over a population using evidence found in sample data.
Probability Notes 16
ii. Null Hypothesis - In a null hypothesis, we claim that there is no relationship
or difference between groups with respect to the value of the population
parameter. We begin the Hypothesis Test by assuming that the Null
Hypothesis to be true and later on we retain/reject the Null Hypothesis
based on the evidence found in sample data. If equaltiy appear it will be in
nul (=/ ≤ / ≥ ) Example:
iv. Steps
2. Decide the Significance Level (α) at which we get to reject the null
hypothesis. Usually, α is set as 0.05. This means that there is a 5%
chance we will reject the Null Hypothesis even when it’s true.
3. Calculate the test statistic and p value- A test statistic is nothing but the
standardized difference between the value of the parameter estimated
from the sample (such as sample mean) and the value of the null
hypothesis (such as hypothesized population mean). It is a measure of
how far the sample mean is from the hypothesized population mean.
4. Take a decision
v. Critical value- The value of the statistic in the sampling distribution for which
the probability is α is called the critical value. The areas beyond the critical
values are known as the critical region/rejection region and critical values
are the values that indicate the edge of the critical region. Critical regions
describe the entire area of values that rejects the null hypothesis. If the test
statistic falls in the critical region, the null hypothesis will be rejected.
Probability Notes 17
vi. P-value:A p-value is nothing but the conditional probability of getting the test
statistic given the null hypothesis is true.
1. One Tailed Test- When the critical/rejection region is on one side of the
distribution it is known as a One-Tailed Test. In this case, the null
hypothesis will be rejected if the test statistic is on one side of the
distribution, either left or right.
2. Two Tailed Test - If the test is two-tailed, α must be divided by 2 and the
critical /rejection regions will be at both ends of the distribution curve.
Hence, in this case, the null hypothesis will be rejected when the test
value is on either of two rejection regions on either side of the
distribution.
Probability Notes 18
x. Errors
1. Type 1 Error- FP
2. Type 2 error- FN
2. Formula
n
P (A) = ∑ P (Ck )P (A∣Ck )
k=0
Probability Notes 19
g. Random Variable formula. For any random variable X where P is its respective
probability we define its mean as,
i. Mean(μ) = ∑ X.P
ii. For any random variable X if it assume the values x1, x2,…xn where the
probability corresponding to each random variable is P(x1), P(x2),…P(xn),
then the expected value of the variable is- E(x) = ∑ x.P(x)
1. E(X2) = ∑X2P
2. E(X) = ∑XP
b
ii. P(a ≤ X ≤ b) = ∫a f (x) dx
i. CDF
−∞
i. if two random variables have the same MGFs, then their distributions are
the same.
ii. ⇒
m(t) = E(etX ) the nth derrivate wrt t using chain rule m(n) (t) =
EX n .etX for example m(n) (0) = EX n . This substitution fo t must be
done only at the end.
Probability Notes 20
1. Mx(t) exist for X
2. The Mx(t) has a finite value for all t belonging to [-a, a], where a is any
positive real number.
k. Properties of MGF
Probability Notes 21
l. Bernouli Distribution- simplest distribution ever
Probability Notes 22
1. Binomial Distribution
Probability Notes 23
1. In order to evaluate the binomial distribution we use combinatorial. Which is
what the first term denotes.
2. Geometric Distrbution
b. P (X = k) = (1 − p)k−1 p, k = 1, 2, ...
pe t
c. MX (t) =
1−(1−p)e t
d. Mean: μ = 1/p
a. k is number of successes
k 1
Probability Notes 24
b. P (X = k) = (k+r−1
k
)pr (1 − p)k ,
k = 0, 1, 2, ...
r
c. MGF MX (t) = ( 1−(1−p)e
p
t)
d. Mean: μ = r/p
a. As we have seen before a and b are the minimum and maximum values so
the range of values for x for the distribution
e. Mean(a+b)/2
f. Variance - (b-a)^2/12
5. Poisson Distribution
a. When λ is low, the distribution is much longer on the right side of its peak
than its left (i.e., it is strongly. As λ increases, the distribution looks more
and more similar to a normal distribution.
c. Mean- λ
d. Variance- λ
e. When λ is a non-integer, the mode is the closest integer smaller than λ..
When λ is an integer, there are two modes: λ and λ−1.
6. Gamma Distribution
Probability Notes 25
a. Continuous poisson distribution
iii. 1/Lambda is called the rate parameter. Lambda is called the scale
parameter
iv. wherever the random variable x appears in the probability density, then
it is divided by β. (which is inverse of lambda)
f. Variance: = α/λ2
7. Weibull
a. k > 0 is the shape parameter and λ > 0 is the scale parameter of the
distribution.
i. A value of k< 1 indicates that the failure rate decreases over time
Probability Notes 26
ii. A value of = k=1 indicates that the failure rate is constant over time.
iii. A value of k>1 indicates that the failure rate increases with time.
k
b. P DF : f(x) = (k/λ) ∗ (x/λ)k−1 ∗ e−x/λ , forx ≥ 0; 0, otherwise
c. CDF : F (x) = 1 − e( − (x/λ)k )
d. MGF : M(t) = Γ(1 + k/t)
e. Mean : μ = λ ∗ Γ(1 + 1/k)
f. Variance : σ 2 = λ2 ∗ (Γ(1 + 2/k) − (Γ(1 + 1/k))2 )
8. Normal
b. μ is the mean
5. Variance: σ^2
6. Properties
Probability Notes 27
d. The normal distribution curve must have only one peak. (i.e., Unimodal)
e. The curve approaches the x-axis, but it never touches, and it extends
farther away from the mean.
9. Lognormal Distribution
b. The estimations using MLE are listed below and that is how we will
calculate the value of these parameters
2. More stats
Probability Notes 28
1. The mode represents the global maximum of the distribution and can
therefore be derived by taking the derivative of the log-normal
probability density function and solving it for 0
Probability Notes 29
1. The mean (also known as the expected value) of the log-normal
distribution is the probability-weighted average over all possible values
Probability Notes 30
10. MLE for Gaussian function
1. These formulas are near identical. We can see that we can use the same
approach as with the normal distribution and just transform our data with a
logarithm first.
i. Left Tailed Test p-value = P[Test statistics <= observed value of the test
statistic] Example- H₀: μ ≥ 5 and H₁: μ < 5 . This is a left-tailed test
since the rejection region would consist of values less than 5.
Probability Notes 31
1. Right tailed test p-value = P [Test statistics >= observed value of the
test statistic] H₀: μ ≤ 39 H₁: μ > 39. .This is a right-tailed test since the
rejection region would consist of values greater than 39.
b. Two tailed test p-value = 2 * P[Test statistics >= |observed value of the test
statistic|].
Probability Notes 32
x̄
c. test statistic = ( — μ) / (σ / √n)
i. x̄ = sample mean
ii. μ = population mean
13. F test
s 21
a. F= where s is the variance of the 2 samples
s 22
1 =1
b. Degree of freedoms is n
n2 − 1
where n1 and n2 are sample sizes
14. T test
xˉ−μ0
a. t= s/ n
where
ˉis sample mean and s is
μ0 is population mean and x
smaple’s standard deviation
s 2p ( n1 + n1 )
1
2
covaraince
d. Degree of freedom is n1 + n2 - 2
dˉ
e. Comparing 2 related samples t = sd dˉis Mean of the differences between
n
f. Pooled varaince
ii. Instead of using just s^2_A or s^2_B as your estimate of the common
variance, pooling combines information from both groups to provide a
more precise and reliable estimate
Probability Notes 33
iv. : Imagine you have data from two groups, group A and group B. You
suspect they come from different populations with potentially different
means, but you believe they share a common variance. However, you
only have estimates of the individual variances (s^2_A and s^2_B) from
your samples.
a. p(x, y) = P (X = x, Y = y)
b. Marginal PMF- pX (x) = ∑y p(x, y)
∞
b. Marginal PDF fX (x) = ∫−∞ f(x, y) dy
∞
c. Marginal PDF fY (y) = ∫−∞ f(x, y) dx
∞
d. Conditional Mean for a given value of y E(X∣Y = y) = ∫∞ x ∗
f(x∣y)dx
17. Other formula
Probability Notes 34
iv. Correlation of 1 or -1: Perfect linear relationship (all points lie on a
straight line).
c. g(x) = f(x(y)).∣ dx
dy ∣
⇒ g(y) = f(x1 )∣ dxdy ∣ + f(x2 )∣ dxdy ∣ + ...
1
2
d. Proces to do this
v. Find the domain of the differentiated function for variable y, given the
domain of variable x from the initial pdf
vi. Now substitute to the formula we mentioned above and we get the PDF
a. Take the table of PMF which is for every value of x what is the probability
Probability Notes 35
1. If the value of y is 1-1, we have unique values. If we don’t have 1-1 then it
is not unique values of y. In case of 1-1 use the pdf to match with pmf and
substitute the value of PMF into the transform depending on whatever value
it has. As shown above
1. In case it is not 1-1, you only draw the transformed PMF for unque values.
As shown above, in collision values you do the addition of the values of the
PMF
Probability Notes 36
20. Transformation of 2 dimensional variales
a. for a given rv, x and y consider u = u(x,y) and v = v(x,y) are continuous
differentiable functions
e. You can chose v as (if not given) x, y or combination x+y Just so that
jacobian ≠0
h. Find the domain of u and v given the domain of x and y (and you can use
your representation of x and y in terms of x and y) to get domain
c. CI = x
ˉ ± t ∗ (s/√n)
i. T* is the critical t-value based on the desired confidence level and
degrees of freedom (df = n - 1)
d. CI = p^ ± z ∗ √(p^(1 − p^)/n)
Probability Notes 37
ii. z* is the critical z-value based on the desired confidence level (e.g.,
1.96 for a 95% confidence level)
22. CR inequality
a. P (X ≥ a) ≤ exp(−ta + M(t))
i. a and t are real numbers and X is random variable
1
b. Chebyshev’s P (∣X − E(X)∣ ≥ kσ) ≤ k2
n
c. Booles’ P (⋂i=1 Ai ) ≥ 1 − ∑ni=1 P (Aci )
d. …
a. Arrange data into a contingency table with rows and columns representing
the different categories of the variables. Each cell in the table contains the
observed frequency (count) of data points falling into that combination of
categories.
i. χ2 = Σ(O − E)2 /E
ii. O = observed frequency
Probability Notes 38
d. Set a significance level (alpha, usually 0.05 or 0.01) to determine the
threshold for rejecting the null hypothesis.
e. Calculate the degrees of freedom (df) based on the number of rows and
columns in the contingency table: df = (rows - 1) * (columns - 1).
g. Decision
i. If the p-value is less than or equal to the significance level, reject the
null hypothesis and conclude there's a significant association between
the variables.
ii. If the p-value is greater than the significance level, fail to reject the null
hypothesis and conclude there's not enough evidence to support an
association.
Probability Notes 39