Lecture 2 ML_Maths
Lecture 2 ML_Maths
2
Why worry about the math?
HOWEVER
To get really useful results, you need good
mathematical intuitions about certain general
machine learning principles, as well as the inner
workings of the individual algorithms.
3
Why worry about the math?
5
Notation
x, y, z, vector (bold, lower case)
u, v
A, B, X matrix (bold, upper case)
y = f( x ) function (map): assigns unique value in
range of y to each value in domain of x
dy / dx derivative of y with respect to single
variable x
y = f( x ) function on multiple variables, i.e. a
vector of variables; function in n-space
y / xi partial derivative of y with respect to
element i of vector x
6
The concept of probability
Intuition:
In some process, several outcomes are possible.
When the process is repeated a large number of
times, each outcome occurs with a characteristic
relative frequency, or probability. If a particular
outcome happens more often than another
outcome, we say it is more probable.
7
The concept of probability
9
Axioms of probability
1. Non-negativity:
for any event E F, p( E ) 0
10
Types of probability spaces
11
Example of discrete probability space
12
Example of discrete probability space
13
Example of continuous probability space
p(O)
Height
14
Example of continuous probability space
15
Probability distributions
example:
sum of two
fair dice
example:
waiting time between
eruptions of Old Faithful
(minutes)
16
Probability Distribution functions
p(x)
1/6
x
1 2 3 4 5 6
P(x) 1
all x
Probability Mass Function (pmf)
x p(x)
1 p(x=1)=1/6
2 p(x=2)=1/6
3 p(x=3)=1/6
4 p(x=4)=1/6
5 p(x=5)=1/6
6 p(x=6)=1/6
1.0
Cumulative distribution function (CDF)
1.0 F(x)
5/6
2/3
1/2
1/3
1/6
1 2 3 4 5 6 x
Cumulative distribution function
x P(x≤A)
1 P(x≤1)=1/6
2 P(x≤2)=2/6
3 P(x≤3)=3/6
4 P(x≤4)=4/6
5 P(x≤5)=5/6
6 P(x≤6)=6/6
Examples
x f(x)
9 .25 Yes, probability
function!
10 .25
11 .25
12 .25
1.0
Answer (b)
x f(x)
Though this sums to 1,
1 (3-1)/2=1.0 you can’t have a negative
probability; therefore, it’s
2 (3-2)/2=.5 not a probability
function.
3 (3-3)/2=0
4 (3-4)/2=-.5
Answer (c)
x f(x)
0 1/25
1 3/25 Doesn’t sum to 1. Thus,
2 7/25 it’s not a probability
function.
3 13/25
24/25
Random variables
27
Multivariate probability distributions
Scenario
– Several random processes occur (doesn’t matter
whether in parallel or in sequence)
– Want to know probabilities for each possible
combination of outcomes
Can describe as joint probability of several random
variables
29
Multivariate probability distributions
Marginal probability
– Probability distribution of a single variable in a
joint distribution
– Example: two random variables X and Y:
p( X = x ) = b=all values of Y p( X = x, Y = b )
Conditional probability
– Probability distribution of one variable given
that another variable takes a certain value
– Example: two random variables X and Y:
p( X = x | Y = y ) = p( X = x, Y = y ) / p( Y = y )
30
Example of marginal probability
31
Example of conditional probability
conditional probability: p( Y = European | X = minivan ) =
0.1481 / ( 0.0741 + 0.1111 + 0.1481 ) = 0.4433
0.2
0.15
probability
0.1
0.05
American
sport
Asian SUV
European minivan
Y = manufacturer X = model type
sedan
32
Continuous multivariate distribution
33
Expected value
Q1. Suppose an individuals thinks that if they quit their job and work for
themselves that there is a 60% chance they could earn $20,000 in their first
year, a 30% chance they could earn $60,000, and a 10% chance they would
earn $0. Calculate the expected value for their income in the first year of
entrepreneurship.
Solution: Expected value = 0.6*$20,000 + 0.3*$60,000 + 0.1*$0 = $30,000
Mean is typically used when we want to calculate the average value of a given
sample
38
Variance of a discrete random variable
The variance of a random variable X is defined by
2
𝑉𝑎𝑟 𝑋 = 𝐸 𝑋 − 𝜇 2= σ𝑖 𝑥𝑖 − 𝜇 . 𝑝 𝑥𝑖 , 𝜇 =
𝐸 𝑋 .
Average value of squared deviation of X = xi from
mean , taking into account probability of the
various xi
– Most common measure of “spread” of a
distribution
– is the standard deviation= 𝑉𝑎𝑟 𝑋
– Compare to formula for variance of an actual
sample
39
Example
Common forms of expected value (3)
Covariance
high (positive)
covariance
Compare to formula for covariance of actual samples
41
Correlation
Linear dependence
without noise
Various nonlinear
dependencies
42
Types of Relationship
Y Y
1 2
X X
Y Y
4
3
X X
43
Types of Relationship…
Y Y
X X
Y Y
X X
44
Types of Relationship…
No relationship
X
Example
1
σ𝑛𝑖=1(𝑥𝑖 − 𝜇𝑥 )(𝑦𝑖 − 𝜇𝑦 )
𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑋, 𝑌 = 𝑁 − 1
𝜎𝑥 . 𝜎𝑦
46
Practice question
Q1. Calculate and analyze the correlation coefficient between the number
of study hours and the number of sleeping hours of different students using
the following data.
Complement rule
p( not A ) = 1 - p( A )
48
Product rule
p( A, B ) = p( A | B ) p( B )
(same expression given previously to define conditional probability)
49
Example of product rule
50
Rule of total probability
p( A ) = p( A, B ) + p( A, not B )
(same expression given previously to define marginal probability)
51
Independence
p( A | B ) = p( A ) or p( A, B ) = p( A ) p( B )
B
(A, not B) A ( A, B )
52
Examples of independence / dependence
Independence:
– Outcomes on multiple rolls of a die
– Outcomes on multiple flips of a coin
– Height of two unrelated individuals
– Probability of getting a king on successive draws from
a deck, if card from each draw is replaced
Dependence:
– Height of two related individuals
– Duration of successive eruptions of Old Faithful
– Probability of getting a king on successive draws from
a deck, if card from each draw is not replaced
53
Example of independence vs. dependence
54
Bayes rule
55
Bayes rule
p( B | A ) = p( A | B ) p( B ) / p( A )
56
Example of Bayes rule
57
Example of Bayes rule, cont’d.
58
Probabilities: when to add, when to multiply
59
Practice questions
(i) The probability that exactly one of them hits the target (it means only one
will hit the target while others two cannot)
P(A)P(B’)P(C’) + P(A’)P(B)P(C’) + P(A’)P(B’)P(C)
1/6×3/4×2/3 + 5/6×1/4×2/3 + 5/6×3/4×1/3 = 31/72
(ii) The probability that atleast one of them hits the target will be:
1- P(A’)P(B’)P(C’) = 1-[(5/6) ×(3/4) ×(2/3)] = 1-30/72 = 42/72
Practice questions
Q2. A covid test is 99% accurate (both ways). Only 0.3% of the population
is covid+. What is the probability that a random person is covid+ given
that the person tests+?
By using equation2
P(person tests+) = P[(person tests+) | (person is covid+)] P(person is
covid+) + P[(person tests+) | (person is not covid+)] P(person is not
covid+)
Practice questions
0.3% 0.99%
Solution: = 22.95%
(0.3% 0.99%) + (99.7% 1% )
Q3. Two dice are rolled. Consider the events A = {sum of two dice equals
3}, B = {sum of two dice equals 7 }, and C = {at least one of the dice shows a
1}.
(a) What is P (A | C)?
(b) What is P (B | C)?
(c) Are A and C independent? What about B and C?
Solution:
Note that the sample space is S = {(i, j) | i, j = 1, 2, 3, 4, 5, 6} with each
outcome equally likely.
Then, A = {(1, 2),(2, 1)}
B = {(1, 6),(2, 5),(3, 4),(4, 3),(5, 2),(6, 1)}
C = {(1, 1),(1, 2),(1, 3),(1, 4),(1, 5),(1, 6),(2, 1),(3, 1),(4, 1),(5, 1),(6, 1)}
2/36
Hence, P (A | C) = P (A ∩ C) /P(C) = = 2/11 .
11/36
2/36
P (B | C) = P (B ∩ C)/P(C) = = 2/11 .
11/36
Note that P(A) = 2/36 ≠ P (A | C), so they are not independent. Similarly, P (B)
= 6/36 ≠ P(B | C), so they are not independent.
Linear algebra applications
64
Why vectors and matrices?
No
65
Vectors
66
Vectors
67
Vector arithmetic
– result is a vector
68
Vector arithmetic
69
Matrices
70
Matrices
71
Matrix arithmetic
72
Matrix arithmetic
Matrix-matrix multiplication
– vector-matrix multiplication just a special case
Multiplication is associative
A(BC)=(AB) C
Multiplication is not commutative
A B B A (generally)
Transposition rule:
( A B )T = B T A T
73
Matrix arithmetic
74
Vector projection
75
Vector and Matrix Operation
Consider a vector [𝑥, 𝑦]𝑇
y
(x, y)
𝐸. 𝑔. 𝑜𝑓 𝑇𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑖𝑜𝑛
2 0 2 4
∗ =
0 1 3 3 x
Vector and Matrix Operation…
Consider a vector [𝑥, 𝑦]𝑇
y
(x, y)
2 0 2 4
∗ =
0 2 3 6
x
Matrix Transformation
𝑥′ 𝑥 𝑡𝑥 𝑆𝑥 0 𝑥
• Translation: = 𝑦 + 𝑡𝑦 = ∗ 𝑦
𝑦′ 0 1
𝑥′ 𝐶𝑜𝑠(𝜃) −𝑆𝑖𝑛(𝜃) 𝑥
• Rotation: = ∗ 𝑦
𝑦′ 𝑆𝑖𝑛(𝜃) 𝐶𝑜𝑠(𝜃)
𝑥′ 𝑆𝑥 0 𝑥
• Scaling: = 0 𝑆𝑦 ∗ 𝑦
𝑦′
Transformation of an Image
• An image of a fern-
like fractal (Barnsley's
fern) that exhibits
affine self-similarity.
• Each of the leaves of the
fern is related to each
other leaf by an affine
transformation.
• For instance, the red leaf
can be transformed into
both the dark blue leaf
and any of the light blue
leaves by a combination
of reflection, rotation,
scaling, and
translation.
https://en.wikipedia.org/wiki/Affine_transformation
Optimization theory topics
Maximum likelihood
Expectation maximization
Gradient descent
80