Introduction To Machine Learning (CS 771A, IIT Kanpur) : Course Notes and Exercises
Introduction To Machine Learning (CS 771A, IIT Kanpur) : Course Notes and Exercises
Suggested Citation: P. Kar. Introduction to Machine Learning (CS 771A, IIT Kanpur), Course Notes and
Exercises, 2019.
Purushottam Kar
IIT Kanpur
[email protected]
This monograph may be used freely for the purpose of research and self-study.
If you are an instructor/professor/lecturer at an educational institution and
wish to use these notes to offer a course of your own, it would be nice if you
could drop a mail to the author at the email address [email protected]
mentioning the same.
IIT Kanpur
Contents
Acknowledgements 6
Appendices 7
A Calculus Refresher 8
A.1 Extrema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
A.2 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
A.3 Second Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . 10
A.4 Stationary Points . . . . . . . . . . . . . . . . . . . . . . . . . . 10
A.5 Useful Rules for Calculating Derivatives . . . . . . . . . . . . . . 11
A.6 Multivariate Functions . . . . . . . . . . . . . . . . . . . . . . . . 12
A.7 Visualizing Multivariate Derivatives . . . . . . . . . . . . . . . . . 14
A.8 Useful Rules for Calculating Multivariate Derivatives . . . . . . . . 16
A.9 Subdifferential Calculus . . . . . . . . . . . . . . . . . . . . . . . 19
A.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
References 37
Introduction to Machine Learning
(CS 771A, IIT Kanpur)
Purushottam Kar1∗
1 IIT Kanpur; [email protected]
ABSTRACT
Machine Learning is the art and science of designing algorithms
that can learn patterns and concepts from data to modify their
own behavior without being explicitly programmed to do so. This
monograph is intended to accompany a course on an introduction
to the design of machine learning algorithms with a modern out-
look. Some of the topics covered herein are Preliminaries (multivari-
ate calculus, linear algebra, probability theory), Supervised Learn-
ing (local/proximity-based methods, learning by function approxi-
mation, learning by probabilistic modeling), Unsupervised Learning
(discriminative models, generative models), practical aspects of ma-
chine learning, and additional topics.
Although the monograph will strive to be self contained and re-
visit basic tools such as calculus, probability, and linear algebra,
the reader is advised to not completely rely on these refresher dis-
cussions but also refer to a standard textbook on these topics.
∗ The contents of this monograph were developed as a part of successive offerings of various
Note that we have −βi ξi occurring with a negative sign in the above
because the constraints are −ξi ≤ 0 and not ξi ≤ 0.
2
1.1. Derivation of the CSVM Dual 3
4. Step 3 (Create the dual problem). This is easy to do as well. Just keep
in mind that there are constraints on the dual variable that they must
be non-negative. We use the shorthand x ≥ 0 to say that all coordinates
of the vector x must be non-negative i.e. xi ≥ 0 for all i.
n n n
( )
1
kwk22 + C · αi (1 − ξi − y i (w> xi + b)) −
X X X
max min ξi + βi ξi
α≥0,β≥0 w∈Rd 2 i=1 i=1 i=1
b∈R
ξ∈Rn
5. Step 4 (Apply first order optimality with respect to all primal variables).
Recall that we do this since in the dual problem, there are no more
constraints on the primal variables and the Lagrangian is a differentiable
function of the primal variables and so the derivatives of the Lagrangian
must vanish with respect to all the primal variables.
∂L Pn i
(a) Optimality w.r.t w. Setting ∂w = 0 gives us w = i=1 αi y · xi .
n
b. Setting ∂L i
P
(b) Optimality w.r.t ∂b = 0 gives us i=1 αi y = 0.
∂L
(c) Optimality w.r.t ξi . Setting ∂ξ i
= 0 gives us αi + βi = C
The above identities are necessarily true at the optimum, so we take them
as constraints in the dual problem. Note that we already have positivity
constraints on the dual variables i.e. α, β ≥ 0.
n n n
1
kwk22 + C · αi (1 − ξi − y i (w> xi + b)) −
X X X
maxn min ξi + βi ξi
α,β∈R w∈Rd 2 i=1 i=1 i=1
b∈R
ξ∈Rn
= kwk22 which
Pn Pn i · w> x i
Applying w = i=1 αi y i · xi tells us that i=1 αi y
further simplifies the objective to
n
1
kwk22
X
αi −
i=1
2
1.1. Derivation of the CSVM Dual 4
objective function
n n X n
X 1X D E
αi − αi αj y i y j x i , x j
i=1
2 i=1 j=1
Thus, we obtain the dual problem with primal variables completely elim-
inated from the objective.
n n X n
X 1X D E
max min αi − αi αj y i y j x i , x j
α,β∈Rn w∈Rd
i=1
2 i=1 j=1
s.t. αi ≥ 0, for i = 1,. . . ,n
βi ≥ 0, for i = 1,. . . ,n
n
X
w= αi y i · x i
i=1
n
X
αi y i = 0
i=1
αi + βi = C
Note that we have removed the variables b, ξ from the optimization prob-
lem since they no longer appear anywhere, either in the constraints or the
objective. However, we still have a constraint, namely w = ni=1 αi y i · xi ,
P
not required themselves and they were just an indirect way of the dual
problem telling us that we must have αi ≤ C.
n n X n
X 1X D E
max αi − αi αj y i y j xi , xj
α∈Rn
i=1
2 i=1 j=1
s.t. αi ∈ [0, C], for i = 1,. . . ,n
n
X
αi y i = 0
i=1
The above is indeed the final form of the dual of the CSVM problem that is
used in professional solvers such as liblinear and sklearn.
We note that the above derivation is by no means an indication of what
must happen while deriving the dual problems for other optimization problems.
Specifically, we warn the reader about the following
The author is thankful to the students of successive offerings of the course for
their inputs and pointing out various errata in the lecture material. This mono-
graph was typeset using the beautiful style of the Foundations and Trends
R
6
Appendices
A
Calculus Refresher
In this chapter we will take a look at basic tools from calculus that would
be required to design and execute machine learning algorithms. Before we
proceed, we caution the reader that the treatment in this chapter will not be
mathematically rigorous and frequently, we will appeal to concepts and results
based on informal arguments and demonstration, rather than proper proofs.
This was done in order to provide the reader with a working knowledge of the
topic without getting into excessive formalism. We direct the reader to texts in
mathematics, of which several excellent ones are available, for a more rigorous
treatment of this subject.
A.1 Extrema
8
A.2. Derivatives 9
and global minima at all values of x that are of the form 2kπ − π2 .
However, apart from global extrema which achieve the largest or the small-
est value of a function among all possible input points, we can also have
local extrema, i.e. local minimum and local maximum. These are points which local extrema
achieve the best value of the function (min for local minima and max for local
maxima) only in a certain (possibly small) region surrounding the point.
A practical example to understand the distinction between local and global
extrema can be that of population: the city of Kanpur has a large population
(that of 3.2 million) which is the highest among cities within the state of Uttar
Pradesh. Thus, Kanpur is at least a local maximum. However, it is not a global
maximum since if we go outside the state of Uttar Pradesh, we find cities like
Mumbai with a population of 12.4 million. However, even Mumbai is a local
maximum (among cities within India) since the global maximum (of largest
population among all cities on Earth) is achieved at Chongqing, China which
has a population of 30.1 million (source: Wikipedia).
It is be clear from the above definitions that all global extrema are necessar-
ily local extrema. For example, Chongqing clearly has the largest population
within China itself and thus a local maximum. However, not all local extrema
need be global extrema.
A.2 Derivatives
Derivatives are an integral part of calculus (pun intended) and are the most
direct way of finding how function values change (increase/decrease) if we
move from one point to another. Given a univariate function i.e. a function univariate function
f : R → R that takes a single real number as input and outputs a real number
(we will take care of multivariate functions later), the derivative of f at a
point x0 tells us two things. Firstly, if the sign of the derivative is positive i.e.
f 0 (x0 ) > 0, then the function value will increase if we move a little bit to the
right on the number line (i.e. go from x0 to x0 + ∆x for some ∆x > 0) and it
will decrease if we move a little bit to the left on the number line. Similarly
if f 0 (x0 ) < 0, then moving right decreases the function value whereas moving
left increases the function value.
Secondly, the magnitude of the derivative
i.e. f 0 (x0 ) tells us by how much would the
How small is small enough for the above result to hold may depend on both
the function f as well as the point x0 where we are applying the result.
Just as the derivative of a function tells us how does the function value changes
(i.e. goes up/down) and by how much, the second derivative tells us how does
the derivative change (i.e. go up/down) and by how much. Intuitively, the
second derivative can be thought of as similar to acceleration if we consider
the derivative as similar to velocity and the function value as being similar to
displacement. If at a point x0 we have f 00 (x0 ) > 0, then this means that the
derivative will go up if we move to the right and decrease if we move to the
left (similarly if f 00 (x0 ) < 0 at a point).
The Taylor’s theorem does extend to second order derivatives as well
This result follows directly from the second order Taylor’s theorem we studied
above. Since f 0 (x0 ) = 0, we have
1
f (x0 + ∆x) ≈ f (x0 ) + f 00 (x0 ) · (∆x)2 ≥ f (x0 )
2
This means that irrespective of whether ∆x < 0 or ∆x > 0 (i.e. irrespective of
whether we move left or right), the function value always increases. Recall that
this is the very definition of a local minimum. Similarly, we can intuitively see
that if f 0 (x0 ) = 0 and f 00 (x0 ) < 0 then x0 is definitely a local maximum.
If we have f 0 (x0 ) = 0 and f 00 (x0 ) = 0 at a point
then the second derivative test is actually silent and
fails to tell us anything informative. The reader is
warned that the first and second derivatives both
vanishing does not mean that the point is a saddle
point. For example, consider the case of the func-
tion f (x) = (x − 2)4 . Clearly x0 = 2 is a local
(and global) minimum. However, it is also true that
0 00
f (2) = 0 = f (2). In such inconclusive cases, higher order derivatives e.g.
f (3) (x) = f 000 (x), f (4) (x) have to be used to figure out what is the status of our
stationary point.
Several rules exist that can help us calculate the derivative of complex-looking
functions with relative ease. These are given below followed by some examples
applying them to problems.
1. (Constant Rule) If h(x) = c where c is not a function of x then h0 (x) = 0
4. (Product Rule) If h(x) = f (x) · g(x) then h0 (x) = f 0 (x) · g(x) + g 0 (x) · f (x)
f (x) f 0 (x)·g(x)−g 0 (x)·f (x)
5. (Quotient Rule) If h(x) = g(x) then h0 (x) = g 2 (x)
sin(x) then f 0 (x) = cos(x) and if f (x) = cos(x) then f 0 (x) = − sin(x). The
most common use of the chain rule is finding f 0 (x) when f is a function of
some variable, say t but t itself is a function of x i.e. t = g(x).
The gradient also has the very useful property of being the direction of steep-
est ascent. This means that among all the directions in which we could move, steepest ascent
if we move along the direction of the gradient, then the function value would
experience the maximum amount of increase. However, for machine learning
applications, a related property holds more importance: among all the direc-
tions in which we could move, if we move along the direction opposite to that
of the gradient i.e. we move along −∇f (x0 ), then the function value would
experience the maximum amount of decrease – this means that the direction
opposite to the gradient offers the steepest descent. steepest descent
Second derivatives play a similar role of documenting how the first derivative
changes as we move a little bit from point to point. However, since we have d
partial derivatives here and d possible axes directions along which to move, the
second derivative for multivariate functions is actually a d × d matrix, called
the Hessian and denoted as ∇2 f (x0 ). Hessian
Clairaut’s theorem tells us that if the function f is “nice” (basically the second
2f 2f
order partial derivatives are all continuous), then ∂x∂i ∂x j
= ∂x∂j ∂x i
i.e. the
Hessian matrix is symmetric. The (i, j)th entry of this Hessian matrix – styled
2f
as ∂x∂i ∂x j
– records how much the ith partial derivative changes if we move a
little bit along the j th axis i.e. if ∆x = (0, 0, . . . , 0, δ, 0, . . . , 0), then
j
∂f 0 ∂f 0 ∂2f
(x + ∆x) ≈ (x ) + (x0 ) · ∆x, if ∆x is “small”.
∂xi ∂xi ∂xi ∂xj
Just as in the univariate case, the Hessian can be incorporated into the Tay-
lor’s theorem to obtain a finer approximation of the change in function value.
Denote H = ∇2 f (x0 ) for sake of notational simplicity
Just as in the univariate case, here also we define stationary points as those
where the gradient of the function vanishes i.e. ∇f (x0 ) = 0. As before, sta-
tionary points can either be local minima/maxima or else saddle points and
the second derivative test is used to decide which is the case. However, the
multivariate second derivative test looks a bit different.
If the Hessian of the function is positive semi definite (PSD) at a stationary
point x0 i.e. ∇f (x0 ) = 0 and H = ∇2 f (x0 ) 0 then x0 is definitely a local
minimum. Recall that a square symmetric matrix A ∈ Rd×d is called positive
semi definite if for all vectors v ∈ Rd , we have v> Av ≥ 0. As before, this
result follows directly from the multivariate second order Taylor’s theorem we
studied above. Since ∇f (x0 ) = 0, we have
1
f (x0 + ∆x) ≈ f (x0 ) + (∆x)> H(∆x) ≥ f (x0 )
2
This means that no matter in which direction we move from x0 , the function
value always increases. This is the very definition of a local minimum. Similarly,
we can intuitively see that if the Hessian of the function is negative semi
definite (NSD) at a stationary point x0 i.e. ∇f (x0 ) = 0 and ∇2 f (x0 ) 0
then x0 is a local maximum. Recall that a square symmetric matrix A ∈ Rd×d
is called negative semi definite if for all vectors v ∈ Rd , we have v> Av ≤ 0.
We now take a toy example to help the reader visualize how multivariate
derivatives operate. We will take d = 2 to allow us to explicitly show gradients
and function values on a 2D grid. The function we will study will not be
continuous but discrete but will nevertheless allow us to revise the essential
aspects of the topics we studied above.
Consider the function f : [0, 8] ×
[0, 6] → R on the left. The func-
tion is discrete – darker shades indi-
cate a higher function value (which
is also written inside the boxes) and
lighter shades indicate a smaller func-
tion value. Since discrete functions
are non-differentiable, we will use ap-
proximations to calculate the gradi-
ent of this function at all the points.
Note that the input to this function are two integers (x, y) where 0 ≤ x ≤ 8
and 0 ≤ y ≤ 6.
Given this, we may estimate the gradient
of the function at a point (x0 , y0 )
∆f ∆f
using the formula ∇f (x0 , y0 ) = ∆x , ∆y where
A.7. Visualizing Multivariate Derivatives 15
∆f f (x0 + 1, y0 ) − f (x0 − 1, y0 )
=
∆x 2
∆f f (x0 , y0 + 1) − f (x0 , y0 − 1)
=
∆y 2
The values of the gradients calculated
using the above formula are shown on
the left. Notice that we have five loca-
tions where the gradient vanishes (4,5),
(1,3), (4,3), (7,3) and (4,1): these are
stationary points. It may be more in-
structive to see the gradients repre-
sented as arrows which the figure on
the left does. Notice that gradients con-
verge toward the local maxima (4,5)
and (4,1) from all directions (this is
expected since the point has a greater
function value than all its neighbors).
Similarly, gradients diverge away from
the local minima (1,3) and (7,3) from
all directions (this is expected as well
since the point has a smaller function value than all its neighbors). However,
the point (4,3) being a saddle point, has gradients converging to it in the x
direction but diverging away from it in the y direction. In order to verify which
of our stationary points are local maxima/minima and which are saddle points,
we need to estimate the Hessian of this function.
∆f ∆f
∆y (x0 + 1, y0 ) − ∆y (x0 − 1, y0 )
fyx =
2
Deriving these formulae for approximating mixed partial derivatives is rela-
∆2 f
tively simple but we do not do so here. Also, the expression for ∆x∆y which
seems needlessly complicated due to the average involved, was made so in or-
der to make sure that we obtain a symmetric matrix as the approximation to
the Hessian (since Clairaut’s theorem does not apply to our toy example it
is not automatically ensured to us). However, any dissatisfaction with formu-
lae aside, we can verify that the Hessian is indeed PSD at the local minima,
NSD at the local maxima and neither NSD nor PSD at the saddle point. This
verifies our earlier second derivative test rules.
The above cases may tempt the reader to wonder what happens if we have a
function f : Rm → Rn which takes in an m-dimensional vector as input and
gives an n-dimensional vector as output. Indeed, the derivative in this case
must be an n × m matrix. Note that all the above cases fit this more general
rule. However, we will study this in detail later when we study Jacobians.
A.8. Useful Rules for Calculating Multivariate Derivatives 17
2. (Sum Rule) If h(x) = f (x) + g(x) then ∇h(x) = ∇f (x) + ∇g(x). This
rule can be derived by applying univariate sum rule repeatedly to each
dimension j ∈ [d].
1
Example A.4. Let σ(x) = 1+exp(−a> x)
where a ∈ Rd is a constant vector that
does not depend on x. Then we can write σ(t) = (t)−1 where t(s) = 1 + exp(s)
where s(x) = −a> x. Thus, applying the chain rule tells us that ∇σ(x) =
σ 0 (t)·t0 (s)·∇s(x). By applying the rules above we have σ 0 (t) = − t12 (polynomial
rule), t0 (s) = exp(s) (constant rule and exponential rule), ∇s(x) = −a (dot
exp(−a> x)
product rule). This gives us ∇σ(x) = (1+exp(−a > x))2 · a = σ(x)(1 − σ(x)) · a.
Example A.5. Let f (x) = (a> x − b)2 where a ∈ Rd is a constant vector and
b ∈ R is a constant scalar. Using the gradient chain rule, we get ∇f (x) =
2(a> x − b) · a .
Example A.6. Let A ∈ Rn×d be a constant matrix and b ∈ Rn be a constant
vector and define f (x) = kAx − bk22 . If we let ai ∈ Rd denote the vector
formed out of the ith row of the matrix A, then we can see that we can
rewrite the function as f (x) = ni=1 (x> ai − bi )2 . Using the sum rule for
P
A.8. Useful Rules for Calculating Multivariate Derivatives 18
Thus, the j th row of the Hessian is the vector a> (transposed since it is a row
vector) scaled by 2aj . Pondering on this for a moment will tells us that this
means that ∇2 f (x) = 2 · aa> ∈ Rd×d .
out of the ith row of the matrix A. It is useful to clarify here that although
the vector ai was formed out of a row of a matrix, the vector itself is a column
vector as per our convention for vectors. Using the sum rule for gradients gives
us ∇gj = 2 ni=1 aji · ai . Similarly as in the above example, we can deduce that
P
then f may have multiple subgradients (in general an infinite number of sub-
gradients) at x0 . The set of all subgradients at x0 is called the subdifferential subdifferential
of f at x0 and denoted as follows
n o
∂f (x0 ) = g ∈ Rd : f (x) ≥ g> (x − x0 ) + f (x0 ) for all x
4. (Max Rule) If h(x) = max {f (x), g(x)}, then the following cases apply:
Note that the max rule has no counterpart in regular calculus since functions
of the form h(x) = f (x> x + b) are usually non-differentiable.
Example A.11. Let `(w) = [1 − y · w> x, 0]+ denote the hinge loss function
applied to a data point (x, y) along with the model w. Applying the chain rule
gives us
−y · x
if y · w> x < 1
∂`(w) = 0 if y · w> x > 1
if y · w> x = 1, where c ∈ [−1, 0]
cy · x
A.10 Exercises
Exercise A.1. Let f (x) = x4 − 4x2 + 4. Find all stationary points of this
function. Which of them are local maxima and minima?
Exercise A.2. Let g : R2 → R be defined as g(x, y) = f (x) + f (y) + 8 where
f is defined in the execise above. Find all stationary points of this function.
Which of them are local maxima and minima? Which one of these are saddle
points?
Exercise A.3. Given a natural number n ∈ N e.g. 2, 8, 97 and a real num-
ber x0 ∈ R, design a function f : R → R so that f (k) (x0 ) = 0 for all
k = 1, 2, . . . , n. Here f (k) (x0 ) denotes the k th order derivative of f at x0 e.g.
f (1) (x0 ) = f 0 (x0 ), f (3) (x0 ) = f 000 (x0 ) etc.
Exercise A.4. Let a, b ∈ Rd be constant vectors and let f (x) = a> xx> b. updated exercise
Calculate ∇f (x) and ∇2 f (x).
Hint: write f (x) = g(x) · h(x) where g(x) = a> x and h(x) = b> x and apply
the product rule.
Exercise A.5. Let b ∈ Rd a constant vector and A ∈ Rd×d be a constant updated exercise
symmetric matrix. Let f (x) = b> Ax. Calculate ∇f (x) and ∇2 f (x).
Hint: write f (x) = c> x where c = A> b.
Exercise A.6. Let A, B, C ∈ Rd×d be three symmetric and constant matrices updated exercise
and p, q ∈ Rd be two constant vectors. Let f (x) = (Ax + p)> C(Bx + q).
Calculate ∇f (x) and ∇2 f (x).
Exercise A.7. Suppose we have n constant vectors a1 , . . . , an ∈ Rd . Let f (x) = new exercise
Pn
i=1 ln 1+ exp(−x> ai ) . Calculate ∇f (x) and ∇2 f (x).
Exercise A.8. Let a ∈ Rd be a constant vector and let f (x) = a> x · kxk22 . new exercise
Calculate ∇f (x) and ∇2 f (x).
Hint: the expressions may be more tedious with this one. Be patient and apply
the product rule carefully to first calculate the gradient. Then move on to the
Hessian by applying the dimensionality rule.
A.10. Exercises 22
Exercise A.9. Show that for any convex function f (whether differentiable or new exercise
not), its subdifferential at any point x0 , i.e. ∂f (x0 ), is always a convex set.
Exercise A.10. For a vector x ∈ Rd its L1 norm is defined as kxk1 , dj=1 |xj |.
P
new exercise
Let f (x) , kxk1 + [1 − x> a]+ where a ∈ Rd is a constant vector. Find the
subdifferential ∂f (x).
n o
Exercise A.11. Let f (x) = max (x> a − b)2 , c where a ∈ Rd is a constant new exercise
vector and b, c ∈ R are constant scalars. Find the subdifferential ∂f (x).
Exercise A.12. Let x, a ∈ Rd , b ∈ R where a is a constant vector that does new exercise
not depend on x and b is a constant real number that does not depend on x.
Let f (x) = a> x − b. Find the subdifferential ∂f (x).
subdifferential ∂f (x).
B
Convex Analysis Refresher
Convex sets and functions remain the favorites of practitioners working on ma-
chine learning algorithms since these objects have several beautiful properties
that make it simple to design efficient algorithms. Of course, the recent years
have seen several strides in non-convex optimization as well due to areas such
as deep learning, robust learning, sparse learning gaining prominence.
23
B.1. Convex Set 24
set contains all its line segments is it called convex. The reader would have
noticed that convex sets bulge outwards in all directions. The presence of any
inward bulges typically makes a set non-convex.
Proof. We first deal with the case of intersection. The intersection of two sets intersection of two sets
(not necessarily convex) is definednto be the set of all points that
o are contained
d
in both the sets i.e. C1 ∩ C2 , x ∈ R : x ∈ C1 and x ∈ C2 . Consider two
points x, y ∈ C1 ∩ C2 . Since x, y ∈ C1 , we know that z = x+y2 ∈ C1 since C1 is
convex. However, by the same argument, we get that z = x+y 2 ∈ C2 as well.
Since z ∈ C1 and z ∈ C2 , we conclude that z ∈ C1 ∩ C2 . This proves that the
intersection of any two convex sets must necessarily be convex. The first figure
above illustrates the intersection region of two convex sets.
The union of two sets (not necessarily convex) is defined to be the set of all union of two sets
points that are contained in either of the sets (including
n points that are present
o
in both sets). More specifically, we define C1 ∩C2 , x ∈ Rd : x ∈ C1 or x ∈ C2 .
The second figure above shows that the union of two convex sets may be non-
convex. However, the union of two convex sets may be convex in some very
special cases, for example, if one set is contained in the other i.e. C1 ⊆ C2
which is illustrated in the third figure.
Example B.2. Consider the set of all points which arenat a Euclidean distance
o
at most 1 from the origin i.e. the unit ball B2 (1) , x ∈ Rd : kxk2 ≤ 1 . To
show that this set is convex, we take x, y ∈ B2 (1) and consider z = x+y
2 . Now,
instead of showing kzk2 ≤ 1 (which will establish convexity), we will instead
show kzk22 ≤ 1 which is equivalent but easier to analyze. We have kzk22 =
x+y
2 kxk22 +kyk22 +2x> y
2
= 4 . Now, recall that the Cauchy-Schwartz inequality
2
tells us that for any two vectors a, b we have a> b ≤ kak2 kbk2 . Thus, we
The figures depict a convex function that lies above all its chords and a
non-convex function which does not do so. It is also important to note that
a non-convex function may lie below some of its chords (as the figure on the
bottom shows) – this does not make the function convex! Only if a function
lies below all its chords is it called convex.
This definition holds true only for differentiable functions but is usually easier
to apply when checking whether a function is convex or not.
Definition B.5 (Tangent). Given a differen- Tangent
tiable function f : Rd → R, the tangent of
the function at a point x0 ∈ Rd is the hy-
perplane ∇f (x0 )> (x − x0 ) + f (x0 ) = 0 i.e.
of the form w> x + b where w = ∇f (x0 )
and b = f (x0 ) − ∇f (x0 )> x0 .
Example√ B.3. Let us look at the example of the Euclidean norm f (x) =
kxk2 = x> x. This function is non-differentiable at the origin i.e. at x = 0 so
d
definition of convexity. Given two points x, y ∈ R , we
we haveto use the
chord
have f x+vy =
x+y 1
2
= 2 kx + yk by using the homogeneity property of the
2
Euclidean distance (if we halve a vector, its length gets halved too). However,
recall that the triangle inequality tells us that for any two vectors p, q, we
kxk2 +kyk2
have kp + qk2 ≤ kpk2 + kqk2 . This gives us f x+y 2 ≤ 2 = f (x)+f
2
(y)
We can take convex functions and manipulate them to obtain new convex
functions. Here we explore some such operations that are useful in machine
learning applications.
3. The sum of two convex functions is always convex (see Theorem B.2).
Proof. We will use the chord definition of convexity here since there is no
surety that f and g are differentiable. Consider two points x, y ∈ Rd . We have
x+y x+y x+y
h =f +g
2 2 2
f (x) + f (y) g(x) + g(y)
≤ +
2 2
(f (x) + g(x)) + (f (y) + g(y)) h(x) + h(y)
= =
2 2
where in the second step, we used the fact that f and g are both convex. This
proves that h is convex by the chord definition of convexity.
Proof. We will use the chord definition of convexity here since there is no
surety that f and g are differentiable. Consider two points x, y ∈ Rd . We have
x+y x+y
h =f g
2 2
Now, since g is convex, we have
x+y g(x) + g(y)
g ≤
2 2
Let us denote the left hand side of the above inequality by p and the right
hand side by q for sake of notational simplicity. Thus, the above inquality tells
us that p ≤ q. However, since f is non-decreasing, we get f (p) ≤ f (q) i.e.
x+y g(x) + g(y)
f g ≤f
2 2
Let us denote u , g(x) and v , g(y) for sake of notational simplicity. Since f
is convex, we have
u+v f (u) + f (v)
f ≤
2 2
This is the same as saying
g(x) + g(y) f (g(x)) + f (g(y)) h(x) + h(y)
f ≤ =
2 2 2
B.3. Operations with Convex Functions 29
Thus, with the chain of inequalities established above, we have shown that
x+y h(x) + h(y)
h ≤
2 2
which proves that h is a convex function.
√
Example B.4. The functions f (x) = ln(x) and g(x) = x are concave. Since
both of these are doubly differentiable functions, we may use the Hessian
definition to decide their concavity. Recall that a function is concave if and
only if its negation is convex. Let p(x) = − ln(x). Then p00 (x) = x12 ≥ 0 for all
x > 0. This confirms that p(x) is convex and that ln(x) is concave. Similarly,
√
define q(x) = − x. Then q 00 (x) = 4x1√x ≥ 0 for all x ≥ 0 which confirms that
√
q(x) is convex and that x is concave.
Example B.5. Let us show that squared Euclidean norm i.e. the function
h(x) = kxk22 is convex. We have already shown above that the function g(x) =
kxk2 is convex. We can write h(x) = f (g(x)) where f (t) = t2 . Now, f 00 (t) =
1 > 0 i.e. f is convex by applying the Hessian rule for convexity. Also, kxk2 ≥ 0
for all x ∈ Rd and the function f is indeed an increasing function on the
positive half of the real line. Thus, Theorem B.3 tells us that h(x) is convex.
Example B.6. Let us show that the hinge loss is a convex function `hinge (t) =
max {1 − t, 0}. Note that the hinge loss function is treated as a univariate
function here i.e. `hinge : R → R. Exercise B.6 shows us that affine functions
are convex. Thus f (x) = 1 − x and g(x) = 0 are both convex functions. Thus,
by applying Theorem B.5, we conclude that the hinge loss function is convex.
Example B.7. We will now show that the objective function used in the C-
SVM formulation is a convex function of the model vector w. For sake of
simplicity, we will show this result without the bias parameter b although we
stress that the result holds even if the bias parameter is present (recall that the
bias can always be hidden inside the model by adding a fake dimension into
n
the data). Let (xi , y i ) i=1 be n data points with xi ∈ Rd and y i ∈ {−1, 1}.
B.4 Exercises
less
n than or equal to unito Mahalanobis distance from the origin i.e. BA (1) ,
d
x ∈ R : dA (x, 0) ≤ 1 . Show that BA (1) is a convex set.
Exercise
n B.2. Consider the hyperplane
o given by the equation w> x + b = 0 i.e.
H , x ∈ Rd : w> x + b = 0 where w is the normal vector to the hyperplane
and b is the bias term. Show that H is a convex set.
Exercise B.3. If I take a convex set and shift
it, does it remain convex? Let C ⊂ R be a
convex set and let v ∈ Rd be any vector
(whether “small” or “large”). Define C + v ,
{x : x = z + v for some z ∈ C}. Show that the
set C +v will always be convex, no matter what
v or convex set C we choose.
Example C.1. Dr. Strange is trying to analyze all possible outcomes of the
Infinity War using the Time Stone but there are way too many of them so
he analyzes only 75% of the n possible outcomes. The Avengers then sit down
to discuss the outcomes. Yet again, since there are so many of them, they
discuss only 25% of the n outcomes (much fewer outcomes are discussed by the
Avengers than were analyzed by Dr Strange since by now Thanos has snatched
away the Time Stone). They may discuss some outcomes that Dr. Strange
has already analyzed as well as some outcomes that he has not analyzed. It is
known that of the outcomes that were analyzed by Dr. Strange, only 30% got
discussed. Given this, can we find out what fraction of the discussed outcomes
were previously analyzed by Dr. Strange?
This problem may not seem like a probability problem to begin with. How-
ever, if we recall the interpretation of probability in terms of proportions, then
it is easy to see this as a probability problem and also apply powerful tools
from probability theory. When we say that Dr. Strange analyzes only 75% of
the outcomes, this is just another way of saying that if we pick one of the
n outcomes uniformly at random (i.e. each outcome gets picked with equal
probability n1 ), then there is a 43 probability that it would be an outcome that
was analyzed by Dr. Strange.
With this realization in mind, we set up our probability space properly.
Our sample space is [n] consisting of the n possible outcomes. Each outcome
is equally likely to be picked i.e. each outcome gets picked with probability n1 .
We now define two indicator variables A and D. A = 1 if the chosen outcome
was analyzed by Dr. Strange and A = 0 otherwise. D = 1 if the chosen outcome
was discussed by the Avengers and D = 0 otherwise.
The problem statement tells us that P [A = 1] = 43 (since 75% outcomes
were analyzed), P [D = 1] = 41 (since 25% outcomes were analyzed). The prob-
lem statement also tells us that of the analyzed outcomes, only 30% were
discussed which means that the number of outcomes that were both discussed
32
33
3
and analyzed, divided by the number of outcomes that were analyzed is 10 i.e.
P[D=1,A=1] 3 3
P[A=1] = 10 . This is just another way of saying that P [D = 1 | A = 1] = 10 .
The problem asks us to find the fraction of discussed outcomes that were
analyzed i.e. we want the number of outcomes that were both analyzed and
discussed, divided by the number of discussed outcomes. This is nothing but
P [A = 1 | D = 1]. Applying the Bayes theorem now tells us
3 3
P [D = 1 | A = 1] · P [A = 1] 10 · 4 9
P [A = 1 | D = 1] = = 1 =
P [D = 1] 4
10
which means that 90% of the discussed outcomes were previously analyzed by
Dr. Strange. This is not surprising given that Dr. Strange analyzed so many
more outcomes than the Avengers were able to discuss.
Example C.2 (Problem 6.4 from Deisenroth et al. (2019)). There are two bags.
The first bag contains four mangoes and two apples; the second bag contains
four mangoes and four apples. We also have a biased coin, which shows “heads”
with probability 0.6 and “tails” with probability 0.4. If the coin shows “heads”,
we pick a fruit at random from bag 1; otherwise we pick a fruit at random
from bag 2. Your friend flips the coin (you cannot see the result), picks a fruit
uniformly at random from the corresponding bag, and presents you a mango.
What is the probability that the mango was picked from bag 2? 1
To solve the above problem, let us define some random variables. Let
B ∈ {1, 2} denote the bag from which the fruit is picked and let F ∈ {M, A}
denote which fruit is selected2 . The problem statement tells us the follow-
ing: P [B = 1] = 0.6 and P [B = 2] = 0.4 since the outcome of the coin flip
completely decides which bag we choose. Now, suppose we knew that the
fruit was being sampled from bag 1, then interpreting probabilities as pro-
portions (since fruits are chosen uniformly at random from a bag), we have
4
P [F = M | B = 1] = 4+2 = 23 and P [F = A | B = 1] = 1−P [F = M | B = 1] =
1 4 1
3 . Similarly, we have P [F = M | B = 2] = 4+4 = 2 = P [F = A | B = 2].
We are told that the fruit that was picked was indeed a mango and are
interested in knowing the chances that it was picked from bag 2. Thus, we are
interested in P [B = 2 | F = M ]. Applying the Bayes theorem gives us
P [F = M | B = 2] · P [B = 2]
P [B = 2 | F = M ] =
P [F = M ]
We directly have values for the two terms in the numerator. The denomina-
tor, however, will have to be calculated by deriving the marginal probability
P [F = M ] from the joint probability distribution PF,B using the sum rule (law
of total probability) and the product rule
P [F = M ] = P [F = M, B = 1] + P [F = M, B = 2]
= P [F = M | B = 1] · P [B = 1] + P [F = M | B = 2] · P [B = 2]
1 In this statement, the word uniformly was added (it was not present in Deisenroth et al. (2019))
easy identification. We can easily make this support numeric say, by mapping M = 1 and A = 0.
34
2 3 1 2 3
= · + · = ,
3 5 2 5 5
where in the first step we used the sum rule and in the second step we used
the product rule. Putting things together gives us
1 2
2 · 5 1
P [B = 2 | F = M ] = 3 = .
5
3
Thus, there is a 13 probability that the mango we got was picked from bag 2.
The complement rule for conditional probability tell us that P [B = 1 | F = M ] =
1 − P [B = 2 | F = M ] = 23 , i.e. there is a much larger, 32 probability that the
mango we got was picked from bag 1. This is to be expected since not only is
bag 1 more likely to be picked up than bag 2 , bag 1 also has a much larger
proportion of mangoes than bag 2 which means that if we got a mango, it is
more likely that it came from bag 1.
Example C.3. Let A, B denote two events such that P [A | B] = 0.36 = P [A].
Can we find P [A | ¬B] i.e. the probability that event A will take place given
that event B has not taken place?
Let us abuse notation to let A, B also denote the indicator variables for
the events i.e. A = 1 if A takes place and A = 0 otherwise and similarly for
B. Note that the problem statement has essentially told us that A and B are
independent events. Since P [A | B] = P [A], by abusing notation we get
P [A = 1, B = 1]
= P [A = 1] ,
P [B = 1]
i.e. P [A = 1, B = 1] = P [A = 1]·P [B = 1]. Now, we are interested in the prob-
ability P [A = 1 | B = 0] = P[A=1,B=0]
P[B=0] . Since we have not been given P [B = 0]
directly, we try to massage the numerator to see if we can get hold of some-
thing.
P [A = 1, B = 0] = P [A = 1] − P [A = 1, B = 1]
= P [A = 1] − P [A = 1] · P [B = 1]
= P [A = 1] (1 − P [B = 1])
= P [A = 1] P [B = 0] ,
where in the first step we used the sum rule (law of total probability), in the
second step, we exploited the independence of the two events and in the last
step, we used the complement rule. This gives us
P [A = 1, B = 0] P [A = 1] P [B = 0]
P [A = 1 | B = 0] = = = P [A = 1] = 0.36,
P [B = 0] P [B = 0]
since the problem statement already tells us that P [A = 1] = 0.36.
Example C.4. Timmy is is trying to kill some free time by typing random
letters on his keyboard. He types 7 random capital alphabet letters (A - Z) on
his keyboard. Each letter is chosen uniformly randomly from the 26 letters and
each choice is completely independent of the other choices. Can we find the
probability that Timmy will end up typing the word COVFEFE?
C.1. Exercises 35
Let Li , i ∈ [7] denote the random variable that tells us which letter was
chosen at the ith location in the word. As before, we will let the support of
the random variables Li be the words of the English alphabet rather than
numbers, for sake of easy identification. We can readily map the letters of the
alphabet to [26] to have numerical supports instead. We are interested in
P [L1 = C, L2 = O, L3 = V, L4 = F, L5 = E, L6 = F, L7 = E]
However, since the choices were made independently, applying the product
rule for expectations tells us that the above is equal to
P [L1 = C]·P [L2 = O]·P [L3 = V ]·P [L4 = F ]·P [L5 = E]·P [L6 = F ]·P [L7 = E]
Since each letter is chosen uniformly at random from the 26 letters of the
7
1
alphabet, the above probability is simply 26 .
C.1 Exercises
Exercise C.1. Let Z denote a random variable which takes value 1 with prob-
ability p and value 2 with probability 1 − p. For what value of p ∈ [0, 1] does
this random variable have the highest variance?
Exercise C.2. Summers are here and people are buying soft drinks, say from
three brands – ThumsUp, Pepsi and Coke. Suppose the total sales of all the
brands put together is exactly 14 million units every day. Now, the market
share of the individual brands changes from day to day. However, it is also
known that ThumsUp sells on an average of 8 million units per day and Coke
sells 4 million units per day on average. How many units does Pepsi sell per
day on an average?
Exercise C.3. Suppose A and B are two events such that P [A] = 0.125
whereas B is an almost sure event i.e. P [B] = 1. What is P [A | B]? What
is P [B | A]?
Exercise C.4. The probability of life on a planet given that there is water on
it is 80%. If there is no water on a planet then life cannot exist on that planet.
The probability of finding water on a planet is 50%. Planet A has life on it
whereas Planet B does not have life on it. What is the probability that Planet
A has water? What is the probability that Planet B has water on it?
Exercise C.5. My sister and I both try to sell comic books to our classmates
at school. Since our target audience is limited, it is known that on any given
day, the two of us together sell at most 10 comic books. It is known that I
sell 5 comic books on average everyday. It is also known that my sister sells
3 comic books on average everyday. How many comic books do the two of us
sell together on average everyday?
Exercise C.6. Martin has submitted his thesis proposal. The probability that
professor X will approve the proposal is 70%. The probability that professor
Y will approve the proposal is 50%. The probability that professor Z will
C.1. Exercises 36
approve the proposal is 40%. The approvals of the three professors are entirely
independent of one another. Suppose Martin has to get approval from at least
two of the three professors to pass his proposal, what is the probability that
his proposal will pass?
References
37