0% found this document useful (0 votes)
157 views39 pages

Introduction To Machine Learning (CS 771A, IIT Kanpur) : Course Notes and Exercises

This document introduces support vector machines (SVMs) and derives the dual formulation of the SVM optimization problem. It begins with the primal formulation, which minimizes a regularization term and constraint violation term subject to margin constraints. It then introduces Lagrange multipliers to convert the constraints into an unconstrained optimization problem. Taking derivatives and enforcing optimality conditions leads to the dual formulation, which is expressed in terms of the Lagrange multipliers alone without reference to the original parameters. The derivation proceeds systematically through each step of converting the primal problem to its dual form.

Uploaded by

Siddhant raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
157 views39 pages

Introduction To Machine Learning (CS 771A, IIT Kanpur) : Course Notes and Exercises

This document introduces support vector machines (SVMs) and derives the dual formulation of the SVM optimization problem. It begins with the primal formulation, which minimizes a regularization term and constraint violation term subject to margin constraints. It then introduces Lagrange multipliers to convert the constraints into an unconstrained optimization problem. Taking derivatives and enforcing optimality conditions leads to the dual formulation, which is expressed in terms of the Lagrange multipliers alone without reference to the original parameters. The derivation proceeds systematically through each step of converting the primal problem to its dual form.

Uploaded by

Siddhant raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Introduction to Machine Learning

(CS 771A, IIT Kanpur)


Course Notes and Exercises

Suggested Citation: P. Kar. Introduction to Machine Learning (CS 771A, IIT Kanpur), Course Notes and
Exercises, 2019.

Purushottam Kar
IIT Kanpur
[email protected]

This monograph may be used freely for the purpose of research and self-study.
If you are an instructor/professor/lecturer at an educational institution and
wish to use these notes to offer a course of your own, it would be nice if you
could drop a mail to the author at the email address [email protected]
mentioning the same.
IIT Kanpur
Contents

1 Support Vector Machines 2


1.1 Derivation of the CSVM Dual . . . . . . . . . . . . . . . . . . . . 2

Acknowledgements 6

Appendices 7

A Calculus Refresher 8
A.1 Extrema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
A.2 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
A.3 Second Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . 10
A.4 Stationary Points . . . . . . . . . . . . . . . . . . . . . . . . . . 10
A.5 Useful Rules for Calculating Derivatives . . . . . . . . . . . . . . 11
A.6 Multivariate Functions . . . . . . . . . . . . . . . . . . . . . . . . 12
A.7 Visualizing Multivariate Derivatives . . . . . . . . . . . . . . . . . 14
A.8 Useful Rules for Calculating Multivariate Derivatives . . . . . . . . 16
A.9 Subdifferential Calculus . . . . . . . . . . . . . . . . . . . . . . . 19
A.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

B Convex Analysis Refresher 23


B.1 Convex Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
B.2 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 25
B.3 Operations with Convex Functions . . . . . . . . . . . . . . . . . 27
B.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

C Probability Theory Refresher 32


C.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

References 37
Introduction to Machine Learning
(CS 771A, IIT Kanpur)
Purushottam Kar1∗
1 IIT Kanpur; [email protected]

ABSTRACT
Machine Learning is the art and science of designing algorithms
that can learn patterns and concepts from data to modify their
own behavior without being explicitly programmed to do so. This
monograph is intended to accompany a course on an introduction
to the design of machine learning algorithms with a modern out-
look. Some of the topics covered herein are Preliminaries (multivari-
ate calculus, linear algebra, probability theory), Supervised Learn-
ing (local/proximity-based methods, learning by function approxi-
mation, learning by probabilistic modeling), Unsupervised Learning
(discriminative models, generative models), practical aspects of ma-
chine learning, and additional topics.
Although the monograph will strive to be self contained and re-
visit basic tools such as calculus, probability, and linear algebra,
the reader is advised to not completely rely on these refresher dis-
cussions but also refer to a standard textbook on these topics.

∗ The contents of this monograph were developed as a part of successive offerings of various

machine learning related courses at IIT Kanpur.


1
Support Vector Machines

1.1 Derivation of the CSVM Dual

Let us recall the CSVM primal problem


n
1
kwk22 + C ·
X
min ξi
w∈Rd ,b∈R,ξ∈Rn 2 i=1
s.t. y i (w> xi + b) ≥ 1 − ξi , for i = 1,. . . ,n
ξi ≥ 0, for i = 1,. . . ,n

We follow the usual steps of deriving the dual problem below

1. Step 1 (Convert the problem into conventional form). The problem is


already a minimization problem so nothing to be done there. However,
we do need to convert the constraints into ≤ 0 type constraints
n
1
kwk22 + C ·
X
min ξi
w∈Rd ,b∈R,ξ∈Rn 2 i=1
s.t. 1 − ξi − y i (w> xi + b) ≤ 0, for i = 1,. . . ,n
− ξi ≤ 0, for i = 1,. . . ,n

2. Step 2 (Introducing dual variables/Lagrange multipliers). Since there are


two sets of n constraints above, let us introduce α1 , . . . , αn for the con-
straints of the kind 1 − ξi − y i (w> xi + b) ≤ 0 and β1 , . . . , βn for the
constraints of the kind −ξi ≤ 0.

3. Step 3 (Create the Lagrangian). This is easy to do


n n n
1
kwk22 +C · αi (1−ξi −y i (w> xi +b))−
X X X
L(w, b, ξ, α, β) = ξi + βi ξi
2 i=1 i=1 i=1

Note that we have −βi ξi occurring with a negative sign in the above
because the constraints are −ξi ≤ 0 and not ξi ≤ 0.

2
1.1. Derivation of the CSVM Dual 3

4. Step 3 (Create the dual problem). This is easy to do as well. Just keep
in mind that there are constraints on the dual variable that they must
be non-negative. We use the shorthand x ≥ 0 to say that all coordinates
of the vector x must be non-negative i.e. xi ≥ 0 for all i.
n n n
( )
1
kwk22 + C · αi (1 − ξi − y i (w> xi + b)) −
X X X
max min ξi + βi ξi
α≥0,β≥0 w∈Rd 2 i=1 i=1 i=1
b∈R
ξ∈Rn

5. Step 4 (Apply first order optimality with respect to all primal variables).
Recall that we do this since in the dual problem, there are no more
constraints on the primal variables and the Lagrangian is a differentiable
function of the primal variables and so the derivatives of the Lagrangian
must vanish with respect to all the primal variables.
∂L Pn i
(a) Optimality w.r.t w. Setting ∂w = 0 gives us w = i=1 αi y · xi .
n
b. Setting ∂L i
P
(b) Optimality w.r.t ∂b = 0 gives us i=1 αi y = 0.
∂L
(c) Optimality w.r.t ξi . Setting ∂ξ i
= 0 gives us αi + βi = C
The above identities are necessarily true at the optimum, so we take them
as constraints in the dual problem. Note that we already have positivity
constraints on the dual variables i.e. α, β ≥ 0.
n n n
1
kwk22 + C · αi (1 − ξi − y i (w> xi + b)) −
X X X
maxn min ξi + βi ξi
α,β∈R w∈Rd 2 i=1 i=1 i=1
b∈R
ξ∈Rn

s.t. αi ≥ 0, for i = 1,. . . ,n


βi ≥ 0, for i = 1,. . . ,n
n
X
w= αi y i · xi
i=1
n
X
αi y i = 0
i=1
αi + βi = C

6. Step 5 (Simplify the objective function by possibly eliminating primal


variables). We can rearrange terms in the objective function as
n n n n
1
kwk22 − αi y i · w> xi − b
X X X X
αi + αi y i + ξi (C − αi − βi )
i=1
2 i=1 i=1 i=1
Pn i
Applying i=1 αi y = 0 and αi + βi = C simplifies the objective to
n n
1
kwk22 − αi y i · w > x i
X X
αi +
i=1
2 i=1

= kwk22 which
Pn Pn i · w> x i
Applying w = i=1 αi y i · xi tells us that i=1 αi y
further simplifies the objective to
n
1
kwk22
X
αi −
i=1
2
1.1. Derivation of the CSVM Dual 4

Applying w = ni=1 αi y i · xi once more completely eliminates w from the


P

objective function
n n X n
X 1X D E
αi − αi αj y i y j x i , x j
i=1
2 i=1 j=1

Thus, we obtain the dual problem with primal variables completely elim-
inated from the objective.
n n X n
X 1X D E
max min αi − αi αj y i y j x i , x j
α,β∈Rn w∈Rd
i=1
2 i=1 j=1
s.t. αi ≥ 0, for i = 1,. . . ,n
βi ≥ 0, for i = 1,. . . ,n
n
X
w= αi y i · x i
i=1
n
X
αi y i = 0
i=1
αi + βi = C

Note that we have removed the variables b, ξ from the optimization prob-
lem since they no longer appear anywhere, either in the constraints or the
objective. However, we still have a constraint, namely w = ni=1 αi y i · xi ,
P

which contains a primal variable. However, since we have already incor-


porated that constraint while simplifying the objective and the objective
is no more linked to the primal variable w either directly or indirectly,
we may safely remove this constraint – we will reuse this constraint again
after solving the dual problem to reconstruct the model vector w using
the dual solution α. Thus, we have completely removed primal variables
from the objective and the constraints so we may as well remove them
from consideration altogether
n n X n
X 1X D E
maxn αi − αi αj y i y j x i , x j
α,β∈R
i=1
2 i=1 j=1
s.t. αi ≥ 0, for i = 1,. . . ,n
βi ≥ 0, for i = 1,. . . ,n
n
X
αi y i = 0
i=1
αi + βi = C

7. Step 5 (Simplify by possibly eliminating some dual variables). It turns


out that we can even eliminate the dual variables βi . Note that the βi
variables do not appear in the objective function at all. Instead, the
variable βi appears in just two constraints, namely βi ≥ 0, αi + βi =
C. The second equation gives us βi = C − αi . Putting this into the
first equation gives us αi ≤ C. This means that the βi variables are
1.1. Derivation of the CSVM Dual 5

not required themselves and they were just an indirect way of the dual
problem telling us that we must have αi ≤ C.
n n X n
X 1X D E
max αi − αi αj y i y j xi , xj
α∈Rn
i=1
2 i=1 j=1
s.t. αi ∈ [0, C], for i = 1,. . . ,n
n
X
αi y i = 0
i=1

The above is indeed the final form of the dual of the CSVM problem that is
used in professional solvers such as liblinear and sklearn.
We note that the above derivation is by no means an indication of what
must happen while deriving the dual problems for other optimization problems.
Specifically, we warn the reader about the following

1. A constraint of the form αi ≥ 0 will always appear in a dual problem


if αi is a dual variable corresponding to a ≤ type constraint (unless of
course the αi variable gets eliminated altogether like we eliminated the
βi variable above). This is because being non-negative is an inherent
property of dual variables that correspond to ≤ type constraints.

2. However, constraints on dual variables such as αi ≤ C need not always


appear in the dual problem. Such constraints do appears if the primal
is the CSVM formulation which effectively uses the hinge loss. However,
such a constraint on αi need not appear (or some other funny-looking
constraint may appear instead) if we use some other primal formulation,
e.g. replace the hinge loss with logistic or squared loss. The reader would
have noticed how the constraint αi ≤ C appeared in the above derivation
because some other dual variable got eliminated, which is something that
need not happen with all optimization problems.

3. In more complicated optimization problems, it may not be possible to


eliminate dual variables like we eliminated βi above. However, if we are
lucky, we may be able to eliminate some dual variables and obtain a
simpler dual problem, which would then be faster to solve since there are
less dual variables about which to worry.

4. In still more complicated optimization problems, it may not be possible


to even eliminate all the primal variables. In such cases, we still must
apply first order optimality with respect to all primal variables to obtain
constraints but we may find ourselves unable to exploit those constraints
to remove all the primal variables. If this happens, we will be left with
the dual problem looking like a max-min problem which we must solve as
is. Such max-min problems are called saddle point optimization problems.
So we would have reduced our dual problem to a saddle point problem.
However, in nice cases where we are able to completely remove the primal
variables, our dual problem gets reduced to a maximization problem.
Acknowledgements

The author is thankful to the students of successive offerings of the course for
their inputs and pointing out various errata in the lecture material. This mono-
graph was typeset using the beautiful style of the Foundations and Trends R

series published by now publishers.

6
Appendices
A
Calculus Refresher

In this chapter we will take a look at basic tools from calculus that would
be required to design and execute machine learning algorithms. Before we
proceed, we caution the reader that the treatment in this chapter will not be
mathematically rigorous and frequently, we will appeal to concepts and results
based on informal arguments and demonstration, rather than proper proofs.
This was done in order to provide the reader with a working knowledge of the
topic without getting into excessive formalism. We direct the reader to texts in
mathematics, of which several excellent ones are available, for a more rigorous
treatment of this subject.

A.1 Extrema

The vast majority of machine learning algorithms learn models by trying to


obtain the best possible performance on training data. What changes from
algorithm to algorithm is how “performance” is defined and what constitutes
“best”. Frequently, performance can be defined in terms of an objective function objective function
f that takes in a model (say, a linear model w) and outputs a real number
f (w) ∈ R called the objective value. Depending on the algorithm designer a objective value
large objective value may be better or a small score may be better (e.g. if f
encodes margin then we want a large objective value, on the other hand if f
encodes the classification error then we want a small objective value).
Given a function f : Rd → R, a point x∗ ∈
Rd is said to be the global maximum of this global maximum
function if f (x∗ ) ≥ f (x) for all other x ∈ Rd .
We similarly define a global minimum of this global minimum
function as a point x̃ such that f (x̃) ≤ f (x)
for all other x ∈ Rd . Note that a function
may have multiple global maxima and global
minima. For example the function f (x) =
sin(x) has global maxima at all values of x that are of the form 2kπ + π2

8
A.2. Derivatives 9

and global minima at all values of x that are of the form 2kπ − π2 .
However, apart from global extrema which achieve the largest or the small-
est value of a function among all possible input points, we can also have
local extrema, i.e. local minimum and local maximum. These are points which local extrema
achieve the best value of the function (min for local minima and max for local
maxima) only in a certain (possibly small) region surrounding the point.
A practical example to understand the distinction between local and global
extrema can be that of population: the city of Kanpur has a large population
(that of 3.2 million) which is the highest among cities within the state of Uttar
Pradesh. Thus, Kanpur is at least a local maximum. However, it is not a global
maximum since if we go outside the state of Uttar Pradesh, we find cities like
Mumbai with a population of 12.4 million. However, even Mumbai is a local
maximum (among cities within India) since the global maximum (of largest
population among all cities on Earth) is achieved at Chongqing, China which
has a population of 30.1 million (source: Wikipedia).
It is be clear from the above definitions that all global extrema are necessar-
ily local extrema. For example, Chongqing clearly has the largest population
within China itself and thus a local maximum. However, not all local extrema
need be global extrema.

A.2 Derivatives

Derivatives are an integral part of calculus (pun intended) and are the most
direct way of finding how function values change (increase/decrease) if we
move from one point to another. Given a univariate function i.e. a function univariate function
f : R → R that takes a single real number as input and outputs a real number
(we will take care of multivariate functions later), the derivative of f at a
point x0 tells us two things. Firstly, if the sign of the derivative is positive i.e.
f 0 (x0 ) > 0, then the function value will increase if we move a little bit to the
right on the number line (i.e. go from x0 to x0 + ∆x for some ∆x > 0) and it
will decrease if we move a little bit to the left on the number line. Similarly
if f 0 (x0 ) < 0, then moving right decreases the function value whereas moving
left increases the function value.
Secondly, the magnitude of the derivative
i.e. f 0 (x0 ) tells us by how much would the

function value increase or decrease if we move


a little bit left or right from the point x0 . For
example, consider the function f (x) = x2 . Its
derivative is f 0 (x) = 2x. This tells us that if
x0 < 0 (where the derivative is negative), the
function value would decrease if we moved
right and increase if we moved left. Similarly, if x0 > 0, the derivative is
positive and thus, the function value would increase if we moved to the right
and decrease if we moved to the left. However, since the magnitude of the
derivative is 2 |x| which increases as we go away from the origin, it can be seen
that the increase in function value, for the same change in the value of x0 s
A.3. Second Derivative 10

much steeper if x0 is far from the origin.


It is important to note that the above observations (e.g. function value goes
up if f (x0 ) > 0 and we move to the right) hold true only if the movement ∆x
is “small”. For example, f (x) = x2 has a negative derivative at x0 = −2 and
so the function value should decrease if we moved right little bit. However,
if we move right too much (say we move to x0 = 3) then the above promise
does not hold since f (3) = 9 > 4 = f (−2). In fact a corollary of the Taylor’s
theorem states Taylor’s theorem
(first order)
f (x0 + ∆x) ≈ f (x0 ) + f 0 (x) · ∆x, if ∆x is “small”.

How small is small enough for the above result to hold may depend on both
the function f as well as the point x0 where we are applying the result.

A.3 Second Derivative

Just as the derivative of a function tells us how does the function value changes
(i.e. goes up/down) and by how much, the second derivative tells us how does
the derivative change (i.e. go up/down) and by how much. Intuitively, the
second derivative can be thought of as similar to acceleration if we consider
the derivative as similar to velocity and the function value as being similar to
displacement. If at a point x0 we have f 00 (x0 ) > 0, then this means that the
derivative will go up if we move to the right and decrease if we move to the
left (similarly if f 00 (x0 ) < 0 at a point).
The Taylor’s theorem does extend to second order derivatives as well

f 0 (x0 + ∆x) ≈ f 0 (x0 ) + f 00 (x0 ) · ∆x, if ∆x is “small”.

Integrating both sides and applying the fundamental theorem of algebra


1
f (x0 + ∆x) ≈ f (x0 ) + f 0 (x0 ) · ∆x + f 00 (x0 ) · (∆x)2 , if ∆x is “small”. Taylor’s theorem
2 (second order)
Although the above derivation is not strictly rigorous, the result is nevertheless
true. Thus, knowing the second derivative can help us get a better approxima-
tion of the change in function value if we move a bit. The second derivative is
most commonly used in machine learning in designing very efficient optimiza-
tion algorithms (known as Newton methods which we will study later). In fact
there exist 3rd and higher order derivatives as well (the third derivative telling
us how does the second derivative change from point to point etc) but since
they are not used all that much, we will not study them here.

A.4 Stationary Points

The stationary points of a


function f : R → R are
defined as the points where
the derivative of the function
vanishes i.e. f 0 (x) = 0. The
A.5. Useful Rules for Calculating Derivatives 11

stationary points of a func-


tion correspond to either the
local/global maxima or minima or else saddle points. Given a stationary point,
the second derivative test is used to distinguish extrema from saddle points. second derivative test
If the second derivative of the function is positive at a stationary point
x i.e. f 0 (x0 ) = 0 and f 00 (x0 ) > 0 then x0 is definitely a local minimum.
0

This result follows directly from the second order Taylor’s theorem we studied
above. Since f 0 (x0 ) = 0, we have
1
f (x0 + ∆x) ≈ f (x0 ) + f 00 (x0 ) · (∆x)2 ≥ f (x0 )
2
This means that irrespective of whether ∆x < 0 or ∆x > 0 (i.e. irrespective of
whether we move left or right), the function value always increases. Recall that
this is the very definition of a local minimum. Similarly, we can intuitively see
that if f 0 (x0 ) = 0 and f 00 (x0 ) < 0 then x0 is definitely a local maximum.
If we have f 0 (x0 ) = 0 and f 00 (x0 ) = 0 at a point
then the second derivative test is actually silent and
fails to tell us anything informative. The reader is
warned that the first and second derivatives both
vanishing does not mean that the point is a saddle
point. For example, consider the case of the func-
tion f (x) = (x − 2)4 . Clearly x0 = 2 is a local
(and global) minimum. However, it is also true that
0 00
f (2) = 0 = f (2). In such inconclusive cases, higher order derivatives e.g.
f (3) (x) = f 000 (x), f (4) (x) have to be used to figure out what is the status of our
stationary point.

A.5 Useful Rules for Calculating Derivatives

Several rules exist that can help us calculate the derivative of complex-looking
functions with relative ease. These are given below followed by some examples
applying them to problems.
1. (Constant Rule) If h(x) = c where c is not a function of x then h0 (x) = 0

2. (Sum Rule) If h(x) = f (x) + g(x) then h0 (x) = f 0 (x) + g 0 (x)

3. (Scaling Rule) If h(x) = c · f (x) and if c is not a function of x then


h0 (x) = c · f 0 (x)

4. (Product Rule) If h(x) = f (x) · g(x) then h0 (x) = f 0 (x) · g(x) + g 0 (x) · f (x)
f (x) f 0 (x)·g(x)−g 0 (x)·f (x)
5. (Quotient Rule) If h(x) = g(x) then h0 (x) = g 2 (x)

6. (Chain Rule) If h(x) = f (g(x)) , (f ◦ g)(x), then h0 (x) = f 0 (g(x)) · g 0 (x)


Apart from this, some handy rules exist for polynomial functions e.g. if
f (x) = xc where c is not a function of x, then f 0 (x) = c · xc−1 , the logarithmic
function i.e. if f (x) = ln(x) then f 0 (x) = x1 , the exponential function i.e. if
f (x) = exp(x) then f 0 (x) = exp(x) and trigonometric functions i.e. if f (x) =
A.6. Multivariate Functions 12

sin(x) then f 0 (x) = cos(x) and if f (x) = cos(x) then f 0 (x) = − sin(x). The
most common use of the chain rule is finding f 0 (x) when f is a function of
some variable, say t but t itself is a function of x i.e. t = g(x).

Example A.1. Let `(x) = (a · x − b)2 where a, b ∈ R are constants that do


not depend on x. Then we can write `(t) = t2 where t(x) = a · x − b. Thus,
applying the chain rule tells us that `0 (x) = `0 (t) · t0 (x). By applying the rules
above we have `0 (t) = 2 · t (polynomial rule) and t0 (x) = a (constant rule and
scaling rule). This gives us `0 (x) = 2a · (a · x − b).
1
Example A.2. Let σ(x) = 1+exp(−B·x) be the sigmoid function where B ∈ R is
a constant that does not depend on x. Then we can write σ(t) = (t)−1 where
t(s) = 1 + exp(s) where s(x) = −B · x. Thus, applying the chain rule tells us
that σ 0 (x) = σ 0 (t)·t0 (s)·s0 (x). By applying the rules above we have σ 0 (t) = − t12
(polynomial rule), t0 (s) = exp(s) (constant rule and exponential rule), s0 (x) =
exp(−B·x)
−B (scaling rule). This gives us σ 0 (x) = B (1+exp(−B·x)) 2 = B · σ(x)(1 − σ(x))

A.6 Multivariate Functions

In the previous sections we looked at functions of one variable i.e. univariate


functions f : R → R. We will now extend our intuitions about derivatives
to multivariate functions i.e. functions of multiple variables i.e. of the form multivariate functions
f : Rd → R which take a d-dimensional vector as input and output a real
number.

A.6.1 First Derivatives


As before, the first derivative tells us how much the function value changes
and in what direction, if we move a bit from our current location. Since in d
dimensions, there are d directions along which we can move, “moving” means
going from x0 ∈ Rd to a point x0 + ∆x where ∆x ∈ Rd (but ∆x is “small” i.e.
k∆xk2 is small). To capture how the function value may change as a result of
such movement, the gradient of the function captures how much the function
changes if we move just long one of the axes.
More specifically, the gradient of a multivariate function f at a point gradient
 
∂f ∂f ∂f
x0 ∈ Rd is a vector ∇f (x0 ) = ∂x1 , ∂x2 , . . . , ∂xd where for any j ∈ [d],
∂f
∂xj indicates whether the function value increases or decreases and by how
much, if we keep all coordinates of x0 fixed except the j th coordinate whih we
increase by a small amount i.e. if ∆x = (0, 0, . . . , 0, δ, 0, . . . , 0), then our friend
j
the Taylor’s theorem tells us that
∂f
f (x0 + ∆x) ≈ f (x0 ) + δ ·
∂xj
We can use the gradient to find out how much the function value would change
if we moved a little bit in a general direction by summing up the individual
contributions from all the axes. Suppose we move along ∆x where now all
A.6. Multivariate Functions 13

coordinates of ∆x may be non-zero (but small), then the following holds

f (x0 + ∆x) ≈ f (x0 ) + ∇f (x0 )> ∆x Multivariate Taylor’s


d theorem (first order)
X ∂f
= f (x0 ) + ∆xj ·
j=1
∂xj

The gradient also has the very useful property of being the direction of steep-
est ascent. This means that among all the directions in which we could move, steepest ascent
if we move along the direction of the gradient, then the function value would
experience the maximum amount of increase. However, for machine learning
applications, a related property holds more importance: among all the direc-
tions in which we could move, if we move along the direction opposite to that
of the gradient i.e. we move along −∇f (x0 ), then the function value would
experience the maximum amount of decrease – this means that the direction
opposite to the gradient offers the steepest descent. steepest descent

A.6.2 Second Derivatives

Second derivatives play a similar role of documenting how the first derivative
changes as we move a little bit from point to point. However, since we have d
partial derivatives here and d possible axes directions along which to move, the
second derivative for multivariate functions is actually a d × d matrix, called
the Hessian and denoted as ∇2 f (x0 ). Hessian

∂2f ∂2f ∂2f


 
∂x12 ∂x1 ∂x2 ... ∂x1 ∂xd
∂2f ∂2f ∂2f
 

∂x2 ∂x1 ∂x22
... ∂x2 ∂xd

2 0
 
∇ f (x ) = 
 .. .. .. ..



 . . . . 

∂2f ∂2f ∂2f
∂xd ∂x1 ∂xd ∂x2 ... ∂xd2

Clairaut’s theorem tells us that if the function f is “nice” (basically the second
2f 2f
order partial derivatives are all continuous), then ∂x∂i ∂x j
= ∂x∂j ∂x i
i.e. the
Hessian matrix is symmetric. The (i, j)th entry of this Hessian matrix – styled
2f
as ∂x∂i ∂x j
– records how much the ith partial derivative changes if we move a
little bit along the j th axis i.e. if ∆x = (0, 0, . . . , 0, δ, 0, . . . , 0), then
j

∂f 0 ∂f 0 ∂2f
(x + ∆x) ≈ (x ) + (x0 ) · ∆x, if ∆x is “small”.
∂xi ∂xi ∂xi ∂xj

Just as in the univariate case, the Hessian can be incorporated into the Tay-
lor’s theorem to obtain a finer approximation of the change in function value.
Denote H = ∇2 f (x0 ) for sake of notational simplicity

f (x0 + ∆x) ≈ f (x0 ) + ∇f (x0 )> ∆x + (∆x)> H(∆x) Multivariate Taylor’s


d d X
d theorem (second order)
X ∂f X ∂2f
= f (x0 ) + ∆xj · + ∆xi ∆xj (x0 )
j=1
∂xj i=1 j=1 ∂xi ∂xj
A.7. Visualizing Multivariate Derivatives 14

A.6.3 Stationary Points

Just as in the univariate case, here also we define stationary points as those
where the gradient of the function vanishes i.e. ∇f (x0 ) = 0. As before, sta-
tionary points can either be local minima/maxima or else saddle points and
the second derivative test is used to decide which is the case. However, the
multivariate second derivative test looks a bit different.
If the Hessian of the function is positive semi definite (PSD) at a stationary
point x0 i.e. ∇f (x0 ) = 0 and H = ∇2 f (x0 )  0 then x0 is definitely a local
minimum. Recall that a square symmetric matrix A ∈ Rd×d is called positive
semi definite if for all vectors v ∈ Rd , we have v> Av ≥ 0. As before, this
result follows directly from the multivariate second order Taylor’s theorem we
studied above. Since ∇f (x0 ) = 0, we have

1
f (x0 + ∆x) ≈ f (x0 ) + (∆x)> H(∆x) ≥ f (x0 )
2

This means that no matter in which direction we move from x0 , the function
value always increases. This is the very definition of a local minimum. Similarly,
we can intuitively see that if the Hessian of the function is negative semi
definite (NSD) at a stationary point x0 i.e. ∇f (x0 ) = 0 and ∇2 f (x0 )  0
then x0 is a local maximum. Recall that a square symmetric matrix A ∈ Rd×d
is called negative semi definite if for all vectors v ∈ Rd , we have v> Av ≤ 0.

A.7 Visualizing Multivariate Derivatives

We now take a toy example to help the reader visualize how multivariate
derivatives operate. We will take d = 2 to allow us to explicitly show gradients
and function values on a 2D grid. The function we will study will not be
continuous but discrete but will nevertheless allow us to revise the essential
aspects of the topics we studied above.
Consider the function f : [0, 8] ×
[0, 6] → R on the left. The func-
tion is discrete – darker shades indi-
cate a higher function value (which
is also written inside the boxes) and
lighter shades indicate a smaller func-
tion value. Since discrete functions
are non-differentiable, we will use ap-
proximations to calculate the gradi-
ent of this function at all the points.
Note that the input to this function are two integers (x, y) where 0 ≤ x ≤ 8
and 0 ≤ y ≤ 6.
Given this, we may estimate  the gradient
 of the function at a point (x0 , y0 )
∆f ∆f
using the formula ∇f (x0 , y0 ) = ∆x , ∆y where
A.7. Visualizing Multivariate Derivatives 15

∆f f (x0 + 1, y0 ) − f (x0 − 1, y0 )
=
∆x 2
∆f f (x0 , y0 + 1) − f (x0 , y0 − 1)
=
∆y 2
The values of the gradients calculated
using the above formula are shown on
the left. Notice that we have five loca-
tions where the gradient vanishes (4,5),
(1,3), (4,3), (7,3) and (4,1): these are
stationary points. It may be more in-
structive to see the gradients repre-
sented as arrows which the figure on
the left does. Notice that gradients con-
verge toward the local maxima (4,5)
and (4,1) from all directions (this is
expected since the point has a greater
function value than all its neighbors).
Similarly, gradients diverge away from
the local minima (1,3) and (7,3) from
all directions (this is expected as well
since the point has a smaller function value than all its neighbors). However,
the point (4,3) being a saddle point, has gradients converging to it in the x
direction but diverging away from it in the y direction. In order to verify which
of our stationary points are local maxima/minima and which are saddle points,
we need to estimate the Hessian of this function.

To do so, we use the following formulae


for the approximate Hessian.
 
∆2 f ∆2 f
∆x2
∇2 f (x0 , y0 ) =  ∆2 f
∆x∆y
∆2 f

∆x∆y ∆y 2
,

where we calculate each of the mixed


partial derivative terms as follows.

∆2 f f (x0 + 1, y0 ) + f (x0 − 1, y0 ) − 2f (x0 , y0 )


2
=
∆x 12
2
∆ f f (x0 , y0 + 1) + f (x0 , y0 − 1) − 2f (x0 , y0 )
2
=
∆y 12
2
∆ f fxy + fyx
=
∆x∆y 2
∆f ∆f
∆x (x0 , y0 + 1) − ∆x (x0 , y0 − 1)
fxy =
2
A.8. Useful Rules for Calculating Multivariate Derivatives 16

∆f ∆f
∆y (x0 + 1, y0 ) − ∆y (x0 − 1, y0 )
fyx =
2
Deriving these formulae for approximating mixed partial derivatives is rela-
∆2 f
tively simple but we do not do so here. Also, the expression for ∆x∆y which
seems needlessly complicated due to the average involved, was made so in or-
der to make sure that we obtain a symmetric matrix as the approximation to
the Hessian (since Clairaut’s theorem does not apply to our toy example it
is not automatically ensured to us). However, any dissatisfaction with formu-
lae aside, we can verify that the Hessian is indeed PSD at the local minima,
NSD at the local maxima and neither NSD nor PSD at the saddle point. This
verifies our earlier second derivative test rules.

A.8 Useful Rules for Calculating Multivariate Derivatives

The quantities of interest that we wish to calculate in the multivariate setting


include derivatives of various orders i.e. gradient (first order derivative) and
Hessian (second order derivative). It is notable that the rules that we stud-
ied in the context of univariate functions (Constant Rule, Sum Rule, Scaling
Rule, Product Rule, Quotient Rule, Chain Rule) continue to apply in the mul-
tivariate setting as well. However, we need to be careful while applying them
otherwise we may make mistakes and get confusing answers.

A.8.1 Dimensionality rule


A handy rule to remember while taking derivatives of multivariate functions is
the dimensionality rule which shows us how to determine the dimensionality dimensionality rule
of the derivative using the dimensionality of the input and the output of the
function. We will have to wait till we study vector valued functions and Jaco-
bians before stating this rule in all its generality. For now, we will simply study
special cases of this rule that are needed to calculate gradients and Hessians.
df
1. If f : Rd → R and x ∈ Rd , then dx (also denoted as the gradient ∇f (x))
h i>
∂f ∂f
must also be a vector of d dimensions such that ∇f (x) = ∂x 1
, . . . , ∂x d
(recall that our vectors are column by convention unless stated otherwise).

2. If f : Rd → Rd is a (vector-valued) function that takes in a d-dimensional


vector as input and gives another d-dimensional vector as output, then
df th
dx must be a matrix of dimensionality d × d. If we denote fi as the i
dimension th
of the output then the (i, j) entry of the derivative is given
h i
df ∂fi
by dx = ∂x j
i.e. the (i, j)th entry captures how the ith dimension of
(i,j)
the output changes as the j th dimension of the input is changed.

The above cases may tempt the reader to wonder what happens if we have a
function f : Rm → Rn which takes in an m-dimensional vector as input and
gives an n-dimensional vector as output. Indeed, the derivative in this case
must be an n × m matrix. Note that all the above cases fit this more general
rule. However, we will study this in detail later when we study Jacobians.
A.8. Useful Rules for Calculating Multivariate Derivatives 17

A.8.2 Useful Rules for Calculating Gradients


Although carefully and correctly applying all the rules of univariate deriva-
tives, as well as the dimensionality rules stated above, will always give us the
right answer, doing so from scratch every time may be time consuming. Thus,
we present here a few handy shortcuts for calculating gradients. We stress that
every single one of these rules can be derived by simply applying the afore-
mentioned rules carefully. In the following, x ∈ Rd is a vector. Also, all vectors
are column vectors unless stated otherwise.

1. (Dot Product Rule) If f (x) = a> x where a ∈ Rd is a vector that does


not depend on x, then ∇f (x) = a. This rule can be derived by applying
univariate scaling rule repeatedly to each dimension j ∈ [d].

2. (Sum Rule) If h(x) = f (x) + g(x) then ∇h(x) = ∇f (x) + ∇g(x). This
rule can be derived by applying univariate sum rule repeatedly to each
dimension j ∈ [d].

3. (Quadratic Rule) If f (x) = x> Ax where A ∈ Rd×d is a symmetric matrix


that is not a function of x, then ∇f (x) = 2Ax. If A is not symmetric,
then ∇f (x) = Ax + A> x. This rule can be derived by applying the
univariate product rule repeatedly to each dimension j ∈ [d].

4. (Chain Rule) If g : Rd → R and f : R → R, then if we define h(x) =


f (g(x)), then ∇h(x) = f 0 (g(x)) · ∇g(x). This rule can be derived by
applying the univariate chain rule repeatedly to each dimension j ∈ [d].

We now illustrate the use of these rules using some examples



Example A.3. Let f (x) = kxk2 . We can rewrite this as f = t, where t(x) =
kxk22 = x> x = x> Id x where Id is the d × d identity matrix. Thus, using the
chain rule we have ∇f (x) = f 0 (t) · ∇t(x). Using the polynomial rule we have
f 0 (t) = 2√
1
t
, whereas using the quadratic rule, we get ∇t(x) = 2x. Thus we
x
have ∇f (x) = kxk . Note that in this case, the gradient is always a unit vector.
2

1
Example A.4. Let σ(x) = 1+exp(−a> x)
where a ∈ Rd is a constant vector that
does not depend on x. Then we can write σ(t) = (t)−1 where t(s) = 1 + exp(s)
where s(x) = −a> x. Thus, applying the chain rule tells us that ∇σ(x) =
σ 0 (t)·t0 (s)·∇s(x). By applying the rules above we have σ 0 (t) = − t12 (polynomial
rule), t0 (s) = exp(s) (constant rule and exponential rule), ∇s(x) = −a (dot
exp(−a> x)
product rule). This gives us ∇σ(x) = (1+exp(−a > x))2 · a = σ(x)(1 − σ(x)) · a.

Example A.5. Let f (x) = (a> x − b)2 where a ∈ Rd is a constant vector and
b ∈ R is a constant scalar. Using the gradient chain rule, we get ∇f (x) =
2(a> x − b) · a .
Example A.6. Let A ∈ Rn×d be a constant matrix and b ∈ Rn be a constant
vector and define f (x) = kAx − bk22 . If we let ai ∈ Rd denote the vector
formed out of the ith row of the matrix A, then we can see that we can
rewrite the function as f (x) = ni=1 (x> ai − bi )2 . Using the sum rule for
P
A.8. Useful Rules for Calculating Multivariate Derivatives 18

gradients gives us ∇f (x) = 2 ni=1 (x> ai − bi ) · ai . Note that this is simply


P

the sum of the vectors ai multiplied by the scalar ci , 2(x> ai − bi ). If we let


c , [c1 , . . . , cn ]> = Ax − b ∈ Rn , then we can rewrite the gradient very neatly
as ∇f (x) = A> c = A> (Ax − b). Remember, the dimensionality rule tells us
that the gradient must be a d-dimensional vector so the gradient cannot be
something like Ac which anyway does not make sense.

A.8.3 Useful Rules for Calculating Hessians


The Hessian of function f : Rd → R is defined as the derivative ∇2 f (x) =
d2 f
dxdx>
. The dxdx> expression is merely a stylized way of distinguishing the two
applications of derivatives with respect to x. We will find it more convenient
d dg
to write the Hessian as dx (∇f ) or else as ∇2 f (x) = dx where g(x) = ∇f (x).
d d
Note that g : R → R is a vector valued function that maps every point
x0 ∈ Rd to the gradient of f at x0 i.e. ∇f (x0 ) ∈ Rd .
dg
The dimensionality rules tell us that since ∇2 f (x) = dx , it must be a d × d
matrix (also symmetric if the function is nice). As before, we present here a
few handy shortcuts for calculating Hessians. We remind the reader that every
single one of these rules can be derived by simply applying the aforementioned
rules of univariate calculus and dimensionality rules carefully.
Specifically, all we need to do is think of the j th coordinate of the g function
(recall that the output of g is a d-dimensional vector) as a univariate function
gj : Rd → R, take the gradient of this univariate function using our usual rules,
and then set the j th row of the Hessian matrix as (∇gj )> .

1. (Linear Rule) If f (x) = a> x where a is a constant vector, then g(x) =


∇f (x) = a and thus ∇gj = 0 for all j (using the constant rule) and thus
∇2 f (x) = 00> ∈ Rd×d . Thus, linear (or even affine functions such as
f (x) = a> x + b where b ∈ R is a constant) have zero Hessians. Note that
the dimensionality rule tells us that although the Hessian is zero, it is a
zero matrix, not a zero vector or the scalar zero.
2. (Quadratic Rule) If f (x) = x> Ax where A ∈ Rd×d is a constant symmet-
ric matrix that is not a function of x, then g(x) = ∇f (x) = 2Ax. If we
let aj denote the vector formed out of the j th row of the matrix A, then
gj (x) = 2x> ai . Thus, we have ∇gj = 2aj using the dot product rule for
gradients. Applying the dimensionality rule then tells us that ∇2 f (x) =
dg d×d . If A is not symmetric, then ∇2 f (x) = A + A> .
dx = 2A ∈ R

We now illustrate the use of these rules using some examples


Example A.7. Let f (x) = 12 kxk22 . We can rewrite this as f = 21 x> x = 12 x> Ix
where I is the d×d identity matrix. The scaling rule and the Hessian quadratic
rule tell us that ∇2 f (x) = I since the identity matrix is symmetric.
Example A.8. Let f (x) = (a> x−b)2 where a ∈ Rd is a constant vector and b ∈
R is a constant scalar. From Example A.5, we get g = ∇f (x) = 2(a> x − b) · a.
Thus, we get gj = 2(a> x − b)aj where aj is the j th coordinate of the vector
a. Using the scaling rule and gradient dot product rule we get ∇gj = 2aj · a.
A.9. Subdifferential Calculus 19

Thus, the j th row of the Hessian is the vector a> (transposed since it is a row
vector) scaled by 2aj . Pondering on this for a moment will tells us that this
means that ∇2 f (x) = 2 · aa> ∈ Rd×d .

Example A.9. Let A ∈ Rn×d be a constant matrix and b ∈ Rn be a constant


vector and define f (x) = kAx − bk22 . From Example A.6, we get g = ∇f (x) =
2 ni=1 (x> ai − bi ) · ai = A> (Ax − b), where ai ∈ Rd denotes the vector formed
P

out of the ith row of the matrix A. It is useful to clarify here that although
the vector ai was formed out of a row of a matrix, the vector itself is a column
vector as per our convention for vectors. Using the sum rule for gradients gives
us ∇gj = 2 ni=1 aji · ai . Similarly as in the above example, we can deduce that
P

this means that ∇2 f (x) = 2 ni=1 ai (ai )> = 2 · A> A ∈ Rd×d .


P

A.9 Subdifferential Calculus

The notions of derivatives of various orders discussed above at least assume


that the function is differentiable. If that is not the case, then gradients (and
by extension Hessians) cannot be defined. However, one can still define some
nice extensions of the notion of gradient if the function is nevertheless convex
(refer to Chapter B for notions of convexity for non-differentiable functions).
Recall that the tangent definition of convexity (see § B.2.3) demands that
the function always lie above all its tangent hyperplanes i.e. f (x) ≥ ∇f (x0 )> (x−
x0 ) + f (x0 ) for all x. For non-differentiable functions, The key to extending
the notion of gradient to non-differentiable convex functions is to take this
definition of convexity and turn it on its head.
Note that a hyperplane t(x) that touches the surface of a function f (even
if f is non-differentiable) at a point x0 must necessarily be of the form t(x) =
g> (x − x0 ) + f (x0 ) (since we must have t(x0 ) = f (x0 )). The trick is to simply
uses this to define any vector g such that f (x) ≥ g> (x − x0 ) + f (x0 ) for all
x as a subgradient of f at the point x0 . Note that such a definition has some subgradient
interesting properties.
Firstly, note that if we ap-
ply this definition to differen-
tiable convex functions then
we conclude that the gradient
∇f (x0 ) is a subgradient of f
at x0 . In fact it is the only
subgradient of f at x0 if f is
differentiable at x . We emphasize this because if f is not differentiable at x0 ,
0

then f may have multiple subgradients (in general an infinite number of sub-
gradients) at x0 . The set of all subgradients at x0 is called the subdifferential subdifferential
of f at x0 and denoted as follows
n o
∂f (x0 ) = g ∈ Rd : f (x) ≥ g> (x − x0 ) + f (x0 ) for all x

Note that the above discussion indicates that if f is differentiable at x0 then


∂f (x0 ) = ∇f (x0 ) i.e. the set has only one element.

A.9. Subdifferential Calculus 20

A.9.1 Rules of Subgradient Calculus


Several rules exist that can help us calculate the subdifferential of complex-
looking non-differentiable functions with relative ease. These (except for the
max rule) do share parallels with the rules we saw for regular calculus but
with some key differences. These are given below followed by some examples
applying them to problems. In the following, a ∈ Rd is a constant vector and
b ∈ R is a constant scalar.

1. (Scaling Rule) If h(x) = c · f (x) and if c is not a function of x then


∂h(x) = c · ∂f (x) , {c · u : u ∈ ∂f (x)}

2. (Sum Rule) If h(x) = f (x) + g(x) then ∂h(x) = ∂f (x) + ∂g(x) ,


{u + v : u ∈ ∂f (x), v ∈ ∂g(x)}. Note that here, we are defining the sum
of two sets as the Minkowski sum. Minkowski sum

3. (Chain Rule) If h(x) = f (x> x + b), then ∂h(x) = ∂f (x> x + b) · a ,


{c · a : c ∈ ∂f (x)}

4. (Max Rule) If h(x) = max {f (x), g(x)}, then the following cases apply:

(a) If f (x) > g(x), then ∂h(x) = ∂f (x)


(b) If f (x) < g(x), then ∂h(x) = ∂g(x)
(c) If f (x) = g(x), then
∂h(x) = {λ · u + (1 − λ) · v : u ∈ ∂f (x), v ∈ ∂g(x), λ ∈ [0, 1]}.

Note that the max rule has no counterpart in regular calculus since functions
of the form h(x) = f (x> x + b) are usually non-differentiable.

A.9.2 Stationary Points for Non-differentiable Functions


The notion of stationary points does extend to non-differentiable convex func-
tions as well. A point x0 is called a stationary point for a function f if
0 ∈ ∂f (x0 ) i.e. the zero vector is a part of its subdifferential.
It is noteworthy that even for non-differentiable convex functions, global
minima must be stationary points in this sense and vice versa. This is easy to
see – suppose that we do have 0 ∈ ∂f (x0 ), then the definition of subgradients
for convex function dictates that we must have f (x) ≥ 0> (x − x0 ) + f (x0 ) for
all x which is the same as saying f (x) ≥ f (x0 ) for all x which is precisely the
definition of a global minimum. Thus, x0 is a global minimum iff 0 ∈ ∂f (x0 ). iff ≡ if and only if

Example A.10. Let `(x) = [1 − x, 0]+ denote the


hinge loss function. To calculate its subdifferential,
we note that we can write this function as `(x) =
max {f (x), g(x)} where f (x) = 1 − x and g(x) =
0. Note that f, g are both differentiable functions.
Applying the max rule, we get at the point when
f (x) = g(x) i.e. at x = 1, we have ∂`(x) =
{λ · (−1) + (1 − λ) · (0) : λ ∈ [0, 1]} since f (x) = −1
A.10. Exercises 21

and g 0 (x) = 0 for all x. Thus, we have



−1 if x < 1


∂`(x) = 0 if x > 1


[−1, 0] if x = 1

Example A.11. Let `(w) = [1 − y · w> x, 0]+ denote the hinge loss function
applied to a data point (x, y) along with the model w. Applying the chain rule
gives us

−y · x

 if y · w> x < 1
∂`(w) = 0 if y · w> x > 1

if y · w> x = 1, where c ∈ [−1, 0]

cy · x

A.10 Exercises

Exercise A.1. Let f (x) = x4 − 4x2 + 4. Find all stationary points of this
function. Which of them are local maxima and minima?
Exercise A.2. Let g : R2 → R be defined as g(x, y) = f (x) + f (y) + 8 where
f is defined in the execise above. Find all stationary points of this function.
Which of them are local maxima and minima? Which one of these are saddle
points?
Exercise A.3. Given a natural number n ∈ N e.g. 2, 8, 97 and a real num-
ber x0 ∈ R, design a function f : R → R so that f (k) (x0 ) = 0 for all
k = 1, 2, . . . , n. Here f (k) (x0 ) denotes the k th order derivative of f at x0 e.g.
f (1) (x0 ) = f 0 (x0 ), f (3) (x0 ) = f 000 (x0 ) etc.
Exercise A.4. Let a, b ∈ Rd be constant vectors and let f (x) = a> xx> b. updated exercise
Calculate ∇f (x) and ∇2 f (x).
Hint: write f (x) = g(x) · h(x) where g(x) = a> x and h(x) = b> x and apply
the product rule.
Exercise A.5. Let b ∈ Rd a constant vector and A ∈ Rd×d be a constant updated exercise
symmetric matrix. Let f (x) = b> Ax. Calculate ∇f (x) and ∇2 f (x).
Hint: write f (x) = c> x where c = A> b.
Exercise A.6. Let A, B, C ∈ Rd×d be three symmetric and constant matrices updated exercise
and p, q ∈ Rd be two constant vectors. Let f (x) = (Ax + p)> C(Bx + q).
Calculate ∇f (x) and ∇2 f (x).
Exercise A.7. Suppose we have n constant vectors a1 , . . . , an ∈ Rd . Let f (x) = new exercise
Pn  
i=1 ln 1+ exp(−x> ai ) . Calculate ∇f (x) and ∇2 f (x).

Exercise A.8. Let a ∈ Rd be a constant vector and let f (x) = a> x · kxk22 . new exercise
Calculate ∇f (x) and ∇2 f (x).
Hint: the expressions may be more tedious with this one. Be patient and apply
the product rule carefully to first calculate the gradient. Then move on to the
Hessian by applying the dimensionality rule.
A.10. Exercises 22

Exercise A.9. Show that for any convex function f (whether differentiable or new exercise
not), its subdifferential at any point x0 , i.e. ∂f (x0 ), is always a convex set.

Exercise A.10. For a vector x ∈ Rd its L1 norm is defined as kxk1 , dj=1 |xj |.
P
new exercise
Let f (x) , kxk1 + [1 − x> a]+ where a ∈ Rd is a constant vector. Find the
subdifferential ∂f (x).
n o
Exercise A.11. Let f (x) = max (x> a − b)2 , c where a ∈ Rd is a constant new exercise
vector and b, c ∈ R are constant scalars. Find the subdifferential ∂f (x).

Exercise A.12. Let x, a ∈ Rd , b ∈ R where a is a constant vector that does new exercise
not depend on x and b is a constant real number that does not depend on x.
Let f (x) = a> x − b . Find the subdifferential ∂f (x).

Exercise A.13. Now suppose we have n constant vectors



a1 , . . . , an ∈ Rd new exercise
Pn > i
and n real constants b1 , . . . , bn ∈ R. Let f (x) = i=1 x a − bi . Find the

subdifferential ∂f (x).
B
Convex Analysis Refresher

Convex sets and functions remain the favorites of practitioners working on ma-
chine learning algorithms since these objects have several beautiful properties
that make it simple to design efficient algorithms. Of course, the recent years
have seen several strides in non-convex optimization as well due to areas such
as deep learning, robust learning, sparse learning gaining prominence.

B.1 Convex Set

Given a set of points (or a region) C ⊂ Rd , we call


this set or region a convex set if the set contains convex set
all line segments that join two points inside that
set. More formally, for a set C to be convex, no
matter which two points we take in the set x, y ∈
C, for every λ ∈ [0, 1], we must have z ∈ C where
z = λ · x + (1 − λ) · y. It is noteworthy that
the vectors z defined this way completely capture
all points on the line segment joining x and y.
Indeed, with λ = 0, we have z = z, λ = 1 gives
us z = y and λ = 0.5 gives us the midpoint of the
line segment. It is noteworthy however, that we
must have λ ∈ [0, 1]. If λ starts taking negative
values or values greater than 1, then we would
start getting points outside the line segment.
For well behaved sets, in order to confirm con-
vexity, it is sufficient to verify that z = x+y
2 ∈C
i.e. we need not take the trouble of verifying for all λ ∈ [0, 1] and simply verify-
ing mid-point convexity is enough to verify convexity (we do need to still verify mid-point convexity
this for all x, y ∈ C though). It is also important to note that non-convex sets,
such as the one depicted in the figure, may contain some of the line segments
that join points within them – this does not make the set convex! Only if a

23
B.1. Convex Set 24

set contains all its line segments is it called convex. The reader would have
noticed that convex sets bulge outwards in all directions. The presence of any
inward bulges typically makes a set non-convex.

Theorem B.1. Given two convex sets C1 , C2 ⊂ Rd , the intersection of these


two sets i.e. C1 ∩ C2 is always convex. However, the union of these two sets i.e.
C1 ∪ C2 need not be convex.

Proof. We first deal with the case of intersection. The intersection of two sets intersection of two sets
(not necessarily convex) is definednto be the set of all points that
o are contained
d
in both the sets i.e. C1 ∩ C2 , x ∈ R : x ∈ C1 and x ∈ C2 . Consider two
points x, y ∈ C1 ∩ C2 . Since x, y ∈ C1 , we know that z = x+y2 ∈ C1 since C1 is
convex. However, by the same argument, we get that z = x+y 2 ∈ C2 as well.
Since z ∈ C1 and z ∈ C2 , we conclude that z ∈ C1 ∩ C2 . This proves that the
intersection of any two convex sets must necessarily be convex. The first figure
above illustrates the intersection region of two convex sets.
The union of two sets (not necessarily convex) is defined to be the set of all union of two sets
points that are contained in either of the sets (including
n points that are present
o
in both sets). More specifically, we define C1 ∩C2 , x ∈ Rd : x ∈ C1 or x ∈ C2 .
The second figure above shows that the union of two convex sets may be non-
convex. However, the union of two convex sets may be convex in some very
special cases, for example, if one set is contained in the other i.e. C1 ⊆ C2
which is illustrated in the third figure.

Example B.1. Are rectangles convex? Let


R , (x, y) ∈ R2 : x ∈ [1, 3] and y ∈ [1, 2]


be a rectangle with side lengths 1 and 2.


We could show R to be convex directly
as we do in the example below. However,
there exists a neater way. Consider the
bands Bx , (x, y) ∈ R2 : x ∈ [1, 3] and


By , (x, y) ∈ R2 : y ∈ [1, 2] . It is easy to




see that R = Bx ∩ By . Thus, if we show


that the bands are convex, we could then use Theorem B.1 to show that R is
convex too! Showing that Bx is convex is pretty easy: if two points have their
x coordinate in the range [1, 3], then the average of those two points clearly
satisfies this as well. This establishes that Bx is convex. Similarly By is convex
which tells us that R is convex.
B.2. Convex Functions 25

Example B.2. Consider the set of all points which arenat a Euclidean distance
o
at most 1 from the origin i.e. the unit ball B2 (1) , x ∈ Rd : kxk2 ≤ 1 . To
show that this set is convex, we take x, y ∈ B2 (1) and consider z = x+y
2 . Now,
instead of showing kzk2 ≤ 1 (which will establish convexity), we will instead
show kzk22 ≤ 1 which is equivalent but easier to analyze. We have kzk22 =
x+y 2 kxk22 +kyk22 +2x> y

2 = 4 . Now, recall that the Cauchy-Schwartz inequality
2
tells us that for any two vectors a, b we have a> b ≤ kak2 kbk2 . Thus, we

kxk2 +kyk2 +2kxk kyk


have kzk22 ≤ 2 2
4
2 2
. Since x, y ∈ B2 (1), we have kxk2 , kyk2 ≤ 1
2
which gives us kzk2 ≤ 1 which establishes the unit ball B2 (1) is a convex set.

B.2 Convex Functions

We now move on to convex functions. These functions play an important role


in several optimization based machine learning algorithms such as SVMs and
logistic regression. There exist several definitions of convex functions, some
that apply only to differentiable functions, and others that apply even to non-
differentiable functions. We look at these below.

B.2.1 Epigraph Convexity


This is the most fundamental definition of convexity and applies to all func-
tions, whether they are differentiable or not. This definition is also quite neat
in that it simply uses the definition of convex sets to define convex functions.
Definition B.1 (Epigraph). Given a func- Epigraph
tion f : Rd → R, its epigraph is defined
as the set of points that lie on or above
the
n graph of the function o i.e. epi(f ) ,
(x, y) ∈ Rd+1 : y ≥ f (x) . Note that the
epigraph is a d + 1-dimensional set and not
a d dimensional set.
Definition B.2 (Epigraph Convexity). A Epigraph Convexity
function f : Rd → R is defined to be con-
vex if its epigraph epi(f ) ∈ Rd+1 is a con-
vex set. On the left, we have a non-convex
function whose epigraph is a non-convex set
(notice the inward bulge) whereas in the fig-
ure above, we have a convex function whose
epigraph is a convex set.

B.2.2 Chord Convexity


The above definition, although fundamental, is not used quite often since there
exist simpler definitions. One of these definitions exploits the fact that convex-
ity of the epigraph set need only be verified at the lower boundary of the set
i.e. at the surface of the function graph. Applying the mid-point definition of
convex sets then gives us this new definition of convex functions.
B.2. Convex Functions 26

Definition B.3 (Chord). Given a function Chord


f : Rd → R and any two points x, y ∈
Rd , the line segment joining the two points
(x, f (x)) and (y, f (y)) is called a chord of
this function.

Definition B.4 (Chord Convexity). A func- Chord Convexity


tion f : Rd → R is convex if and only if the
function graph lies below all its chords. Us-
ing the mid-point definition, this is equiv-
alent to saying that a function is convex if
d
and only  any two points x, y ∈ R we
 if for
have f x+y 2 ≤ f (x)+f
2
(y)
.

The figures depict a convex function that lies above all its chords and a
non-convex function which does not do so. It is also important to note that
a non-convex function may lie below some of its chords (as the figure on the
bottom shows) – this does not make the function convex! Only if a function
lies below all its chords is it called convex.

B.2.3 Tangent Convexity

This definition holds true only for differentiable functions but is usually easier
to apply when checking whether a function is convex or not.
Definition B.5 (Tangent). Given a differen- Tangent
tiable function f : Rd → R, the tangent of
the function at a point x0 ∈ Rd is the hy-
perplane ∇f (x0 )> (x − x0 ) + f (x0 ) = 0 i.e.
of the form w> x + b where w = ∇f (x0 )
and b = f (x0 ) − ∇f (x0 )> x0 .

Definition B.6 (Tangent Convexity). A dif- Tangent Convexity


ferentiable function f : Rd → R is convex
if and only if the function graph lies above
all its tangents i.e. for all x0 , x ∈ Rd , we
have f (x) ≥ ∇f (x0 )> (x − x0 ) + f (x0 ).
Note that the point (x0 , f (x0 )) always lies on the tangent hyperplane at x0 .
The figures above depict a convex function that lies above all its tangents and
a non-convex function which fails to lie above at least one of its tangents. It
is important to note that non-convex functions may lie above some of their
tangenst (as the figure on the bottom shows) – this does not make the function
convex! Only if a functions lies above all its tangents is it called convex.
It is also useful to clarify that the epigraph and chord definitions of con-
vexity continue to apply here as well. Its is just that the tangent definition is
easier to use in several cases. A rough analogy is that of deciding the income
of individuals – although we can find out the total income of any citizen of
India, it may be tedious to do so. However, it is much easier to find the income
B.3. Operations with Convex Functions 27

of a person if that person files income tax returns (truthfully, of course).

B.2.4 Hessian Convexity


For doubly differentiable functions, we have an even simpler definition of con- Hessian Convexity
vexity. A doubly differentiable function f : Rd → R is convex if and only if its
Hessian is positive semi-definite at all points i.e. ∇2 f (x0 )  0 for all x0 ∈ Rd .
Recall that this implies that for all v ∈ Rd , we have v> ∇2 f (x0 )v ≥ 0.

B.2.5 Concave Functions


Concave functions are defined as those whose negative is a convex function Concave functions
i.e. f : Rd → R is defined to be concave if the function −f is convex. Convex
functions typically look like upturned cups (think of the function f (x) = x2
which is convex and looks like a right-side-up cup). Concave functions on the
other hand look like inverted cups, for example f (x) = −x2 . To check whether
a function is concave or not, we need to simply check (using the epigraph,
chord, tangent, or Hessian methods) whether the negative of that function is
convex or not.

Example√ B.3. Let us look at the example of the Euclidean norm f (x) =
kxk2 = x> x. This function is non-differentiable at the origin i.e. at x = 0 so
d
definition of convexity. Given two points x, y ∈ R , we
we haveto use the chord
have f x+vy = x+y 1
2 = 2 kx + yk by using the homogeneity property of the

2
Euclidean distance (if we halve a vector, its length gets halved too). However,
recall that the triangle inequality tells us that for any two vectors p, q, we
kxk2 +kyk2
have kp + qk2 ≤ kpk2 + kqk2 . This gives us f x+y 2 ≤ 2 = f (x)+f
2
(y)

which proves the convexity of the norm.

B.3 Operations with Convex Functions

We can take convex functions and manipulate them to obtain new convex
functions. Here we explore some such operations that are useful in machine
learning applications.

1. Affine functions are always convex1 .

2. Scaling a convex function by a positive scale factor always yields a convex


function2 .

3. The sum of two convex functions is always convex (see Theorem B.2).

4. If g : Rd → R is a (multivariate) convex function and f : R → R is a


(univariate) convex and non-decreasing function i.e. a ≤ b ⇔ f (a) ≤ f (b),
then the function h , f ◦ g i.e. h(x) = f (g(x)) is also convex (see
Theorem B.3).

1 See Exercise B.6.


2 See Exercise B.7.
B.3. Operations with Convex Functions 28

5. If f : R → R is a (univariate) convex function then for any a ∈ Rd , b ∈ R,


the (multivariate) function g : Rd → R defined as g(x) = f (a> x + b) is
always convex (see Theorem B.4).

6. If f, g : Rd → R are two (multivariate) convex functions then the function


h , max {f, g} is also convex (see Theorem B.5).

Theorem B.2 (Sum of Convex Functions). Given two convex functions f, g :


Rd → R, the function h , f + g is always convex.

Proof. We will use the chord definition of convexity here since there is no
surety that f and g are differentiable. Consider two points x, y ∈ Rd . We have
x+y x+y x+y
     
h =f +g
2 2 2
f (x) + f (y) g(x) + g(y)
≤ +
2 2
(f (x) + g(x)) + (f (y) + g(y)) h(x) + h(y)
= =
2 2
where in the second step, we used the fact that f and g are both convex. This
proves that h is convex by the chord definition of convexity.

Theorem B.3 (Composition of Convex Functions). Suppose g : Rd → R is a


(multivariate) convex function and f : R → R is a (univariate) convex and
non-decreasing function i.e. a ≤ b ⇔ f (a) ≤ f (b), then the function h , f ◦ g
i.e. h(x) = f (g(x)) is always convex.

Proof. We will use the chord definition of convexity here since there is no
surety that f and g are differentiable. Consider two points x, y ∈ Rd . We have
x+y x+y
    
h =f g
2 2
Now, since g is convex, we have
x+y g(x) + g(y)
 
g ≤
2 2
Let us denote the left hand side of the above inequality by p and the right
hand side by q for sake of notational simplicity. Thus, the above inquality tells
us that p ≤ q. However, since f is non-decreasing, we get f (p) ≤ f (q) i.e.
x+y g(x) + g(y)
    
f g ≤f
2 2
Let us denote u , g(x) and v , g(y) for sake of notational simplicity. Since f
is convex, we have
u+v f (u) + f (v)
 
f ≤
2 2
This is the same as saying
g(x) + g(y) f (g(x)) + f (g(y)) h(x) + h(y)
 
f ≤ =
2 2 2
B.3. Operations with Convex Functions 29

Thus, with the chain of inequalities established above, we have shown that
x+y h(x) + h(y)
 
h ≤
2 2
which proves that h is a convex function.

Theorem B.4 (Convex Wrappers over Affine Functions). If f : R → R is a


(univariate) convex function then for any a ∈ Rd , b ∈ R, the (multivariate)
function g : Rd → R defined as g(x) = f (a> x + b) is always convex.
Proof. We will yet again use the chord definition of convexity. Consider two
points x, y ∈ Rd . We have
x+y x+y
     
g = f a> +b
2 2
!
(a> x + b) + (a> y + b)
=f
2
f (a> x + b) + f (a> y + b) g(x) + g(y)
≤ =
2 2
where in the second step we used the linearity of the dot product i.e. c> (a +
b) = c> a + c> b and in the third step we used convexity of f . This shows that
the function g is convex.

Theorem B.5 (Maximum of Convex Functions). If f, g : Rd → R are two (mul-


tivariate) convex functions then the function h , max {f, g} is also convex.
Proof. To prove this result, we will need
the following simple monotonicity property
of the max function: Let a, b, c, d ∈ R be four
real numbers such that a ≤ c and b ≤ d.
Then we must have max {a, b} ≤ max {c, d}.
This can be shown in a stepwise manner
(max {a, b} ≤ max {c, b} ≤ max {c, d}) Now,
consider two points x, y ∈ Rd . We have
x+y x+y x+y
      
h = max f ,g
2 2 2
f (x) + f (y) g(x) + g(y)
 
≤ max ,
2 2
max {f (x), g(x)} + max {f (y), g(y)} g(x) + g(y)
 
≤ max ,
2 2
max {f (x), g(x)} + max {f (y), g(y)} max {f (x), g(x)} + max {f (y), g(y)}
 
≤ max ,
2 2
max {f (x), g(x)} + max {f (y), g(y)} h(x) + h(y)
= =
2 2
where in the second step, we used the fact that f, g are convex functions
and the monotonicity property of the max function. The third and the fourth
steps also use the monotonicity property. The fifth step uses the fact that
max {a, a} = a. This proves that h is a convex function.
B.4. Exercises 30


Example B.4. The functions f (x) = ln(x) and g(x) = x are concave. Since
both of these are doubly differentiable functions, we may use the Hessian
definition to decide their concavity. Recall that a function is concave if and
only if its negation is convex. Let p(x) = − ln(x). Then p00 (x) = x12 ≥ 0 for all
x > 0. This confirms that p(x) is convex and that ln(x) is concave. Similarly,

define q(x) = − x. Then q 00 (x) = 4x1√x ≥ 0 for all x ≥ 0 which confirms that

q(x) is convex and that x is concave.

Example B.5. Let us show that squared Euclidean norm i.e. the function
h(x) = kxk22 is convex. We have already shown above that the function g(x) =
kxk2 is convex. We can write h(x) = f (g(x)) where f (t) = t2 . Now, f 00 (t) =
1 > 0 i.e. f is convex by applying the Hessian rule for convexity. Also, kxk2 ≥ 0
for all x ∈ Rd and the function f is indeed an increasing function on the
positive half of the real line. Thus, Theorem B.3 tells us that h(x) is convex.

Example B.6. Let us show that the hinge loss is a convex function `hinge (t) =
max {1 − t, 0}. Note that the hinge loss function is treated as a univariate
function here i.e. `hinge : R → R. Exercise B.6 shows us that affine functions
are convex. Thus f (x) = 1 − x and g(x) = 0 are both convex functions. Thus,
by applying Theorem B.5, we conclude that the hinge loss function is convex.

Example B.7. We will now show that the objective function used in the C-
SVM formulation is a convex function of the model vector w. For sake of
simplicity, we will show this result without the bias parameter b although we
stress that the result holds even if the bias parameter is present (recall that the
bias can always be hidden inside the model by adding a fake dimension into
n
the data). Let (xi , y i ) i=1 be n data points with xi ∈ Rd and y i ∈ {−1, 1}.


Denote zi , y i · xi for sake of notational simplicity. The the C-SVM objective


function is reproduced below:
n n
1 1
kwk22 + `hinge (y i · w> xi ) = kwk22 + `hinge (w> zi )
X X
fC-SVM (w) =
2 i=1
2 i=1

Note that the feature vectors xi , i = 1, . . . , n and the labels y i , i = i, . . . , n


(and hence the vectors zi ) are treated as constants since we cannot change
our training data. The only variable here is the model vector w which we
learn using the training data. We have already shown that kwk22 is a convex
function of w, Exercise B.7 shows that 21 kwk22 is convex too. We showed
above that `hinge is a convex function and thus, Theorem B.4 shows that
hi (w) = `hinge (w> zi ) is a convex function of w for every i. Theorem B.2
shows that the sum of convex functions is convex which shows that fC-SVM (w)
is a convex function.

B.4 Exercises

Exercise B.1. Let A be a positive semi-definite


q matrix and let us define a
Mahalanobis distance using A as dA (x, y) = (x − y)> A(x − y). Consider
the unit ball according to this distance i.e. the set of all points that are
B.4. Exercises 31

less
n than or equal to unito Mahalanobis distance from the origin i.e. BA (1) ,
d
x ∈ R : dA (x, 0) ≤ 1 . Show that BA (1) is a convex set.

Exercise
n B.2. Consider the hyperplane
o given by the equation w> x + b = 0 i.e.
H , x ∈ Rd : w> x + b = 0 where w is the normal vector to the hyperplane
and b is the bias term. Show that H is a convex set.
Exercise B.3. If I take a convex set and shift
it, does it remain convex? Let C ⊂ R be a
convex set and let v ∈ Rd be any vector
(whether “small” or “large”). Define C + v ,
{x : x = z + v for some z ∈ C}. Show that the
set C +v will always be convex, no matter what
v or convex set C we choose.

Exercise B.4. If I take a convex set and scale it,


does it remain convex? Let C ⊂ R be a convex set
and let me scale dimension j using a scaling factor
sj > 0 i.e. for a vector x, the scaled vector is x̃
where x̃j = sj · xj . We can represent this operation
using a diagonal matrix S ∈ Rd×d where Sii = si
and Sij = 0 if i 6= j i.e. x̃ = Sx. In the figure to
the left, the x axis has been scaled up (expanded) 33% i.e. s1 = 1.333 and
the y axis has been" scaled down #(shrunk) by 33% i.e. s2 = 0.667. Thus, in
1.333 0
this example S = . Define SC , {x : x = Sz for some z ∈ C}.
0 0.667
Show that the set SC will always be convex, no matter what positive scaling
factors or convex set C we choose. Does this result hold even if (some of) the
scaling factors are negative? What if some of the scaling factors are zero?
Exercise B.5. Above, we saw two operations (shifting a.k.a translation and
scaling) that keep convex sets convex. However, can these operations turn a
non-convex set into a convex set i.e. can there exist a non-convex set C such
that C + v is convex or else SC is convex when all scaling factors are non-zero?
What if some (or all) scaling factors are zero?
Exercise B.6. Show that affine functions are always convex i.e. for any a ∈
Rd , b ∈ R, the function f (x) = a> x + b is a convex function. Next, show that
affine functions are always concave as well. In fact, affine functions are the
only functions that are both convex as well as concave.
Exercise B.7. Show that affine functions when scaled by a positive constant,
remain convex i.e. for convex function f : Rd → R and any c > 0, the function
g = c · f is also convex. Next, show that if c < 0 then g is concave. What
happens if c = 0? Does g become convex or concave?
Exercise B.8. The logistic loss function is very popular in machine learning
and is defined as `logistic (t) = ln(1 + exp(−t)). Show that given data points
 i i n
(x , y ) i=1 (which are to be treated as constants) the function f (w) ,
Pn i > i
i=1 `logistic (y · w x ) is a convex function of w.
C
Probability Theory Refresher

Example C.1. Dr. Strange is trying to analyze all possible outcomes of the
Infinity War using the Time Stone but there are way too many of them so
he analyzes only 75% of the n possible outcomes. The Avengers then sit down
to discuss the outcomes. Yet again, since there are so many of them, they
discuss only 25% of the n outcomes (much fewer outcomes are discussed by the
Avengers than were analyzed by Dr Strange since by now Thanos has snatched
away the Time Stone). They may discuss some outcomes that Dr. Strange
has already analyzed as well as some outcomes that he has not analyzed. It is
known that of the outcomes that were analyzed by Dr. Strange, only 30% got
discussed. Given this, can we find out what fraction of the discussed outcomes
were previously analyzed by Dr. Strange?
This problem may not seem like a probability problem to begin with. How-
ever, if we recall the interpretation of probability in terms of proportions, then
it is easy to see this as a probability problem and also apply powerful tools
from probability theory. When we say that Dr. Strange analyzes only 75% of
the outcomes, this is just another way of saying that if we pick one of the
n outcomes uniformly at random (i.e. each outcome gets picked with equal
probability n1 ), then there is a 43 probability that it would be an outcome that
was analyzed by Dr. Strange.
With this realization in mind, we set up our probability space properly.
Our sample space is [n] consisting of the n possible outcomes. Each outcome
is equally likely to be picked i.e. each outcome gets picked with probability n1 .
We now define two indicator variables A and D. A = 1 if the chosen outcome
was analyzed by Dr. Strange and A = 0 otherwise. D = 1 if the chosen outcome
was discussed by the Avengers and D = 0 otherwise.
The problem statement tells us that P [A = 1] = 43 (since 75% outcomes
were analyzed), P [D = 1] = 41 (since 25% outcomes were analyzed). The prob-
lem statement also tells us that of the analyzed outcomes, only 30% were
discussed which means that the number of outcomes that were both discussed

32
33

3
and analyzed, divided by the number of outcomes that were analyzed is 10 i.e.
P[D=1,A=1] 3 3
P[A=1] = 10 . This is just another way of saying that P [D = 1 | A = 1] = 10 .
The problem asks us to find the fraction of discussed outcomes that were
analyzed i.e. we want the number of outcomes that were both analyzed and
discussed, divided by the number of discussed outcomes. This is nothing but
P [A = 1 | D = 1]. Applying the Bayes theorem now tells us
3 3
P [D = 1 | A = 1] · P [A = 1] 10 · 4 9
P [A = 1 | D = 1] = = 1 =
P [D = 1] 4
10

which means that 90% of the discussed outcomes were previously analyzed by
Dr. Strange. This is not surprising given that Dr. Strange analyzed so many
more outcomes than the Avengers were able to discuss.

Example C.2 (Problem 6.4 from Deisenroth et al. (2019)). There are two bags.
The first bag contains four mangoes and two apples; the second bag contains
four mangoes and four apples. We also have a biased coin, which shows “heads”
with probability 0.6 and “tails” with probability 0.4. If the coin shows “heads”,
we pick a fruit at random from bag 1; otherwise we pick a fruit at random
from bag 2. Your friend flips the coin (you cannot see the result), picks a fruit
uniformly at random from the corresponding bag, and presents you a mango.
What is the probability that the mango was picked from bag 2? 1
To solve the above problem, let us define some random variables. Let
B ∈ {1, 2} denote the bag from which the fruit is picked and let F ∈ {M, A}
denote which fruit is selected2 . The problem statement tells us the follow-
ing: P [B = 1] = 0.6 and P [B = 2] = 0.4 since the outcome of the coin flip
completely decides which bag we choose. Now, suppose we knew that the
fruit was being sampled from bag 1, then interpreting probabilities as pro-
portions (since fruits are chosen uniformly at random from a bag), we have
4
P [F = M | B = 1] = 4+2 = 23 and P [F = A | B = 1] = 1−P [F = M | B = 1] =
1 4 1
3 . Similarly, we have P [F = M | B = 2] = 4+4 = 2 = P [F = A | B = 2].
We are told that the fruit that was picked was indeed a mango and are
interested in knowing the chances that it was picked from bag 2. Thus, we are
interested in P [B = 2 | F = M ]. Applying the Bayes theorem gives us
P [F = M | B = 2] · P [B = 2]
P [B = 2 | F = M ] =
P [F = M ]
We directly have values for the two terms in the numerator. The denomina-
tor, however, will have to be calculated by deriving the marginal probability
P [F = M ] from the joint probability distribution PF,B using the sum rule (law
of total probability) and the product rule

P [F = M ] = P [F = M, B = 1] + P [F = M, B = 2]
= P [F = M | B = 1] · P [B = 1] + P [F = M | B = 2] · P [B = 2]
1 In this statement, the word uniformly was added (it was not present in Deisenroth et al. (2019))

to emphasize the nature of the random choice of fruit from a bag.


2 Note that we used a non-numeric support {M, A} for the random variable F merely for sake of

easy identification. We can easily make this support numeric say, by mapping M = 1 and A = 0.
34

2 3 1 2 3
= · + · = ,
3 5 2 5 5
where in the first step we used the sum rule and in the second step we used
the product rule. Putting things together gives us
1 2
2 · 5 1
P [B = 2 | F = M ] = 3 = .
5
3

Thus, there is a 13 probability that the mango we got was picked from bag 2.
The complement rule for conditional probability tell us that P [B = 1 | F = M ] =
1 − P [B = 2 | F = M ] = 23 , i.e. there is a much larger, 32 probability that the
mango we got was picked from bag 1. This is to be expected since not only is
bag 1 more likely to be picked up than bag 2 , bag 1 also has a much larger
proportion of mangoes than bag 2 which means that if we got a mango, it is
more likely that it came from bag 1.
Example C.3. Let A, B denote two events such that P [A | B] = 0.36 = P [A].
Can we find P [A | ¬B] i.e. the probability that event A will take place given
that event B has not taken place?
Let us abuse notation to let A, B also denote the indicator variables for
the events i.e. A = 1 if A takes place and A = 0 otherwise and similarly for
B. Note that the problem statement has essentially told us that A and B are
independent events. Since P [A | B] = P [A], by abusing notation we get
P [A = 1, B = 1]
= P [A = 1] ,
P [B = 1]
i.e. P [A = 1, B = 1] = P [A = 1]·P [B = 1]. Now, we are interested in the prob-
ability P [A = 1 | B = 0] = P[A=1,B=0]
P[B=0] . Since we have not been given P [B = 0]
directly, we try to massage the numerator to see if we can get hold of some-
thing.

P [A = 1, B = 0] = P [A = 1] − P [A = 1, B = 1]
= P [A = 1] − P [A = 1] · P [B = 1]
= P [A = 1] (1 − P [B = 1])
= P [A = 1] P [B = 0] ,

where in the first step we used the sum rule (law of total probability), in the
second step, we exploited the independence of the two events and in the last
step, we used the complement rule. This gives us
P [A = 1, B = 0] P [A = 1] P [B = 0]
P [A = 1 | B = 0] = = = P [A = 1] = 0.36,
P [B = 0] P [B = 0]
since the problem statement already tells us that P [A = 1] = 0.36.
Example C.4. Timmy is is trying to kill some free time by typing random
letters on his keyboard. He types 7 random capital alphabet letters (A - Z) on
his keyboard. Each letter is chosen uniformly randomly from the 26 letters and
each choice is completely independent of the other choices. Can we find the
probability that Timmy will end up typing the word COVFEFE?
C.1. Exercises 35

Let Li , i ∈ [7] denote the random variable that tells us which letter was
chosen at the ith location in the word. As before, we will let the support of
the random variables Li be the words of the English alphabet rather than
numbers, for sake of easy identification. We can readily map the letters of the
alphabet to [26] to have numerical supports instead. We are interested in

P [L1 = C, L2 = O, L3 = V, L4 = F, L5 = E, L6 = F, L7 = E]

However, since the choices were made independently, applying the product
rule for expectations tells us that the above is equal to

P [L1 = C]·P [L2 = O]·P [L3 = V ]·P [L4 = F ]·P [L5 = E]·P [L6 = F ]·P [L7 = E]

Since each letter is chosen uniformly at random from the 26 letters of the
 7
1
alphabet, the above probability is simply 26 .

C.1 Exercises

Exercise C.1. Let Z denote a random variable which takes value 1 with prob-
ability p and value 2 with probability 1 − p. For what value of p ∈ [0, 1] does
this random variable have the highest variance?

Exercise C.2. Summers are here and people are buying soft drinks, say from
three brands – ThumsUp, Pepsi and Coke. Suppose the total sales of all the
brands put together is exactly 14 million units every day. Now, the market
share of the individual brands changes from day to day. However, it is also
known that ThumsUp sells on an average of 8 million units per day and Coke
sells 4 million units per day on average. How many units does Pepsi sell per
day on an average?

Exercise C.3. Suppose A and B are two events such that P [A] = 0.125
whereas B is an almost sure event i.e. P [B] = 1. What is P [A | B]? What
is P [B | A]?

Exercise C.4. The probability of life on a planet given that there is water on
it is 80%. If there is no water on a planet then life cannot exist on that planet.
The probability of finding water on a planet is 50%. Planet A has life on it
whereas Planet B does not have life on it. What is the probability that Planet
A has water? What is the probability that Planet B has water on it?

Exercise C.5. My sister and I both try to sell comic books to our classmates
at school. Since our target audience is limited, it is known that on any given
day, the two of us together sell at most 10 comic books. It is known that I
sell 5 comic books on average everyday. It is also known that my sister sells
3 comic books on average everyday. How many comic books do the two of us
sell together on average everyday?

Exercise C.6. Martin has submitted his thesis proposal. The probability that
professor X will approve the proposal is 70%. The probability that professor
Y will approve the proposal is 50%. The probability that professor Z will
C.1. Exercises 36

approve the proposal is 40%. The approvals of the three professors are entirely
independent of one another. Suppose Martin has to get approval from at least
two of the three professors to pass his proposal, what is the probability that
his proposal will pass?
References

Deisenroth, M. P., A. A. Faisal, and C. S. On (2019), ‘Mathematics for Machine Learning’.


https://mml-book.com/.

37

You might also like