Math Cheat Sheet
Math Cheat Sheet
Below is a quick refresher on some math tools from 340 that we’ll assume knowledge of for the
PSets.
1 Basic Probability
1.1 Discrete random variables
A random variable is a variable whose value is uncertain (i.e. the roll of a die). If X is a random
variable that always takes non-negative, integer values, (we’ll refer to this as a discrete random
variable) then we can write the expected value of X as:
∞
X
Definition of expected value, form 1: E[X] = Pr[X = i] · i.
i=0
Probably the above definition is familiar to most of you already. Another way to compute the
expected value (which sometimes results in simpler calculations) is:
∞
X
Definition of expected value, form 2: E[X] = Pr[X > i].
i=0
We obtain the second equality just by flipping the order of sums: the term Pr[X = j] is summed
once for every i < j. The third equality is obtained by just observing that there are exactly j non-
negative integers less than j.
1
• The PDF is just a formal way of discussing the probability that X = x. Because the random
variable is continuous, the probability that X = x is actually zero for all x (what is the
probably that you spend exactly 3.4284203 seconds reading this sentence)? So we think
of dx as being infinitesimally small (the same dx from your calculus classes), and think of
Pr[X = x] as f (x)dx.
So how do we take the expectation of a continuous random variable? We just need to map the
definitions above into the new language.
Z ∞
Definition of expected value, continuous random variables, form 1: E[X] = xf (x)dx.
0
You should parse exactly the same way as form 1 for discrete random variables, except we’ve
replaced the sum with an integral, and Pr[X = i] is now “f (x)dx ≈ Pr[X = x].” The equivalent
definition for form 2 is also often easier to use in calculations:
Z ∞
Definition of expected value, continuous random variables, form 2: E[X] = (1−F (x))dx.
0
If F (x) = Pr[X ≤ x], then 1 − F (x) = Pr[X > x], so this is the same as form 2 for discrete
random variables, except we’ve replaced the sum with an integral. For form 2, it is crucial that the
integral start below at 0, even when the random variable only takes values (say) > 1. We’ll see this
in examples below.
1.3 Examples
Consider the uniform distribution on the set {4, 5} (4 w.p. 1/2, 5 w.p. 1/2). Then the expected value
as computed by form 1 is:
∞
X
Pr[X = i] · i = 4 · 1/2 + 5 · 1/2 = 4.5.
i=0
Now consider the uniform distribution on the interval [4, 5] (equally likely to be any real number
in [4, 5]). Then the PDF associated with this distribution is f (x) = 1, x ∈ [4, 5], f (x) = 0, x ∈ /
[4, 5]. And we can compute the expected value by form 1 as:
Z ∞ Z 5
xf (x)dx = xdx = x2 /2|54 = 25/2 − 8 = 4.5.
0 4
We can also compute it using form 2 as:
2
Z ∞ Z 4 Z 5 Z ∞
(1 − F (x))dx = 1dx + (x − 4)dx + 0dx = 4 + (x2 /2 − 4x|54 ) + 0 = 4.5.
0 0 4 5
Note that it is crucial that we started the integral at 0 and not 4 for form 2, otherwise we would
have incorrectly computed the expectation as .5 instead of 4.5. This isn’t crucial for form 1, since
all the terms in [0, 4] drop out anyway as f (x) = 0.
∞
X
E[X1 + X2 ] = Pr[X1 + X2 = i] · i
i=0
X∞ Xi
= Pr[X1 = j] · Pr[X2 = i − j] · i
i=0 j=0
X∞ X ∞
= Pr[X1 = j] · Pr[X2 = i − j] · i
j=0 i=j
X∞ ∞
X
= Pr[X1 = j] · Pr[X2 = `] · (` + j) (changing variables with ` = i − j)
j=0 `=0
∞ ∞
!
X X
= Pr[X1 = j] · j+ Pr[X2 = `] · `
j=0 `=0
X∞
= Pr[X1 = j] · (j + E[X2 ])
j=0
∞
X
= E[X1 ] + E[X2 ]. (because Pr[X1 = j] = 1)
j=0
3
Say we want to find the global maximum of f (x) = 4x − x2 . The derivative is 4 − 2x, so there
is a unique critical point at x = 2. So if there is a global maximum, it must be x = 2. We can verify
that limx→±∞ = −∞, so x = 2 must be the global maximum.1
• Find all critical points, compute f (a), f (b), f (x) for all critical points x and output the largest.
• Confirm that f 0 (a) > 0 (that is, f is increasing at a) and f 0 (b) < 0. This proves that neither
a nor b can be the global maximum. Then compute f (x) for all critical points x and output
the largest.
• In either of the above, rather than directly comparing f (x) to f (y), one can instead prove that
f 0 (z) ≥ 0 on the entire interval [x, y] to conclude that f (y) ≥ f (x).
• Prove that x is a global unconstrained maximum of f (·), and observe that x ∈ [a, b].
There are many other approaches. The point is that at the end of the day, you must directly or
indirectly compare all critical points and all endpoints. You don’t have to directly compute f (·)
at all of these values (the bullets above provide some shortcuts), but you must at least indirectly
compare them. For this class, it is OK to just describe your approach without writing down the
entire calculations (as in the following examples).
Say we want to find the constrained maximum of f (x) = x2 on the interval [3, 8]. f has no
critical points on this range, so the maximum must be either 3 or 8. f 0 (x) = 2x > 0 on this entire
interval, so therefore the maximum must be 8.
Say we want to find the constrained maximum of f (x) = 3x2 − x3 on the interval [−2, 3].
f 0 (x) = 6x − 3x2 , and therefore f has critical points at 0 and 2. So we need to (at least indirectly)
consider −2, 0, 2, 3. We see that f 0 (x) ≤ 0 on [−2, 0], so we can immediately rule out 0. We also
see that f 0 (x) ≤ 0 on [2, 3], so we can immediately rule out 3, and we only need to compare −2
and 2. We can also immediately see that f (−x) > f (x) for all x > 0, and therefore f (−2) is the
global constrained maximum.
Say we want to find the constrained maximum of f (x) = 4x − x2 on the interval [−8, 5]. We
already proved above that x = 2 is the global unconstrained maximum. Therefore x = 2 is also
the global constrained maximum on [−8, 5].
Warning! An incorrect approach. It might be tempting to try the following approach: First, find
all local maxima of f (·). Call this set X. Then, check to see which elements of X lie in [a, b].
Call them Y . Then, output the argmax of f (x) over all x ∈ Y . This approach does not work,
and in fact we already saw a counterexample. Say we want to find the constrained maximum of
f (x) = 3x2 − x3 on the interval [−2, 3]. Then f 0 (x) = 6x − 3x2 , and f has critical points at 0 and
2. We can verify that x = 0 is a local minimum and x = 2 is a local maximum. So x = 2 is the
unique local maximum, and it also lies in [−2, 3]. But, we saw that it’s incorrect to conclude that
therefore x = 2 is the constrained global maximum.
1
We can also verify that x = 2 is a local maximum by computing f 00 (2) = −2, but this isn’t necessary.
4
2.3 Multi-variable, unconstrained optimization
Say now we want to find the unconstrained global maximum of a differentiable multi-variate func-
tion f (·, ·, . . . , ·). Again, any value that is the unconstrained maximum must be a critical point,
where a critical point has ∂f∂x(~xi ) = 0 for all i. Again, not all critical points are local optima/maxima,
but all local maxima are definitely critical points. One also needs to confirm that f (·) indeed
achieves its global maximum by examining limits towards ∞. Doing this formally can sometimes
be tedious, but in this class we’ll only see cases where this is straight-forward.2 Sometimes, it
might also be helpful to think of some variables as being fixed, and solve successive single-variable
optimization problems. Here are some examples that you might reasonably need to solve:
Say you want to maximize f (x1 , x2 ) = x1 − x21 − x22 . We can immediately see that for any
x1 , f (x1 , x2 ) is maximized at x2 = 0 (this is what we mean by thinking of x1 as fixed and solving
a single-variable optimization problem for x2 ). Once we’ve set x2 = 0, we now just want to
maximize x1 − x21 , which is achieved at x1 = 1/2. So the unconstrained maximizer is (1/2, 0).
Say you want to maximize f (x1 , x2 ) = x1 x2 − x21 − x22 . We can again think of x1 as fixed and
∂f
see that ∂x 2
= x1 − 2x2 , and so for fixed x1 , the unique maximizer is at x2 = x1 /2. We can then
just optimize x1 (x1 /2) − x21 − (x1 /2)2 = (−3/4) · x21 , which is clearly maximized at x1 = 0. So
the unique global maximizer is (0, 0). P
Say you want to maximize f (~x) = i fi (xi ). That is, the function you’re trying to maximize
is just the sum of single-variable functions (one for each coordinate of ~x). Then we can simply
maximize each fi (·) separately, and let x∗i = arg maxxi {fi (xi )}. Observe that ~x∗ must be the
maximizer of f (~x). Most (possibly all) of the instances you will need to solve in the PSets will be
of this format.
2
Sometimes you’ll need to be clever, but ideally very few (if any) proofs will require very tedious calculations.