0% found this document useful (0 votes)
5 views34 pages

Advanced Algorithm Design Lecture Notes Princeton Cos521 Itebooks instant download

The document contains lecture notes for Princeton's COS 521 course on Advanced Algorithm Design, focusing on the analysis and design of algorithms, particularly in the context of hashing. It discusses the evolution of algorithms from undergraduate to graduate levels, emphasizing the need for diverse approaches and modeling in algorithm design. Additionally, it covers the fundamentals of hashing, including hash functions, collision resolution, and the concept of 2-universal hash families.

Uploaded by

woolryvoloh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views34 pages

Advanced Algorithm Design Lecture Notes Princeton Cos521 Itebooks instant download

The document contains lecture notes for Princeton's COS 521 course on Advanced Algorithm Design, focusing on the analysis and design of algorithms, particularly in the context of hashing. It discusses the evolution of algorithms from undergraduate to graduate levels, emphasizing the need for diverse approaches and modeling in algorithm design. Additionally, it covers the fundamentals of hashing, including hash functions, collision resolution, and the concept of 2-universal hash families.

Uploaded by

woolryvoloh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Advanced Algorithm Design Lecture Notes

Princeton Cos521 Itebooks download

https://ebookbell.com/product/advanced-algorithm-design-lecture-
notes-princeton-cos521-itebooks-23835968

Explore and download more ebooks at ebookbell.com


Here are some recommended products that we believe you will be
interested in. You can click the link to download.

A Textbook Of Data Structures And Algorithms Volume 3 Mastering


Advanced Data Structures And Algorithm Design Strategies 2nd Edition G
A Vijayalakshmi Pai

https://ebookbell.com/product/a-textbook-of-data-structures-and-
algorithms-volume-3-mastering-advanced-data-structures-and-algorithm-
design-strategies-2nd-edition-g-a-vijayalakshmi-pai-47399684

Automated Design Of Electrical Converters With Advanced Ai Algorithms


Xin Zhang

https://ebookbell.com/product/automated-design-of-electrical-
converters-with-advanced-ai-algorithms-xin-zhang-49606052

Design Optimization Of Renewable Energy Systems Using Advanced


Optimization Algorithms Venkata Rao Ravipudi

https://ebookbell.com/product/design-optimization-of-renewable-energy-
systems-using-advanced-optimization-algorithms-venkata-rao-
ravipudi-46753108

Assembly Line Design The Balancing Of Mixedmodel Hybrid Assembly Lines


With Genetic Algorithms Springer Series In Advanced Manufacturing 1st
Edition Brahim Rekiek

https://ebookbell.com/product/assembly-line-design-the-balancing-of-
mixedmodel-hybrid-assembly-lines-with-genetic-algorithms-springer-
series-in-advanced-manufacturing-1st-edition-brahim-rekiek-2151350
Advances In Metaheuristic Algorithms For Optimal Design Of Structures
Third Edition 3rd Edition Ali Kaveh

https://ebookbell.com/product/advances-in-metaheuristic-algorithms-
for-optimal-design-of-structures-third-edition-3rd-edition-ali-
kaveh-23269222

Advances In Metaheuristic Algorithms For Optimal Design Of Structures


1st Edition A Kaveh Auth

https://ebookbell.com/product/advances-in-metaheuristic-algorithms-
for-optimal-design-of-structures-1st-edition-a-kaveh-auth-4696934

Advances In Metaheuristic Algorithms For Optimal Design Of Structures


2nd Edition A Kaveh Auth

https://ebookbell.com/product/advances-in-metaheuristic-algorithms-
for-optimal-design-of-structures-2nd-edition-a-kaveh-auth-5675734

Advances In Evolutionary Algorithms Theory Design And Practice 1st


Edition Dr Chang Wook Ahn Auth

https://ebookbell.com/product/advances-in-evolutionary-algorithms-
theory-design-and-practice-1st-edition-dr-chang-wook-ahn-auth-4191262

Jaya An Advanced Optimization Algorithm And Its Engineering


Applications 1st Ed 2019 Ravipudi Venkata Rao

https://ebookbell.com/product/jaya-an-advanced-optimization-algorithm-
and-its-engineering-applications-1st-ed-2019-ravipudi-venkata-
rao-7039990
princeton univ. F’14 cos 521: Advanced Algorithm Design
Lecture 1: Course Intro and Hashing
Lecturer: Sanjeev Arora Scribe:Sanjeev

Algorithms are integral to computer science and every computer scientist (even as an
undergrad) has designed several algorithms. So has many a physicist, electrical engineer,
mathematician etc. This course is meant to be your one-stop shop to learn how to design
a variety of algorithms. The operative word is “variety. ”In other words you will avoid the
blinders that one often sees in domain experts. A bayesian needs to see priors on the data
before he can begin designing algorithms; an optimization expert needs to cast all problems
as convex optimization; a systems designer has never seen any problem that cannot be
solved by hashing. (OK, mostly kidding but there is some truth in these stereotypes.)
These and more domain-specific ideas make an appearance in our course, but we will learn
to not be wedded to any single approach.
The primary skill you will learn in this course is how to analyse algorithms: prove their
correctness and their running time and any other relevant properties. Learning to analyse a
variety of algorithms (designed by others) will let you design better algorithms later in life.
I will try to fill the course with beautiful algorithms. Be prepared for frequent rose-smelling
stops, in other words.

1 Difference between grad and undergrad algorithms


Undergrad algorithms is largely about algorithms discovered before 1990; grad algorithms
is a lot about algorithms discovered since 1990. OK, I picked 1990 as an arbitrary cutoff.
Maybe it is 1985, or 1995. What happened in 1990 that caused this change, you may
ask? Nothing. It was no single event but just a gradual shift in the emphasis and goals of
computer science as it became a more mature field.
In the first few decades of computer science, algorithms research was driven by the goal
of designing basic components of a computer: operating systems, compilers, networks, etc.
Other motivations were classical problems in discrete mathematics, operations research,
graph theory. The algorithmic ideas that came out of these quests form the core of un-
dergraduate course: data structures, graph traversal, string matching, parsing, network
flows, etc. Starting around 1990 theoretical computer science broadened its horizons and
started looking at new problems: algorithms for bioinformatics, algorithms and mechanism
design for e-commerce, algorithms to understand big data or big networks. This changed
algorithms research and the change is ongoing. One big change is that it is often unclear
what the algorithmic problem even is. Identifying it is part of the challenge. Thus good
modeling is important. This in turn is shaped by understanding what is possible (given our
understanding of computational complexity) and what is reasonable given the limitations
of the type of inputs we are given.

1
2

Some examples of this change:

The changing graph. In undergrad algorithms the graph is given and arbitrary (worst-
case). In grad algorithms we are willing to look at where the graph came from (social
network, computer vision etc.) since those properties may be germane to designing a good
algorithm. (This is not a radical idea of course but we will see that formulating good graph
models is not easy. This is why you see a lot of heuristic work in practice, without any
mathematical proofs of correctness.)

Changing data structures: In undergrad algorithms the data structures were simple
and often designed to hold data generated by other algorithms. A stack allows you to hold
vertices during depth-first search traversal of a graph, or instances of a recursive call to a
procedure. A heap is useful for sorting and searching.
But in the newer applications, data often comes from sources we don’t control. Thus it
may be noisy, or inexact, or both. It may be high dimensional. Thus something like heaps
will not work, and we need more advanced data structures.
We will encounter the “curse of dimensionality”which constrains algorithm design for
high-dimensional data.

Changing notion of input/output: Algorithms in your undergrad course have a simple


input/output model. But increasingly we see a more nuanced interpretation of what the
input is: datastreams (useful in analytics involving routers and webservers), online (sequence
of requests), social network graphs, etc. And there is a corresponding subtlety in settling
on what an appropriate output is, since we have to balance output quality with algorithmic
efficiency. In fact, design of a suitable algorithm often goes hand in hand with understanding
what kind of output is reasonable to hope for.

Type of analysis: In undergrad algorithms the algorithms were often exact and work on
all (i.e., worst-case) inputs. In grad algorithms we are willing to relax these requirements.

2 Hashing: Preliminaries
Now we briefly study hashing, both because it is such a basic data structure, and because
it is a good setting to develop some fluency in probability calculations.
Hashing can be thought of as a way to rename an address space. For instance, a router
at the internet backbone may wish to have a searchable database of destination IP addresses
of packets that are whizing by. An IP address is 128 bits, so the number of possible IP
addresses is 2128 , which is too large to let us have a table indexed by IP addresses. Hashing
allows us to rename each IP address by fewer bits. Furthermore, this renaming is done
probabilistically, and the renaming scheme is decided in advance before we have seen the
actual addresses. In other words, the scheme is oblivious to the actual addresses.
Formally, we want to store a subset S of a large universe U (where |U | = 2128 in the
above example). And |S| = m is a relatively small subset. For each x ∈ U , we want to
support 3 operations:
3

• insert(x). Insert x into S.

• delete(x). Delete x from S.

• query(x). Check whether x ∈ S.

U
h
n elements

Figure 1: Hash table. x is placed in T [h(x)].

A hash table can support all these 3 operations. We design a hash function

h : U −→ {0, 1, . . . , n − 1} (1)

such that x ∈ U is placed in T [h(x)], where T is a table of size n.


Since |U |  n, multiple elements can be mapped into the same location in T , and we
deal with these collisions by constructing a linked list at each location in the table.
One natural question to ask is: how long is the linked list at each location?
This can be analysed under two kinds of assumptions:

1. Assume the input is the random.

2. Assume the input is arbitrary, but the hash function is random.

Assumption 1 may not be valid for many applications.


Hashing is a concrete method towards Assumption 2. We designate a set of hash func-
tions H, and when it is time to hash S, we choose a random function h ∈ H and hope
that on average we will achieve good performance for S. This is a frequent benefit of a
randomized approach: no single hash function works well for every input, but the average
hash function may be good enough.

3 Hash Functions
Say we have a family of hash functions H, and for each h ∈ H, h : U −→ [n]1 . What do
mean if we say these functions are random?
For any x1 , x2 , . . . , xm ∈ S (xi 6= xj when i 6= j), and any a1 , a2 , . . . , am ∈ [n], ideally a
random H should satisfy:
1
We use [n] to denote the set {0, 1, . . . , n − 1}
4

• Prh∈H [h(x1 ) = a1 ] = n1 .
1
• Prh∈H [h(x1 ) = a1 ∧ h(x2 ) = a2 ] = n2
. Pairwise independence.
1
• Prh∈H [h(x1 ) = a1 ∧ h(x2 ) = a2 ∧ · · · ∧ h(xk ) = ak ] = nk
. k-wise independence.
1
• Prh∈H [h(x1 ) = a1 ∧ h(x2 ) = a2 ∧ · · · ∧ h(xm ) = am ] = nm . Full independence (note
that |U | = m).

Generally speaking, we encounter a tradeoff. The more random H is, the greater the
number of random bits needed to generate a function h from this class, and the higher the
cost of computing h.
For example, if H is a fully random family, there are nm possible h, since each of the
m elements at S have n possible locations they can hash to. So we need log |H| = m log n
bits to represent each hash function. Since m is usually very large, this is not practical.
But the advantage of a random hash function is that it ensures very few collisions with
high probability. Let Lx be the length of the linked list containing x; this is just the number
of elements with the same hash value as x. Let random variable
(
1 if h(y) = h(x),
Iy = (2)
0 otherwise.
P
So Lx = 1 + y∈S;y6=x Iy , and
X m−1
E[Lx ] = 1 + E[Iy ] = 1 + (3)
n
y∈S;y6=x

Usually we choose n > m, so this expected length is less than 2. Later we will analyse
this in more detail, asking how likely is Lx to exceed say 100.
The expectation calculation above doesn’t need full independence; pairwise indepen-
dence would actually suffice. This motivates the next idea.

4 2-Universal Hash Families


Definition 1 (Carter Wegman 1979) Family H of hash functions is 2-universal if for
any x 6= y ∈ U ,
1
Pr [h(x) = h(y)] ≤ (4)
h∈H n

Note that this property is even weaker than 2 independence.


We can design 2-universal hash families in the following way. Choose a prime p ∈
{|U |, . . . , 2|U |}, and let

fa,b (x) = ax + b mod p (a, b ∈ [p], a 6= 0) (5)

And let
ha,b (x) = fa,b (x) mod n (6)
5

Lemma 1
6 x2 and s 6= t, the following system
For any x1 =

ax1 + b = s mod p (7)


ax2 + b = t mod p (8)

has exactly one solution.

Since [p] constitutes a finite field, we have that a = (x1 − x2 )−1 (s − t) and b = s − ax1 .
Since we have p(p − 1) different hash functions in H in this case,
1
Pr [h(x1 ) = s ∧ h(x2 ) = t] = (9)
h∈H p(p − 1)

Claim H = {ha,b : a, b ∈ [p] ∧ a 6= 0} is 2-universal.

Proof: For any x1 6= x2 ,

Pr[ha,b (x1 ) = ha,b (x2 )] (10)


X
= δ(s=t mod n) Pr[fa,b (x1 ) = s ∧ fa,b (x2 ) = t] (11)
s,t∈[p],s6=t
1 X
= δ(s=t mod n) (12)
p(p − 1)
s,t∈[p],s6=t
1 p(p − 1)
≤ (13)
p(p − 1) n
1
= (14)
n
where δ is the Dirac delta function. Equation (13) follows because for each s ∈ [p], we have
at most (p − 1)/n different t such that s 6= t and s = t mod n. 2
Can we design a collision free hash table then? Say we have m elements, and the hash
table is of size n. Since for any x1 6= x2 , Prh [h(x1 ) = h(x2 )] ≤ n1 , the expected number of
total collisions is just
 
X X m 1
E[ h(x1 ) = h(x2 )] = E[h(x1 ) = h(x2 )] ≤ (15)
2 n
x1 6=x2 x1 6=x2

Let’s pick m ≥ n2 , then


1
E[number of collisions] ≤ (16)
2
and so
1
Pr [∃ a collision] ≤ (17)
h∈H 2
So if the size the hash table is large enough m ≥ n2 , we can easily find a collision free
hash functions. But in reality, such a large table is often unrealistic. We may use a two-layer
hash table to avoid this problem.
6

0
1

si elements
i
s2i locations

n−1

Figure 2: Two layer hash tables.

Specifically, let si denote the number of collisions at location i. If we can construct a


second layer table of size s2i , we can easily find a collision-free hash table to store all the si
Pm−1
elements. Thus the total size of the second-layer hash tables is 2
Pm−1 i=0 si .
Note that i=0 si (si − 1) is just the number of collisions calculated in Equation (15),
so
X X X m(m − 1)
E[ s2i ] = E[ si (si − 1)] + E[ si ] = + m ≤ 2m (18)
n
i i i

5 Load Balance
Now we think a bit about how large the linked lists (ie number of collisions) can get. Let
us think for simplicity about hashing n keys in a hash table of size n. This is the famous
balls-and-bins calculation, also called load balance problem. We have n balls and n bins,
and we randomly put the balls into bins. Then for a given i,
 
n 1 1
Pr[bini gets more than k elements] ≤ · k ≤ (19)
k n k!

By Stirling’s formula,
√ k
k! ∼ 2nk( )k (20)
e
If we choose k = O( logloglogn n ), we can let 1
k! ≤ 1
n2
. Then

1 1
Pr[∃ a bin ≥ k balls] ≤ n · 2
= (21)
n n
12
So with probability larger than 1 − n ,

log n
max load ≤ O( ) (22)
log log n
2 1
this can be easily improve to 1 − nc
for any constant c
7

Aside: The above load balancing is not bad; no more than O( logloglogn n ) balls in a bin with
high probability. Can we modify the method of throwing balls into bins to improve the load
balancing? We use an idea that you use at the supermarket checkout: instead of going to
a random checkout counter you try to go to the counter with the shortest queue. In the
load balancing case this is computationally too expensive: one has to check all n queues.
A much simpler version is the following: when the ball comes in, pick 2 random bins, and
place the ball in the one that has fewer balls. Turns out this modified rule ensures that the
maximal load drops to O(log log n), which is a huge improvement. This called the power of
two choices.
princeton univ. F’13 cos 521: Advanced Algorithm Design
Lecture 2: Karger’s Min Cut Algorithm
Lecturer: Sanjeev Arora Scribe:Sanjeev

Today’s topic is simple but gorgeous: Karger’s min cut algorithm and its extension. It
is a simple randomized algorithm for finding the minimum cut in a graph: a subset of
vertices S in which the set of edges leaving S, denoted E(S, S) has minimum size among
all subsets. You may have seen an algorithm for this problem in your undergrad class that
uses maximum flow. Karger’s algorithm is elementary and and a great introduction to
randomized algorithms.
The algorithm is this: Pick a random edge, and merge its endpoints into a single “su-
pernode.”Repeat until the graph has only two supernodes, which is output as our guess for
min-cut. (As you continue, the supernodes may develop parallel edges; these are allowed.
Selfloops are ignored.)
Note that if you pick a random edge, it is more likely to come from parts of the graph
that contain more edges in the first place. Thus this algorithm looks like a great heuristic
to try on all kinds of real-life graphs, where one wants to cluster the nodes into “tightly-
knit”portions. For example, social networks may cluster into communities; graphs capturing
similarity of pixels may cluster to give different portions of the image (sky, grass, road etc.).
Thus instead of continuing Karger’s algorithm until you have two supernodes left, you could
stop it when there are k supernodes and try to understand whether these correspond to a
reasonable clustering.
Today we will first see that the above version of the algorithm yields the optimum min
cut with probability at least 2/n2 . Thus we can repeat it say 20n2 times, and output the
smallest cut seen in any iteration. The probability that the optimum cut is not seen in any
2
repetition is at most (1 − 2/n2 )20n < 0.01.
Unfortunately, this simple version has running time about n4 which is not great.
So then we see a better version with a simple tweak that brings the running time down
to closer to n2 . The idea is that roughly that repetition ensures fault tolerance. The real-life
advice of making two backups of your hard drive is related to this: the probability that both
fail is much smaller than one does. In case of Karger’s algorithm, the overall √ probability
of success is too low. But if run part of the way until the graph has n/ 2 supernodes,
the chance that the mincut √ hasn’t changed is at least 1/2. So you make two independent
runs that go down to n/ 2 supernodes, and recursively solve both of these. Thus the
expected number of instances that will yield the correct mincut is 2 × 12 = 1. (Unwrapping

the recursion, you see that each instance of size n/ 2 will generate two instances of size
n/2, and so on.) Simple induction shows that this 2-wise repetition is enough to bring the
probability of success above 1/ log n.
As you might suspect, this is not the end of the story but improvements beyond this
get more hairy. If anybody is interested I can give more pointers.
Also this algorithm forms the basis of other algorithms for other tasks. Again, talk to
me for pointers.

1
Chapter 3

Large deviations bounds and


applications

Today’s topic is deviation bounds: what is the probability that a random variable deviates
from its mean by a lot? Recall that a random variable X is a mapping from a probability
space to R. The expectation or mean is denoted E[X] or sometimes as µ.
In many settings we have a set of n random variables X1 , X2 , X3 , . . . , Xn defined on
the same probability space. To give an example, the probability space could be that of all
possible outcomes of n tosses of a fair coin, and Xi is the random variable that is 1 if the
ith toss is a head, and is 0 otherwise, which means E[Xi ] = 1/2.
The first observation we make is that of the Linearity of Expectation, viz.
X X
E[ Xi ] = E[Xi ]
i i
It is important to realize that linearity holds regardless of the whether or not the random
variables are independent.
Can we say something about E[X1 X2 ]? In general, nothing much but if X1 , X2 are
independent events (formally, this means that for all a, b Pr[X1 = a, X2 = b] = Pr[X1 =
a] Pr[X2 = b]) then E[X1 X2 ] = E[X1 ] E[X2 ].
Note that if the Xi ’s are
P pairwisePindependent (i.e., each pair are mutually independent)
then this means that var[ i Xi ] = i var[Xi ].

3.1 Three progressively stronger tail bounds


Now we give three methods that give progressively stronger bounds.

3.1.1 Markov’s Inequality (aka averaging)


The first of a number of inequalities presented today, Markov’s inequality says that any
non-negative random variable X satisfies
1
Pr (X k E[X])  .
k

18
19

Note that this is just another way to write the trivial observation that E[X] k ·Pr[X k].
Can we give any meaningful upperbound on Pr[X < c · E[X]] where c < 1, in other
words the probability that X is a lot less than its expectation? In general we cannot.
However, if we know an upperbound on X then we can. For example, if X 2 [0, 1] and
E[X] = µ then for any c < 1 we have (simple exercise)
1 µ
Pr[X  cµ]  .
1 cµ
Sometimes this is also called an averaging argument.
Example 1 Suppose you took a lot of exams, each scored from 1 to 100. If your average
score was 90 then in at least half the exams you scored at least 80.

3.1.2 Chebyshev’s Inequality


The variance of a random variable X is one measure (there are others too) of how “spread
out”it is around its mean. It is defined as E[(x µ)2 ] = E[X 2 ] µ2 .
A more powerful inequality, Chebyshev’s inequality, says
1
Pr[|X µ| k ] ,
k2
where µ and 2 are the mean and variance of X. Recall that 2 = E[(X µ)2 ] = E[X 2 ] µ2 .
Actually, Chebyshev’s inequality is just a special case of Markov’s inequality: by definition,
⇥ ⇤
E |X µ|2 = 2 ,

and so,
⇥ ⇤ 1
Pr |X µ|2 k2 2

.
k2
Here is simple fact that’s used a lot: If Y1 , Y2 , . . . , Yt are iid (whichP
is jargon for inde-
pendent and identically distributed) then the variance of their average k1 i Yt is exactly 1/t
times the variance of one of them. Using Chebyshev’s inequality, this already implies that
the average of iid variables converges sort-of strongly to the mean.

Example: Load balancing


Suppose we toss m balls into n bins. You can think of m jobs being randomly assigned to
n processors. Let X = number of balls assigned to the first bin. Then E[X] = m/n. What
is the chance that X > 2m/n? Markov’s inequality says this is less than 1/2.
To use Chebyshev we need to compute the variance of X. For this let YP i be the indicator
random variable that is 1 i↵ the ith ball falls in the first bin. Then X = i Yi . Hence
X X X X
E[X 2 ] = E[ Yi2 + 2 Yi Yj ] = E[Yi2 ] + E[Yi Yj ].
i i<j i i<j

m(m 1)
Now for independent random variables E[Yi Yj ] = E[Yi ] E[Yj ] so E[X 2 ] = m
n + n2
.
Hence the variance is very close to m/n, and thus Chebyshev implies that the probability
that Pr[X > 2 m n
n ] < m . When m > 3n, say, this is stronger than Markov.
20

3.1.3 Large deviation bounds


When we toss a coin many times, the expected number of heads is half the number of tosses.
How tightly is this distribution concentrated? Should we be very surprised if after 1000
tosses we have 625 heads?
The Central Limit Theorem says that the sum of n independent random variables (with
bounded mean and variance) converges to the famous Gaussian distribution (popularly
known as the Bell Curve). This is very useful in algorithm design: we maneuver to de-
sign algorithms so that the analysis boils down to estimating the sum of independent (or
somewhat independent) random variables.
To do a back of the envelope calculation, if all n coin tosses are fair (Heads has probability
1/2) then the Gaussian approximation implies that the probability of seeing N heads where
p 2
|N n/2| > a n/2 is at most e a /2 . The chance of seeing at least 625 heads in 1000
tosses of an unbiased coin is less than 5.3 ⇥ 10 7 . These are pretty strong bounds!
This kind of back-of-the-envelope calculations using the Gaussian approximation will
get most of the credit in homeworks.
In general, for finite n the sum of n random variables need not be an exact Gaussian;
this is particularly true if the variables are not identically distributed and well-behaved like
the random coin tosses above. That’s where Cherno↵ bounds come in. (By the way these
bounds are also known by other names in di↵erent fields since they have been independently
discovered.)
First we give an inequality that works for general variables that are real-valued in [ 1, 1].
This is not correct as stated but is good enough for your use in this course.
Theorem 2 (Inexact! Only a qualitative version)
P random variables and each Xi 2 [ 1, 1]. Let µi = E[Xi ]
If X1 , X2 , . . . , Xn are independent
2
and i = var[Xi ]. Then X = i Xi satisfies

k2
Pr[|X µ| > k ]  2 exp( ),
4
P 2
P 2
where µ = i µi and = i i. Also, k  /2 (say).

Instead of proving the above we prove a simpler theorem for binary valued variables
which showcases the basic idea.
Theorem 3
Let X1 , X2 , . . . , Xn be independent
P 0/1-valued random variables
Pand let pi = E[Xi ], where
0 < pi < 1. Then the sum X = ni=1 Xi , which has mean µ = ni=1 pi , satisfies

Pr[X (1 + )µ]  (c )µ
⇥ ⇤
where c is shorthand for (1+ e)(1+ ) .

Remark: There is an analogous inequality that bounds the probability of deviation below
the mean, whereby becomes negative and the in the probability becomes  and the c
is very similar.
Proof: Surprisingly, this inequality also is proved using the Markov inequality, albeit
applied to a di↵erent random variable.
21

We introduce a positive constant t (which we will specify later) and consider the random
variable exp(tX): when X is a this variable is exp(ta). The advantage of this variable is
that X Y Y
E[exp(tX)] = E[exp(t Xi )] = E[ exp(tXi )] = E[exp(tXi )], (3.1)
i i i

where the last equality holds because the Xi r.v.s are independent, which implies that
exp(tXi )’s are also independent. Now,

E[exp(tXi )] = (1 pi ) + p i e t ,

therefore,
Y Y Y
E[exp(tXi )] = [1 + pi (et 1)]  exp(pi (et 1))
i i i
X (3.2)
= exp( pi (et 1)) = exp(µ(et 1)),
i

as 1 + x  ex . Finally, apply Markov’s inequality to the random variable exp(tX), viz.

E[exp(tX)] exp((et 1)µ)


Pr[X (1 + )µ] = Pr[exp(tX) exp(t(1 + )µ)]  = ,
exp(t(1 + )µ) exp(t(1 + )µ)

using lines (3.1) and (3.2) and the fact that t is positive. Since t is a dummy variable, we can
choose any positive value we like for it. The right hand size is minimized if t = ln(1+ )—just
di↵erentiate—and this leads to the theorem statement. 2
The following is the more general inequality for variables that do not lie in [ 1, 1]. It is
proved similarly to Cherno↵ bound.
Theorem 4 (Hoeffding) P
Suppose X1 , X2 , . . . , Xn are independent r.v.’s, with ai  Xi  bi . If X = i Xi and
µ = E[X] then
t2
Pr[X µ > t]  exp( P ).
i (bi ai ) 2

3.2 Application 1: Sampling/Polling


Opinion polls and statistical sampling rely on tail bounds. Suppose there are n arbitrary
numbers in [0, 1] If we pick t of them randomly (with replacement!) then the sample mean
is within ±✏ of the true mean with probability at least 1 if t > ⌦( ✏12 log 1/ ). (Verify
this calculation!)
In general, Cherno↵ bounds implies that taking k independent estimates and taking
their mean ensures that the value is highly concentrated about their mean; large deviations
happen with exponentially small probability.
22

3.3 Balls and Bins revisited: Load balancing


Suppose we toss m balls into n bins. You can think of m jobs being randomly assigned to
n processors. Then the expected number of balls in each bin is m/n. When m = n this
expectation is 1 but we saw in Lecture 1 that the most overloaded bin has ⌦(log n/ log log n)
balls. However, if m = cn log n then the expected number of balls in each bin is c log n.
Thus Cherno↵ bounds imply that the chance of seeing less than 0.5c log n or more than
1.5c log n is less than c log n for some constant (which depends on the 0.5, 1.5 etc.) which
can be made less than say 1/n2 by choosing c to be a large constant.
Moral: if an office boss is trying to allocate work fairly, he/she should first create more
work and then do a random assignment.

3.4 What about the median?


Given n numbers in [0, 1] can we approximate the median via sampling? This will be part
of your homework.
Exercise: Show that it is impossible to estimate the value of the median within say 1.1
factor with o(n) samples.
But what is possible is to produce a number that is an approximate median: it is greater
than at least n/2 n/t numbers below it and less than at least n/2 n/t numbers. The
idea is to take a random sample of a certain size and take the median of that sample. (Hint:
Use balls and bins.)
One can use the approximate median algorithm to describe a version of quicksort with
very predictable performance. Say we are given n numbers in an array. Recall that (random)
quicksort is the sorting algorithm where you randomly pick one of the n numbers as a pivot,
then partition the numbers into those that are bigger than and smaller than the pivot (which
takes O(n) time). Then you recursively sort the two subsets.
This procedure works in expected O(n log n) time as you may have learnt in an undergrad
course. But its performance is uneven because the pivot may not divide the instance into
two exactly equal pieces. For instance the chance that the running time exceeds 10n log n
time is quite high.
A better way to run quicksort is to first do a quick estimation of the median and then
do a pivot. This algorithm runs in very close to n log n time, which is optimal.
Chapter 4

Hashing with real numbers and


their big-data applications

Using only memory equivalent to 5 lines of printed text, you can estimate with a
typical accuracy of 5 per cent and in a single pass the total vocabulary of Shake-
speare. This wonderfully simple algorithm has applications in data mining, esti-
mating characteristics of huge data flows in routers, etc. It can be implemented
by a novice, can be fully parallelized with optimal speed-up and only need minimal
hardware requirements. Theres even a bit of math in the middle!
Opening lines of a paper by Durand and Flajolet, 2003.

As we saw in Lecture 1, hashing can be thought of as a way to rename an address space.


For instance, a router at the internet backbone may wish to have a searchable database of
destination IP addresses of packets that are whizzing by. An IP address is 128 bits, so the
number of possible IP addresses is 2128 , which is too large to let us have a table indexed
by IP addresses. Hashing allows us to rename each IP address by fewer bits. In Lecture 1
this hash was a number in a finite field (integers modulo a prime p). In recent years large
data algorithms have used hashing in interesting ways where the hash is viewed as a real
number. For instance, we may hash IP addresses to real numbers in the unit interval [0, 1].
Example 2 (Dartthrowing method of estimating areas) Suppose gives you a piece
of paper of irregular shape and you wish to determine its area. You can do so by pinning
it on a piece of graph paper. Say, it lies completely inside the unit square. Then throw a
dart n times on the unit square and observe the fraction of times it falls on the irregularly
shaped paper. This fraction is an estimator for the area of the paper.
Of course, the digital analog of throwing a dart n times on the unit square is to take a
random hash function from {1, . . . , n} to [0, 1] ⇥ [0, 1].

Strictly speaking, one cannot hash to a real number since computers lack infinite preci-
sion. Instead, one hashes to rational numbers in [0, 1]. For instance, hash IP addresses to
the set [p] as before, and then think of number “i mod p”as the rational number i/p. This
works OK so long as our method doesn’t use too many bits of precision in the real-valued
hash.

23
Other documents randomly have
different content
credimi, e faremo baldoria.... Oh, se faremo baldoria!... E chi sa...
chi sa... che l'antica vigna non ci riveda!... (Le dà un bacio rovente.)

Rosa

(tremante) Signor Lucio...

Lucio

(lasciandole le mani) No! Non ho detto nulla... Non ho fatto nulla....


Va, ragazza mia, va a messa [pg!362] col tuo fidanzato, va a
messa... (Raccoglie i fiori e glieli ridà. Poi, subito, gliene strappa un
ciuffo e se ne copre il viso odorando avidamente) Va... va... va...
(La campanella della chiesa riempie l'aria di squilli allegri.)

Rosa

(si allontana in un raggio di sole.)

(Sipario.)

Fine del dramma.

*** END OF THIS PROJECT GUTENBERG EBOOK IL TRIONFO ***


*** END OF THE PROJECT GUTENBERG EBOOK IL TRIONFO:
DRAMMA IN QUATTRO ATTI ***

Updated editions will replace the previous one—the old editions


will be renamed.

Creating the works from print editions not protected by U.S.


copyright law means that no one owns a United States
copyright in these works, so the Foundation (and you!) can copy
and distribute it in the United States without permission and
without paying copyright royalties. Special rules, set forth in the
General Terms of Use part of this license, apply to copying and
distributing Project Gutenberg™ electronic works to protect the
PROJECT GUTENBERG™ concept and trademark. Project
Gutenberg is a registered trademark, and may not be used if
you charge for an eBook, except by following the terms of the
trademark license, including paying royalties for use of the
Project Gutenberg trademark. If you do not charge anything for
copies of this eBook, complying with the trademark license is
very easy. You may use this eBook for nearly any purpose such
as creation of derivative works, reports, performances and
research. Project Gutenberg eBooks may be modified and
printed and given away—you may do practically ANYTHING in
the United States with eBooks not protected by U.S. copyright
law. Redistribution is subject to the trademark license, especially
commercial redistribution.

START: FULL LICENSE


THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK

To protect the Project Gutenberg™ mission of promoting the


free distribution of electronic works, by using or distributing this
work (or any other work associated in any way with the phrase
“Project Gutenberg”), you agree to comply with all the terms of
the Full Project Gutenberg™ License available with this file or
online at www.gutenberg.org/license.

Section 1. General Terms of Use and


Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand,
agree to and accept all the terms of this license and intellectual
property (trademark/copyright) agreement. If you do not agree
to abide by all the terms of this agreement, you must cease
using and return or destroy all copies of Project Gutenberg™
electronic works in your possession. If you paid a fee for
obtaining a copy of or access to a Project Gutenberg™
electronic work and you do not agree to be bound by the terms
of this agreement, you may obtain a refund from the person or
entity to whom you paid the fee as set forth in paragraph 1.E.8.

1.B. “Project Gutenberg” is a registered trademark. It may only


be used on or associated in any way with an electronic work by
people who agree to be bound by the terms of this agreement.
There are a few things that you can do with most Project
Gutenberg™ electronic works even without complying with the
full terms of this agreement. See paragraph 1.C below. There
are a lot of things you can do with Project Gutenberg™
electronic works if you follow the terms of this agreement and
help preserve free future access to Project Gutenberg™
electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright
law in the United States and you are located in the United
States, we do not claim a right to prevent you from copying,
distributing, performing, displaying or creating derivative works
based on the work as long as all references to Project
Gutenberg are removed. Of course, we hope that you will
support the Project Gutenberg™ mission of promoting free
access to electronic works by freely sharing Project Gutenberg™
works in compliance with the terms of this agreement for
keeping the Project Gutenberg™ name associated with the
work. You can easily comply with the terms of this agreement
by keeping this work in the same format with its attached full
Project Gutenberg™ License when you share it without charge
with others.

1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside
the United States, check the laws of your country in addition to
the terms of this agreement before downloading, copying,
displaying, performing, distributing or creating derivative works
based on this work or any other Project Gutenberg™ work. The
Foundation makes no representations concerning the copyright
status of any work in any country other than the United States.

1.E. Unless you have removed all references to Project


Gutenberg:

1.E.1. The following sentence, with active links to, or other


immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project
Gutenberg™ work (any work on which the phrase “Project
Gutenberg” appears, or with which the phrase “Project
Gutenberg” is associated) is accessed, displayed, performed,
viewed, copied or distributed:

This eBook is for the use of anyone anywhere in


the United States and most other parts of the
world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-
use it under the terms of the Project Gutenberg
License included with this eBook or online at
www.gutenberg.org. If you are not located in the
United States, you will have to check the laws of
the country where you are located before using
this eBook.

1.E.2. If an individual Project Gutenberg™ electronic work is


derived from texts not protected by U.S. copyright law (does not
contain a notice indicating that it is posted with permission of
the copyright holder), the work can be copied and distributed to
anyone in the United States without paying any fees or charges.
If you are redistributing or providing access to a work with the
phrase “Project Gutenberg” associated with or appearing on the
work, you must comply either with the requirements of
paragraphs 1.E.1 through 1.E.7 or obtain permission for the use
of the work and the Project Gutenberg™ trademark as set forth
in paragraphs 1.E.8 or 1.E.9.

1.E.3. If an individual Project Gutenberg™ electronic work is


posted with the permission of the copyright holder, your use and
distribution must comply with both paragraphs 1.E.1 through
1.E.7 and any additional terms imposed by the copyright holder.
Additional terms will be linked to the Project Gutenberg™
License for all works posted with the permission of the copyright
holder found at the beginning of this work.
1.E.4. Do not unlink or detach or remove the full Project
Gutenberg™ License terms from this work, or any files
containing a part of this work or any other work associated with
Project Gutenberg™.

1.E.5. Do not copy, display, perform, distribute or redistribute


this electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the
Project Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must,
at no additional cost, fee or expense to the user, provide a copy,
a means of exporting a copy, or a means of obtaining a copy
upon request, of the work in its original “Plain Vanilla ASCII” or
other form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™
works unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or


providing access to or distributing Project Gutenberg™
electronic works provided that:

• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt
that s/he does not agree to the terms of the full Project
Gutenberg™ License. You must require such a user to return or
destroy all copies of the works possessed in a physical medium
and discontinue all use of and all access to other copies of
Project Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project


Gutenberg™ electronic work or group of works on different
terms than are set forth in this agreement, you must obtain
permission in writing from the Project Gutenberg Literary
Archive Foundation, the manager of the Project Gutenberg™
trademark. Contact the Foundation as set forth in Section 3
below.

1.F.
1.F.1. Project Gutenberg volunteers and employees expend
considerable effort to identify, do copyright research on,
transcribe and proofread works not protected by U.S. copyright
law in creating the Project Gutenberg™ collection. Despite these
efforts, Project Gutenberg™ electronic works, and the medium
on which they may be stored, may contain “Defects,” such as,
but not limited to, incomplete, inaccurate or corrupt data,
transcription errors, a copyright or other intellectual property
infringement, a defective or damaged disk or other medium, a
computer virus, or computer codes that damage or cannot be
read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES - Except


for the “Right of Replacement or Refund” described in
paragraph 1.F.3, the Project Gutenberg Literary Archive
Foundation, the owner of the Project Gutenberg™ trademark,
and any other party distributing a Project Gutenberg™ electronic
work under this agreement, disclaim all liability to you for
damages, costs and expenses, including legal fees. YOU AGREE
THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE, STRICT
LIABILITY, BREACH OF WARRANTY OR BREACH OF CONTRACT
EXCEPT THOSE PROVIDED IN PARAGRAPH 1.F.3. YOU AGREE
THAT THE FOUNDATION, THE TRADEMARK OWNER, AND ANY
DISTRIBUTOR UNDER THIS AGREEMENT WILL NOT BE LIABLE
TO YOU FOR ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL,
PUNITIVE OR INCIDENTAL DAMAGES EVEN IF YOU GIVE
NOTICE OF THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If you


discover a defect in this electronic work within 90 days of
receiving it, you can receive a refund of the money (if any) you
paid for it by sending a written explanation to the person you
received the work from. If you received the work on a physical
medium, you must return the medium with your written
explanation. The person or entity that provided you with the
defective work may elect to provide a replacement copy in lieu
of a refund. If you received the work electronically, the person
or entity providing it to you may choose to give you a second
opportunity to receive the work electronically in lieu of a refund.
If the second copy is also defective, you may demand a refund
in writing without further opportunities to fix the problem.

1.F.4. Except for the limited right of replacement or refund set


forth in paragraph 1.F.3, this work is provided to you ‘AS-IS’,
WITH NO OTHER WARRANTIES OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of
damages. If any disclaimer or limitation set forth in this
agreement violates the law of the state applicable to this
agreement, the agreement shall be interpreted to make the
maximum disclaimer or limitation permitted by the applicable
state law. The invalidity or unenforceability of any provision of
this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the


Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and
distribution of Project Gutenberg™ electronic works, harmless
from all liability, costs and expenses, including legal fees, that
arise directly or indirectly from any of the following which you
do or cause to occur: (a) distribution of this or any Project
Gutenberg™ work, (b) alteration, modification, or additions or
deletions to any Project Gutenberg™ work, and (c) any Defect
you cause.
Section 2. Information about the Mission
of Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new
computers. It exists because of the efforts of hundreds of
volunteers and donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project
Gutenberg™’s goals and ensuring that the Project Gutenberg™
collection will remain freely available for generations to come. In
2001, the Project Gutenberg Literary Archive Foundation was
created to provide a secure and permanent future for Project
Gutenberg™ and future generations. To learn more about the
Project Gutenberg Literary Archive Foundation and how your
efforts and donations can help, see Sections 3 and 4 and the
Foundation information page at www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-
profit 501(c)(3) educational corporation organized under the
laws of the state of Mississippi and granted tax exempt status
by the Internal Revenue Service. The Foundation’s EIN or
federal tax identification number is 64-6221541. Contributions
to the Project Gutenberg Literary Archive Foundation are tax
deductible to the full extent permitted by U.S. federal laws and
your state’s laws.

The Foundation’s business office is located at 809 North 1500


West, Salt Lake City, UT 84116, (801) 596-1887. Email contact
links and up to date contact information can be found at the
Foundation’s website and official page at
www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission
of increasing the number of public domain and licensed works
that can be freely distributed in machine-readable form
accessible by the widest array of equipment including outdated
equipment. Many small donations ($1 to $5,000) are particularly
important to maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws


regulating charities and charitable donations in all 50 states of
the United States. Compliance requirements are not uniform
and it takes a considerable effort, much paperwork and many
fees to meet and keep up with these requirements. We do not
solicit donations in locations where we have not received written
confirmation of compliance. To SEND DONATIONS or determine
the status of compliance for any particular state visit
www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states


where we have not met the solicitation requirements, we know
of no prohibition against accepting unsolicited donations from
donors in such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot


make any statements concerning tax treatment of donations
received from outside the United States. U.S. laws alone swamp
our small staff.
Please check the Project Gutenberg web pages for current
donation methods and addresses. Donations are accepted in a
number of other ways including checks, online payments and
credit card donations. To donate, please visit:
www.gutenberg.org/donate.

Section 5. General Information About


Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.

Project Gutenberg™ eBooks are often created from several


printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.

More than just a book-buying platform, we strive to be a bridge


connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.

Join us on a journey of knowledge exploration, passion nurturing, and


personal growth every day!

ebookbell.com

You might also like