Midterm 2006
Midterm 2006
• There are 7 questions in this exam (11 pages including this cover sheet).
• Questions are not equally difficult.
• If you need more room to work out your answer to a question, use the back of the page
and clearly mark on the front of the page if we are to look at what’s on the back.
• This exam is open book and open notes. Computers, PDAs, cell phones are not allowed.
• You have 1 hour and 20 minutes. Good luck!
Name:
Andrew ID:
2 Decision Tree 12
4 Bias-Variance Decomposition 12
7 Learning Theory 14
Total 100
1
1 Conditional Independence, MLE/MAP, Probability (12 pts)
1. (4 pts) Show that Pr(X, Y |Z) = Pr(X|Z) Pr(Y |Z) if Pr(X|Y, Z) = Pr(X|Z).
2. (4 pts) If a data point y follows the Poisson distribution with rate parameter θ, then the
probability of a single observation y is
θy e−θ
p(y|θ) = , for y = 0, 1, 2, · · · .
y!
You are given data points y1 , · · · , yn independently drawn from a Poisson distribution with
parameter θ. Write down the log-likelihood of the data as a function of θ.
3. (4 pts) Suppose that in answering a question in a multiple choice test, an examinee either
knows the answer, with probability p, or he guesses with probability 1 − p. Assume that the
probability of answering a question correctly is 1 for an examinee who knows the answer and
1/m for the examinee who guesses, where m is the number of multiple choice alternatives.
What is the probability that an examinee knew the answer to a question, given that he has
correctly answered it?
2
2 Decision Tree (12 pts)
The following data set will be used to learn a decision tree for predicting whether students are
lazy (L) or diligent (D) based on their weight (Normal or Underweight), their eye color (Amber or
Violet) and the number of eyes they have (2 or 3 or 4).
The following numbers may be helpful as you answer this problem without using a calculator:
log2 0.1 = −3.32, log2 0.2 = −2.32, log2 0.3 = −1.73, log2 0.4 = −1.32, log2 0.5 = −1.
*You don’t need to show the derivation for your answers in this problem.
2. (3 pts) What attribute would the ID3 algorithm choose to use for the root of the tree (no
pruning)?
3. (4 pts) Draw the full decision tree learned for this data (no pruning).
3
3 Neural Network and Regression (18 pts)
Consider a two-layer neural network to learn a function f : X → Y where X = hX1 , X2 i consists of
two attributes. The weights, w1 , · · · , w6 , can be arbitrary. There are two possible choices for the
function implemented by each unit in this network:
1
• S: signed sigmoid function S(a) = sign[σ(a) − 0.5] = sign[ 1+exp(−a) − 0.5]
1. (4 pts) Assign proper activation functions (S or L) to each unit in the following graph so this
neural network simulates a linear regression: Y = β1 X1 + β2 X2 .
2. (4 pts) Assign proper activation functions (S or L) for each unit in the following graph so this
neural network simulates a binary logistic regression classifier: Y = arg maxy P (Y = y|X),
exp(β1 X1 +β2 X2 ) 1
where P (Y = 1|X) = 1+exp(β 1 X1 +β2 X2 )
, P (Y = −1|X) = 1+exp(β1 X 1 +β2 X2 )
.
4
4. (4 pts) Assign proper activation functions (S or L) for each unit in the following graph so this
neural network simulates a boosting classifier which combines two logistic regression classifiers,
f1 : X → Y1 and f2 : X → Y2 , to produce its final prediction: Y = sign[α1 Y1 + α2 Y2 ]. Use
the same definition in problem 3.2 for f1 and f2 .
5
4 Bias-Variance Decomposition (12 pts)
1. (6 pts) Suppose you have regression data generated by a polynomial of degree 3. Characterize
the bias-variance of the estimates of the following models on the data with respect to the true
model by circling the appropriate entry.
Bias Variance
Linear regression low/high low/high
Polynomial regression with degree 3 low/high low/high
Polynomial regression with degree 10 low/high low/high
2. Let Y = f (X) + ², where ² has mean zero and variance σ²2 . In k-nearest neighbor (kNN)
regression, the prediction of Y at point x0 is given by the average of the values Y at the k
neighbors closest to x0 .
(a) (2 pts) Denote the `-nearest neighbor to x0 by x(`) and its corresponding Y value by
y(`) . Write the prediction fˆ(x0 ) of the kNN regression for x0 in terms of y(`) , 1 ≤ ` ≤ k.
6
5 Support Vector Machine (12 pts)
Consider a supervised learning problem in which the training examples are points in 2-dimensional
space. The positive examples are (1, 1) and (−1, −1). The negative examples are (1, −1) and
(−1, 1).
1. (1 pts) Are the positive examples linearly separable from the negative examples in the original
space?
2. (4 pts) Consider the feature transformation φ(x) = [1, x1 , x2 , x1 x2 ], where x1 and x2 are,
respectively, the first and second coordinates of a generic example x. The prediction function
is y(x) = wT ∗ φ(x) in this feature space. Give the coefficients, w, of a maximum-margin
decision surface separating the positive examples from the negative examples. (You should
be able to do this by inspection, without any significant computation.)
3. (3 pts) Add one training example to the graph so the total five examples can no longer be
linearly separated in the feature space φ(x) defined in problem 5.2.
4. (4 pts) What kernel K(x, x0 ) does this feature transformation φ correspond to?
7
6 Generative vs. Discriminative Classifier (20 pts)
Consider the binary classification problem where class label Y ∈ {0, 1} and each training example
X has 2 binary attributes X1 , X2 ∈ {0, 1}.
In this problem, we will always assume X1 and X2 are conditional independent given Y , that
the class priors are P (Y = 0) = P (Y = 1) = 0.5, and that the conditional probabilities are as
follows:
P (X1 |Y ) X1 = 0 X1 = 1 P (X2 |Y ) X2 = 0 X2 = 1
Y =0 0.7 0.3 Y =0 0.9 0.1
Y =1 0.2 0.8 Y =1 0.5 0.5
The expected error rate is the probability that a classifier provides an incorrect prediction for an
observation: if Y is the true label, let Ŷ (X1 , X2 ) be the predicted class label, then the expected
error rate is
³ ´ 1
X 1
X ³ ´
PD Y = 1 − Ŷ (X1 , X2 ) = PD X1 , X2 , Y = 1 − Ŷ (X1 , X2 ) .
X1 =0 X2 =0
Note that we use the subscript D to emphasize that the probabilities are computed under the true
distribution of the data.
*You don’t need to show all the derivation for your answers in this problem.
1. (4 pts) Write down the naı̈ve Bayes prediction for all the 4 possible configurations of X1 , X2 .
The following table would help you to complete this problem.
2. (4 pts) Compute the expected error rate of this naı̈ve Bayes classifier which predicts Y given
both of the attributes {X1 , X2 }. Assume that the classifier is learned with infinite training
data.
8
3. (4 pts) Which of the following two has a smaller expected error rate?
4. (4 pts) Now, suppose that we create a new attribute X3 , which is a deterministic copy of X2 .
What is the expected error rate of the naı̈ve Bayes which predicts Y given all the attributes
(X1 , X2 , X3 ) now? Assume that the classifier is learned with infinite training data.
5. (4 pts) Explain what is happening with naı̈ve Bayes in problem 6.4? Does logistic regression
suffer from the same problem? Why?
9
7 Learning Theory (14 pts)
You read in the paper that the famous bird migration website, Netflocks, is offering a $1M prize
for accurately recommending movies about penguins. Furthermore, it is providing a training data
set containing 100,000,000 labeled training examples. Each training example consists of a set
of 100 real-valued features describing a movie, along with a boolean label indicating whether to
recommend this movie to a person.
You determine that the $1M can be yours if you can train a linear Support Vector Machine
with a true accuracy of 98%. Of course you understand that PAC learning theory provides only
probabilistic bounds, so you decide to enter only if you can prove you have at least a 0.9 probability
of achieving an accuracy of 98%.
1. (8 pts) Can you use PAC learning theory to decide whether you can meet your performance
objective? If yes, give an expression for the number of training examples sufficient to meet
your performance objective. If not, explain why not, then provide the minimum set of addi-
tional assumptions needed so that PAC learning theory can be applied, and give an expression
of the number of training examples sufficient under your assumptions. (you may leave your
expression as an unsolved arithmetic expression, but it should contain only constants - no
variables).
10
2. (3 pts) Consider the PAC-style statement “we can achieve true accuracy of at least 98% with
probability 0.9.” What is the meaning of “with probability 0.9”? Answer this by describing
a randomized experiment which you could perform repeatedly to test whether the statement
is true.
3. (3 pts) Your friend already has a private dataset of 100,000,000 labeled movies, so she will end
up with twice as much training data as you. You train using the Netflocks data to produce
a classifier h1 . She uses the same learning algorithm, but trains with twice as much data to
produce her output hypothesis, h2 . You are interested in how well the training errors of h1
and h2 predict their true errors. Consider the ratio
√ 1 1 1
4, 2, 2, 1, −1, √ , ,
2 2 4
* Any resemblance to real persons, animals, or organizations, living or dead, is purely coincidental.
11