0% found this document useful (0 votes)
69 views25 pages

10-701 Midterm Exam, Fall 2007

This document appears to be an exam for a machine learning course. It provides instructions for a 17-page exam covering topics like loss functions, kernel regression, model selection, support vector machines, decision trees, and ensemble methods. The exam includes short answer questions worth up to 20 points, as well as longer questions on specific topics worth up to 30 points. Calculators are not allowed and students can only use materials provided in the course, like lecture slides and readings. The exam has an 80 minute time limit.

Uploaded by

Juan Vera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views25 pages

10-701 Midterm Exam, Fall 2007

This document appears to be an exam for a machine learning course. It provides instructions for a 17-page exam covering topics like loss functions, kernel regression, model selection, support vector machines, decision trees, and ensemble methods. The exam includes short answer questions worth up to 20 points, as well as longer questions on specific topics worth up to 30 points. Calculators are not allowed and students can only use materials provided in the course, like lecture slides and readings. The exam has an 80 minute time limit.

Uploaded by

Juan Vera
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

10-701 Midterm Exam, Fall 2007

1. Personal info:

• Name:
• Andrew account:
• E-mail address:

2. There should be 17 numbered pages in this exam (including this cover sheet).

3. You can use any material you brought: any book, class notes, your print outs of class
materials that are on the class website, including my annotated slides and relevant
readings, and Andrew Moore’s tutorials. You cannot use materials brought by other
students. Calculators are not necessary. Laptops, PDAs, phones and Internet access
are not allowed.

4. If you need more room to work out your answer to a question, use the back of the page
and clearly mark on the front of the page if we are to look at what’s on the back.

5. Work efficiently. Some questions are easier, some more difficult. Be sure to give yourself
time to answer all of the easy ones, and avoid getting bogged down in the more difficult
ones before you have answered the easier ones.

6. Note there are extra-credit sub-questions. The grade curve will be made without
considering students’ extra credit points. The extra credit will then be used to try to
bump your grade up without affecting anyone else’s grade.

7. You have 80 minutes.

8. Good luck!

Question Topic Max. score Score


1 Short questions 20 + 0.1010 extra
2 Loss Functions 12
3 Kernel Regression 12
4 Model Selection 14
5 Support Vector Machine 12
6 Decision Trees and Ensemble Methods 30

1
1 [20 Points] Short Questions
The following short questions should be answered with at most two sentences, and/or a
picture. For yes/no questions, make sure to provide a short justification.

1. [2 point] Does a 2-class Gaussian Naive Bayes classifier with parameters µ1k , σ1k , µ2k ,
σ2k for attributes k = 1, ..., m have exactly the same representational power as logistic
regression (i.e., a linear decision boundary), given no assumptions about the variance
2
values σik ?

F SOLUTION: No. Gaussian Naive Bayes classifier has more expressive power than
a linear classifier if we don’t restrict the variance parameters to be class-independent. In
other words, Gaussian Naive Bayes classifier is a linear classifier if we assume σ1k = σ2k ,
∀k.

2. [2 points] For linearly separable data, can a small slack penalty (“C”) hurt the training
accuracy when using a linear SVM (no kernel)? If so, explain how. If not, why not?

F SOLUTION: Yes. If the optimal values of α’s (say in the dual formulation) are
greater than C, we may end up with a sub-optimal decision boundary with respect to
the training examples. Alternatively, a small C can allow large slacks, thus the resulting
classifier will have a small value of w2 but can have non-zero training error.

3. [3 points] Consider running AdaBoost with Multinomial Naive Bayes as the weak
learner for two classes and k binary features. After t iterations, of AdaBoost, how
many parameters do you need to remember? In other words, how many numbers do
you need to keep around to predict the label of a new example? Assume that the
weak-learner training error is non-zero at iteration t. Don’t forget to mention where
the parameters come from.

F SOLUTION: P Recall that we predict the label of an example using AdaBoost by


y = sign( t0 αt0 ht0 (x)). Hence for each iteration of adaboost we need to remember the
parameters of ht0 (x) and αt0 . Since our weak classifier is Multinomial Naive Bayes, we need
2k + 1 parameters for it including 2k for P (Xi |Y = 0), P (Xi |Y = 1) for k = 1, ..., k and
the prior P (Y = 1). Hence for each iteration we have 2k + 2 parameters including αt0 , and
after t iterations we have t(2k + 2) parameters.

4. [2 points] In boosting, would you stop the iteration if the following happens? Justify
your answer with at most two sentences each question.

• The error rate of the combined classifier on the original training data is 0.

2
F SOLUTION: No. Boosting is robust to overfitting. Test error may decrease
even after the training error is 0.

• The error rate of the current weak classifier on the weighted training data is 0.

F SOLUTION: Yes. In this case, αt = +∞, and the weights of all the examples
are 0. On the other hand, the current weak classifier is perfect on the weighted training
data, so it is also perfect on the original data set. There is no need to combine this
classifier with other classifiers.

5. [4 points] Given n linearly independent feature vectors in n dimensions, show that


for any assignment to the binary labels you can always construct a linear classifier
with weight vector w which separates the points. Assume that the classifier has the
form sign(w · x). Note that a square matrix composed of linearly independent rows is
invertible.

F SOLUTION: Lets define the class labels y ∈ {−1, +1} and the matrix X such that
each of the n rows is one of the n dimensional feature vectors. Then we want to find a w
such that sign(Xw) = y. We know that if Xw = y then sign(Xw) = y. Because, X is
composed of linearly independent rows we can invert X to obtain w = X −1 y. Therefore
we can construct a linear classifier that separates all n points. Interestingly, if we add an
additional constant term to the features we can separate n + 1 linearly independent point
in n dimensions.

6. [3 points] Construct a one dimensional classification dataset for which the Leave-one-
out cross validation error of the One Nearest Neighbors algorithm is always 1. Stated
another way, the One Nearest Neighbor algorithm never correctly predicts the held out
point.

F SOLUTION: For this question you simply need an alternating configuration of the
points {+, −, +, −, . . .} along the real line. In leave-one-out cross validation we compute
the predicted class for each point given all the remaining points. Because the neighbors
of every point are in the oppositie class, the leave-one-out cross validation predictions will
always be wrong.

7. [2 points] Would we expect that running AdaBoost using the ID3 decision tree learning
algorithm (without pruning) as the weak learning algorithm would have a better true
error rate than running ID3 alone (i.e., without boosting (also without pruning))?
Explain.

3
F SOLUTION: No. Unless two differently labeled examples have the same feature
vectors, ID3 will find a consistent classifier every time. In particular, after the first iteration
of AdaBoost, 1 = 0, so the first decision tree learned gets an infinite weight α1 = ∞, and
the example weights Dt+1 (i) would either all become 0, all become ∞, or would remain
uniform (depending on the implementation). In any case, we either halt, overflow, or make
no progress, none of which helps the true error rate.

8. [1 point] Suppose there is a coin with unknown bias p. Does there exist some value of p
for which we would expect the maximum a-posteriori estimate of p, using a Beta(4, 2)
prior, to require more coin flips before it is close to the true value of p, compared to
the number of flips required of the maximum likelihood estimate of p? Explain.
(The Beta(4, 2) distribution is given in the figure below.)
HΒH =4., ΒT =2.L
pHΘL

2.0

1.5

1.0

0.5

0.0 Θ
0.0 0.2 0.4 0.6 0.8 1.0

Figure 1: Beta(4, 2) distribution

F SOLUTION: Yes. Consider p = 0. Because the prior value there is low, it takes
more samples to overcome our prior beliefs and converge to the correct solution, compared
to maximum likelihood (which only needs a single observation to find p exactly).

9. [1 point] Suppose there is a coin with unknown bias p. Does there exist some value
of p for which we would expect the maximum a-posteriori estimate of p, using a
U nif orm([0, 1]) prior, to require more coin flips before it is close to the true value
of p, compared to the number of flips required of the maximum likelihood estimate of
p? Explain.

F SOLUTION: No. In this case, the prior’s PDF is a constant fp (q) = 1, so the
MAP solution arg maxq∈[0,1] fp (q)fx (Data; q) is always identical to the maximum likelihood
solution arg maxq∈[0,1] fx (Data; q).

10. [0.1010 extra credit] Can a linear classifier separate the positive from the negative
examples in the dataset below? Justify.

4
Colbert
U2
f or
Loosing my religion
president

T he Beatles
N irvana
T here is a season...
Grunge
T urn! T urn! T urn!

F SOLUTION: No. This is an instantiation of XOR.

5
2 [12 points] Loss Function
Generally speaking, a classifier can be written as H(x) = sign(F (x)), where H(x) : Rd →
{−1, 1} and F (x) : Rd → R. To obtain the P parameters in F (x), we need to minimize the
i i
loss function averaged over the training set: i L(y F (x )). Here L is a function of yF (x).
For example, for linear classifiers, F (x) = w0 + dj=1 wj xj , and yF (x) = y(w0 + dj=1 wj xj )
P P

1. [4 points] Which loss functions below are appropriate to use in classification? For the
ones that are not appropriate, explain why not. In general, what conditions does L
have to satisfy in order to be an appropriate loss function? The x axis is yF (x), and
the y axis is L(yF (x)).

12 1 1

0.9 0.9
10
0.8 0.8

0.7 0.7
8
0.6 0.6

6 0.5 0.5

0.4 0.4
4
0.3 0.3

0.2 0.2
2
0.1 0.1

0 0 0
!10 !5 0 5 10 !10 !5 0 5 10 !10 !5 0 5 10

(a) (b) (c)

12 !

"$+
10
"$*

"$)
8
"$(

6 "$#

"$'
4
"$&

"$%
2
"$!

0 "
!10 !5 0 5 10 !!" !# " # !"

(d) (e)

F SOLUTION: (a) and (b) are appropriate to use in classification. In (c), there is very
little penalty for extremely misclassified examples, which correspond to very negative yF (x).
In (d) and (e), correctly classified examples are penalized, whereas misclassified examples
are not. In general, L should approximate the 0-1 loss, and it should be a non-increasing
function of yF (x).

6
2. [4 points] Of the above loss functions appropriate to use in classification, which one is
the most robust to outliers? Justify your answer.

F SOLUTION: (b) is more robust to outliers. For outliers, yF (x) is often very nega-
tive. In (a), outliers are heavily penalized. So the resulting classifier is largely affected by
the outliers. On the other hand, in (b), the loss of outliers is bounded. So the resulting
classifier is less affected by the outliers, and thus more robust.

3. [4 points] Let F (x) = w0 + dj=1 wj xj and L(yF (x)) = 1+exp(yF


1
P
(x))
. Suppose you use
gradient descent to obtain the optimal parameters w0 and wj . Give the update rules
for these parameters.

L(y i F (xi )) =
P
P SOLUTION:
F
1
To obtain the parameters in F (x), we need to minimize
P 1
i
i 1+exp(y i F (xi )) = i 1+exp(y i (w + d w xi )) .
P
0 j=1 j j

∂ X X ∂ 1 X y i exp(y i F (xi ))
L(y i F (xi )) = ( ) = −
∂w0 i ∂w0 1 + exp(y i (w0 + dj=1 wj xij )) (1 + exp(y i F (xi )))2
P
i i

∀k = 1, . . . , d,

∂ X X ∂ 1 X y i xi exp(y i F (xi ))
k
L(y i F (xi )) = ( d
) = −
∂wk i ∂w i (1 + exp(y i F (xi )))2
i (w +
P
i k 1 + exp(y 0 w x
j=1 j j )) i

Therefore, the update rules are as follows.

(t+1) ∂ X X y i exp(y i F (xi ))


w0 = w0t − η L(y i F (xi )) = w0t + η
∂w0 i i
(1 + exp(y i F (xi )))2

∀k = 1, . . . , d,

(t+1) ∂ X X y i xi exp(y i F (xi ))


k
wk = wkt − η L(y i F (xi )) = w0t + η
∂wk i i
(1 + exp(y i F (xi )))2

L(y i F (xi )), some people only


P
 COMMON MISTAKE 1: Instead of minimizing i
minimize L(yF (x)).

7
3 [12 points] Kernel Regression, k-NN
1. [4 points] Sketch the fit Y given X for the dataset given below using kernel regression
with a box kernel

1 if − h ≤ xi − xj < h
K(xi , xj ) = I(−h ≤ xi − xj < h) =
0 otherwise

for h = 0.5, 2.

F SOLUTION:

• h = 0.5
Kernel Regression with h = 0.5
4

3.5

2.5

2
y

1.5

0.5

0
0 1 2 3 4 5 6
x

8
• h=2
Kernel Regression with h = 2
4

3.5

2.5

2
y

1.5

0.5

0
0 1 2 3 4 5 6
x

Just for fun, what happens with 1-NN and 2-NN:

• h = 0.5
Kernel Regression with h = 0.5
4
Box−Kernel Regression
3.5 1−NN
2−NN

2.5

2
y

1.5

0.5

0
0 1 2 3 4 5 6
x

9
• h=2
Kernel Regression with h = 2
4
Box−Kernel Regression
3.5 1−NN
2−NN

2.5

2
y

1.5

0.5

0
0 1 2 3 4 5 6
x

 COMMON MISTAKE 1: Some people tried to classify the given points using
k-NN instead of doing regression.

 COMMON MISTAKE 2: Many people sketched what seemed to be kernel re-


gression or local linear regression with a Gaussian kernel.

10
2. [4 points] Sketch or describe a dataset where kernel regression with the box kernel
above with h = 0.5 gives the same regression values as 1-NN but not as 2-NN in the
domain x ∈ [0, 6] below.

F SOLUTION: Part 1 of the problem with h = 0.5 is actually an example where the
regression values of kernel regression match 1-NN regression values (in the given domain).
The basic idea is to create a dataset where in each inteval of width 1 (2h) there is only 1
“training” point and not all given y values are the same (then 1-NN and 2-NN would give
the same regression values).

 COMMON MISTAKE 1: Some people tried to come up with examples for k-NN
classification instead of regression.

Here are some other example solutions. Note that you only had to give the points and here
the lines are added just for you to see what happens.

• Example 1
Kernel Regression with h = 0.5
5
Box−Kernel Regression
4.5 1−NN
2−NN
4

3.5

2.5
y

1.5

0.5

0
0 1 2 3 4 5 6
x

11
• Example 2
Kernel Regression with h = 0.5
5

4.5

3.5

2.5
y

2
Box−Kernel Regression
1.5 1−NN
2−NN
1

0.5

0
0 1 2 3 4 5 6
x

 COMMON MISTAKE 2: Here kernel regression is not defined for some range
(e.g. x ∈ [0, 1.5]).

Kernel Regression with h = 0.5


5

4.5

3.5

2.5
y

1.5 Box−Kernel Regression


1−NN
1 2−NN

0.5

0
0 1 2 3 4 5 6
x

12
3. [4 points] Sketch or describe a dataset where kernel regression with the box kernel
above with h = 0.5 gives the same regression values as 2-NN but not as 1-NN in the
domain x ∈ (0, 6) below.

F SOLUTION: As in Part 2, the basic idea is to create a dataset where in each inteval
of width 1 (2h) there are 2 “training” points.
Here are some example solutions.

• Example 1
Kernel Regression with h = 0.5
5

4.5

3.5

2.5
y

2 Box−Kernel Regression
1.5 1−NN
2−NN
1

0.5

0
0 1 2 3 4 5 6
x

• Example 2
Kernel Regression with h = 0.5
5
Box−Kernel Regression
4.5 1−NN
2−NN
4

3.5

2.5
y

1.5

0.5

0
0 1 2 3 4 5 6
x

13
• Example 3: This example works if given a rule for 1-NN to break ties between two
neighbors at the same distance from the test point, not average over both their values.
Kernel Regression with h = 0.5
5
Box−Kernel Regression
4.5 1−NN
2−NN
4

3.5

2.5
y

1.5

0.5

0
0 1 2 3 4 5 6
x

14
4 [14 Points] Model Selection
A central theme in machine learning is model selection. In this problem you will have the
opportunity to demonstrate your understanding of various model selection techniques and
their consequences. To make things more concrete we will consider the dataset D given in
Equation 1 consisting of n independent identically distributed observations. The features of
D consist of pairs (xi1 , xi2 ) ∈ R2 and the observations y i ∈ R are continuous valued.

D = {((x11 , x12 ), y 1 ), ((x21 , x22 ), y 2 ), . . . , ((xn1 , xn2 ), y n )} (1)

Consider the abstract model given Equation 2. The function fθ1 ,θ2 is a mapping from the
features in R2 to an observation in R1 which depends on two parameters θ1 and θ2 . The
i correspond to the noise. Here we will assume that the i ∼ N (0, σ 2 ) are independent
Gaussians with zero mean and variance σ 2 .

y i = fθ1 ,θ2 (xi1 , xi2 ) + i (2)

1. [4 Points] Show that the log likelihood of the data given the parameters is equal to
Equation 3.
n √
1 X i i i 2

l(D; θ1 , θ2 ) = − (y − f θ1 ,θ2 (x 1 , x2 )) − n log 2πσ (3)
2σ 2 i=1

Recall the probability density function of the N (µ, σ 2 ) Gaussian distribution is given
by Equation 4.

(x − µ)2
 
1
p(x) = √ exp − (4)
2πσ 2σ 2

F SOLUTION: The first thing one should think about is the probability of a single
data point under this model. We can of coure write this as:

y i − fθ1 ,θ2 (xi1 , xi2 ) = i

Because we know the distribution of i we know that:

y i − fθ1 ,θ2 (xi1 , xi2 ) ∼ N (0, σ 2 )

We can therefore write the likelihood of the data as:


n 2
!
i
Y 1 (y − fθ1 ,θ2 (xi1 , xi2 ))
L(D; θ1 θ2 ) = √ exp −
i=1
2πσ 2σ 2

15
To compute the log likelihood we take the log of the likelihood from above to obtain:

l(D; θ1 θ2 ) = log (L(D; θ1 θ2 ))


n 2
!!
Y 1 (y i − fθ1 ,θ2 (xi1 , xi2 ))
= log √ exp −
i=1
2πσ 2σ 2
n 2
!!
X 1 (y i − fθ1 ,θ2 (xi1 , xi2 ))
= log √ exp −
i=1
2πσ 2σ 2
n
!
√  (y i − f i i 2
θ1 ,θ2 (x1 , x2 ))
X
= − log 2πσ −
i=1
2σ 2
√ n
 1 X i 2
= −n log 2πσ − 2 y − fθ1 ,θ2 (xi1 , xi2 )
2σ i=1

2. [1 Point] If we disregard the parts that do not depend on fθ1 ,θ2 and Y the negative
of the log-likelihood given in Equation 3 is equivalent to what commonly used loss
function?

F SOLUTION: The negative of the log likelihood is the square loss function. This is
the loss function used in least squares regression.

3. [2 Points] Many common techniques used to find the maximum likelihood estimates of
θ1 and θ2 rely on our ability to compute the gradient of the log-likelihood. Compute
the gradient of the log likelihood with respect to θ1 and θ2 . Express you answer in
terms of:
∂ ∂
yi, fθ1 ,θ2 (xi1 , xi2 ), fθ1 ,θ2 (xi1 , xi2 ), fθ1 ,θ2 (xi1 , xi2 )
∂θ1 ∂θ2

F SOLUTION: The important technique we use here is the chain rule. To save space
I will take the gradient with respect to θj .
n
∂ 1 X ∂ i
l(D; θ1 , θ2 ) = − 2 (y − fθ1 ,θ2 (xi1 , xi2 ))2
∂θj 2σ i=1 ∂θj
n
1 X ∂
2(y i − fθ1 ,θ2 (xi1 , xi2 )) −fθ1 ,θ2 (xi1 , xi2 )

= − 2
2σ i=1 ∂θj
n
1 X i ∂
= 2
(y − fθ1 ,θ2 (xi1 , xi2 )) fθ1 ,θ2 (xi1 , xi2 )
σ i=1 ∂θj

16
 COMMON MISTAKE 1: Many people forgot about the negative inside (·)2 . This
is very important as it would result in an algorithm that finds the minimal likelihood.

4. [2 Points] Given the learning rate η, what update rule would you use in gradient descent
to maximize the likelihood.

F SOLUTION: The tricky part about this problem is deciding whether to add or
subtract the gradient. The easiest way to think about this is consider a simple line. If the
slope is positive then adding the slope to the x value will result in a larger y value. Because
we are trying to maximize the likelihood we will add the gradient:
(t+1) (t) ∂
θj = θj + η l(D; θ1 , θ2 ) (5)
∂θj

 COMMON MISTAKE 1: Many people had the sign backwards which is what we
normally use when we are trying to minimize some loss function.

5. [3 Points] Suppose you are given some function h such that h(θ1 , θ2 ) ∈ R is large when
fθ1 ,θ2 is complicated and small when fθ1 ,θ2 is simple. Use the function h along with
the negative log-likelihood to write down an expression for the regularized loss with
parameter λ.

F SOLUTION: For this problem we simply write the regularized loss as the negative
log likelihood plus the regularization term:
lossreg (D; θ1 , θ2 ) = −l(D; θ1 , θ2 ) + λh(θ1 , θ2 ) (6)
Based on this equation we see that to minimize the loss we want to maximize the likelihood
and minimize the model complexity.

 COMMON MISTAKE 1: Again, the signs are important here. Conceptually we


want to minimize model complexity and maximize model fit (or likelihood). Since we always
try to minimize the regularized loss we know that we want a negative sign on the likelihood
and positive sign on the regularization term h(θ1 , θ2 ).

6. [2 Points] For small and large values of λ describe the bias variance trade off with
respect to the regularized loss provided in the previous part.

F SOLUTION: As we increase the value of λ we place a greater penalty on model


complexity resulting in simpler models, more model bias, and less model variance. Alterna-
tively, if we decrease the value of λ then we permit more complex models resulting in less
model bias and greater model variance.

17
5 [12 points] Support Vector Machine
1. [2 points] Suppose we are using a linear SVM (i.e., no kernel), with some large C value,
and are given the following data set.

X2 3

1 2 3 4 5
X1

Draw the decision boundary of linear SVM. Give a brief explanation.

F SOLUTION: Because of the large C value, the decision boundary will classify all of
the examples correctly. Furthermore, among separators that classify the examples correctly,
it will have the largest margin (distance to closest point).

2. [3 points] In the following image, circle the points such that removing that example
from the training set and retraining SVM, we would get a different decision boundary
than training on the full sample.

3
X2

1 2 3 4 5
X1

You do not need to provide a formal proof, but give a one or two sentence explanation.

18
F SOLUTION: These examples are the support vectors; all of the other examples
are such that their corresponding constraints are not tight in the optimization problem, so
removing them will not create a solution with smaller objective function value (norm of
w). These three examples are positioned such that removing any one of them introduces
slack in the constraints, allowing for a solution with a smaller objective function value and
with a different third support vector; in this case, because each of these new (replacement)
support vectors is not close to the old separator, the decision boundary shifts to make its
distance to that example equal to the others.

3. [3 points] Suppose instead of SVM, we use regularized logistic regression to learn the
classifier. That is,
(i)
kwk2 X 1 e(w·x +b)
(w, b) = arg min − 1[y (i)
= 0] ln (w·x(i) +b)
+ 1[y (i)
= 1] ln (w·x(i) +b)
.
w∈R2 ,b∈R 2 i
1 + e 1 + e

In the following image, circle the points such that removing that example from the
training set and running regularized logistic regression, we would get a different decision
boundary than training with regularized logistic regression on the full sample.

3
X2

1 2 3 4 5
X1

You do not need to provide a formal proof, but give a one or two sentence explanation.

F SOLUTION: Because of the regularization, the weights will not diverge to infinity,
and thus the probabilities at the solution are not at 0 and 1. Because of this, every example
contributes to the loss function, and thus has an influence on the solution.

19
4. [4 points] Suppose we have a kernel K(·, ·), such that there is an implicit high-dimensional
feature map φP: Rd → RD that satisfies ∀x, z ∈ Rd , K(x, z) = φ(x) · φ(z), where
φ(x) · φ(z) = D i=1 φ(x)i φ(z)i is the dot product in the D-dimensional space.
Show how to calculate the Euclidean distance in the D-dimensional space
v
u D
uX
kφ(x) − φ(z)k = t (φ(x)i − φ(z)i )2
i=1

without explicitly calculating the values in the D-dimensional vectors. For this ques-
tion, you should provide a formal proof.

F SOLUTION:
v
u D
uX
kφ(x) − φ(z)k = t (φ(x) − φ(z) )2
i i
i=1
v
u D
uX
= t φ(x)2 + φ(z)2 − 2φ(x) φ(z)
i i i i
i=1
v ! ! !
u D D D
u X X X
= t φ(x)2i + φ(z)2i − 2φ(x)i φ(z)i
i=1 i=1 i=1
p
= φ(x) · φ(x) + φ(z) · φ(z) − 2φ(x) · φ(z)
p
= K(x, x) + K(z, z) − 2K(x, z).

20
6 [30 points] Decision Tree and Ensemble Methods
An ensemble classifier HT (x) is a collection of T weak classifiers ht (x), each with some weight
αt , t = 1, . . . , T . Given a data point x ∈ Rd , HT (x) predicts its label based on the weighted
majority vote P of the ensemble. In the binary case where the class label is either 1 or -1,
HT (x) = sgn( Tt=1 αt ht (x)), where ht (x) : Rd → {−1, 1}, and sgn(z) = 1 if z > 0 and
sgn(z) = −1 if z ≤ 0. Boosting is an example of ensemble classifiers where the weights are
calculated based on the training error of the weak classifier on the weighted training set.

1. [10 points] For the following data set,

1.5

0.5

!0.5

!1

!1.5

!2
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2

• Describe a binary decision tree with the minimum depth and consistent with the
data;

F SOLUTION: The decision tree is as follows.

21
• Describe an ensemble classifier H2 (x) with 2 weak classifiers that is consistent
with the data. The weak classifiers should be simple decision stumps. Specify the
weak classifiers and their weights.

F SOLUTION: h1 (x) = sgn(x), α1 = 1, h2 (x) = sgn(y), α2 = 1.

22
2. [10 points] For the following XOR data set,

1.5

0.5

!0.5

!1

!1.5

!2
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2

• Describe a binary decision tree with the minimum depth and consistent with the
data;

F SOLUTION: The decision tree is as follows.

23
• Let the ensemble classifier consist of the four binary classifiers shown below (the
arrow means that the corresponding classifier classifies every data point in that
direction as +), prove that there are no weights α1 , . . . , α4 , that make the ensemble
classifier consistent with the data.

1.5 h4

1
h1
0.5

!0.5
h2

!1

!1.5 h3

!2
!2 !1.5 !1 !0.5 0 0.5 1 1.5 2

F SOLUTION: Based on the binary classifiers, we have the following inequalities.

α1 − α2 − α3 + α4 > 0
−α1 + α2 − α3 + α4 ≤ 0
−α1 + α2 + α3 − α4 > 0
α1 − α2 + α3 − α4 ≤ 0

Apparently, the first and third equations can not be satisfied at the same time. Therefore,
there are no weights α1 , . . . , α4 , that make the ensemble classifier consistent with the data.

24
3. [10 points] Suppose that for each data point, the feature vector x ∈ {0, 1}m , i.e., x
y ∈ {−1, 1}, and the true classifier
consists of m binary valued features, the class labelP
is a majority vote over the features, i.e. y = sgn( m i=1 (2xi − 1)), where xi is the i
th

component of the feature vector.

• Describe a binary decision tree with the minimum depth and consistent with the
data. How many leaves does it have?

F SOLUTION: The decision tree looks as follows.

m
Any answers between 2 2 and 2m will get full points.

• Describe an ensemble classifier with the minimum number of weak classifiers.


Specify the weak classifiers and their weights.

F SOLUTION: The ensemble classifier has m weak classifiers. hi (x) = 2xi −


1, αi = 1, i = 1, . . . , m

25

You might also like