Final 2019
Final 2019
Exam policy: This exam allows one one-page, two-sided cheat sheet; No
other materials.
Time: 120 minutes. Be sure to write your name and Penn student ID
(the 8 bigger digits on your ID card) on the answer form and fill in the
associated bubbles in pencil.
If you think a question is ambiguous, mark what you think is the best answer.
As always, we will consider written regrade requests if your interpretation of
a question differed from what we intended. We will only grade the answer
forms.
For the “TRUE or FALSE” questions, note that “TRUE” is (a) and “FALSE”
is (b). For the multiple choice questions, select exactly one answer.
1
1. [2 points] For very large training data sets, which of the following will
usually have the lowest training time?
2. [2 points] For very large training data sets, which of the following will
usually produce the smallest models (requiring the fewest parameters)?
2
6. [1 points] True or False? The distribution P (A) = 1/2, P (B) =
1/2 has higher entropy than the distribution P (A) = 1/3, P (B) =
1/3, P (C) = 1/3
3
For the next 3 questions, assume you have classification data with
classes Y being +1 or -1 and features xj also being +1 or -1 for j ∈
1, ..., p.
In an attempt to turbocharge your classifier, you duplicate each fea-
ture, so now each example has 2p features, with xp+j = xj for j ∈
1, ..., p. The following questions compare the original feature set with
the doubled one. You may assume that in the case of ties, class +1
is always chosen. Assume that there are equal numbers of training
examples in each class.
(a) The test accuracy will usually be higher with the original features.
(b) The test accuracy will usually be higher with the doubled fea-
tures.
(c) The test accuracy will be the same with either feature set.
(a) The test accuracy will, in general, be higher with the original
features.
(b) The test accuracy will, in general, be higher with the doubled
features.
(c) The test accuracy will always be the same with either feature set.
4
14. [2 points] In neural networks, what is the benefit of the ReLU activation
function over a Sigmoid activation function?
15. [2 points] Consider a convolutional net where the p-dimensional input data
is laid out in a one-dimensional fashion (e.g. for text or speech), and where
there are k filters (kernels), each of size m × 1. Assume no padding, and a
stride of size s. Which of the following best approximates the number of
outputs of this layer?
(a) ksp/m
(b) kspm
(c) ksm/p
(d) spm
(e) kmp/s
16. [2 points] Suppose you have inputs as x = −2, y = 5, and z = −4. You
have a neuron q and neuron f with functions:
q =x+y
f =q∗z
What is the gradient of f with respect to x, y, and z? See the figure below.
(a) (−3, 4, 4)
(b) (4, 4, 3)
5
(c) (−4, −4, 3)
(d) (3, −4, −4)
17. [2 points] In which of the following neural net architectures do some of the
weights get reused more than once in each single forward pass?
18. [2 points] In AdaBoost, we choose αt as the weight of the t-th weak learner,
where
1 1 − t
αt = ln
2 t
εt = Px∼Dt [ht (x) 6= y] is the weighted fraction of examples misclassified by
the t-th weak learner. Here our weak learners are depth 1 decision trees that
yield vertical or horizontal half-plane decision boundaries. If we conduct two
iterations of boosting on the following dataset, Which is larger, α1 or α2 ?
y
+ −
+ − − − −
− + + + +
(a) α1 > α2
(b) α2 > α1
(c) α1 = α2
(d) Not enough information
19. [2 points] Suppose you have a classification problem where you want to pe-
nalize misclassifications more the farther they are from the decision bound-
ary. How many of the following loss functions would be appropriate?
6
• 0 − 1 loss
• hinge loss
• logistic loss
• exponential loss
(a) none
(b) 1
(c) 2
(d) 3
(e) all 4
(a) 0 − 1 loss
(b) hinge loss
(c) exponential loss
(d) (a) and (b)
(e) (a), (b) and (c)
22. [1 points] True or False? Removal of a support vector will always change
the SVM decision boundary.
23. [1 points] True or False? Radial Basis Functions (RBFs) use a Gaussian
kernel to transform a p-dimensional feature space (x) to a (k-dimensional)
transformed feature space where k ≤ p.
7
24. [1 points] True or False? Principle Component Regression (PCR) uses the
right singular vectors of the feature matrix X to transform a p-dimensional
feature space (x) to a (k-dimensional) transformed feature space where k ≤
p.
26. [1 points] True or False? Least Mean Squares (LMS) is an online approx-
imation to linear regression and perceptrons are online approximations to
SVMs.
27. [2 points] After performing SVD on a dataset with 5 features, you retrieve
eigenvalues 6, 5, 4, 3, 2. How many components should we include to explain
at least 75% of the variance of the dataset?
(a) 1
(b) 2
(c) 3
(d) 4
28. [1 points] True or False? After performing SVD on a dataset, you notice
the eigenvalues returned are all approximately equal. You expect variance
explained to be approximately linear to the number of components used for
PCA.
30. [1 points] True or False? Changing one feature of a dataset with multiple
features from centimeters to inches will not affect the outcome of PCA.
8
For the next 4 questions, refer the the graph presented.
y
E(3,3)
×
B(-1,2) Q(2,2)
×
C(-2,1) F(1,1)
× ×
P(-3,0) A(0,0)
× x
D(-1,-2)
×
31. [2 points] In the first step of K-means with the standard Euclidean distance
metric, which points will be assigned to the cluster centered at P?
(a) C, D
(b) A, C, D
(c) B, C, D
(d) A, B, C
(e) A, B, C, D
32. [2 points] Continue running K-means with the standard Euclidean distance
metric. What does the cluster center P get updated to? (Do not include P
as a point).
(a) (− 23 , 12 )
(b) (− 34 , 13 )
(c) (− 34 , 0)
(d) (−1, 31 )
(e) (− 23 , 13 )
9
While K-means used Euclidean distance in class, we can extend it to other
distance functions, where the assignment and update phases still iteratively
minimize the total (non-Euclidian) distance. Here, consider the Manhattan
distance:
Again start from the original locations for P and Q as shown in the figure,
and perform the update assignment step and the update cluster center step
using Manhattan distance as the distance function:
33. [2 points] Starting from the same initial configuration, select all points that
get assigned to the cluster with center at P, under this new distance function
d0 (A, B).
(a) C, D
(b) A, C, D
(c) B, C, D
(d) A, B, C
(e) A, B, C, D
34. [2 points] What does cluster center P now get updated to, under this new
distance function d0 (A, B)? (Do not include P as a point).
(a) (− 23 , 12 )
(b) (− 34 , 13 )
(c) (− 34 , 0)
(d) (−1, − 31 )
(e) (− 23 , 13 )
The exam as given did not have the correct solution option; it is now D
36. [1 points] True or False? An AUC (Area under the ROC curve) of 0.4 on
test data suggests overfitting.
10
37. [1 points] True or False? LIME (Local Interpretable Model-Agnostic Ex-
planations) fits a linear model to observations close a point of interest and
determines which features in that linear model are most influential in mak-
ing the prediction. The use of ”observations” here is ambiguous; it fits
points that are created as perturbations of the original point.
39. [1 points] True or False? When running linear regressions, it is a good idea
to look at the largest (in absolute value) regression weights to see which
features are most influential in determining the predictions.
40. [2 points] For Naive Bayes, what happens to our document posteriors as we
increase our pseudo-count parameter?
41. [2 points] We are trying to use Naive Bayes to classify a Facebook meme as
either funny or sad. Suppose out of 100 training memes, we see that 60 of
these memes are funny while 40 are sad. Also, assume that our dictionary
consists of only three words, and the counts of the words for funny and sad
memes are listed below.
If we see a meme post with one occurrence of the word friends and one
occurrence of the word cry (with no other words), which class has higher
posterior probability?
(a) Funny
(b) Sad
11
There are 100 memes, 60 funny and 40 sad. (as stated in the question text)
Of the 60 funny memes, 40 contain ”yum”. 40 contain ”friends” and 20
contain ”cry”. (The memes have more than one word in them, so a meme
like ”froyo with my friends, yum!” has both ”friends” and ”yum” in it.)
When we compute p(friends—funny), that is short for the probability a
meme with class label ”funny” contains the word ”friends”, which is 40/60,
not 40/(40+40+20).
42. [1 points] True or False? Because LDA has “hidden variables” representing
the mixture of topics within each document and the topic that each word in
each document come from, it is often solved using the EM algorithm.
43. [1 points] True or False? EM algorithms are attractive because for problems
such as estimating Gaussian Mixture Models, they are guaranteed to find a
global optimum in likelihood.
44. [1 points] True or False? Power methods for estimating eigenvectors are
attractive because they are guaranteed to find a global optimum in recon-
struction error when used in PCA.
45. [2 points] Which of the following statement about Hidden Markov Model
(HMM) is not true?
46. [1 points] True or False? The EM algorithm does a kind of “gradient de-
scent” in likelihood, since both steps are guaranteed to decrease the negative
log-likelihood.
12
The following 5 questions are related to this graph:
B C
A E
F G
47. [1 points] True or False? The joint probability of this graph can be repre-
sented as:
P (A | B)P (B)P (C)P (D | A, E)P (E | B, C)P (F | D)P (G | C, E, D)
48. [1 points] True or False? The class of joint probability distributions that
can be represented by the resulting Bayesian network:
P (A | B)P (B)P (C)P (D | A, E)P (E | B, C)P (F | A, B, C, D, E)P (G | A, B, C, E, D)
is smaller than the original network shown above.
13
52. [1 points] True or False? Although speech-to-text and text-to-speech is
usually modeled using LSTMs or other RNNs, one could also use CNNs
with one-dimensional filters.
53. [1 points] True or False? CNNs work well on high dimensional problems
like medical diagnosis from health records (which contain varied features
like age, weight, temperature, lab results, disease history, etc.)
54. [1 points] True or False? Vanilla RNNs, unlike HMMs, do not forget things
exponentially quickly.
55. [1 points] True or False? Q-learning is guaranteed to converge (for discrete
states and actions) so long as all (state, action) pairs are visited infinitely
often. It is, but only given specific constraints on the learning rate.
56. [1 points] True or False? In Q-learning, Q(s, a) represents the expected
discounted reward of taking action a in state s and subsequently following
an optimal policy.
57. [1 points] True or False? Epsilon-greedy Reinforcement Learning methods
”exploit” by using an optimal policy a (small) fraction, given by , and
”explore” a large fraction (1 − ) of the time.
58. [1 points] True or False? Current RL methods for game play, such as alp-
haZero, unlike earlier methods that trained a “new” Q-function by playing
against an older one, now just play the “new” network against itself.
59. [1 points] True or False? Value Iteration iteratively updates V using Bell-
man’s equation, and is guaranteed to converge to the unique optimum repre-
sented by the solution to Bellman’s equation (if all states are visited infinite
numbers of times).
60. [1 points] True or False? Autoencoders always take an input and pass it
through an “encoder” which produces a lower dimensional representation
which is then passed through a “decoder” to reconstruct the input as accu-
rately as possible.
61. [1 points] True or False? When picking which additional points, x, to label
for linear regression, it is desirable to pick points that are “as spread out as
possible” (i.e. as far away as possible from the existing points)
62. [1 points] True or False? When picking which additional points, x, to label
for SVMs, it is desirable to pick points that are ”as spread out as possible”
(i.e. far away from the existing points)
14
63. [1 points] True or False? The most widely used experimental design methods
pick new points to label such that they maximize a norm ||X T X||p for some
p. Too complex; everyone was given credit: They minimize the norm
of the inverse of that matrix. This is often, but not always the same as
maximizing that matrix.
64. [1 points] True or False? When doing active learning for SVMs, labeling
the x’s for which one is “most uncertain” will tend to select points that are
closer to the separating hyperplane.
65. [1 points] True or False? The “Query by Committee” active learning method
makes more sense to use with linear regression than with random forests.
66. [1 points] If a standard CNN model has been trained to distinguish images
of men from women in a setting where the training data has 75% women,
then predictions on a test set of images drawn from the same distribution
is more likely to have
67. [2 points] When training a machine learning model on a data set which is
not representative of the population of interest (e.g. when using Twitter
users to represent the general population) it is best to:
68. [1 points] True or False? Least Mean Squares does stochastic gradient
descent in a negative log-likelihood.
15