Machine Learning Exam Prep
Machine Learning Exam Prep
Instructions:
• Fill in your name and Andrew ID above. Be sure to write neatly, or you may not
receive credit for your exam.
• Clearly mark your answers in the allocated space on the front of each page. If
needed, use the back of a page for scratch space, but you will not get credit for anything
written on the back of a page. If you have made a mistake, cross out the invalid parts
of your solution, and circle the ones which should be graded.
• No electronic devices may be used during the exam.
• Please write all answers in pen.
• You have N/A to complete the exam. Good luck!
10-601 Machine Learning Exam 3 Practice Problems - Page 2 of 35
# Marie Curie
# Noam Chomsky
If you need to change your answer, you may cross out the previous answer and bubble in
the new answer:
Select One: Who taught this course?
Henry Chai
# Marie Curie
@ Noam Chomsky
@
For “Select all that apply” questions, please fill in all appropriate squares completely:
Select all that apply: Which are scientists?
■ Stephen Hawking
■ Albert Einstein
■ Isaac Newton
□ I don’t know
Again, if you need to change your answer, you may cross out the previous answer(s) and
bubble in the new answer(s):
Select all that apply: Which are scientists?
■ Stephen Hawking
■ Albert Einstein
■ Isaac Newton
@I
■
@ don’t know
For questions where you must fill in a blank, please make sure your final answer is fully
included in the given space. You may cross out answers or parts of answers, but the final
answer must still be within the given space.
Fill in the blank: What is the course number?
10-601 10-S7601
S
10-601 Machine Learning Exam 3 Practice Problems - Page 3 of 35
Y1 Y2 Y3
X1 X2 X3
(b) Write out the factorized joint distribution of P (X, Y) using the independencies/-
conditional independencies assumed by the HMM graph, using terms Y1 , Y2 , Y3 and
X1 , X2 , X3 .
P (X, Y) =
3
Q
P (X, Y) = P (Y1 )P (Y2 |Y1 )P (Y3 |Y2 ) P (Xt |Yt )
t=1
(c) True or False: In general, we should not include unobserved variables in a graphical
model because we cannot learn anything useful about them without observations.
True False
False.
10-601 Machine Learning Exam 3 Practice Problems - Page 4 of 35
2. Consider an HMM with states Yt ∈ {S1 , S2 , S3 }, observations Xt ∈ {A, B, C} and pa-
1/2 1/4 1/4
rameters π = 1 0 0 , transition matrix B = 0 1/2 1/2, and emission matrix
0 0 1
1/2 1/2 0
A = 1/2 0 1/2.
0 1/2 1/2
(a) What is P (Y5 = S3 )?
1 − P (Y5 = S1 ) − P (Y5 = S2 )
1 1
=1 − −4×
16 32
13
=
16
posterior probability = 1
10-601 Machine Learning Exam 3 Practice Problems - Page 6 of 35
The HMM has k states (s1 , ..., sk ). sk is the terminal state. All states have the same
emission probabilities (shown in the figure). The HMM always starts at s1 as shown, and
can move to either the next greater-number state or stay in the current state. Transition
probabilities for all states except sk are also the same as shown. More formally:
1. P (Yi = St | Yi−1 = St−1 ) = 0.4
2. P (Yi = St | Yi−1 = St ) = 0.6
3. P (Yi = St | Yi−1 = Sj ) = 0 for all j ∈ [k] \ {t, t − 1}
Once a run reaches sk it outputs a symbol based on the sk state emission probability
and terminates.
1. Assume we observed the output AABAABBA from the HMM. Select all answers
below that COULD be correct.
k>8
k<8
k>6
k<6
k=7
BCDE. It cannot be more that 8 since if it was we would have more than 8 values
in the output.
2. Now assume that k = 4. Let P (′ AABA′ ) be the probability of observing AABA
from a full run of the HMM. For the following equations, fill in the box with >, <, =
or ? (? implies it is impossible to tell).
2 Bayesian Networks
1. Consider the following Bayesian network.
(a) Determine whether the following conditional independencies are true.
X1 X2
X3 X4
X5
X1 ⊥ X 2 | X3 ?
Circle one: Yes No
No.
X1 ⊥ X 4 ?
Circle one: Yes No
Yes.
X5 ⊥ X 2 | X3 ?
Circle one: Yes No
Yes.
(b) Write out the joint probability in a form that utilizes as many independence/con-
ditional independence assumptions contained in the graph as possible. Answer:
P (X1 , X2 , X3 , X4 , X5 ) =
P (X1 , X2 , X3 , X4 , X5 ) = P (X1 )P (X2 )P (X3 |X1 , X2 )P (X4 |X2 )P (X5 |X3 )
2. Consider the Bayesian network shown below for the following questions (a)-(f). Assume
all variables are boolean-valued.
A B
C D
(a) (Short answer) Write down the factorization of the joint probability P (A, B, C, D, E)
for the above graphical model, as a product of the five distributions associated with
the five variables.
P (A, B, C, D, E) = P (A)P (B)P (C|A, B)P (D|B)P (E|C)
(b) True or False: Is C conditionally independent of D given B (i.e. is (C ⊥ D)|B)?
True
(c) True or False: Is A conditionally independent of D given C (i.e. is (A ⊥ D)|C)?
False
(d) True or False: Is A independent of B (i.e. is A ⊥ B)? True
(e) Write an expression for P (C = 1|A = 1, B = 0, D = 1, E = 0) in terms of the
parameters of Conditional Probability Distributions associated with this graphical
model.
P (A = 1, B = 0, C = 1, D = 1, E = 0)
P (C = 1|A = 1, B = 0, D = 1, E = 0) = P1
c=0 P (A = 1, B = 0, C = c, D = 1, E = 0)
3 Reinforcement Learning
3.1 Markov Decision Process
Environment Setup (may contain spoilers for Shrek 1)
Lord Farquaad is hoping to evict all fairytale creatures from his kingdom of Duloc, and
has one final ogre to evict: Shrek. Unfortunately all his previous attempts to catch the
crafty ogre have fallen short, and he turns to you, with your knowledge of Markov Decision
Processes (MDP’s) to help him catch Shrek once and for all.
Consider the following MDP environment where the agent is Lord Farquaad:
1. What are |S| and |A| (size of state space and size of action space)?
p(s′ |s, a) assumes that s′ is determined only by s and a (and not any other previous
states or actions).
3. What are the following transition probabilities?
p((1, 1, N )|(1, 1, N ), M ) =
p((1, 1, N )|(1, 1, E), L) =
p((2, 1, S)|(1, 1, S), M ) =
p((2, 1, E)|(1, 1, S), M ) =
p((1, 1, N )|(1, 1, N ), M ) = 1
p((1, 1, N )|(1, 1, E), L) = 1
p((2, 1, S)|(1, 1, S), M ) = 1
p((2, 1, E)|(1, 1, S), M ) = 0
4. Given a start position of (1, 1, E) and a discount factor of γ = 0.5, what is the expected
discounted future reward from a = R? For a = L? (Fix γ = 0.5 for following problems).
For a = R we get RR = 5 ∗ ( 21 )16 (it takes 17 moves for Farquaad to get to Shrek,
starting with R, M, M, M, L...)
For a = L, this is a bad move, and we need another move to get back to our original
orientation, from which we can go with our optimal policy. So the reward here is:
RL = ( 12 )2 ∗ RR = 5 ∗ ( 12 )18
5. What is the optimal action from each state, given that orientation is fixed at E? (if
there are multiple options, choose any)
10-601 Machine Learning Exam 3 Practice Problems - Page 12 of 35
R R M R
R R L R
M R L R
M M L -
(some have multiple options, I just chose one of the possible ones)
6. Farquaad’s chief strategist (Vector from Despicable Me) suggests that having γ = 0.9
will result in a different set of optimal policies. Is he right? Why or why not?
Vector is wrong. While the reward quantity will be different, the set of optimal policies
9 16
does not change. (it is now 5 ∗ ( 10 ) ) (one can only assume that Lord Farquaad and
Vector would be in kahoots: both are extremely nefarious!)
7. Vector then suggests the following setup: R(s, a) = 0 when moving into the swamp, and
R(s, a) = −1 otherwise. Will this result in a different set of optimal policies? Why or
why not?
It will not. While the reward quantity will be different, the set of optimal policies
does not change. (Farquaad will still try to minimize the number of steps he takes in
order to reach Shrek)
10-601 Machine Learning Exam 3 Practice Problems - Page 13 of 35
8. Vector now suggests the following setup: R(s, a) = 5 when moving into the swamp, and
R(s, a) = 0 otherwise, but with γ = 1. Could this result in a different optimal policy?
Why or why not?
This will change the policy, but not in Lord Farquaad’s favor. He will no longer be
incentivized to reach Shrek quickly (since γ = 1). The optimal reward from each state
is the same (5) and therefore each action from each state is also optimal. Vector really
should have taken 10-301/601...
9. Surprise! Elsa from Frozen suddenly shows up. Vector hypnotizes her and forces her to
use her powers to turn the ground into ice. The environment is now stochastic: since
the ground is now slippery, when choosing the action M , with a 0.2 chance, Farquaad
will slip and move two squares instead of one. What is the expected future-discounted
rewards from s = (2, 4, S)?
3. In the image below is a representation of the game that you are about to play. There
are 5 states: A, B, C, D, and the goal state. The goal state, when reached, gives 100
points as reward (that is, you can assume R(D, right) = 140). In addition to the goal’s
points, you also get points by moving to different states. The amount of points you get
are shown next to the arrows. You start at state B. To figure out the best policy, you
use asynchronous value iteration with a decay (γ) of 0.9. You should initialize the value
of each state to 0.
(i) When you first start playing the game, what action would you take (up, down, left,
right) at state B?
Up
(ii) What is the total reward at state B at this time?
C
(iv) What is the total reward at state B at this time?
182.1 (30 from the immediate action, and 43 ∗ 0.9 + (100 + 40) ∗ 0.92 = 152.1 from
the future reward (value at state C))
10-601 Machine Learning Exam 3 Practice Problems - Page 16 of 35
γVk (s′ )], for any a ∈ A? Indicate the most restrictive relationship that applies. For
example, if x < y always holds, use < instead of ≤. Selecting ? means it’s not possible
to assign any true relationship. Assume R(s, a, s′ ) ≥ 0 ∀s, s′ ∈ S, a ∈ A.
Vk+1 (s) □ s′ P (s′ |s, a)[R(s, a, s′ ) + γVk (s′ )]
P
⃝ =
⃝ <
⃝ >
⃝ ≤
⃝ ≥
⃝ ?
E
3.3 Q-Learning
1. For the following true/false, circle one answer and provide a one-sentence explanation:
(i) One advantage that Q-learning has over Value and Policy iteration is that it can
account for non-deterministic policies.
Circle one: True False
False. All three methods can account for non-deterministic policies
(ii) You can apply Value or Policy iteration to any problem that Q-learning can be
applied to.
Circle one: True False
False. Unlike the others, Q-learning doesn’t need to know the transition proba-
bilities (p(s’ | s, a)), or the reward function (r(s,a)) to train. This is its biggest
advantage.
(iii) Q-learning is guaranteed to converge to the true value Q* for a greedy policy.
Circle one: True False
False. Q-learning converges only if every state will be explored infinitely. Thus,
purely exploiting policies (e.g. greedy policies) will not necessarily converge to Q*,
but rather to a local optimum.
2. For the following parts of this problem, recall that the update rule for Q-learning is:
′ ′
w ← w − α q(s, a; w) − (r + γ max ′
q(s , a ; w) ∇w q(s, a; w)
a
10-601 Machine Learning Exam 3 Practice Problems - Page 17 of 35
(i) From the update rule, let’s look at the specific term X = (r + γ maxa′ q(s′ , a′ ; w))
Describe in English what is the role of X in the weight update.
Estimate of true total return (Q*(s,a)). This may get multiple answers, so grade
accordingly
⃝ Q(A, South)
⃝ Q(B, North)
⃝ Q(B, South)
⃝ None of the above
A
5. In general, for Q-Learning (standard/tabular Q-learning, not approximate Q-learning)
to converge to the optimal Q-values, which of the following are true?
True or False: It is necessary that every state-action pair is visited infinitely often.
⃝ True
⃝ False
True or False: It is necessary that the discount γ is less than 0.5.
⃝ True
⃝ False
True or False: It is necessary that actions get chosen according to arg maxa Q(s, a).
⃝ True
⃝ False
(1) True: In order to ensure convergence in general for Q learning, this has to be true.
In practice, we generally care about the policy, which converges well before the values
do, so it is not necessary to run it infinitely often. (2) False: The discount factor must
be greater than 0 and less than 1, not 0.5. (3) False: This would actually do rather
poorly, because it is purely exploiting based on the Q-values learned thus far, and not
exploring other states to try and find a better policy.
6. Consider training a robot to navigate the following grid-based MDP environment.
• The only action from states A and E is Exit, which leads deterministically to the
terminal state
The reward function is as follows:
• R(A, Exit, T ) = 10
• R(E, Exit, T ) = 1
• The reward for any other tuple (s, a, s′ ) equals -1
Assume the discount factor is 1. When taking action Left, with probability 0.8, the robot
will successfully move one space to the left, and with probability 0.2, the robot will move
one space in the opposite direction. When taking action Right, with probability 0.8, the
robot will successfully move one space to the right, and with probability 0.2, the robot
will move one space in the opposite direction. Run synchronous value iteration on this
environment for two iterations. Begin by initializing the value of all states to zero.
Write the value of each state after the first (k = 1) and the second (k = 2) iterations.
Write your values as a comma-separated list of 6 numerical expressions in the alpha-
betical order of the states, specifically V (A), V (B), V (C), V (D), V (E), V (T ). Each of
the six entries may be a number or an expression that evaluates to a number. Do not
include any max operations in your response.
V1 (A), V1 (B), V1 (C), V1 (D), V1 (E), V1 (T ) (Values for 6 states):
What is the resulting policy after this second iteration? Write your answer as a comma-
separated list of three actions representing the policy for states, B, C, and D, in that
order. Actions may be Left or Right.
π(B), π(C), π(D) based on V2 :
1 1
0 0
0 1 0 1
Solution:
1 1
0 0
0 1 0 1
(ii) Now consider the following two plots, where we have drawn only the principal
components. Draw the data ellipse or place data points that could yield the given
principal components for each plot. Note that for the right hand plot, the principal
components are of equal magnitude.
1 1
0 0
0 1 0 1
Solution:
10-601 Machine Learning Exam 3 Practice Problems - Page 22 of 35
1 1
0 0
0 1 0 1
(ii) X̂ X̂ T = Z T Z.
(iii) The goal of PCA is to interpret the underlying structure of the data in terms of
the principal components that are best at predicting the output variable.
(iv) The output of PCA is a new representation of the data that is always of lower
dimensionality than the original feature representation.
5 K-Means
1. For True or False questions, circle your answer and justify it; for QA questions, write
down your answer.
(i) For a particular dataset and a particular k, k-means always produce the same result,
if the initialized centers are the same. Assume there is no tie when assigning the
clusters.
⃝ True
⃝ False
Justify your answer:
True. Every time you are computing the completely same distances, so the result
is the same.
(ii) k-means can always converge to the global optimum.
⃝ True
⃝ False
Justify your answer:
False. k-means is quite sensitive to outliers, since it computes the cluster center
based on the mean value of all data points in this cluster.
(iv) k in k-nearest neighbors and k-means have the same meaning.
⃝ True
⃝ False
Justify your answer:
10-601 Machine Learning Exam 3 Practice Problems - Page 25 of 35
False. In knn, k is the number of data points we need to look at when classifying
a data point. In k-means, k is the number of clusters.
(v) What’s the biggest difference between k-nearest neighbors and k-means?
2. In k-means, random initialization could possibly lead to a local optimum with very bad
performance. To alleviate this issue, instead of initializing all of the centers completely
randomly, we decide to use a smarter initialization method. This leads us to k-means++.
The only difference between k-means and k-means++ is the initialization strategy, and
all of the other parts are the same. The basic idea of k-means++ is that instead of simply
choosing the centers to be random points, we sample the initial centers iteratively, each
time putting higher probability on points that are far from any existing center. Formally,
the algorithm proceeds as follows.
Given: Data set x(i) , i = 1, . . . , N
Initialize:
µ(1) ∼ Uniform({x(i) }Ni=1 )
For j = 2, . . . , k
Computing probabilities of selecting each point
′
minj ′ <j ∥µ(j ) −x(i) ∥22
pi = PN (j ′ ) −x(i′ ) ∥2
i′ =1 minj ′ <j ∥µ 2
Note: n is the number of data points, k is the number of clusters. For cluster 1’s center,
you just randomly choose one data point. For the following centers, every time you
initialize a new center, you will first compute the distance between a data point and
the center closest to this data point. After computing the distances for all data points,
perform a normalization and you will get the probability. Use this probability to sample
for a new center.
Now assume we have 5 data points (n=5): (0, 0), (1, 2), (2, 3), (3, 1), (4, 1). The
number of clusters is 3 (k=3). The center of cluster 1 is randomly choosen as (0, 0).
These data points are shown in the figure below.
10-601 Machine Learning Exam 3 Practice Problems - Page 27 of 35
(i) What is the probability of every data point being chosen as the center for cluster
2? (The answer should contain 5 probabilities, each for every data point)
(0, 0): 0
(1, 2): 0.111
(2, 3): 0.289
(3, 1): 0.222
(4, 1): 0.378
(ii) Which data point is mostly liken chosen as the center for cluster 2?
(iii) Assume the center for cluster 2 is chosen to be the most likely one as you computed
in the previous question. Now what is the probability of every data point being
chosen as the center for cluster 3? (The answer should contain 5 probabilities, each
for every data point)
(0, 0): 0
(1, 2): 0.357
(2, 3): 0.571
(3, 1): 0.071
(4, 1): 0
10-601 Machine Learning Exam 3 Practice Problems - Page 28 of 35
(iv) Which data point is mostly liken chosen as the center for cluster 3?
(v) Assume the center for cluster 3 is also chosen to be the most likely one as you
computed in the previous question. Now we finish the initialization for all 3 centers.
List the data points that are classified into cluster 1, 2, 3 respectively.
cluster 1: (0, 0)
cluster 2: (1, 2), (2, 3)
cluster 3: (3, 1), (4, 1)
(vi) Based on the above clustering result, what’s the new center for every cluster?
(vii) According to the result of (ii) and (iv), explain how does k-means++ alleviate the
local optimum issue due to initialization?
k-means++ tends to initialize new cluster centers with the data points that are far
10-601 Machine Learning Exam 3 Practice Problems - Page 29 of 35
away from the existing centers, to make sure all of the initial cluster centers stay
away from each other.
3. Consider a dataset with seven points {x1 , . . . , x7 }. Given below are the distances between
all pairs of points.
x1 x2 x3 x4 x5 x6 x7
x1 0 5 3 1 6 2 3
x2 5 0 4 6 1 7 8
x3 3 4 0 4 3 5 6
x4 1 6 4 0 7 1 2
x5 6 1 3 7 0 8 9
x6 2 7 5 1 8 0 1
x7 3 8 6 2 9 1 0
Assume that k = 2, and the cluster centers are initialized to x3 and x6 . Which of the
following shows the two clusters formed at the end of the first iteration of k-means?
Circle the correct option.
⃝ {x1 , x2 , x3 , x4 }, {x5 , x6 , x7 }
⃝ {x2 , x3 , x5 }, {x1 , x4 , x6 , x7 }
⃝ {x1 , x2 , x3 , x5 }, {x4 , x6 , x7 }
⃝ {x2 , x3 , x4 , x7 }, {x1 , x5 , x6 }
Solution: (b).
10-601 Machine Learning Exam 3 Practice Problems - Page 30 of 35
6 Ensemble Methods
6.1 AdaBoost
1. In the AdaBoost algorithm, if the final hypothesis makes no mistakes on the training
data, which of the following is correct?
Select all that apply:
□ Additional rounds of training can help reduce the errors made on unseen data.
False
True, follows from the update equation.
3. In the last semester, someone used AdaBoost to train some data and recorded all the
weights throughout iterations but some entries in the table are not recognizable. Clever
as you are, you decide to employ your knowledge of Adaboost to determine some of
the missing information.
10-601 Machine Learning Exam 3 Practice Problems - Page 31 of 35
Below, you can see part of table that was used in the problem set. There are columns
for the Round # and for the weights of the six training points (A, B, C, D, E, and F)
at the start of each round. Some of the entries, marked with “?”, are impossible for
you to read.
In the following problems, you may assume that non-consecutive rows are independent
of each other, and that a classifier with error less than 12 was chosen at each step.
(a) The weak classifier chosen in Round 1 correctly classified training points A, B,
C, and E but misclassified training points D and F. What should the updated
weights have been in the following round, Round 2? Please complete the form
below.
1 1 1 1 1 1
, , , , ,
8 8 8 4 8 4
(b) During Round 219, which of the training points (A, B, C, D, E, F) must have
been misclassified, in order to produce the updated weights shown at the start of
Round 220? List all the points that were misclassified. If none were misclassified,
write ‘None’. If it can’t be decided, write ‘Not Sure’ instead.
Not sure
(c) You observes that the weights in round 3017 or 8888 (or both) cannot possibly be
right. Which one is incorrect? Why? Please explain in one or two short sentences.
Round 3017 is incorrect.
4. What condition must a weak learner satisfy in order for boosting to work?
Short answer:
The weak learner must classify above chance performance.
5. After an iteration of training, AdaBoost more heavily weights which data points to
train the next weak learner? (Provide an intuitive answer with no math symbols.)
Short answer:
The data points that are incorrectly classified by weak learners trained in previous
iterations are more heavily weighted.
6. Extra credit Do you think that a deep neural network is nothing but a case of
boosting? Why or why not? Impress us.
Answer:
Both viewpoints can be argued. One may view passing a linear combination through
a nonlinear function as a weak learner (e.g., logistic regression), and that the deep
neural network corrects for errors made by these weak learners in deeper layers. Then
again, every layer of the deep neural network is optimized in a global fashion (i.e.,
all weights are updated simultaneously) to improve performance, which could possibly
capture dependencies which boosting could not.
Almost all coherent answers should be accepted, with full points to those who strongly
argue their position with ML ideas.
highest weight in inference? If there are multiple trees, mention them all
DT1, DT2, DT3, DT4, DT5. Random Forests do unweighted sums of the indi-
vidual tree predictions
(c) To reduce the error of each individual decision tree, Neural uses all the features
to train each tree. How would this impact the generalisation error of the random
forest?
The generalisation error would decrease as each tree has lower generalisa-
tion error
The generalisation error would increase as each tree has insufficient training
data
The generalisation error would increase as the trees are highly correlated
The generalisation error would increase as the trees are highly correlated
10-601 Machine Learning Exam 3 Practice Problems - Page 34 of 35
7 Recommender Systems
1. Applied to the Netflix Prize problem, which of the following methods does NOT always
require side information about the users and the movies?
Select all that apply:
□ Neighborhood methods
□ Content filtering
□ Latent factor methods
□ Collaborative filtering
□ None of the above
ACD
2. Select all that apply:
□ Using matrix factorization, we can embed both users and items in the same
space
□ Using matrix factorization, we can embed either solely users or solely items in
the same space, as we cannot combine different types of data
□ In a rating matrix of users by books that we are trying to fill up, the best-
known solution is to fill the empty values with 0s and apply PCA, allowing the
dimensionality reduction to make up for this lack of data
□ Alternating minimization allows us to minimize over two variables
□ Alternating minimization avoids the issue of getting stuck in local minima
□ If the data is multidimensional, then overfitting is extremely rare
□ Nearest neighbor methods in recommender systems are restricted to using eu-
clidian distance for their distance metric
□ None of the above
AD
Filling empty values with 0s is not ideal since we are assuming data values that are
not necessarily true. Thus, we cannot apply PCA when there is missing values.
Alternating minimization can still get stuck at a local minimum.
Both euclidian distance and cosine similarity are valid metrics.
3. Your friend Duncan wants to build a recommender system for his new website Dunc-
Tube, where users can like and dislike videos that are posted there. In order to build
his system using collaborative filtering, he decides to use Non-Negative Matrix Factor-
ization. What is an issue with Duncan’s approach, and what could he change about
the website or the algorithm in order to fix it?
10-601 Machine Learning Exam 3 Practice Problems - Page 35 of 35
Since Duncan’s website incorporates negative responses directly, NNMF can’t be used
to model these sorts of responses (since NNMF enforces that both the original and
the factored matrices are all non-negative). To fix this, Duncan would either have to
remove the dislike option from his website, OR use a different matrix factorization
algorithm like SVD.
4. You and your friends want to build a movie recommendation system based on collabo-
rative filtering. There are three websites (A, B and C) that you decide to extract users
rating from. On website A, the rating scale is from 1 to 5. On website B, the rating
scale is from 1 to 10. On website C, the rating scale is from 1 to 100. Assume you will
have enough information to identify users and movies on one website with users and
movies on another website. Would you be able to build a recommendation system?
And briefly explain how would you do it?
Yes. We would be able to do it. First, Normalize the ratings score within certain range.
(E.g. re-scale each dataset ratings to a 0-1 range). After that, combine users ratings
of the three websites by matching movies and users. With users rating, we could con-
duct Matrix Factorization to predict the missing ratings for users. (Or Neighborhood
method)
5. What is the difference between collaborative filtering and content filtering?
Content filtering assumes access to side information about items and content filtering
does not.