0% found this document useful (0 votes)
61 views8 pages

Ain Shams University Faculty of Engineering

This document contains an exam for a course on systems engineering. It consists of 5 questions over 4 pages with a total of 90 marks. Question 1 involves building a decision tree from sample data and using k-nearest neighbors classification. Question 2 covers neural networks, including proving that a perceptron cannot learn a given task and backpropagation training of a feedforward network. Question 3 involves hierarchical agglomerative clustering of 1D data points using single and complete linkage and determining cluster groupings.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views8 pages

Ain Shams University Faculty of Engineering

This document contains an exam for a course on systems engineering. It consists of 5 questions over 4 pages with a total of 90 marks. Question 1 involves building a decision tree from sample data and using k-nearest neighbors classification. Question 2 covers neural networks, including proving that a perceptron cannot learn a given task and backpropagation training of a feedforward network. Question 3 involves hierarchical agglomerative clustering of 1D data points using single and complete linkage and determining cluster groupings.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

AIN SHAMS UNIVERSITY

FACULTY OF ENGINEERING
Department of Computer and Systems Engineering
4th Year, Electrical Engineeing

2nd Semester, 2015/2016 Course Code: CSE 465 Time Allowed: 3 Hours

Selected Topics in Systems Engineering

Exam consists of FIVE questions in FOUR pages Total Marks: 90

Question 1: Decision Trees & KNN (18 Marks)

A. For the data in the given table and using the information entropy, a decision tree is to be built. (12 Marks)
1. What is the information entropy of the data?
Solution: Since the data is divided evenly between the two classes, the
weighted entropy at the top node is 1.

2. What is the first feature to use in splitting the data? why?


f1 f2 f3 Class
Solution: In the first step,
A M Y 1
- weighted entropy for attribute f1 is: (4/8) * 1 + (4/8)*1 = 1.
A M Z 0
- weighted entropy for attribute f3 is: (4/8) * 1 + (4/8)*1 = 1.
A N Y 0
- weighted entropy for attribute f2 is: (4/8) * .81125 + (4/8)*.81125 = .81125.
A N Z 1
You can see that you need to start by splitting with attribute f2, as it gives a
B M Y 0
net Information gain = 1 - .81125 = .18775. If you split it by other attributes,
B M Z 0
you would get zero information gain.
B N Y 1
3. What is the next feature to split with? B N Z 1
Solution: For the second step, the remaining attributes are: ?f1? and ?f3?.
Now, you can split by f1-f3, f1-f1, f3-f1 and f3-f3 for the two children in the
first step. Every case we have the same information gain. Weighted Entropy
for the second step is: .5 So, the information gain is: .81125 - .5 = . 31125.
B. Suppose we have collected the following one dimensional samples from two classes: D1 = {−1, −2, 0, 4} and
D2 = {2, 3}. We use kNN classifier with k = 2, and whenever there is any ambiguity (e.g. one closest neighbor
is from class 1 and another from class 2), we always prefer class 1. Draw the decision regions and decision
boundaries for this case. (6 Marks)
Solution:

Question 2: Neural Networks (18 Marks)

A. Suppose you want to train a two-input perceptron on the following classification problem:

{(2, 6)T , 0}, {(1, 3)T , 1}, {(3, 9)T , 1}

Prove, mathematically, that the perceptron cannot learn this task, using inequalities expressed in terms of the
weights w0 , w1 , and w2 . (6 Marks)
Solution: the proof is similar to the proof that perceptrons cannot do XOR. Assume some set of weights exists
that performs the task correctly. Then:

1 /4 Examiner: Prof. Hazem Abbas


1. w0 + 2 · w1 + 6 · w2 ≤ 0 pattern 1
2. w0 + w1 + 3 · w2 > 0 pattern 2
3. w0 + 3 · w1 + 9 · w2 > 0 pattern 3
4. 2 · w0 + 4 · w1 + 12 · w2 > 0 (2) + (3)
5. 2 · w0 + 4 · w1 + 12 · w2 ≤ 0 2 × (1)

Lines 4 and 5 are mutually contradictory.

B. A 1-1-1 feedforward network is trained using the backpropagation (BP) algorithm. The output node has a a
linear activation function, fo (n) = 0.5 + 0.01 ∗ no , where no is the net input to the output node. The hidden
node function is fh (n) = 1+e1−nh , where nh , is the net input to the hidden nodes. [Remember that a node bias
can be replaced with a weight connected to a constant input with value +1] (12 Marks)

1. Identify the learnable weights of the network.


Solution: Assuming that the input node is labeled as 1 with activation x, hidden node as 2 with activation
h, and the output node as 3 with activation o the learnable parameters are:
ω1 (weight from 1 to 2), ω2 (weight from 2 to 3), ω3 (bias of 2), ω4 (bias of 3), and α the parameter of the 3.
2. A training data of a single pattern, p = {(x, t) = (1, 0)}, where x is the input to the input node and t is the
target value. Using zero initial values for all parameters, calculate the following values: nh , no , fh and fo
Solution: nh = 0, no = 0, fh = 0.5 = fo
3. Write down the training rules for all weights of the network using the BP algorithm.
Solution:

'$ '$ '$


w21 w32
x - fh - fo
&% &% &%

* *
 b 3
 b2


1 1
.
∂E
Derive the BP learning rules for the parameters in (a) Generally, for any weight, the update is ∆wi = −µ ∂wi
.
∂E ∂E 2
For the bias parameters, ∆bi = −µ ∂bi . For the slope, β, ∆βi = −µ ∂βi . where E = Σp (t − fo ) .
(1):

∂E ∂E ∂fo ∂no
= · ·
∂ω32 ∂fo ∂no ∂ω32
0 0
= −2(t − fo ) · fo · fh (oo = 0.5)
∆w32 = µ(t − fo ) · fh = µδo · fh

Similarly, ∆b3 = µ(t − fo ) · 1 = µ(t − fo )


(2)

∂E ∂E ∂fo ∂no ∂fh ∂nh


= · · · ·
∂ω21 ∂fo ∂no ∂fh ∂nh ∂ω21
0 0 0
= −(t − fo ) · fo · ω32 · fh (nh ) · x (fh (nh ) = fh (1 − fh ))
∆w21 = µ(t − fo ) · ω32 · fh (1 − fh ) · x = µδh x

Again, ∆b1 = µ(t − fo ) · ω32 · βfh (1 − fh )


4. Calculate the values of δo and δh using the data in part (2).
Solution:δo = −0.5, δh = 0
5. Calculate the values of all weights after applying the training rules.
Solution: ∆w32 = 0.25µ, ∆b3 = −0.5µ, ∆w21 = ∆b2 = 0

Question 3: Clustering (18 Marks)

2 /4 Examiner: Prof. Hazem Abbas


 The one dimensional data points, {10, 20, 40, 80, 85, 121, 160, 168, 195}, are to be clustered as described below.
For each part of the problem, assume that the Euclidean distance between the data points will be used as a
dissimilarity measure.

1. Use hierarchical agglomerative clustering with single linkage to cluster the data. Draw a dendrogram to
illustrate your clustering and include a vertical axis with numerical labels indicating the height of each
parental node in the dendrogram.
Solution:

2. Repeat part (1) using hierarchical agglomerative clustering with complete linkage. Solution:

3. If two clusters are desired, what data points would be clustered together according to the single linkage
method used in part (1)?
Solution: (10, 20, 40) and (80,...,195)
4. Repeat part (3) using the complete linkage method used in part (2)?
Solution:
(10,...,121) and (160, 168, 195)
5. Use the K-means algorithm with K = 3 to cluster the data set. Suppose that the points 160, 168, and 195
were selected as the initial cluster means. Work from these initial values to determine the final clustering
for the data.
3 /4 Examiner: Prof. Hazem Abbas
Solution:
1. (10, 20, 40, 80, 85, 121, 160) (168) (195)
means: 73.71 168 195
2. (10, 20, 40, 80, 85) (121, 160, 168) (195)
means: 47 149.67 195
3. same as step 2.
6. If a different set of three starting means were used, would that result in a different set of final clusters?
Explain.
Solution:
means: 23.3333 95.3333 174.3333
final clustering: (10, 20, 40) (80, 85, 121) (160, 168, 195)

Question 4: SVM & PCA (18 Marks)


A. Consider a dataset with 3 points in 1-D with their class labels: {x = 0,0 +0 }, {x = −1,0 −0 }, {x = +1,0 −0 }.
(9 Marks)
1. Are the classes {+, −} linearly separable?
Solution:
clearly, the classes are linearly separable.

2. Consider mapping each point to 3-D using new feature vectors Φ(x) = [1, 2x, x2 ]. Are the classes now
linearly separable? If so, find a separating hyperplane.
Solution:
The points are mapped to (1, 0, 0)T , (1, ?2, 1)T , (1, 2, 1)T respectively. The points are now separable in 3-
dimensional space. A separating hyperplane is given by the weight vector (0,0,1) in the new space as seen
in the figure.

3. Define a class variable yi ∈ {+1, −1} which denotes the class of xi and let w = (w1 , w2 .w3 )t . The max-
margin SVM classifier solves the following problem
1
min ||w||2 subject to yi (wT Φ(xi ) + b) ≥ 1, i = 1, 2, 3
w,b 2

Using the method of Lagrange multipliers show that the solution is w = (0, 0, −2)T , b = 1 and calculate the
1
margin ||w|| 2
.
Solutions:
For optimization problems with inequality constraints such as the above, we should apply KKT conditions
which is a generalization of Lagrange multipliers. However this problem can be solved easier by noting
that we have three vectors in the 3-dimensional space and all of them are support vectors. Hence the all 3
constraints hold with equality. Therefore we can apply the method of Lagrange multipliers to,
1
min ||w||2 subject to yi (wT Φ(xi ) + b) ≥ 1, i = 1, 2, 3
w,b 2

space as seen in the figure.


4 /4 Examiner: Prof. Hazem Abbas
4. Show that the solution remains the same if the constraints are changed to

yi (wT Φ(xi ) + b) ≥ ρ, i = 1, 2, 3 ρ ≥ 1.

B. The data points x1 = (−1, −1)t and x2 = (2, 2)t belong to class C1 , and x3 = (1, 1)t and x4 = (−2, −2)t belong
to class C2 . (9 Marks)

1. Calculate the covariance matrix of these data points and the corresponding eigenvectors and eigenvalues.
Solutions:  
1 P4 1 P4 T 2.5 2.5
mx = 4 i=1 xi = 0, Cx = 4 i=1 xi xi =
2.5 2.5

λ − 2.5 −2.5
Eigenvalues: |λI − Cx | = 0, →
= 0 → λ1 = 5, λ2 = 0
−2.5 λ − 2.5
      
2.5 2.5 e11 e11 1 1
Eigenvectors: Cx ei = λi ei → = 5 → e11 = e12 → e1 = 2
√ The
2.5 2.5 e12 e 1
  12
−1
second eigenvector will be normal to e1 , i.e., e2 = √12
1
2. If you project the data on the 1st principal eigenvector and discard the projections on the remaining ones,
what would be the mean square error between the original and the new data.
Solution:
Since all data points lie on e1 , the projected data are exactly at the same locations of the original data and
thus the error will be zero.
3. Using the projected data in part (2), would you be able to separate the two classes? Justify your answer.
5 /4 Examiner: Prof. Hazem Abbas
Solution:
No. The data will be inseparable as the projection did not change the location of the points.

Question 5: Linear Regression (18 Marks)

A. Given a set of data points, {(xi , yi ), i = 1, · · · , N }, xi , yi ∈ R, derive the regression parameters, (w0 , w1 ) of
the model yˆi = w0 + w1 xi , such that the mean square error, E = N1 N 2
P
i=1 (yi − yˆi ) is minimized. (6 Marks)
Solution:
We estimate w0 and w1 by minimizing the sum of the squared errors:
N
1 X
E= (yj − w0 − w1 xj )2 .
N
j=1

By differentiating t with respect to w0 and w1 , we get normal equations:

w0 + xw1 = y

xw0 + x2 w1 = xy
Solving these equations, we get
xy − x̄ȳ Sxy
ŵ1 = = ,
x2 − (x̄)2 Sxx
where Sxy = n(xy − x̄ȳ) and
Sxy
ŵ0 = y − x .
Sxx
Alternatively, in a vector form,  
w0
= (X T X)−1 X T Y
w1

B. For the data points, {(1, 1)t , (2, 2)t , (3, 1)t }, calculate (w0 , w1 ) and E. Without recalculating the parameters,
can you guess the values of (w0 , w1 ) and E if the point (2, 1)t is added? (6 Marks)
   
w0 4/3 1 1 2 2
= (X T X)−1 X T Y = , , E = (2.( )2 + ( )2 ) =
w1 0 3 3 3 9

It is clear that the best fit is a horizontal line, and when the point (2, 1)t is added, it will stay as a horizontal
line with error from the three points at y = 1 should be equal to the point at y = 2, i.e, a + b = 1, 3a = b →
a = 14 , w0 = 45 , w1 = 0, E = 41 (3.( 41 )2 + ( 34 )2 ) = 16
3

C. The linear regressor can be used as a two ways (+1,-1) classifier by applying the function
1
P (yi = +1|xi ) =
1 + e−(w0 +w1 xi )
which calculates the probability of the input xi being in class (+1). What is the probability of the other
class? Show that the decision boundary between the two classes is w0 + w1 xi = 0. (6 Marks)
Solution:
1 e−(w0 +w1 xi )
P (yi = −1|xi ) = 1 − =
1 + e−(w0 +w1 xi ) 1 + e−(w0 +w1 xi )
For classification, for the input xi to be in class (+1),

P (yi = +1|xi ) > P (yi = −1|xi )

1 > e−(w0 +w1 xi )


Taking the log of both sides
0 > −(w0 + w1 xi )
or w0 + w1 xi > 0 thus the separating line is w0 + w1 xi = 0.

6 /4 Examiner: Prof. Hazem Abbas


SVM  Solution to the dual problem is:
𝑇
Classifier is: f(𝑋𝑖 ) = sign(𝑊 𝑋𝑖 + b) w=∑𝑛𝑖=1 ∝𝑖 𝑦𝑖 𝑥𝑖
Functional margin of 𝑋𝑖 is : 𝑦𝑖 (𝑊 𝑇 𝑋𝑖 +b) b=𝑦𝑘 (1-ξ𝑘 ) - 𝑤 𝑇 𝑥𝑘 where k=𝑎𝑟𝑔𝑚𝑎𝑥𝑘 ′ 𝑎𝑘 ′
𝑊 𝑇 𝑋𝑖 + b W is not needed explicitly for classification!
r=y | 𝑊 𝑇 𝑋𝑖 + b≥1 if 𝑦𝑖 =1 & 𝑊 𝑇 𝑋𝑖 + b≤1 if 𝑦𝑖 =-1
‖𝑊‖ f(X) = ∑ 𝑎𝑖 𝑦𝑖 𝑥𝑖 𝑇 𝑥 +b
𝜌 = ‖𝑋𝑎 − 𝑋𝑏 ‖2 = 2/‖𝑊‖2 Linear: K(𝑋𝑖 , 𝑋𝑗 )= 𝑋𝑖 𝑇 𝑋𝑗
A better formulation (min ||W|| = max 1/||W||) :
Polynomial of power p: K(𝑋𝑖 , 𝑋𝑗 )= (1 + 𝑋𝑖 𝑇 𝑋𝑗 )𝑃
Find W and b such that
Gaussian (radial-basis function network):
∅(W) = 1/2 𝑊 𝑇 W is minimized; 2
And for all {(𝑋𝑖 ,𝑦𝑖 )}: 𝑦𝑖 (𝑊 𝑇 𝑋𝑖 + b) ≥1 −‖𝑋𝑖 − 𝑋𝑗 ‖
K(𝑋𝑖 , 𝑋𝑗 )= exp( )
The decision boundary that classify the examples correctly 2𝜎2
holds 𝑦𝑖 (𝑊 𝑇 𝑋𝑖 + b) ≥1, ∀𝑖 Sigmoid: K(𝑋𝑖 , 𝑋𝑗 )= tanh(𝛽0 𝑋𝑖 𝑇 𝑋𝑗 +𝛽1 )
If 𝑋0 is a solution to the constrained optimization problem K(𝑋𝑖 , 𝑋𝑗 )= ∅(𝑋𝑖 ) 𝑇 ∅(𝑋𝑗 )
𝑚𝑖𝑛𝑥 f(x) subject to 𝑔𝑖 (x) ≤0 ∀𝑖 ∈ 1...n Where ∅(𝑥) = [1 𝑥1 2 √2𝑥1 𝑥2 𝑥2 2 √2𝑥1 √2𝑥2 ]
There must exist ∝𝑖 ≥ 0such that 𝑋0 satisfies Radial basis function (infinite dimensional space)
2
𝜕 −‖𝑋𝑖 − 𝑋𝑗 ‖
(𝑓(𝑥) + ∑𝑛𝑖=1 ∝𝑖 𝑔𝑖 (x))|𝑥=𝑥0 =0 K(𝑋𝑖 , 𝑋𝑗 )= exp( )
𝜕𝑥 2𝜎2
𝑔𝑖 (x) = 0 ∀i ∈ 1...n
The function f(x)+ ∑𝑛 𝑖=1 ∝𝑖 𝑔𝑖 (𝑥) is called lagrangian DT
The solution is the point of gradient 0
1 Calculating Entropy
Minimize ‖𝑊‖2 H(X) = − ∑𝑖𝜖 𝑋 pi log pi
2
Subject to 1- 𝑦𝑖 (𝑊 𝑇 𝑋𝑖 + b) ≤0 , ∀𝑖 Calculating Information Gain
|𝑆𝑓|
The Lagrangian is: Gain(S,F)= H(S) - ∑𝑓𝜀 𝑣𝑎𝑙𝑢𝑒𝑠(𝐹) H(𝑆𝑓)
|𝑆|
1
L = 𝑊 𝑊+∑𝑛𝑇 𝑇
𝑖=1 ∝𝑖 (1 − 𝑦𝑖 (𝑊 𝑋𝑖 + b) ) Identify the feature with the greatest Information Gain and
2
−1 𝑛 repeat this process recursively.
= ∑𝑖=1 ∑𝑛𝑗=1 ∝𝑖 ∝𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖 𝑇 𝑥𝑗 + ∑𝑛𝑖=1 ∝𝑖 (rearrangingterms)
2
1 𝑛 NN
Maximize ∑𝑛 𝑛
𝑖=1 ∝𝑖 − 2 ∑𝑖=1 ∑𝑗=1 ∝𝑖 ∝𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖 𝑥𝑗
𝑇

Subject to ∑𝑛𝑖=1 ∝𝑖 𝑦𝑖 = 0, ∝𝑖 ≥0 ∀𝑖
Gradient Descent:
𝜕𝐽
We can obtain w by For i = 0 to n : 𝑤𝑖 ← 𝑤𝑖 + ∆𝑤𝑖 Where ∆𝑤𝑖 = −𝜂
𝜕𝑤𝑖
w=∑𝑛 𝑖=1 ∝𝑖 𝑦𝑖 𝑥𝑖 True of Batch Gradient Descent: After each epoch, compute
b can be obtained from a positive support vector average loss over the training set:
(𝑥 ∝𝑝𝑠𝑣 )knowing that 𝑊 𝑇 𝑥𝑝𝑠𝑣 +b=1 Find w that minimizes “loss” function where M is the number of
training examples.
sign( 𝑊 𝑇 𝑍+b)=sign(∑∀𝑖∈𝑆𝑉 ∝𝑖 𝑦𝑖 𝑥𝑖 𝑇 𝑍 + 𝑏) L = “confidence” of incorrect prediction.
𝑀 𝑀 𝑀
The old formulation 1 1
Find W and b such that J(w) = ∑ 𝐿(𝑤 ⋅ 𝑥 𝑘 , 𝑡 𝑘 ) = ∑ ∑ max(0, −(𝑤 ⋅ 𝑥 𝑘 )𝑡 𝑘 )
𝑀 𝑀
∅(W) = 1/2 𝑊 𝑇 W is minimized is minimized and for all {(𝑋𝑖 ,𝑦𝑖 )} 𝐾=1 𝐾=1 𝐾=1

𝑦𝑖 (𝑊 𝑇 𝑋𝑖 + b) ≥1 𝑁−1
The new formulation incorporating slack variables: 1
∇𝜃 𝑅 = ∑ 2(𝑡𝑖 − 𝑔(𝜃 𝑇 𝑥𝑖 ))(−1)𝑔′ (𝜃 𝑇 𝑥𝑖 )𝑥𝑖 = 0
Find W and b such that 2𝑁
𝑖=0
∅(W) = 1/2 𝑊 𝑇 W +C∑ ξ𝑖 is minimized and for all {(𝑋𝑖 ,𝑦𝑖 )}
𝑦𝑖 (𝑊 𝑇 𝑋𝑖 + b) ≥1-ξ𝑖 and ξ𝑖 ≥0 for all i Stochastic Gradient Descent:
Parameter C can be viewed as a way to control overfitting
The dual problem for soft margin classification: 𝐽𝑘 (𝑤) = max(0, −(𝑤 ⋅ 𝑥 𝑘 )𝑡 𝑘 )
Find ∝𝑖 ... ∝𝑁 such that 𝜕𝐽𝑘 0 𝑖𝑓 (𝑤 ⋅ 𝑥 𝑘 )𝑡 𝑘 > 0, 𝑡ℎ𝑎𝑡 𝑖𝑠 𝑥 𝑘 𝑤𝑎𝑠 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦
1 ={ 𝑘 𝑘
𝜕𝑤𝑖 −𝑥𝑖 𝑡 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Q( ∝ )= ∑ − 2 ∑𝑛𝑖=1 ∑𝑛𝑗=1 ∝𝑖 ∝𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖 𝑇 𝑥𝑗 is maximized
∝𝑖
and ∑𝑛
𝑖=1 ∝𝑖 𝑦𝑖 = 0 & 0≤ ∝𝑖 ≤ C for all ∝𝑖 After processing training example k, if perceptron misclassified the
𝜕𝐽
example: ∆𝑤𝑖 = −𝜂 = 𝜂𝑥𝑖𝑘 𝑡 𝑘
𝜕𝑤𝑖

KNN(K-nearest-neighbour):
For each node j in hidden layer, ℎ𝑗 =
Steps:
1. For input instance X loop on the labeled training data stored and calculate 𝑆(∑𝑖 ∈ 𝑖𝑛𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 𝑤𝑗𝑖 𝑥𝑖 + 𝑤𝑗0 )
distance from X using a distance function.
2. select K nearest instances and for each instance vote for its class. For each node k in output layer,𝑜𝑘 =
3. X is classified to belong to the class with 𝑆(∑𝑗 ∈ ℎ𝑖𝑑𝑑𝑒𝑛 𝑙𝑎𝑦𝑒𝑟 𝑤𝑘𝑗 ℎ𝑗 + 𝑤𝑘0 )
highest votes. 𝑥𝑖 : Activation of input node i.
ℎ𝑗 : Activation of hidden node j.
Distance Functions: 1)Euclidean Distance: √∑𝑘𝑖=1(𝑥𝑖 − 𝑦𝑖 )2
𝑜𝑘 : Activation of output node k.
1
𝑤𝑗𝑖 : Weight from node i to node j.
2)Manhattan: ∑𝑘
𝑖=1 |𝑥𝑖 − 𝑦𝑖 | 3)Minkowski: (∑𝑘𝑖=1(|𝑥𝑖 − 𝑦𝑖 |)𝑞 )𝑞

3/4
PCA Algorithm Rest of NN
1
mean of the data 𝑥̅ = ∑𝑁 xn sample covariance matrix Error (or “loss”) E is sum-squared error over all output
𝑁 𝑛=1
1
s= ∑𝑁 (xn −𝑥̅ )(xn−𝑥̅ ) T
units:
𝑁 𝑛=1
Do the eigenvalue decomposition of the D x D matrix S
𝑛𝑜. 𝑜𝑓 𝑝𝑎𝑡𝑡𝑒𝑟𝑛𝑠 𝑛𝑜. 𝑜𝑓 𝑜𝑢𝑡𝑝𝑢𝑡𝑠
Take the top K eigenvectors (corresponding to the top K 1
eigenvalues) 𝐸(𝑤) = ∑ ∑ (𝑡𝑘 − 𝑜𝑘 )2
2
Call these u1,...,uk(s.t. λ1 ≥ λ2 ≥ ... λk−1 ≥ λk ) 𝑝=1 𝐾=1
U = [u1 u2 ... uk ] is the projection matrix of size
D x K. Projection of each example xn is computed as zn= UTxn Weight Decay:
Zn is a K x 1 vector
(also called the embedding of xn) Δ𝑤𝑗𝑖𝑡 = 𝜂𝛿𝑗 𝑥𝑗𝑖 + 𝛼Δ𝑤𝑗𝑖𝑡−1 − 𝜆𝑤𝑗𝑖𝑡−1
PCA
Where λ is a parameter between 0 and 1 and 𝛼 is the
Projection of data point xn along u1 : 𝑢1𝑇 xn
momentum
Projection of the mean
1 Clustering
𝑥̅ along u1 : 𝑢1𝑇 𝑥
̅ , where 𝑥̅ = ∑𝑁
𝑛=1 xn
𝑁
K-means algorithm:
Variance of the projected data (along projection direction u1 ) : -Iterate:
-- Assign each of example xn to its closest cluster center
1 𝑇 𝑇
∑𝑁
𝑛=1 {𝑢1 xn -𝑢1 𝑥̅ }2 =𝑢1𝑇 𝑆𝑢1
𝑁 Ck ={n : k = arg mink||xn −μk||2} ( Ck is the set of examples
Where S is the data covariance matrix defined as
closest to μk )
1
S= ∑𝑁 (xn −𝑥̅ )(xn−𝑥̅ ) T -- Recompute the new cluster centers μk (mean/centroid of the
𝑁 𝑛=1 1
set Ck ) 𝜇𝑘 = ∑ 𝑥
Objective function: u1TSu1 +λ1(1−u1Tu1 ) |𝐶 | 𝑛ε𝐶 𝑘 𝑛
𝑘
Taking derivative w.r.t. u1 and setting it to zero gives: Su1 = λ1 u1 --Repeat while not converged
This is the eigenvalue eqn K-means Objective function : J(μ,r) = ∑𝑁 𝐾
𝑛=1 ∑𝑘=1 𝑟𝑛𝑘 ||𝑥𝑛 − 𝜇𝑘 ||2
- u1 must be an eigenvector of S (and λ1 the corresponding eigenvalue)
But there are multiple eigenvectors of S. which one is u1 ? Min-link or single-link: results in chaining (clusters can get very
Consider uT1 Su1 = uT1 λ1 u1 = λ1 (using uT1 u1 = 1). large) d(R,S) = minxRε R,xSε Sd(xR,xS)
We know that the projected data variance
Max-link or complete-link: results in small, round shaped
uT1 Su1 = λ1 is max
- Thus λ1 should be the largest eigenvalue clusters d(R,S) = maxxRε R,xSε Sd(xR,xS)
- - Thus u1 is the first (top) eigenvector of S ( with eigenvalue λ1 ) Average-link: compromise between single and complexte
=> the first principal component (direction of highest variance in the linkage
data) d(R,S) = |R1||S|∑d(xR,xS) xRεR,xsεS
Subsequent PC’ are given by the subsequent eigenvectors of S

PCA Approximate reconstruction Linear regression


Given the principal components u1 , …. uk , the PCA
approximation of an example xn is: Regression: t(x) = f(x)+ε, ε: some noise| Linear: y(x) = wo +w1x
𝑇 Linear model: l(w) = ∑𝑁𝑛=1 [t −(wo +w1x )]
n (n) 2
𝑥̂ n = ∑𝐾 𝐾
𝑖=1 (𝑥𝑛 ui ) ui = ∑𝑖=1 ( Z ni ui ) ∂l
Where Z n = [Z n1 , ….., Z nK] is the low-dimensional Gradient descent: w ← w−λ
∂𝑤
projection of x n Batch updates: w ← w+2λ ∑𝑁
(t(n) −y(x(n)))x(n)
𝑛=1
To compress a dataset X = [x 1 , ….., x N ], all we need is the Stochastic/online updates: w ← w+2λ (t(n) −y(x(n)))x(n)
set of K<< D Analytical solution:
Principal components, and the projections Define: t = [t(1),t(2),... ...,t(N)]T, X=[1,x(1) 1,x(2) .... 1,x(N)]T
Z = [z 1 , ….., z N ] of each example
Then: w = (XTX)−1XTt
PCA for Very High Dimensional Data Multi-dimensional Inputs: y(x) = wo + ∑𝑑𝑗=1 wjxj = wTx
In many cases N<D Recall: PCA requires eigen-decomposition M-th order polynomial function of one dimensional feature x:
of DxD covariance matrix y(x,w) = wo + ∑𝑀
𝑗=1 wjx
j

1
S= X XT (assuming centered data, and X being DxN ) Regularized Least Squares:
𝑁
1 l(w) = ∑𝑁
𝑛=1 [t −(wo +w1x )] +αw w
n (n) 2 T
The relationship is ui = Xvi {λi,vi} is an eigenvalue-eigen
(𝑁𝜆𝑖 )2 Gradient descent: w ← w+2λ[∑𝑁 𝑛=1 (t −y(x ))x −αw]
(n) (n) (n)
1
vector pair of the N x N matrix XTX , and ui is the −1 T
Analytical solution: w = (X X +αI) X t
T
𝑁
1
corresponding eigenvector of S = XXT (that we want)
𝑁

4/4

You might also like