Ain Shams University Faculty of Engineering
Ain Shams University Faculty of Engineering
FACULTY OF ENGINEERING
Department of Computer and Systems Engineering
4th Year, Electrical Engineeing
2nd Semester, 2015/2016 Course Code: CSE 465 Time Allowed: 3 Hours
A. For the data in the given table and using the information entropy, a decision tree is to be built. (12 Marks)
1. What is the information entropy of the data?
Solution: Since the data is divided evenly between the two classes, the
weighted entropy at the top node is 1.
A. Suppose you want to train a two-input perceptron on the following classification problem:
Prove, mathematically, that the perceptron cannot learn this task, using inequalities expressed in terms of the
weights w0 , w1 , and w2 . (6 Marks)
Solution: the proof is similar to the proof that perceptrons cannot do XOR. Assume some set of weights exists
that performs the task correctly. Then:
B. A 1-1-1 feedforward network is trained using the backpropagation (BP) algorithm. The output node has a a
linear activation function, fo (n) = 0.5 + 0.01 ∗ no , where no is the net input to the output node. The hidden
node function is fh (n) = 1+e1−nh , where nh , is the net input to the hidden nodes. [Remember that a node bias
can be replaced with a weight connected to a constant input with value +1] (12 Marks)
∂E ∂E ∂fo ∂no
= · ·
∂ω32 ∂fo ∂no ∂ω32
0 0
= −2(t − fo ) · fo · fh (oo = 0.5)
∆w32 = µ(t − fo ) · fh = µδo · fh
1. Use hierarchical agglomerative clustering with single linkage to cluster the data. Draw a dendrogram to
illustrate your clustering and include a vertical axis with numerical labels indicating the height of each
parental node in the dendrogram.
Solution:
2. Repeat part (1) using hierarchical agglomerative clustering with complete linkage. Solution:
3. If two clusters are desired, what data points would be clustered together according to the single linkage
method used in part (1)?
Solution: (10, 20, 40) and (80,...,195)
4. Repeat part (3) using the complete linkage method used in part (2)?
Solution:
(10,...,121) and (160, 168, 195)
5. Use the K-means algorithm with K = 3 to cluster the data set. Suppose that the points 160, 168, and 195
were selected as the initial cluster means. Work from these initial values to determine the final clustering
for the data.
3 /4 Examiner: Prof. Hazem Abbas
Solution:
1. (10, 20, 40, 80, 85, 121, 160) (168) (195)
means: 73.71 168 195
2. (10, 20, 40, 80, 85) (121, 160, 168) (195)
means: 47 149.67 195
3. same as step 2.
6. If a different set of three starting means were used, would that result in a different set of final clusters?
Explain.
Solution:
means: 23.3333 95.3333 174.3333
final clustering: (10, 20, 40) (80, 85, 121) (160, 168, 195)
3. Define a class variable yi ∈ {+1, −1} which denotes the class of xi and let w = (w1 , w2 .w3 )t . The max-
margin SVM classifier solves the following problem
1
min ||w||2 subject to yi (wT Φ(xi ) + b) ≥ 1, i = 1, 2, 3
w,b 2
Using the method of Lagrange multipliers show that the solution is w = (0, 0, −2)T , b = 1 and calculate the
1
margin ||w|| 2
.
Solutions:
For optimization problems with inequality constraints such as the above, we should apply KKT conditions
which is a generalization of Lagrange multipliers. However this problem can be solved easier by noting
that we have three vectors in the 3-dimensional space and all of them are support vectors. Hence the all 3
constraints hold with equality. Therefore we can apply the method of Lagrange multipliers to,
1
min ||w||2 subject to yi (wT Φ(xi ) + b) ≥ 1, i = 1, 2, 3
w,b 2
yi (wT Φ(xi ) + b) ≥ ρ, i = 1, 2, 3 ρ ≥ 1.
B. The data points x1 = (−1, −1)t and x2 = (2, 2)t belong to class C1 , and x3 = (1, 1)t and x4 = (−2, −2)t belong
to class C2 . (9 Marks)
1. Calculate the covariance matrix of these data points and the corresponding eigenvectors and eigenvalues.
Solutions:
1 P4 1 P4 T 2.5 2.5
mx = 4 i=1 xi = 0, Cx = 4 i=1 xi xi =
2.5 2.5
λ − 2.5 −2.5
Eigenvalues: |λI − Cx | = 0, →
= 0 → λ1 = 5, λ2 = 0
−2.5 λ − 2.5
2.5 2.5 e11 e11 1 1
Eigenvectors: Cx ei = λi ei → = 5 → e11 = e12 → e1 = 2
√ The
2.5 2.5 e12 e 1
12
−1
second eigenvector will be normal to e1 , i.e., e2 = √12
1
2. If you project the data on the 1st principal eigenvector and discard the projections on the remaining ones,
what would be the mean square error between the original and the new data.
Solution:
Since all data points lie on e1 , the projected data are exactly at the same locations of the original data and
thus the error will be zero.
3. Using the projected data in part (2), would you be able to separate the two classes? Justify your answer.
5 /4 Examiner: Prof. Hazem Abbas
Solution:
No. The data will be inseparable as the projection did not change the location of the points.
A. Given a set of data points, {(xi , yi ), i = 1, · · · , N }, xi , yi ∈ R, derive the regression parameters, (w0 , w1 ) of
the model yˆi = w0 + w1 xi , such that the mean square error, E = N1 N 2
P
i=1 (yi − yˆi ) is minimized. (6 Marks)
Solution:
We estimate w0 and w1 by minimizing the sum of the squared errors:
N
1 X
E= (yj − w0 − w1 xj )2 .
N
j=1
w0 + xw1 = y
xw0 + x2 w1 = xy
Solving these equations, we get
xy − x̄ȳ Sxy
ŵ1 = = ,
x2 − (x̄)2 Sxx
where Sxy = n(xy − x̄ȳ) and
Sxy
ŵ0 = y − x .
Sxx
Alternatively, in a vector form,
w0
= (X T X)−1 X T Y
w1
B. For the data points, {(1, 1)t , (2, 2)t , (3, 1)t }, calculate (w0 , w1 ) and E. Without recalculating the parameters,
can you guess the values of (w0 , w1 ) and E if the point (2, 1)t is added? (6 Marks)
w0 4/3 1 1 2 2
= (X T X)−1 X T Y = , , E = (2.( )2 + ( )2 ) =
w1 0 3 3 3 9
It is clear that the best fit is a horizontal line, and when the point (2, 1)t is added, it will stay as a horizontal
line with error from the three points at y = 1 should be equal to the point at y = 2, i.e, a + b = 1, 3a = b →
a = 14 , w0 = 45 , w1 = 0, E = 41 (3.( 41 )2 + ( 34 )2 ) = 16
3
C. The linear regressor can be used as a two ways (+1,-1) classifier by applying the function
1
P (yi = +1|xi ) =
1 + e−(w0 +w1 xi )
which calculates the probability of the input xi being in class (+1). What is the probability of the other
class? Show that the decision boundary between the two classes is w0 + w1 xi = 0. (6 Marks)
Solution:
1 e−(w0 +w1 xi )
P (yi = −1|xi ) = 1 − =
1 + e−(w0 +w1 xi ) 1 + e−(w0 +w1 xi )
For classification, for the input xi to be in class (+1),
Subject to ∑𝑛𝑖=1 ∝𝑖 𝑦𝑖 = 0, ∝𝑖 ≥0 ∀𝑖
Gradient Descent:
𝜕𝐽
We can obtain w by For i = 0 to n : 𝑤𝑖 ← 𝑤𝑖 + ∆𝑤𝑖 Where ∆𝑤𝑖 = −𝜂
𝜕𝑤𝑖
w=∑𝑛 𝑖=1 ∝𝑖 𝑦𝑖 𝑥𝑖 True of Batch Gradient Descent: After each epoch, compute
b can be obtained from a positive support vector average loss over the training set:
(𝑥 ∝𝑝𝑠𝑣 )knowing that 𝑊 𝑇 𝑥𝑝𝑠𝑣 +b=1 Find w that minimizes “loss” function where M is the number of
training examples.
sign( 𝑊 𝑇 𝑍+b)=sign(∑∀𝑖∈𝑆𝑉 ∝𝑖 𝑦𝑖 𝑥𝑖 𝑇 𝑍 + 𝑏) L = “confidence” of incorrect prediction.
𝑀 𝑀 𝑀
The old formulation 1 1
Find W and b such that J(w) = ∑ 𝐿(𝑤 ⋅ 𝑥 𝑘 , 𝑡 𝑘 ) = ∑ ∑ max(0, −(𝑤 ⋅ 𝑥 𝑘 )𝑡 𝑘 )
𝑀 𝑀
∅(W) = 1/2 𝑊 𝑇 W is minimized is minimized and for all {(𝑋𝑖 ,𝑦𝑖 )} 𝐾=1 𝐾=1 𝐾=1
𝑦𝑖 (𝑊 𝑇 𝑋𝑖 + b) ≥1 𝑁−1
The new formulation incorporating slack variables: 1
∇𝜃 𝑅 = ∑ 2(𝑡𝑖 − 𝑔(𝜃 𝑇 𝑥𝑖 ))(−1)𝑔′ (𝜃 𝑇 𝑥𝑖 )𝑥𝑖 = 0
Find W and b such that 2𝑁
𝑖=0
∅(W) = 1/2 𝑊 𝑇 W +C∑ ξ𝑖 is minimized and for all {(𝑋𝑖 ,𝑦𝑖 )}
𝑦𝑖 (𝑊 𝑇 𝑋𝑖 + b) ≥1-ξ𝑖 and ξ𝑖 ≥0 for all i Stochastic Gradient Descent:
Parameter C can be viewed as a way to control overfitting
The dual problem for soft margin classification: 𝐽𝑘 (𝑤) = max(0, −(𝑤 ⋅ 𝑥 𝑘 )𝑡 𝑘 )
Find ∝𝑖 ... ∝𝑁 such that 𝜕𝐽𝑘 0 𝑖𝑓 (𝑤 ⋅ 𝑥 𝑘 )𝑡 𝑘 > 0, 𝑡ℎ𝑎𝑡 𝑖𝑠 𝑥 𝑘 𝑤𝑎𝑠 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦
1 ={ 𝑘 𝑘
𝜕𝑤𝑖 −𝑥𝑖 𝑡 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Q( ∝ )= ∑ − 2 ∑𝑛𝑖=1 ∑𝑛𝑗=1 ∝𝑖 ∝𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖 𝑇 𝑥𝑗 is maximized
∝𝑖
and ∑𝑛
𝑖=1 ∝𝑖 𝑦𝑖 = 0 & 0≤ ∝𝑖 ≤ C for all ∝𝑖 After processing training example k, if perceptron misclassified the
𝜕𝐽
example: ∆𝑤𝑖 = −𝜂 = 𝜂𝑥𝑖𝑘 𝑡 𝑘
𝜕𝑤𝑖
KNN(K-nearest-neighbour):
For each node j in hidden layer, ℎ𝑗 =
Steps:
1. For input instance X loop on the labeled training data stored and calculate 𝑆(∑𝑖 ∈ 𝑖𝑛𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 𝑤𝑗𝑖 𝑥𝑖 + 𝑤𝑗0 )
distance from X using a distance function.
2. select K nearest instances and for each instance vote for its class. For each node k in output layer,𝑜𝑘 =
3. X is classified to belong to the class with 𝑆(∑𝑗 ∈ ℎ𝑖𝑑𝑑𝑒𝑛 𝑙𝑎𝑦𝑒𝑟 𝑤𝑘𝑗 ℎ𝑗 + 𝑤𝑘0 )
highest votes. 𝑥𝑖 : Activation of input node i.
ℎ𝑗 : Activation of hidden node j.
Distance Functions: 1)Euclidean Distance: √∑𝑘𝑖=1(𝑥𝑖 − 𝑦𝑖 )2
𝑜𝑘 : Activation of output node k.
1
𝑤𝑗𝑖 : Weight from node i to node j.
2)Manhattan: ∑𝑘
𝑖=1 |𝑥𝑖 − 𝑦𝑖 | 3)Minkowski: (∑𝑘𝑖=1(|𝑥𝑖 − 𝑦𝑖 |)𝑞 )𝑞
3/4
PCA Algorithm Rest of NN
1
mean of the data 𝑥̅ = ∑𝑁 xn sample covariance matrix Error (or “loss”) E is sum-squared error over all output
𝑁 𝑛=1
1
s= ∑𝑁 (xn −𝑥̅ )(xn−𝑥̅ ) T
units:
𝑁 𝑛=1
Do the eigenvalue decomposition of the D x D matrix S
𝑛𝑜. 𝑜𝑓 𝑝𝑎𝑡𝑡𝑒𝑟𝑛𝑠 𝑛𝑜. 𝑜𝑓 𝑜𝑢𝑡𝑝𝑢𝑡𝑠
Take the top K eigenvectors (corresponding to the top K 1
eigenvalues) 𝐸(𝑤) = ∑ ∑ (𝑡𝑘 − 𝑜𝑘 )2
2
Call these u1,...,uk(s.t. λ1 ≥ λ2 ≥ ... λk−1 ≥ λk ) 𝑝=1 𝐾=1
U = [u1 u2 ... uk ] is the projection matrix of size
D x K. Projection of each example xn is computed as zn= UTxn Weight Decay:
Zn is a K x 1 vector
(also called the embedding of xn) Δ𝑤𝑗𝑖𝑡 = 𝜂𝛿𝑗 𝑥𝑗𝑖 + 𝛼Δ𝑤𝑗𝑖𝑡−1 − 𝜆𝑤𝑗𝑖𝑡−1
PCA
Where λ is a parameter between 0 and 1 and 𝛼 is the
Projection of data point xn along u1 : 𝑢1𝑇 xn
momentum
Projection of the mean
1 Clustering
𝑥̅ along u1 : 𝑢1𝑇 𝑥
̅ , where 𝑥̅ = ∑𝑁
𝑛=1 xn
𝑁
K-means algorithm:
Variance of the projected data (along projection direction u1 ) : -Iterate:
-- Assign each of example xn to its closest cluster center
1 𝑇 𝑇
∑𝑁
𝑛=1 {𝑢1 xn -𝑢1 𝑥̅ }2 =𝑢1𝑇 𝑆𝑢1
𝑁 Ck ={n : k = arg mink||xn −μk||2} ( Ck is the set of examples
Where S is the data covariance matrix defined as
closest to μk )
1
S= ∑𝑁 (xn −𝑥̅ )(xn−𝑥̅ ) T -- Recompute the new cluster centers μk (mean/centroid of the
𝑁 𝑛=1 1
set Ck ) 𝜇𝑘 = ∑ 𝑥
Objective function: u1TSu1 +λ1(1−u1Tu1 ) |𝐶 | 𝑛ε𝐶 𝑘 𝑛
𝑘
Taking derivative w.r.t. u1 and setting it to zero gives: Su1 = λ1 u1 --Repeat while not converged
This is the eigenvalue eqn K-means Objective function : J(μ,r) = ∑𝑁 𝐾
𝑛=1 ∑𝑘=1 𝑟𝑛𝑘 ||𝑥𝑛 − 𝜇𝑘 ||2
- u1 must be an eigenvector of S (and λ1 the corresponding eigenvalue)
But there are multiple eigenvectors of S. which one is u1 ? Min-link or single-link: results in chaining (clusters can get very
Consider uT1 Su1 = uT1 λ1 u1 = λ1 (using uT1 u1 = 1). large) d(R,S) = minxRε R,xSε Sd(xR,xS)
We know that the projected data variance
Max-link or complete-link: results in small, round shaped
uT1 Su1 = λ1 is max
- Thus λ1 should be the largest eigenvalue clusters d(R,S) = maxxRε R,xSε Sd(xR,xS)
- - Thus u1 is the first (top) eigenvector of S ( with eigenvalue λ1 ) Average-link: compromise between single and complexte
=> the first principal component (direction of highest variance in the linkage
data) d(R,S) = |R1||S|∑d(xR,xS) xRεR,xsεS
Subsequent PC’ are given by the subsequent eigenvectors of S
1
S= X XT (assuming centered data, and X being DxN ) Regularized Least Squares:
𝑁
1 l(w) = ∑𝑁
𝑛=1 [t −(wo +w1x )] +αw w
n (n) 2 T
The relationship is ui = Xvi {λi,vi} is an eigenvalue-eigen
(𝑁𝜆𝑖 )2 Gradient descent: w ← w+2λ[∑𝑁 𝑛=1 (t −y(x ))x −αw]
(n) (n) (n)
1
vector pair of the N x N matrix XTX , and ui is the −1 T
Analytical solution: w = (X X +αI) X t
T
𝑁
1
corresponding eigenvector of S = XXT (that we want)
𝑁
4/4