0% found this document useful (0 votes)

61 views8 pages

Ain Shams University Faculty of Engineering

This document contains an exam for a course on systems engineering. It consists of 5 questions over 4 pages with a total of 90 marks. Question 1 involves building a decision tree from sample data and using k-nearest neighbors classification. Question 2 covers neural networks, including proving that a perceptron cannot learn a given task and backpropagation training of a feedforward network. Question 3 involves hierarchical agglomerative clustering of 1D data points using single and complete linkage and determining cluster groupings.

Uploaded by

سلمى طارق عبدالخالق عطيه Unknown

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views8 pages

Ain Shams University Faculty of Engineering

Uploaded by

سلمى طارق عبدالخالق عطيه Unknown

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

AIN SHAMS UNIVERSITY

FACULTY OF ENGINEERING
Department of Computer and Systems Engineering
4th Year, Electrical Engineeing

2nd Semester, 2015/2016 Course Code: CSE 465 Time Allowed: 3 Hours

Selected Topics in Systems Engineering

Exam consists of FIVE questions in FOUR pages Total Marks: 90

Question 1: Decision Trees & KNN (18 Marks)

A. For the data in the given table and using the information entropy, a decision tree is to be built. (12 Marks)
1. What is the information entropy of the data?
Solution: Since the data is divided evenly between the two classes, the
weighted entropy at the top node is 1.

2. What is the first feature to use in splitting the data? why?

f1 f2 f3 Class
Solution: In the first step,
A M Y 1
- weighted entropy for attribute f1 is: (4/8) * 1 + (4/8)*1 = 1.
A M Z 0
- weighted entropy for attribute f3 is: (4/8) * 1 + (4/8)*1 = 1.
A N Y 0
- weighted entropy for attribute f2 is: (4/8) * .81125 + (4/8)*.81125 = .81125.
A N Z 1
You can see that you need to start by splitting with attribute f2, as it gives a
B M Y 0
net Information gain = 1 - .81125 = .18775. If you split it by other attributes,
B M Z 0
you would get zero information gain.
B N Y 1
3. What is the next feature to split with? B N Z 1
Solution: For the second step, the remaining attributes are: ?f1? and ?f3?.
Now, you can split by f1-f3, f1-f1, f3-f1 and f3-f3 for the two children in the
first step. Every case we have the same information gain. Weighted Entropy
for the second step is: .5 So, the information gain is: .81125 - .5 = . 31125.
B. Suppose we have collected the following one dimensional samples from two classes: D1 = {−1, −2, 0, 4} and
D2 = {2, 3}. We use kNN classifier with k = 2, and whenever there is any ambiguity (e.g. one closest neighbor
is from class 1 and another from class 2), we always prefer class 1. Draw the decision regions and decision
boundaries for this case. (6 Marks)
Solution:

Question 2: Neural Networks (18 Marks)

A. Suppose you want to train a two-input perceptron on the following classification problem:

{(2, 6)T , 0}, {(1, 3)T , 1}, {(3, 9)T , 1}

Prove, mathematically, that the perceptron cannot learn this task, using inequalities expressed in terms of the
weights w0 , w1 , and w2 . (6 Marks)
Solution: the proof is similar to the proof that perceptrons cannot do XOR. Assume some set of weights exists
that performs the task correctly. Then:

1 /4 Examiner: Prof. Hazem Abbas

1. w0 + 2 · w1 + 6 · w2 ≤ 0 pattern 1
2. w0 + w1 + 3 · w2 > 0 pattern 2
3. w0 + 3 · w1 + 9 · w2 > 0 pattern 3
4. 2 · w0 + 4 · w1 + 12 · w2 > 0 (2) + (3)
5. 2 · w0 + 4 · w1 + 12 · w2 ≤ 0 2 × (1)

Lines 4 and 5 are mutually contradictory.

B. A 1-1-1 feedforward network is trained using the backpropagation (BP) algorithm. The output node has a a
linear activation function, fo (n) = 0.5 + 0.01 ∗ no , where no is the net input to the output node. The hidden
node function is fh (n) = 1+e1−nh , where nh , is the net input to the hidden nodes. [Remember that a node bias
can be replaced with a weight connected to a constant input with value +1] (12 Marks)

1. Identify the learnable weights of the network.

Solution: Assuming that the input node is labeled as 1 with activation x, hidden node as 2 with activation
h, and the output node as 3 with activation o the learnable parameters are:
ω1 (weight from 1 to 2), ω2 (weight from 2 to 3), ω3 (bias of 2), ω4 (bias of 3), and α the parameter of the 3.
2. A training data of a single pattern, p = {(x, t) = (1, 0)}, where x is the input to the input node and t is the
target value. Using zero initial values for all parameters, calculate the following values: nh , no , fh and fo
Solution: nh = 0, no = 0, fh = 0.5 = fo
3. Write down the training rules for all weights of the network using the BP algorithm.
Solution:

'$ '$ '$

w21 w32
x - fh - fo
&% &% &%

* *
b 3
b2

1 1
.
∂E
Derive the BP learning rules for the parameters in (a) Generally, for any weight, the update is ∆wi = −µ ∂wi
.
∂E ∂E 2
For the bias parameters, ∆bi = −µ ∂bi . For the slope, β, ∆βi = −µ ∂βi . where E = Σp (t − fo ) .
(1):

∂E ∂E ∂fo ∂no
= · ·
∂ω32 ∂fo ∂no ∂ω32
0 0
= −2(t − fo ) · fo · fh (oo = 0.5)
∆w32 = µ(t − fo ) · fh = µδo · fh

Similarly, ∆b3 = µ(t − fo ) · 1 = µ(t − fo )

(2)

∂E ∂E ∂fo ∂no ∂fh ∂nh

= · · · ·
∂ω21 ∂fo ∂no ∂fh ∂nh ∂ω21
0 0 0
= −(t − fo ) · fo · ω32 · fh (nh ) · x (fh (nh ) = fh (1 − fh ))
∆w21 = µ(t − fo ) · ω32 · fh (1 − fh ) · x = µδh x

Again, ∆b1 = µ(t − fo ) · ω32 · βfh (1 − fh )

4. Calculate the values of δo and δh using the data in part (2).
Solution:δo = −0.5, δh = 0
5. Calculate the values of all weights after applying the training rules.
Solution: ∆w32 = 0.25µ, ∆b3 = −0.5µ, ∆w21 = ∆b2 = 0

Question 3: Clustering (18 Marks)

2 /4 Examiner: Prof. Hazem Abbas

The one dimensional data points, {10, 20, 40, 80, 85, 121, 160, 168, 195}, are to be clustered as described below.
For each part of the problem, assume that the Euclidean distance between the data points will be used as a
dissimilarity measure.

1. Use hierarchical agglomerative clustering with single linkage to cluster the data. Draw a dendrogram to
illustrate your clustering and include a vertical axis with numerical labels indicating the height of each
parental node in the dendrogram.
Solution:

2. Repeat part (1) using hierarchical agglomerative clustering with complete linkage. Solution:

3. If two clusters are desired, what data points would be clustered together according to the single linkage
method used in part (1)?
Solution: (10, 20, 40) and (80,...,195)
4. Repeat part (3) using the complete linkage method used in part (2)?
Solution:
(10,...,121) and (160, 168, 195)
5. Use the K-means algorithm with K = 3 to cluster the data set. Suppose that the points 160, 168, and 195
were selected as the initial cluster means. Work from these initial values to determine the final clustering
for the data.
3 /4 Examiner: Prof. Hazem Abbas
Solution:
1. (10, 20, 40, 80, 85, 121, 160) (168) (195)
means: 73.71 168 195
2. (10, 20, 40, 80, 85) (121, 160, 168) (195)
means: 47 149.67 195
3. same as step 2.
6. If a different set of three starting means were used, would that result in a different set of final clusters?
Explain.
Solution:
means: 23.3333 95.3333 174.3333
final clustering: (10, 20, 40) (80, 85, 121) (160, 168, 195)

Question 4: SVM & PCA (18 Marks)

A. Consider a dataset with 3 points in 1-D with their class labels: {x = 0,0 +0 }, {x = −1,0 −0 }, {x = +1,0 −0 }.
(9 Marks)
1. Are the classes {+, −} linearly separable?
Solution:
clearly, the classes are linearly separable.
√
2. Consider mapping each point to 3-D using new feature vectors Φ(x) = [1, 2x, x2 ]. Are the classes now
linearly separable? If so, find a separating hyperplane.
Solution:
The points are mapped to (1, 0, 0)T , (1, ?2, 1)T , (1, 2, 1)T respectively. The points are now separable in 3-
dimensional space. A separating hyperplane is given by the weight vector (0,0,1) in the new space as seen
in the figure.

3. Define a class variable yi ∈ {+1, −1} which denotes the class of xi and let w = (w1 , w2 .w3 )t . The max-
margin SVM classifier solves the following problem
1
min ||w||2 subject to yi (wT Φ(xi ) + b) ≥ 1, i = 1, 2, 3
w,b 2

Using the method of Lagrange multipliers show that the solution is w = (0, 0, −2)T , b = 1 and calculate the
1
margin ||w|| 2
.
Solutions:
For optimization problems with inequality constraints such as the above, we should apply KKT conditions
which is a generalization of Lagrange multipliers. However this problem can be solved easier by noting
that we have three vectors in the 3-dimensional space and all of them are support vectors. Hence the all 3
constraints hold with equality. Therefore we can apply the method of Lagrange multipliers to,
1
min ||w||2 subject to yi (wT Φ(xi ) + b) ≥ 1, i = 1, 2, 3
w,b 2

space as seen in the figure.

4 /4 Examiner: Prof. Hazem Abbas
4. Show that the solution remains the same if the constraints are changed to

yi (wT Φ(xi ) + b) ≥ ρ, i = 1, 2, 3 ρ ≥ 1.

B. The data points x1 = (−1, −1)t and x2 = (2, 2)t belong to class C1 , and x3 = (1, 1)t and x4 = (−2, −2)t belong
to class C2 . (9 Marks)

1. Calculate the covariance matrix of these data points and the corresponding eigenvectors and eigenvalues.
Solutions:
1 P4 1 P4 T 2.5 2.5
mx = 4 i=1 xi = 0, Cx = 4 i=1 xi xi =
2.5 2.5

λ − 2.5 −2.5
Eigenvalues: |λI − Cx | = 0, →
= 0 → λ1 = 5, λ2 = 0
−2.5 λ − 2.5

2.5 2.5 e11 e11 1 1
Eigenvectors: Cx ei = λi ei → = 5 → e11 = e12 → e1 = 2
√ The
2.5 2.5 e12 e 1
12
−1
second eigenvector will be normal to e1 , i.e., e2 = √12
1
2. If you project the data on the 1st principal eigenvector and discard the projections on the remaining ones,
what would be the mean square error between the original and the new data.
Solution:
Since all data points lie on e1 , the projected data are exactly at the same locations of the original data and
thus the error will be zero.
3. Using the projected data in part (2), would you be able to separate the two classes? Justify your answer.
5 /4 Examiner: Prof. Hazem Abbas
Solution:
No. The data will be inseparable as the projection did not change the location of the points.

Question 5: Linear Regression (18 Marks)

A. Given a set of data points, {(xi , yi ), i = 1, · · · , N }, xi , yi ∈ R, derive the regression parameters, (w0 , w1 ) of
the model yˆi = w0 + w1 xi , such that the mean square error, E = N1 N 2
P
i=1 (yi − yˆi ) is minimized. (6 Marks)
Solution:
We estimate w0 and w1 by minimizing the sum of the squared errors:
N
1 X
E= (yj − w0 − w1 xj )2 .
N
j=1

By differentiating t with respect to w0 and w1 , we get normal equations:

w0 + xw1 = y

xw0 + x2 w1 = xy
Solving these equations, we get
xy − x̄ȳ Sxy
ŵ1 = = ,
x2 − (x̄)2 Sxx
where Sxy = n(xy − x̄ȳ) and
Sxy
ŵ0 = y − x .
Sxx
Alternatively, in a vector form,
w0
= (X T X)−1 X T Y
w1

B. For the data points, {(1, 1)t , (2, 2)t , (3, 1)t }, calculate (w0 , w1 ) and E. Without recalculating the parameters,
can you guess the values of (w0 , w1 ) and E if the point (2, 1)t is added? (6 Marks)

w0 4/3 1 1 2 2
= (X T X)−1 X T Y = , , E = (2.( )2 + ( )2 ) =
w1 0 3 3 3 9

It is clear that the best fit is a horizontal line, and when the point (2, 1)t is added, it will stay as a horizontal
line with error from the three points at y = 1 should be equal to the point at y = 2, i.e, a + b = 1, 3a = b →
a = 14 , w0 = 45 , w1 = 0, E = 41 (3.( 41 )2 + ( 34 )2 ) = 16
3

C. The linear regressor can be used as a two ways (+1,-1) classifier by applying the function
1
P (yi = +1|xi ) =
1 + e−(w0 +w1 xi )
which calculates the probability of the input xi being in class (+1). What is the probability of the other
class? Show that the decision boundary between the two classes is w0 + w1 xi = 0. (6 Marks)
Solution:
1 e−(w0 +w1 xi )
P (yi = −1|xi ) = 1 − =
1 + e−(w0 +w1 xi ) 1 + e−(w0 +w1 xi )
For classification, for the input xi to be in class (+1),

P (yi = +1|xi ) > P (yi = −1|xi )

1 > e−(w0 +w1 xi )

Taking the log of both sides
0 > −(w0 + w1 xi )
or w0 + w1 xi > 0 thus the separating line is w0 + w1 xi = 0.

6 /4 Examiner: Prof. Hazem Abbas

SVM  Solution to the dual problem is:
𝑇
Classifier is: f(𝑋𝑖 ) = sign(𝑊 𝑋𝑖 + b) w=∑𝑛𝑖=1 ∝𝑖 𝑦𝑖 𝑥𝑖
Functional margin of 𝑋𝑖 is : 𝑦𝑖 (𝑊 𝑇 𝑋𝑖 +b) b=𝑦𝑘 (1-ξ𝑘 ) - 𝑤 𝑇 𝑥𝑘 where k=𝑎𝑟𝑔𝑚𝑎𝑥𝑘 ′ 𝑎𝑘 ′
𝑊 𝑇 𝑋𝑖 + b W is not needed explicitly for classification!
r=y | 𝑊 𝑇 𝑋𝑖 + b≥1 if 𝑦𝑖 =1 & 𝑊 𝑇 𝑋𝑖 + b≤1 if 𝑦𝑖 =-1
‖𝑊‖ f(X) = ∑ 𝑎𝑖 𝑦𝑖 𝑥𝑖 𝑇 𝑥 +b
𝜌 = ‖𝑋𝑎 − 𝑋𝑏 ‖2 = 2/‖𝑊‖2 Linear: K(𝑋𝑖 , 𝑋𝑗 )= 𝑋𝑖 𝑇 𝑋𝑗
A better formulation (min ||W|| = max 1/||W||) :
Polynomial of power p: K(𝑋𝑖 , 𝑋𝑗 )= (1 + 𝑋𝑖 𝑇 𝑋𝑗 )𝑃
Find W and b such that
Gaussian (radial-basis function network):
∅(W) = 1/2 𝑊 𝑇 W is minimized; 2
And for all {(𝑋𝑖 ,𝑦𝑖 )}: 𝑦𝑖 (𝑊 𝑇 𝑋𝑖 + b) ≥1 −‖𝑋𝑖 − 𝑋𝑗 ‖
K(𝑋𝑖 , 𝑋𝑗 )= exp( )
The decision boundary that classify the examples correctly 2𝜎2
holds 𝑦𝑖 (𝑊 𝑇 𝑋𝑖 + b) ≥1, ∀𝑖 Sigmoid: K(𝑋𝑖 , 𝑋𝑗 )= tanh(𝛽0 𝑋𝑖 𝑇 𝑋𝑗 +𝛽1 )
If 𝑋0 is a solution to the constrained optimization problem K(𝑋𝑖 , 𝑋𝑗 )= ∅(𝑋𝑖 ) 𝑇 ∅(𝑋𝑗 )
𝑚𝑖𝑛𝑥 f(x) subject to 𝑔𝑖 (x) ≤0 ∀𝑖 ∈ 1...n Where ∅(𝑥) = [1 𝑥1 2 √2𝑥1 𝑥2 𝑥2 2 √2𝑥1 √2𝑥2 ]
There must exist ∝𝑖 ≥ 0such that 𝑋0 satisfies Radial basis function (infinite dimensional space)
2
𝜕 −‖𝑋𝑖 − 𝑋𝑗 ‖
(𝑓(𝑥) + ∑𝑛𝑖=1 ∝𝑖 𝑔𝑖 (x))|𝑥=𝑥0 =0 K(𝑋𝑖 , 𝑋𝑗 )= exp( )
𝜕𝑥 2𝜎2
𝑔𝑖 (x) = 0 ∀i ∈ 1...n
The function f(x)+ ∑𝑛 𝑖=1 ∝𝑖 𝑔𝑖 (𝑥) is called lagrangian DT
The solution is the point of gradient 0
1 Calculating Entropy
Minimize ‖𝑊‖2 H(X) = − ∑𝑖𝜖 𝑋 pi log pi
2
Subject to 1- 𝑦𝑖 (𝑊 𝑇 𝑋𝑖 + b) ≤0 , ∀𝑖 Calculating Information Gain
|𝑆𝑓|
The Lagrangian is: Gain(S,F)= H(S) - ∑𝑓𝜀 𝑣𝑎𝑙𝑢𝑒𝑠(𝐹) H(𝑆𝑓)
|𝑆|
1
L = 𝑊 𝑊+∑𝑛𝑇 𝑇
𝑖=1 ∝𝑖 (1 − 𝑦𝑖 (𝑊 𝑋𝑖 + b) ) Identify the feature with the greatest Information Gain and
2
−1 𝑛 repeat this process recursively.
= ∑𝑖=1 ∑𝑛𝑗=1 ∝𝑖 ∝𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖 𝑇 𝑥𝑗 + ∑𝑛𝑖=1 ∝𝑖 (rearrangingterms)
2
1 𝑛 NN
Maximize ∑𝑛 𝑛
𝑖=1 ∝𝑖 − 2 ∑𝑖=1 ∑𝑗=1 ∝𝑖 ∝𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖 𝑥𝑗
𝑇

Subject to ∑𝑛𝑖=1 ∝𝑖 𝑦𝑖 = 0, ∝𝑖 ≥0 ∀𝑖
Gradient Descent:
𝜕𝐽
We can obtain w by For i = 0 to n : 𝑤𝑖 ← 𝑤𝑖 + ∆𝑤𝑖 Where ∆𝑤𝑖 = −𝜂
𝜕𝑤𝑖
w=∑𝑛 𝑖=1 ∝𝑖 𝑦𝑖 𝑥𝑖 True of Batch Gradient Descent: After each epoch, compute
b can be obtained from a positive support vector average loss over the training set:
(𝑥 ∝𝑝𝑠𝑣 )knowing that 𝑊 𝑇 𝑥𝑝𝑠𝑣 +b=1 Find w that minimizes “loss” function where M is the number of
training examples.
sign( 𝑊 𝑇 𝑍+b)=sign(∑∀𝑖∈𝑆𝑉 ∝𝑖 𝑦𝑖 𝑥𝑖 𝑇 𝑍 + 𝑏) L = “confidence” of incorrect prediction.
𝑀 𝑀 𝑀
The old formulation 1 1
Find W and b such that J(w) = ∑ 𝐿(𝑤 ⋅ 𝑥 𝑘 , 𝑡 𝑘 ) = ∑ ∑ max(0, −(𝑤 ⋅ 𝑥 𝑘 )𝑡 𝑘 )
𝑀 𝑀
∅(W) = 1/2 𝑊 𝑇 W is minimized is minimized and for all {(𝑋𝑖 ,𝑦𝑖 )} 𝐾=1 𝐾=1 𝐾=1

𝑦𝑖 (𝑊 𝑇 𝑋𝑖 + b) ≥1 𝑁−1
The new formulation incorporating slack variables: 1
∇𝜃 𝑅 = ∑ 2(𝑡𝑖 − 𝑔(𝜃 𝑇 𝑥𝑖 ))(−1)𝑔′ (𝜃 𝑇 𝑥𝑖 )𝑥𝑖 = 0
Find W and b such that 2𝑁
𝑖=0
∅(W) = 1/2 𝑊 𝑇 W +C∑ ξ𝑖 is minimized and for all {(𝑋𝑖 ,𝑦𝑖 )}
𝑦𝑖 (𝑊 𝑇 𝑋𝑖 + b) ≥1-ξ𝑖 and ξ𝑖 ≥0 for all i Stochastic Gradient Descent:
Parameter C can be viewed as a way to control overfitting
The dual problem for soft margin classification: 𝐽𝑘 (𝑤) = max(0, −(𝑤 ⋅ 𝑥 𝑘 )𝑡 𝑘 )
Find ∝𝑖 ... ∝𝑁 such that 𝜕𝐽𝑘 0 𝑖𝑓 (𝑤 ⋅ 𝑥 𝑘 )𝑡 𝑘 > 0, 𝑡ℎ𝑎𝑡 𝑖𝑠 𝑥 𝑘 𝑤𝑎𝑠 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦
1 ={ 𝑘 𝑘
𝜕𝑤𝑖 −𝑥𝑖 𝑡 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Q( ∝ )= ∑ − 2 ∑𝑛𝑖=1 ∑𝑛𝑗=1 ∝𝑖 ∝𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖 𝑇 𝑥𝑗 is maximized
∝𝑖
and ∑𝑛
𝑖=1 ∝𝑖 𝑦𝑖 = 0 & 0≤ ∝𝑖 ≤ C for all ∝𝑖 After processing training example k, if perceptron misclassified the
𝜕𝐽
example: ∆𝑤𝑖 = −𝜂 = 𝜂𝑥𝑖𝑘 𝑡 𝑘
𝜕𝑤𝑖

KNN(K-nearest-neighbour):
For each node j in hidden layer, ℎ𝑗 =
Steps:
1. For input instance X loop on the labeled training data stored and calculate 𝑆(∑𝑖 ∈ 𝑖𝑛𝑝𝑢𝑡 𝑙𝑎𝑦𝑒𝑟 𝑤𝑗𝑖 𝑥𝑖 + 𝑤𝑗0 )
distance from X using a distance function.
2. select K nearest instances and for each instance vote for its class. For each node k in output layer,𝑜𝑘 =
3. X is classified to belong to the class with 𝑆(∑𝑗 ∈ ℎ𝑖𝑑𝑑𝑒𝑛 𝑙𝑎𝑦𝑒𝑟 𝑤𝑘𝑗 ℎ𝑗 + 𝑤𝑘0 )
highest votes. 𝑥𝑖 : Activation of input node i.
ℎ𝑗 : Activation of hidden node j.
Distance Functions: 1)Euclidean Distance: √∑𝑘𝑖=1(𝑥𝑖 − 𝑦𝑖 )2
𝑜𝑘 : Activation of output node k.
1
𝑤𝑗𝑖 : Weight from node i to node j.
2)Manhattan: ∑𝑘
𝑖=1 |𝑥𝑖 − 𝑦𝑖 | 3)Minkowski: (∑𝑘𝑖=1(|𝑥𝑖 − 𝑦𝑖 |)𝑞 )𝑞

3/4
PCA Algorithm Rest of NN
1
mean of the data 𝑥̅ = ∑𝑁 xn sample covariance matrix Error (or “loss”) E is sum-squared error over all output
𝑁 𝑛=1
1
s= ∑𝑁 (xn −𝑥̅ )(xn−𝑥̅ ) T
units:
𝑁 𝑛=1
Do the eigenvalue decomposition of the D x D matrix S
𝑛𝑜. 𝑜𝑓 𝑝𝑎𝑡𝑡𝑒𝑟𝑛𝑠 𝑛𝑜. 𝑜𝑓 𝑜𝑢𝑡𝑝𝑢𝑡𝑠
Take the top K eigenvectors (corresponding to the top K 1
eigenvalues) 𝐸(𝑤) = ∑ ∑ (𝑡𝑘 − 𝑜𝑘 )2
2
Call these u1,...,uk(s.t. λ1 ≥ λ2 ≥ ... λk−1 ≥ λk ) 𝑝=1 𝐾=1
U = [u1 u2 ... uk ] is the projection matrix of size
D x K. Projection of each example xn is computed as zn= UTxn Weight Decay:
Zn is a K x 1 vector
(also called the embedding of xn) Δ𝑤𝑗𝑖𝑡 = 𝜂𝛿𝑗 𝑥𝑗𝑖 + 𝛼Δ𝑤𝑗𝑖𝑡−1 − 𝜆𝑤𝑗𝑖𝑡−1
PCA
Where λ is a parameter between 0 and 1 and 𝛼 is the
Projection of data point xn along u1 : 𝑢1𝑇 xn
momentum
Projection of the mean
1 Clustering
𝑥̅ along u1 : 𝑢1𝑇 𝑥
̅ , where 𝑥̅ = ∑𝑁
𝑛=1 xn
𝑁
K-means algorithm:
Variance of the projected data (along projection direction u1 ) : -Iterate:
-- Assign each of example xn to its closest cluster center
1 𝑇 𝑇
∑𝑁
𝑛=1 {𝑢1 xn -𝑢1 𝑥̅ }2 =𝑢1𝑇 𝑆𝑢1
𝑁 Ck ={n : k = arg mink||xn −μk||2} ( Ck is the set of examples
Where S is the data covariance matrix defined as
closest to μk )
1
S= ∑𝑁 (xn −𝑥̅ )(xn−𝑥̅ ) T -- Recompute the new cluster centers μk (mean/centroid of the
𝑁 𝑛=1 1
set Ck ) 𝜇𝑘 = ∑ 𝑥
Objective function: u1TSu1 +λ1(1−u1Tu1 ) |𝐶 | 𝑛ε𝐶 𝑘 𝑛
𝑘
Taking derivative w.r.t. u1 and setting it to zero gives: Su1 = λ1 u1 --Repeat while not converged
This is the eigenvalue eqn K-means Objective function : J(μ,r) = ∑𝑁 𝐾
𝑛=1 ∑𝑘=1 𝑟𝑛𝑘 ||𝑥𝑛 − 𝜇𝑘 ||2
- u1 must be an eigenvector of S (and λ1 the corresponding eigenvalue)
But there are multiple eigenvectors of S. which one is u1 ? Min-link or single-link: results in chaining (clusters can get very
Consider uT1 Su1 = uT1 λ1 u1 = λ1 (using uT1 u1 = 1). large) d(R,S) = minxRε R,xSε Sd(xR,xS)
We know that the projected data variance
Max-link or complete-link: results in small, round shaped
uT1 Su1 = λ1 is max
- Thus λ1 should be the largest eigenvalue clusters d(R,S) = maxxRε R,xSε Sd(xR,xS)
- - Thus u1 is the first (top) eigenvector of S ( with eigenvalue λ1 ) Average-link: compromise between single and complexte
=> the first principal component (direction of highest variance in the linkage
data) d(R,S) = |R1||S|∑d(xR,xS) xRεR,xsεS
Subsequent PC’ are given by the subsequent eigenvectors of S

PCA Approximate reconstruction Linear regression

Given the principal components u1 , …. uk , the PCA
approximation of an example xn is: Regression: t(x) = f(x)+ε, ε: some noise| Linear: y(x) = wo +w1x
𝑇 Linear model: l(w) = ∑𝑁𝑛=1 [t −(wo +w1x )]
n (n) 2
𝑥̂ n = ∑𝐾 𝐾
𝑖=1 (𝑥𝑛 ui ) ui = ∑𝑖=1 ( Z ni ui ) ∂l
Where Z n = [Z n1 , ….., Z nK] is the low-dimensional Gradient descent: w ← w−λ
∂𝑤
projection of x n Batch updates: w ← w+2λ ∑𝑁
(t(n) −y(x(n)))x(n)
𝑛=1
To compress a dataset X = [x 1 , ….., x N ], all we need is the Stochastic/online updates: w ← w+2λ (t(n) −y(x(n)))x(n)
set of K<< D Analytical solution:
Principal components, and the projections Define: t = [t(1),t(2),... ...,t(N)]T, X=[1,x(1) 1,x(2) .... 1,x(N)]T
Z = [z 1 , ….., z N ] of each example
Then: w = (XTX)−1XTt
PCA for Very High Dimensional Data Multi-dimensional Inputs: y(x) = wo + ∑𝑑𝑗=1 wjxj = wTx
In many cases N<D Recall: PCA requires eigen-decomposition M-th order polynomial function of one dimensional feature x:
of DxD covariance matrix y(x,w) = wo + ∑𝑀
𝑗=1 wjx
j

1
S= X XT (assuming centered data, and X being DxN ) Regularized Least Squares:
𝑁
1 l(w) = ∑𝑁
𝑛=1 [t −(wo +w1x )] +αw w
n (n) 2 T
The relationship is ui = Xvi {λi,vi} is an eigenvalue-eigen
(𝑁𝜆𝑖 )2 Gradient descent: w ← w+2λ[∑𝑁 𝑛=1 (t −y(x ))x −αw]
(n) (n) (n)
1
vector pair of the N x N matrix XTX , and ui is the −1 T
Analytical solution: w = (X X +αI) X t
T
𝑁
1
corresponding eigenvector of S = XXT (that we want)
𝑁

4/4

Machine Learning PYQ 2022 Ans
No ratings yet
Machine Learning PYQ 2022 Ans
17 pages
MLvsMAP Merged
No ratings yet
MLvsMAP Merged
208 pages
10-701/15-781 Machine Learning Mid-Term Exam Solution: Your Name
No ratings yet
10-701/15-781 Machine Learning Mid-Term Exam Solution: Your Name
12 pages
SVM
No ratings yet
SVM
57 pages
John A. Hartigan-Clustering Algorithms-John Wiley & Sons (1975)
No ratings yet
John A. Hartigan-Clustering Algorithms-John Wiley & Sons (1975)
369 pages
4 - SVM
No ratings yet
4 - SVM
58 pages
cs221-lecture11
No ratings yet
cs221-lecture11
71 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
56 pages
Solutions: 10-601 Machine Learning, Midterm Exam: Spring 2008 Solutions
No ratings yet
Solutions: 10-601 Machine Learning, Midterm Exam: Spring 2008 Solutions
8 pages
DL Assignment Solutions
No ratings yet
DL Assignment Solutions
64 pages
CS-3035 (ML) - CS End April 2024
No ratings yet
CS-3035 (ML) - CS End April 2024
21 pages
MachineLearning MidTerm UMT Spring 2021
100% (1)
MachineLearning MidTerm UMT Spring 2021
12 pages
Weekly Homework X
No ratings yet
Weekly Homework X
15 pages
IAT-II Question Paper With Solution of 18CS71 Artificial Intelligence & Machine Learning Dec-2021-Dr - Swathi.Y
No ratings yet
IAT-II Question Paper With Solution of 18CS71 Artificial Intelligence & Machine Learning Dec-2021-Dr - Swathi.Y
12 pages
Wa0030.
No ratings yet
Wa0030.
36 pages
Week 7 - Graded
No ratings yet
Week 7 - Graded
17 pages
10f 601 Midterm
No ratings yet
10f 601 Midterm
17 pages
Answers 2024
No ratings yet
Answers 2024
11 pages
Ad
No ratings yet
Ad
5 pages
quiz3
No ratings yet
quiz3
12 pages
12s 701 Final
No ratings yet
12s 701 Final
17 pages
AI42001_Machine_Learing_Foundations_ES_2024
No ratings yet
AI42001_Machine_Learing_Foundations_ES_2024
18 pages
I-K-Means and Clustering
No ratings yet
I-K-Means and Clustering
6 pages
hw3 Soln
No ratings yet
hw3 Soln
7 pages
SMAI End 2015 S
No ratings yet
SMAI End 2015 S
4 pages
CMU 2018s NinaBALCAN HW3
No ratings yet
CMU 2018s NinaBALCAN HW3
7 pages
finals19
No ratings yet
finals19
16 pages
Pattern Recognition Pyq
No ratings yet
Pattern Recognition Pyq
9 pages
Advantages:: Q.No 1.a Ans
No ratings yet
Advantages:: Q.No 1.a Ans
12 pages
ML Question CMU
No ratings yet
ML Question CMU
12 pages
CS 7641 CSE/ISYE 6740 Mid-Term Exam 2 (Fall 2016) Solutions: 1 Probability and Bayes' Rule (14 PTS)
No ratings yet
CS 7641 CSE/ISYE 6740 Mid-Term Exam 2 (Fall 2016) Solutions: 1 Probability and Bayes' Rule (14 PTS)
12 pages
ML 2023a Midsem Solution
No ratings yet
ML 2023a Midsem Solution
9 pages
cs675 SS2022 Midterm Solution PDF
No ratings yet
cs675 SS2022 Midterm Solution PDF
10 pages
Ain Shams University Faculty of Engineering
No ratings yet
Ain Shams University Faculty of Engineering
2 pages
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 3
No ratings yet
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 3
8 pages
Lattin Et Al - Analyzing Multivariate Data - 279-281
No ratings yet
Lattin Et Al - Analyzing Multivariate Data - 279-281
3 pages
ml-20240315
No ratings yet
ml-20240315
8 pages
Midterm 2008s Solution
No ratings yet
Midterm 2008s Solution
12 pages
endsem_ML_makeup_AK-_1_
No ratings yet
endsem_ML_makeup_AK-_1_
7 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
35 pages
Midterm Solutions
No ratings yet
Midterm Solutions
8 pages
IBM322 Last Year ETE
No ratings yet
IBM322 Last Year ETE
5 pages
endsem_ML_regular_AK
No ratings yet
endsem_ML_regular_AK
7 pages
2017-18-I MS Key
No ratings yet
2017-18-I MS Key
6 pages
Machine Learning Solutions
No ratings yet
Machine Learning Solutions
6 pages
hw3
No ratings yet
hw3
7 pages
Kernel PCA
No ratings yet
Kernel PCA
13 pages
class-test-1
No ratings yet
class-test-1
5 pages
2019-20-I ES Key
No ratings yet
2019-20-I ES Key
4 pages
MS_key-4
No ratings yet
MS_key-4
4 pages
SVM Problems1
No ratings yet
SVM Problems1
5 pages
Final: CS 189 Spring 2013 Introduction To Machine Learning
No ratings yet
Final: CS 189 Spring 2013 Introduction To Machine Learning
9 pages
CS467-ML-Dec2018
No ratings yet
CS467-ML-Dec2018
3 pages
Scs 414 Machine Learning Assignment 2 Sc212-1012-2019
No ratings yet
Scs 414 Machine Learning Assignment 2 Sc212-1012-2019
12 pages
QCM_DL
No ratings yet
QCM_DL
7 pages
2022 CS244 End Sem Soln
No ratings yet
2022 CS244 End Sem Soln
6 pages
PRML 2022 Endsem
No ratings yet
PRML 2022 Endsem
3 pages
Machine Learning Questions Final - Solutions
No ratings yet
Machine Learning Questions Final - Solutions
5 pages
DOC-20241220-WA0002.
No ratings yet
DOC-20241220-WA0002.
129 pages
PDF Laporan Praktikum Data Mining - Compress
No ratings yet
PDF Laporan Praktikum Data Mining - Compress
142 pages
ES_key (4)
No ratings yet
ES_key (4)
4 pages
Hierarchical Clustering in Machine Learning
No ratings yet
Hierarchical Clustering in Machine Learning
11 pages
Practice Questions Lec 18 45 (2)
No ratings yet
Practice Questions Lec 18 45 (2)
4 pages
Social Computing (2019 Pattern, Semester VIII) – Exam Questions and Answers
No ratings yet
Social Computing (2019 Pattern, Semester VIII) – Exam Questions and Answers
25 pages
Cluster Analysis - Part A
No ratings yet
Cluster Analysis - Part A
77 pages
25 - Picking - Unlocked
No ratings yet
25 - Picking - Unlocked
56 pages
Chapter 18 Presentation
No ratings yet
Chapter 18 Presentation
47 pages
Clustering
No ratings yet
Clustering
43 pages
Cluster Analysis: Prof. Vandith Pamuru
No ratings yet
Cluster Analysis: Prof. Vandith Pamuru
68 pages
Unit - 5
No ratings yet
Unit - 5
111 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
61 pages
ML IMP QUES 2
No ratings yet
ML IMP QUES 2
37 pages
Clustering Lecture
No ratings yet
Clustering Lecture
49 pages
Survey of Clustering Algorithms: Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE
No ratings yet
Survey of Clustering Algorithms: Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE
34 pages
Advanced Machine Learning Challenge5
No ratings yet
Advanced Machine Learning Challenge5
22 pages
Malhotra Mr05 PPT 20
100% (1)
Malhotra Mr05 PPT 20
41 pages
Artificial Intelligence, Machine Learning and Smart Technologies For Nondestructive Evaluation
No ratings yet
Artificial Intelligence, Machine Learning and Smart Technologies For Nondestructive Evaluation
17 pages
Agglomerative Hierarchical Clustering (AHC) Method for Data Mining Sales Product Clustering
No ratings yet
Agglomerative Hierarchical Clustering (AHC) Method for Data Mining Sales Product Clustering
10 pages
Assignment 3 - ADM 3308
No ratings yet
Assignment 3 - ADM 3308
13 pages
Clustering in R Tutorial
No ratings yet
Clustering in R Tutorial
13 pages
ML Solved Endsem
No ratings yet
ML Solved Endsem
16 pages
Ain Shams University Faculty of Engineering
No ratings yet
Ain Shams University Faculty of Engineering
8 pages
CSE 319 Pattern Recognition: Clustering
No ratings yet
CSE 319 Pattern Recognition: Clustering
58 pages
Assignment 5 1
No ratings yet
Assignment 5 1
13 pages
Physicochemical Properties of Banana Peel Flour As Influenced by Variety and Stage of Ripeness Multivariate Statistical Analysis
No ratings yet
Physicochemical Properties of Banana Peel Flour As Influenced by Variety and Stage of Ripeness Multivariate Statistical Analysis
14 pages
Cluster Analysis
No ratings yet
Cluster Analysis
9 pages
Lattin Et Al - Analyzing Multivariate Data - 281-283
No ratings yet
Lattin Et Al - Analyzing Multivariate Data - 281-283
3 pages
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet

Ain Shams University Faculty of Engineering

Uploaded by

Ain Shams University Faculty of Engineering

Uploaded by

AIN SHAMS UNIVERSITY

Selected Topics in Systems Engineering

Exam consists of FIVE questions in FOUR pages Total Marks: 90

Question 1: Decision Trees & KNN (18 Marks)

2. What is the first feature to use in splitting the data? why?

Question 2: Neural Networks (18 Marks)

{(2, 6)T , 0}, {(1, 3)T , 1}, {(3, 9)T , 1}

1 /4 Examiner: Prof. Hazem Abbas

Lines 4 and 5 are mutually contradictory.

1. Identify the learnable weights of the network.

'$ '$ '$

Similarly, ∆b3 = µ(t − fo ) · 1 = µ(t − fo )

∂E ∂E ∂fo ∂no ∂fh ∂nh

Again, ∆b1 = µ(t − fo ) · ω32 · βfh (1 − fh )

Question 3: Clustering (18 Marks)

2 /4 Examiner: Prof. Hazem Abbas

Question 4: SVM & PCA (18 Marks)

space as seen in the figure.

Question 5: Linear Regression (18 Marks)

By differentiating t with respect to w0 and w1 , we get normal equations:

P (yi = +1|xi ) > P (yi = −1|xi )

1 > e−(w0 +w1 xi )

6 /4 Examiner: Prof. Hazem Abbas

PCA Approximate reconstruction Linear regression

You might also like