Sol 4
Sol 4
yt = W xt + V st
st+1 = yt
from some initial state s0 , where t denotes the tth call of the RNN, i.e., xt is the tth input.
Figure 1
(a) What is the recurrent state in the RNN from Figure 1? Name one example that can be more naturally
modeled with RNNs than with feedforward neural networks?
(b) As the state of an RNN changes over different runs of the RNN, the loss functions that we use for feedforward
neural networks do not yield consistent results. For given dataset X, please propose a loss function ( based
on the mean square loss function) for RNNs and justify why you chose this loss function.
(c) For a dataset X := (xt , yt )k1 (for some k ∈ N), show how information is propagated by drawing a feedforward
neural network that corresponds to the RNN from Figure 1 for k = 3. Recall that a feedforward neural
network does not contain nodes with a persistent state. (Hint: unfold the RNN.)
Solution 1:
(a) The reccurent state is denoted s. In this case it conincides with the output. Reccurent models are used to
model data with temporal structure e.g. time series, speech, sound.
(b) We have a data X = {(xt , yt )}, where we assume that the data is ordered temporily. Thus, we define the
PT
loss function to be L(U, W, s0 ) = t=1 (y(t) − f (xt , st−1 (U, W ), U, W ), where st is the previous reccurent
state. The initial state s0 needs to be specified and the problem depends on it as well.
Figure 2
2
Problem 2 (Expressiveness of Neural Networks):
In this question we will consider neural networks with sigmoid activation functions of the form
1
ϕ(z) = .
1 + exp(−z)
If we denote by vjl the value of neuron j at layer l its value is computed as
X
vjl = ϕ w0 + wj,i vil−1 .
i∈Layerl−1
In the following questions you will have to design neural networks that compute functions of two Boolean inputs
X1 and X2 . Given that the outputs of the sigmoid units are real numbers Y ∈ (0, 1), we will treat the final
output as Boolean by considering it as 1 if greater than 0.5 and 0 otherwise.
(a) Give 3 weights w0 , w1 , w2 for a single unit with two inputs X1 and X2 that implements the logical OR
function Y = X1 ∨ X2 .
(b) Can you implement the logical AND function Y = X1 ∧ X2 using a single unit? If so, give weights that
achieve this. If not, explain the problem.
(c) It is impossible to implement the XOR function Y = X1 ⊕ X2 using a single unit. However, you can do it
using a multi-layer neural network. Use the smallest number of units you can to implement XOR function.
Draw your network and show all the weights.
(d) Create a neural network with only one hidden layer (of any number of units) that implements
(A ∨ ¬B) ⊕ (¬C ∨ ¬D).
Draw your network and show all the weights.
Solution 2:
(a) We consider the following network w = 0.5, w1 = 1 and w2 = 1. We check whether the output we get is
desired OR function. The network looks as follows,
A ∨ B = round(ϕ(w0 + w1 A + w2 B))
A ∨ B = round(ϕ(−0.5 + A + B))
A B A∨B Network Round
1 1 1 ≈ 0.81 1
0 1 1 ≈ 0.62 1
1 0 1 ≈ 0.62 1
0 0 0 ≈ 0.37 0
(b) w0 = −1.5, w1 = 1 and w2 = 1. We check whether the output we get is desired AND function. The
network looks as follows,
A ∧ B = round(ϕ(w0 + w1 A + w2 B))
A ∧ B = round(ϕ(−1.5 + A + B))
A B A∧B Network Round
1 1 1 ≈ 0.62 1
0 1 0 ≈ 0.37 0
1 0 0 ≈ 0.37 0
0 0 0 ≈ 0.18 0
3
(c) We find the weights by choosing the weights of the first layer and optimizing over the weights of the last
layer s.t. the inequalities are satisfied.
We use a network with one hidden layer and two states
Figure 3
(A ∨ ¬B) ⊕ (¬C ∨ ¬D) ⇐⇒ (A ∧ C ∧ D) ∨ (¬A ∧ B ∧ ¬C) ∨ (¬A ∧ B ∧ ¬D) ∨ (¬B ∧ C ∧ D). (1)
4
Our expression decomposes to 4 expressions that are combined via logical ORs. Thus, we propose to choose
4 cells each modeling the respective expression in OR stream (AND here), and in the last cell we take OR
of all of them. Implementation of AND can be done simply using the intuition from previous exercises.
In full,
output(A, B, C, D) = H(h1(A, C, D), h2(A, B, C), h3(A, B, D), h4(B, C, D)). (2)
5
Figure 4
(a) Consider a single training example x = [x1 , x2 , x3 ] with target output (label) y. Write down the sequence
of calculations required to compute the squared error cost (called forward propagation).
(b) A way to reduce the number of parameters to avoid overfitting is to tie certain weights together, so that
they share a parameter. Suppose we decide to tie the weights w1 and w4 , so that w1 = w4 = wtied . What
is the derivative of the error E with respect to wtied , i.e. ∇wtied E?
(c) For a data set D = {(x(1) , y (1) ), · · · , (x(n) , y (n) )} consisting of n labeled examples, write the pseudocode
of the stochastic gradient descent algorithm with learning rate ηt for optimizing the weight wtied (assume
all the other parameters are fixed).
Solution 3:
6
Past Exam question,Detailed solution not provided