0% found this document useful (0 votes)

2 views

Chap2slides - Copy

The document discusses the connections between machine learning and shallow neural networks, highlighting how classical models can be represented as special cases of shallow neural networks. It explains the advantages of deep learning in handling larger datasets and the importance of architecture in incorporating domain-specific insights. Additionally, it covers various models such as logistic regression and multinomial logistic regression, emphasizing the use of softmax activation for multi-class predictions.

Uploaded by

Nandita Bhanja Chaudhuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Chap2slides - Copy

Uploaded by

Nandita Bhanja Chaudhuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

Charu C.

Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Connecting Machine Learning with Shallow

Neural Networks

Neural Networks and Deep Learning, Springer, 2018

Chapter 2, Section 2.1
Neural Networks and Machine Learning

• Neural networks are optimization-based learning models.

• Many classical machine learning models use continuous op-

timization:

– SVMs, Linear Regression, and Logistic Regression

– Singular Value Decomposition

– (Incomplete) Matrix factorization for Recommender Sys-

tems

• All these models can be represented as special cases of shal-

low neural networks!
The Continuum Between Machine Learning and Deep
Learning

DEEP LEARNING

ACCURACY
CONVENTIONAL
MACHINE LEARNING

AMOUNT OF DATA

• Classical machine learning models reach their learning capac-

ity early because they are simple neural networks.

• When we have more data, we can add more computational

units to improve performance.
The Deep Learning Advantage

• Exploring the neural models for traditional machine learning

is useful because it exposes the cases in which deep learning
has an advantage.

– Add capacity with more nodes for more data.

– Controlling the structure of the architecture provides a

way to incorporate domain-speciﬁc insights (e.g., recur-
rent networks and convolutional networks).

• In some cases, making minor changes to the architecture

leads to interesting models:

– Adding a sigmoid/softmax layer in the output of a neural

model for (linear) matrix factorization can result in logis-
tic/multinomial matrix factorization (e.g., word2vec).
Recap: Perceptron versus Linear Support Vector Machine

INPUT NODES INPUT NODES

W PERCEPTRON CRITERION W
(SMOOTH SURROGATE)
HINGE LOSS

OUTPUT NODE OUTPUT NODE

X ∑ LOSS = MAX(0,-y[W X])

y X ∑ LOSS = MAX(0,-y[W X]+1)
y

LINEAR ACTIVATION LINEAR ACTIVATION

(a) Perceptron (b) SVM

Loss = max{0, −y(W · X)} Loss = max{0, 1 − y(W · X)}

• The Perceptron criterion is a minor variation of hinge loss

with identical update of W ⇐ W + αyX in both cases.

• We update only for misclassiﬁed instances in perceptron, but

update also for “marginally correct” instances in SVM.
Perceptron Criterion versus Hinge Loss

PERCEPTRON CRITERION HINGE LOSS

LOSS
0 1
VALUE OF W X FOR
POSITIVE CLASS INSTANCE

• Loss for positive class training instance at varying values of

W · X.
What About the Kernel SVM?

HIDDEN LAYER
(RBF ACTIVATION)

INPUT LAYER

x1
OUTPUT LAYER
x2 y

BIAS NEURON
+1 (HIDDEN LAYER)

• RBF Network for unsupervised feature engineering.

– Unsupervised feature engineering is good for noisy data.

– Supervised feature engineering (with deep learning) is

good for learning rich structure.
Much of Machine Learning is a Shallow Neural Model

• By minor changes to the architecture of perceptron we can

get:

– Linear regression, Fisher discriminant, and Widrow-Hoﬀ

learning ⇒ Linear activation in output node

– Logistic regression ⇒ Sigmoid activation in output node

• Multinomial logistic regression ⇒ Softmax Activation in Final

Layer

• Singular value decomposition ⇒ Linear autoencoder

• Incomplete matrix factorization for Recommender Systems

⇒ Autoencoder-like architecture with single hidden layer
(also used in word2vec)
Why do We Care about Connections?

• Connections tell us about the cases that it makes sense to

use conventional machine learning:

– If you have less data with noise, you want to use conven-
tional machine learning.

– If you have a lot of data with rich structure, you want to

use neural networks.

– Structure is often learned by using deep neural architec-

tures.

• Architectures like convolutional neural networks can use

domain-speciﬁc insights.
Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Neural Models for Linear Regression,

Classiﬁcation, and the Fisher Discriminant
[Connections with Widrow-Hoﬀ Learning]

Neural Networks and Deep Learning, Springer, 2018

Chapter 2, Section 2.2
Widrow-Hoﬀ Rule: The Neural Avatar of Linear
Regression

• The perceptron (1958) was historically followed by Widrow-

Hoﬀ Learning (1960).

• Identical to linear regression when applied to numerical tar-

gets.

– Originally proposed by Widrow and Hoﬀ for binary targets

(not natural for regression).

• The Widrow-Hoﬀ method, when applied to mean-centered

features and mean-centered binary class encoding, learns the
Fisher discriminant.

• We ﬁrst discuss linear regression for numeric classes and then

visit the case of binary classes.
Linear Regression: An Introduction

• In linear regression, we have training pairs (X i, yi) for i ∈

{1 . . . n}, so that Xi contains d-dimensional features and yi
contains a numerical target.

• We use a linear parameterized function to predict ŷi = W ·X i.

• Goal is to learn W , so that the sum-of-squared diﬀerences

between observed yi and predicted ŷi is minimized over the
entire training data.

• Solution exists in closed form, but requires the inversion of

a potentially large matrix.

• Gradient-descent is typically used anyway.

Linear Regression with Numerical Targets:Neural Model

INPUT NODES
W
SQUARED LOSS
OUTPUT NODE

X ∑ LOSS = (y-[W X])2

LINEAR ACTIVATION

• Predicted output is ŷi = W · X i and loss is Li = (yi − ŷi)2.

• Gradient-descent update is W ⇐ W −α ∂Li = W +α(yi −ŷi)X i.

∂W
Widrow-Hoﬀ: Linear Regression with Binary Targets

• For yi ∈ {−1, +1}, we use same loss of (yi − ŷi)2, and update
of W ⇐ W + α (y − ŷ ) X .
i i i
delta

– When applied to binary targets, it is referred to as delta

rule.

– Perceptron uses the same update with ŷi = sign{W · X i},

whereas Widrow-Hoﬀ uses ŷi = W · X i.

• Potential drawback: Retrogressive treatment of well-

separated points caused by the pretension that binary targets
are real-valued.

– If yi = +1, and W · X i = 106, the point will be heavily

penalized for strongly correct classiﬁcation!

– Does not happen in perceptron.

Comparison of Widrow-Hoﬀ with Perceptron and SVM

• Convert the binary loss functions and updates to a form more

easily comparable to perceptron using yi2 = 1:

• Loss of (X i, yi) is (yi − W · X i)2 = (1 − yi[W · X i])2

Update: W ⇐ W + αyi(1 − yi[W · X i])X i

Perceptron L1-Loss SVM

Loss max{−yi (W · X i), 0} max{1 − yi(W · X i ), 0}
Update W ⇐ W + αyiI(−yi [W · X i ] > 0)X i W ⇐ W + αyiI(1 − yi[W · X i] > 0)X i

Widrow-Hoﬀ Hinton’s L2-Loss SVM

Loss (1 − yi (W · X i))2 max{1 − yi (W · X i), 0}2
Update W ⇐ W + αyi(1 − yi [W · X i])X i W ⇐ W + αyimax{(1 − yi[W · X i]), 0}X i
Some Interesting Historical Facts

• Hinton proposed the SVM L2-loss three years before Cortes

and Vapnik’s paper on SVMs.

– G. Hinton. Connectionist learning procedures. Artificial

Intelligence, 40(1–3), pp. 185–234, 1989.

– Hinton’s L2-loss was proposed to address some of the

weaknesses of loss functions like linear regression on binary
targets.

– When used with L2-regularization, it behaves identically to

an L2-SVM, but the connection with SVM was overlooked.

• The Widrow-Hoﬀ rule is also referred to as ADALINE, LMS

(least mean-square method), delta rule, and least-squares
classiﬁcation.
Connections with Fisher Discriminant

• Consider a binary classiﬁcation problem with training in-

stances (X i, yi) and yi ∈ {−1, +1}.

– Mean-center each feature vector as Xi − μ.

n
– Mean-center the binary class by subtracting i=1 yi /n
from each yi.

• Use the delta rule W ⇐ W + α (y − ŷ ) X for learning.

i i i
delta

• Learned vector is the Fisher discriminant!

– Proof in Christopher Bishop’s book on machine learning.

Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Neural Models for Logistic Regression

Neural Networks and Deep Learning, Springer, 2018

Chapter 2, Section 2.2
Logistic Regression: A Probabilistic Model

• Consider the training pair (X i, yi) with d-dimensional feature

variables in Xi and class variable yi ∈ {−1, +1}.

• In logistic regression, the sigmoid function is applied to W ·X i,

which predicts the probability that yi is +1.
1
ŷi = P (yi = 1) =
1 + exp(−W · X i)

• We want to maximize ŷi for positive class instances and 1− ŷi

for negative class instances.

– Same as minimizing −log(ŷi ) for positive class instances

and −log(1 − ŷi) for negative instances.

– Same as minimizing loss Li = −log(|yi/2 − 0.5 + ŷi|).

– Alternative form of loss Li = log(1 + exp[−yi(W · X i)])

Maximum-Likelihood Objective Functions

• Why did we use the negative logarithms?

• Logistic regression is an example of a maximum-likelihood

objective function.

• Our goal is to maximize the product of the probabilities of

correct classiﬁcation over all training instances.

– Same as minimizing the sum of the negative log probabil-

ities.

– Loss functions are always additive over training instances.

– So we are really minimizing i −log(|yi/2 − 0.5 + ŷi |) which

can be shown to be i log(1 + exp[−yi(W · X i)]).
Logistic Regression: Neural Model

INPUT NODES
W
LOG LIKELIHOOD

OUTPUT NODE

LOSS = -LOG(|y/2 - 0.5 + ŷ|)

X ∑ ŷ
y

ŷ = PROBABILITY OF +1
y = OBSERVED VALUE
SIGMOID ACTIVATION (+1 OR -1)

• Predicted output is ŷi = 1/(1 + exp(−W · X i)) and loss is

Li = −log(|yi/2 − 0.5 + ŷi|) = log(1 + exp[−yi(W · X i)]).

– Gradient-descent update is W ⇐ W − α ∂Li .

∂W

yiXi
W ⇐W +α
1 + exp[yi(W · Xi)]
Interpreting the Logistic Update

• An important multiplicative factor in the update increment

is 1/(1 + exp[yi(W · X i)]).

• This factor is 1 − ŷi for positive instances and ŷi for negative
instances ⇒ Probability of mistake!

• Interpret as: W ⇐ W +α Probability of mistake on (Xi , yi) (yi Xi)
Comparing Updates of Diﬀerent Models

• The unregularized updates of the perceptron, SVM, Widrow-

Hoﬀ, and logistic regression can all be written in the following
form:
W ⇐ W + αyiδ(X i, yi)X i

• The quantity δ(X i, yi) is a mistake function, which is:

– Raw mistake value (1 − yi(W · X i)) for Widrow-Hoﬀ

– Mistake indicator whether (0 − yi(W · X i)) > 0 for percep-

tron.

– Margin/mistake indicator whether (1 − yi(W · X i)) > 0 for

SVM.

– Probability of mistake on (X i, yi) for logistic regression.

Comparing Loss Functions of Diﬀerent Models

4
PERCEPTRON (SURROGATE)

3.5 WIDROW−HOFF/FISHER
SVM HINGE
3 LOGISTIC

2.5
DECISION
2 BOUNDARY
PENALTY

1.5

0.5

−0.5 INCORRECT CORRECT

PREDICTIONS PREDICTIONS

−1
−3 −2 −1 0 1 2 3
PREDICTION= W.X FOR X IN POSITIVE CLASS

• Loss functions are similar (note Widrow-Hoﬀ retrogression).

Other Comments on Logistic Regression

• Many classical neural models use repeated computational

units with logistic and tanh activation functions in hidden
layers.

• One can view these methods as feature engineering models

that stack multiple logistic regression models.

• The stacking of multiple models creates inherently more pow-

erful models than their individual components.
Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

The Softmax Activation Function and

Multinomial Logistic Regression

Neural Networks and Deep Learning, Springer, 2018

Chapter 2, Section 2.3
Binary Classes versus Multiple Classes

• All the models discussed so far discuss only the binary class
setting in which the class label is drawn from {−1, +1}.

• Many natural applications contain multiple classes without a

natural ordering among them:

– Predicting the category of an image (e.g., truck, carrot).

– Language models: Predict the next word in a sentence.

• Models like logistic regression are naturally designed to pre-

dict two classes.
Generalizing Logistic Regression

• Logistic regression produces probabilities of the two out-

comes of a binary class.

• Multinomial logistic regression produces probabilities of mul-

tiple outcomes.

– In order to produce probabilities of multiple classes, we

need an activation function with a vector output of prob-
abilities.

– The softmax activation function is a vector-based gener-

alization of the sigmoid activation used in logistic regres-
sion.

• Multinomial logistic regression is also referred to as softmax

classiﬁer.
The Softmax Activation Function

• The softmax activation function is a natural vector-centric

generalization of the scalar-to-scalar sigmoid activation ⇒
vector-to-vector function.

• Logistic sigmoid activation: Φ(v) = 1/(1 + exp(−v)).

• Softmax activation: Φ(v1 . . . vk ) = k 1

[exp(v1) . . . exp(vk )]
i=1 exp(vi )

– The k outputs (probabilities) sum to 1.

• Binary case of using sigmoid(v) is identical to using 2-element

softmax activation with arguments (v, 0).

– Multinomial logistic regression with 2-element softmax is

equivalent to binary logistic regression.
Loss Functions for Softmax

• Recall that we use the negative logarithm of the probability

of observed class in binary logistic regression.

– Natural generalization to multiple classes.

– Cross-entropy loss: Negative logarithm of the probability

of correct class.

– Probability distribution among incorrect classes has no ef-

fect.

• Softmax activation is used almost exclusively in output layer

and (almost) always paired with cross-entropy loss.
Cross-Entropy Loss of Softmax

• Like the binary logistic case, the loss L is a negative log

probability.

Softmax Probability Vector ⇒ [ŷ1, ŷ2, . . . ŷk ]

1
[ŷ1 . . . ŷk ] = k [exp(v1) . . . exp(vk )]
i=1 exp(vi)

• The loss is −log(ŷc), where c ∈ {1 . . . k} is the correct class

of that training instance.

• Cross entropy loss is −vc) + log[ kj=1 exp(vj )]
Loss Derivative of Softmax

• Since softmax is almost always paired with cross-entropy loss

∂L for each pre-activation value
L, we can directly estimate ∂vr
from v1 . . . vk .

• Diﬀerentiate loss value of −vc + log[ kj=1 exp(vj )]

• Like the sigmoid derivative, the result is best expressed in

terms of the post-activation values ŷ1 . . . ŷk .

• The loss derivative of the softmax is as follows:

⎧
∂L ⎨ŷ − 1 If r is correct class
r
=
∂vr ⎩ŷr If r is not correct class
Multinomial Logistic Regression

vi = Wi X

W1 v1
∑ ŷ1 = exp(v1)/[∑exp(vi)]

v2
ŷ2 = exp(v2)/[∑exp(vi)]
X
W2
∑ TRUE CLASS

ŷ3 = exp(v3)/[∑exp(vi)]

W3
∑ v3 SOFTMAX
LAYER
LOSS = -LOG(- ŷ2)

• The ith training instance is (X i, c(i)), where c(i) ∈ {1 . . . k}

is class index ⇒ Learn k parameter vectors W 1 . . . W k .

– Deﬁne real-valued score vr = W r · X i for rth class.

– Convert scores to probabilities ŷ1 . . . ŷk with softmax acti-

vation on v1 . . . vk ⇒ Hard or soft prediction
Computing the Derivative of the Loss

• The cross-entropy loss for the ith training instance is Li =

−log(ŷc(i)).

• For gradient-descent, we need to compute ∂Li .

∂W r

• Using chain rule of diﬀerential calculus, we get:

∂Li ∂Li ∂vj ∂Li ∂vr
= = +Zero-terms
∂Wr j ∂vj ∂Wr ∂vr ∂W
r
Xi
⎧
⎨−X (1 − ŷ ) if r = c(i)
i r
=
⎩X ŷr c(i)
if r =
i
Gradient Descent Update

• Each separator Wr is updated using the gradient:

∂Li
Wr ⇐ Wr − α
∂Wr

• Substituting the gradient from the previous slide, we obtain:

⎧
⎨X · (1 − ŷ ) if r = c(i)
i r
Wr ⇐ Wr + α
⎩−X · ŷr c(i)
if r =
i
Summary

• The book also contains details of the multiclass Perceptron

and Weston-Watkins SVM.

• Multinomial logistic regression is a direct generalization of

logistic regression.

• If we apply the softmax classiﬁer with two classes, we will

obtain W1 = −W2 to be the same separator as obtained in
logistic regression.

• Cross-entropy loss and softmax are almost always paired in

output layer (for all types of architectures).

– Many of the calculus derivations in previous slides are re-

peatedly used in diﬀerent settings.
Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

The Autoencoder for Unsupervised

Representation Learning

Neural Networks and Deep Learning, Springer, 2018

Chapter 2, Section 2.5
Unsupervised Learning

• The models we have discussed so far use training pairs of

the form (X, y) in which the feature variables X and target
y are clearly separated.

– The target variable y provides the supervision for the learn-

ing process.

• What happens when we do not have a target variable?

– We want to capture a model of the training data without

the guidance of the target.

– This is an unsupervised learning problem.

Example

• Consider a 2-dimensional data set in which all points are

distributed on the circumference of an origin-centered circle.

• All points in the ﬁrst and third quadrant belong to class +1

and remaining points are −1.

– The class variable provides focus to the learning process

of the supervised model.

– An unsupervised model needs to recognize the circular

manifold without being told up front.

– The unsupervised model can represent the data in only 1

dimension (angular position).

• Best way of modeling is data-set dependent ⇒ Lack of su-

pervision causes problems
Unsupervised Models and Compression

• Unsupervised models are closely related to compression be-

cause compression captures a model of regularities in the
data.

– Generative models represent the data in terms of a com-

pressed parameter set.

– Clustering models represent the data in terms of cluster

statistics.

– Matrix factorization represents data in terms of low-rank

approximations (compressed matrices).

• An autoencoder also provides a compressed representation

of the data.
Deﬁning the Input and Output of an Autoencoder

INPUT LAYER OUTPUT LAYER

x1 xI1
HIDDEN LAYER
x2 xI2

x3 xI3

x4
xI4

x5 OUTPUT OF THIS LAYER PROVIDES xI5

REDUCED REPRESENTATION

• All neural networks work with input-output pairs.

– In a supervised problem, the output is the label.

• In the autoencoder, the output values are the same as inputs:

replicator neural network.

– The loss function penalizes a training instance depending

on how far it is from the input (e.g., squared loss).
Encoder and Decoder

RECONSTRUCTEDDATA
ORIGINALDATA
ENCODER DECODER

CODE
(MULTILAYERNEURAL (MULTILAYERNEURAL
NETWORK) NETWORK)
FUNCTION F(.) FUNCTIONG(.)

F(X)

CONSTRICTED
LAYERSIN XI =(G oF)(X)
X
MIDDLE

• Reconstructing the data might seem like a trivial matter by

simply copying the data forward from one layer to another.

– Not possible when the number of units in the middle are

constricted.

– Autoencoder is divided into encoder and decoder.

Basic Structure of Autoencoder

• It is common (but not necessary) for an M -layer autoen-

coder to have a symmetric architecture between the input
and output.

– The number of units in the kth layer is the same as that

in the (M − k + 1)th layer.

• The value of M is often odd, as a result of which the (M +

1)/2th layer is often the most constricted layer.

– We are counting the (non-computational) input layer as

the ﬁrst layer.

– The minimum number of layers in an autoencoder would

be three, corresponding to the input layer, constricted
layer, and the output layer.
Undercomplete Autoencoders and Dimensionality
Reduction

• The number of units in each middle layer is typically fewer

than that in the input (or output).

– These units hold a reduced representation of the data, and

the ﬁnal layer can no longer reconstruct the data exactly.

• This type of reconstruction is inherently lossy.

• The activations of hidden layers provide an alternative to

linear and nonlinear dimensionality reduction techniques.
Overcomplete Autoencoders and Representation Learning

• What happens if the number of units in hidden layer is equal

to or larger than input/output layers?

– There are inﬁnitely many hidden representations with zero

error.

– The middle layers often do not learn the identity function.

– We can enforce speciﬁc properties on the redundant repre-

sentations by adding constraints/regularization to hidden
layer.

∗ Training with stochastic gradient descent is itself a form

of regularization.

∗ One can learn sparse features by adding sparsity con-

straints to hidden layer.
Applications

• Dimensionality reduction ⇒ Use activations of constricted

hidden layer

• Sparse feature learning ⇒ Use activations of con-

strained/regularized hidden layer

• Outlier detection: Find data points with larger reconstruction

error

– Related to denoising applications

• Generative models with probabilistic hidden layers (varia-

tional autoencoders)

• Representation learning ⇒ Pretraining

Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Singular Value Decomposition with

Autoencoders

Neural Networks and Deep Learning, Springer, 2018

Chapter 2, Section 2.5
Singular Value Decomposition

• Truncated SVD is the approximate decomposition of an n × d

matrix D into D ≈ QΣP T , where Q, Σ, and P are n × k, k × k,
and d × k matrices, respectively.

– Orthonormal columns of each of P , Q, and nonnegative

diagonal matrix Σ.

– Minimize the squared sum of residual entries in D −QΣP T .

– The value of k is typically much smaller than min{n, d}.

– Setting k to min{n, d} results in a zero-error decomposi-

tion.
Relaxed and Unnormalized Deﬁnition of SVD

• Two-way Decomposition: Find and n × k matrix U , and

d × k matrix V so that ||D − U V T ||2 is minimized.

– Property: At least one optimal pair U and V will have

mutually orthogonal columns (but non-orthogonal alter-
natives will exist).

– The orthogonal solution can be converted into the 3-way

factorization of SVD.

– Exercise: Given U and V with orthogonal columns, ﬁnd

Q, Σ and P .

• In the event that U and V have non-orthogonal columns at

optimality, these columns will span the same subspace as the
orthogonal solution at optimality.
Dimensionality Reduction and Matrix Factorization

• Singular value decomposition is a dimensionality reduction

method (like any matrix factorization technique).

D ≈ UV T

• The n rows of D contain the n training points.

• The n rows of U provide the reduced representations of the

training points.

• The k columns of V contain the orthogonal basis vectors.

The Autoencoder Architecture for SVD

INPUT LAYER OUTPUT LAYER

x1 xI1
WT VT
x2 xI2

x3 xI3

x4 xI4

x5 xI5
OUTPUT OF THIS LAYER PROVIDES
REDUCED REPRESENTATION

• The rows of the matrix D are input to encoder.

• The activations of hidden layer are rows of U and the weights

of the decoder contain V .

• The reconstructed data contain the rows of U V T .

Why is this SVD?

• If we use the mean-squared error as the loss function, we are

optimizing ||D − U V T ||2 over the entire training data.

– This is the same objective function as SVD!

• It is possible for gradient-descent to arrive at an optimal

solution in which the columns of each of U and V might not
be mutually orthogonal.

• Nevertheless, the subspace spanned by the columns of each

of U and V will always be the same as that found by the
optimal solution of SVD.
Some Interesting Facts

• The optimal encoder weight matrix W will be the pseudo-

inverse of the decoder weight matrix V if the training data
spans the full dimensionality.
W = (V T V )−1V T

– If the encoder and decoder weights are tied W = V T ,

the columns of the weight matrix V will become mutually
orthogonal.

– Easily shown by substituting W = V T above and postmul-

tiplying with V to obtain V T V = I.

– This is exactly SVD!

• Tying encoder-decoder weights does not lead to orthogonal-

ity for other architectures, but is a common practice anyway.
Deep Autoencoders

1.2
0.6
1 POINT C
0.4 POINT B
POINT A
0.8
0.2
POINT C POINT A
0.6
0
0.4
−0.2
0.2
−0.4
POINT B
0
−0.6
−0.2
1.5
0.6
1 1 0.4 5
0.5 0.2
0.5
0
0 0 −0.2 0
−0.5 −0.5 −0.4
−0.6
−1 −1 −5

• Better reductions are obtained by using increased depth and

nonlinearity.

• Crucial to use nonlinear activations with deep autoencoders.

Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Row-Index to Row-Value Autoencoders:

Incomplete Matrix Factorization for
Recommender Systems

Neural Networks and Deep Learning, Springer, 2018

Chapter 2, Section 2.6
Recommender Systems

• Recap of SVD: Factorizes D ≈ U V T so that the sum-of-

squares of residuals ||D − U V T ||2 is minimized.

– Helpful to watch previous lecture on SVD

• In recommender systems (RS), we have an n×d ratings matrix

D with n users and d items.

– Most of the entries in the matrix are unobserved

– Want to minimize ||D − U V T ||2 only over the observed

entries

– Can reconstruct the entire ratings matrix using U V T ⇒

Most popular method in traditional machine learning.
Diﬃculties with Autoencoder

• If some of the inputs are missing, then using an autoencoder

architecture will implicitly assume default values for some
inputs (like zero).

– This is a solution used in some recent methods like Au-

toRec.

– Does not exactly simulate classical MF used in recom-

mender systems because it implicitly makes assumptions
about unobserved entries.

• None of the proposed architectures for recommender systems

in the deep learning literature exactly map to the classical
factorization method of recommender systems.
Row-Index-to-Row-Value Autoencoder

• Autoencoders map row values to row values.

– Discuss an autoencoder architecture to map the one-hot

encoded row index to the row values.

– Not standard deﬁnition of autoencoder.

– Can handle incomplete values but cannot handle out-of-

sample data.

– Also useful for representation learning (e.g., node repre-

sentation of graph adjacency matrix).

• The row-index-to-row-value architecture is not recognized

as a separate class of architectures for MF (but used often
enough to deserve recognition as a class of MF methods).
Row-Index-to-Row-Value Autoencoder for RS

USERS ITEMS
E.T.
4
ALICE
0 VT
U NIXON
MISSING
BOB
0
SHREK
5
SAYANI
1
GANDHI
MISSING
JOHN
0
NERO
MISSING

ONE-HOT ENCODED INPUT

• Encoder and decoder weight matrices are U and V T .

– Input is one-hot encoded row index (only in-sample)

– Number of nodes in hidden layer is factorization rank.

– Outputs contain the ratings for that row index.

How to Handle Incompletely Speciﬁed Entries?

E.T. E.T.
4 2
ALICE ALICE
0 0
NIXON
5
BOB BOB
0 1
SHREK
5
SAYANI SAYANI
1 0
GANDHI
4
JOHN JOHN
0 0 NERO
3

OBSERVED RATINGS (SAYANI): E.T., SHREK OBSERVED RATINGS (BOB): E.T., NIXON, GANDHI, NERO

• Each user has his/her own neural architecture with missing

outputs.

• Weights across diﬀerent user architectures are shared.

Equivalence to Classical Matrix Factorization for RS

• Since the two weight matrices are U and V T , the one-hot

input encoding will pull out the relevant row from U V T .

• Since the outputs only contain the observed values, we are

optimizing the sum-of-square errors over observed values.

• Objective functions in the two cases are equivalent!

Training Equivalence

• For k hidden nodes, there are k paths between each user and
each item identiﬁer.

• Backpropagation updates weights along all k paths from each

observed item rating to the user identiﬁer.

– Backpropagation in a later lecture.

• These k updates can be shown to be identical to classical ma-

trix factorization updates with stochastic gradient descent.

• Backpropagation on neural architecture is identical to classi-

cal MF stochastic gradient descent.
Advantage of Neural View over Classical MF View

• The neural view provides natural ways to add power to the

architecture with nonlinearity and depth.

– Much like a child playing with a LEGO toy.

– You are shielded from the ugly details of training by an

inherent modularity in neural architectures.

– The name of this magical modularity is backpropagation.

• If you have binary data, you can add logistic outputs for
logistic matrix factorization.

• Word2vec belongs to this class of architectures (but direct

relationship to nonlinear matrix factorization is not recog-
nized).
Importance of Row-Index-to-Row-Value Autoencoders

• Several MF methods in machine learning can be expressed

as row-index-to-row-value autoencoders (but not widely
recognized–RS matrix factorization a notable example).

• Several row-index-to-row-value architectures in NN literature

are also not fully recognized as matrix factorization methods.

– The full relationship of word2vec to matrix factorization

is often not recognized.

– Indirect relationship to linear PPMI matrix factorization

was shown by Levy and Goldberg.

– In a later lecture, we show that word2vec is directly a form

of nonlinear matrix factorization because of its row-index-
to-row-value architecture and nonlinear activation.
Charu C. Aggarwal
IBM T J Watson Research Center
Yorktown Heights, NY

Word2vec: The Skipgram Model

Neural Networks and Deep Learning, Springer, 2018

Chapter 2, Section 2.7
Word2Vec: An Overview

• Word2vec computes embeddings of words using sequential

proximity in sentences.

– If Paris is closely related to France, then Paris and France

must occur together in small windows of sentences.

∗ Their embeddings should also be somewhat similar.

– Continuous bag-of-words predicts central word from con-

text window.

– Skipgram model predicts context window from central

word.
Words and Context

• A window of size t on either side is predicted using a word.

• This model tries to predict the context wi−twi−t+1 . . . wi−1

wi+1 . . . wi+t−1wi+t around word wi, given the ith word in
the sentence, denoted by wi.

• The total number of words in the context window is m = 2t.

• One can also create a d × d word-context matrix C with

frequencies cij .

• We want to ﬁnd an embedding of each word.

Where have We Seen this Setup Before?

• Similar to recommender systems with implicit feedback.

• Instead of user-item matrices, we have square word-context

matrices.

– The frequencies correspond to the number of times a con-

textual word (column id) appears for a target word (row
id).

– Analogous to the number of units bought by a user (row

id) of an item (column id).

– An unrecognized fact is that skipgram word2vec uses an

almost identical model to current recommender systems.

• Helpful to watch previous lecture on recommender systems

with row-index-to-value autoencoders.
Word2Vec: Skipgram Model
y11
y12
y13

V=[vqj]
p X d matrix y1d

x1 yj1
x2 yj2
x3 h1 yj3
h2
U=[ujq] V=[vqj]
d X p matrix p X d matrix
hp
xd yjd

V=[vqj] ym1
p X d matrix ym2
ym3

ymd

• Input is the one-hot encoded word identiﬁer and output con-

tains m identical softmax probability sets.
Word2Vec: Skipgram Model
x1 yj1
x2 yj2
x3 h1 yj3
h2
U=[ujq] V=[vqj]
d X p matrix p X d matrix
hp
xd yjd

MINIBATCH THE m d-DIMENSIONAL OUTPUT VECTORS IN EACH

CONTEXT WINDOW DURING STOCHASTIC GRADIENT DESCENT.
THE SHOWN OUTPUTS yjk CORRESPOND TO THE jth OF m OUTPUTS.

• Since the m outputs are identical, we can collapse the m

outputs into a single output.

• Mini-batch the words in a context window to achieve the

same eﬀect.

• Gradient descent steps for each instance are proportional to

d ⇒ Expensive.
Word2Vec: Skipgram Model with Negative Sampling

x1 yj1
x2 yj2
x3 h1 yj3
h2
U=[ujq] V=[vqj]
d X p matrix p X d matrix
hp
xd yjd

MINIBATCH THE m d-DIMENSIONAL OUTPUT VECTORS IN EACH

CONTEXT WINDOW DURING STOCHASTIC GRADIENT DESCENT.
THE SHOWN OUTPUTS yjk CORRESPOND TO THE jth OF m OUTPUTS.

• Change the softmax layer into sigmoid layer.

• Of the d outputs, keep the positive output and sample k out

of the remaining d − 1 (with log loss).

• Where have we seen missing outputs before?

Can You See the Similarity?
x1 yj1
x2 yj2
x3 h1 yj3
h2
U=[ujq] V=[vqj]
THE VAST MAJORITY OF ZERO
d X p matrix p X d matrix OUTPUTS ARE MISSING
hp (NEGATIVE SAMPLING)
xd yjd

USERS ITEMS
E.T.
4
ALICE
0 VT
U NIXON
MISSING
BOB
0
SHREK
5
SAYANI
1
GANDHI
MISSING
JOHN
0
NERO
MISSING

ONE-HOT ENCODED INPUT

• Main diﬀerence: Sigmoid output layer with log loss.

Word2Vec is Nonlinear Matrix Factorization

• Levy and Goldberg showed an indirect relationship between

word2vec SGNS and PPMI matrix factorization.

• We provide a much more direct result in the book.

– Word2vec is (weighted) logistic matrix factorization.

– Not surprising because of the similarity with the recom-

mender architecture.

– Logistic matrix factorization is already used in recom-

mender systems!

– Neither the word2vec authors nor the community have

pointed out this direct connection.
Other Extensions

• We can apply a row-index-to-value autoencoder to any type

of matrix to learn embeddings of either rows or columns.

• Applying to graph adjacency matrix leads to node embed-

dings.

– Idea has been used by DeepWalk and node2vec after (in-

directly) enhancing the matrix entries with random-walk
methods.

– Details of graph embedding methods in book.

API 653 Vol 1
92% (12)
API 653 Vol 1
128 pages
Mmpi
No ratings yet
Mmpi
19 pages
ANN_Unit-2
No ratings yet
ANN_Unit-2
48 pages
Lec1 PerceptronPocket Recap
No ratings yet
Lec1 PerceptronPocket Recap
61 pages
lecture19
No ratings yet
lecture19
8 pages
CS60010: Deep Learning: Spring 2021
No ratings yet
CS60010: Deep Learning: Spring 2021
32 pages
Lecture Notes 3 Perceptron
No ratings yet
Lecture Notes 3 Perceptron
7 pages
Deep Learning Summer School 2015: Introduction To Machine Learning
No ratings yet
Deep Learning Summer School 2015: Introduction To Machine Learning
46 pages
NN Theory
No ratings yet
NN Theory
138 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
Matematics and Machine Learning
No ratings yet
Matematics and Machine Learning
156 pages
Maths For ML
No ratings yet
Maths For ML
156 pages
Lecture13 - ML Linear & Log-Linear Models
No ratings yet
Lecture13 - ML Linear & Log-Linear Models
34 pages
Neural Networks
No ratings yet
Neural Networks
14 pages
Minsky y Papert
No ratings yet
Minsky y Papert
77 pages
Week3_LearningI
No ratings yet
Week3_LearningI
48 pages
Session 6 Machine Learning Algorithms
No ratings yet
Session 6 Machine Learning Algorithms
46 pages
9.b Handout-1-Loss Functions
No ratings yet
9.b Handout-1-Loss Functions
3 pages
Perceptron
No ratings yet
Perceptron
26 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
cs188 Fa23 Note21
No ratings yet
cs188 Fa23 Note21
8 pages
DL 02 Basics
No ratings yet
DL 02 Basics
94 pages
What Is Computer Vision?
No ratings yet
What Is Computer Vision?
125 pages
What Is Computer Vision?
No ratings yet
What Is Computer Vision?
120 pages
05_optimization_basics
No ratings yet
05_optimization_basics
94 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
11-Nonlinear Models (Neural Networks)
No ratings yet
11-Nonlinear Models (Neural Networks)
6 pages
Lec 04 Deep Networks 2
No ratings yet
Lec 04 Deep Networks 2
78 pages
cs188-sp24-note22
No ratings yet
cs188-sp24-note22
8 pages
Lec 21
No ratings yet
Lec 21
34 pages
Lecture 4 - Linear Classification
No ratings yet
Lecture 4 - Linear Classification
34 pages
SML_Lecture5
No ratings yet
SML_Lecture5
45 pages
Linear Models
No ratings yet
Linear Models
30 pages
383-Fall11-Lec19
No ratings yet
383-Fall11-Lec19
30 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
16-dl-1 - converted
No ratings yet
16-dl-1 - converted
9 pages
cs188 sp23 Note25
No ratings yet
cs188 sp23 Note25
8 pages
lec22-ML III
No ratings yet
lec22-ML III
51 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
56 pages
Lecture 2 Math
No ratings yet
Lecture 2 Math
34 pages
Lecture 2
No ratings yet
Lecture 2
57 pages
Chapter 2 - 2 Shallow neural network 2_2
No ratings yet
Chapter 2 - 2 Shallow neural network 2_2
34 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Lecture 3_Regression (1)
No ratings yet
Lecture 3_Regression (1)
47 pages
Notes5_Regression
No ratings yet
Notes5_Regression
14 pages
Machine Learning: Support Vector Machines Kernel Methods
No ratings yet
Machine Learning: Support Vector Machines Kernel Methods
87 pages
NN-Ch2 New V1
No ratings yet
NN-Ch2 New V1
99 pages
AN2DL_02_2324_Perceptron_2_FeedForward
No ratings yet
AN2DL_02_2324_Perceptron_2_FeedForward
55 pages
Learning With Linear Neurons: Adapted From Lectures by Geoffrey Hinton and Others Updated by N. Intrator, May 2007
No ratings yet
Learning With Linear Neurons: Adapted From Lectures by Geoffrey Hinton and Others Updated by N. Intrator, May 2007
59 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
MIT - Machine Learning Notes From Chapter 1 - 14 PDF
No ratings yet
MIT - Machine Learning Notes From Chapter 1 - 14 PDF
101 pages
This Story Paraphrased From A Post On 9/4/12
No ratings yet
This Story Paraphrased From A Post On 9/4/12
7 pages
CH 1
No ratings yet
CH 1
24 pages
cs188 sp23 Lec25 - Z
No ratings yet
cs188 sp23 Lec25 - Z
38 pages
Cours1 ML
No ratings yet
Cours1 ML
41 pages
Notes6_Classification
No ratings yet
Notes6_Classification
10 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
ML-2
No ratings yet
ML-2
155 pages
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
As 1379-1997 Amdt 1-2000 Specification and Supply of Concret
No ratings yet
As 1379-1997 Amdt 1-2000 Specification and Supply of Concret
2 pages
Cherp 2012 Defining Energy Security Takes More Than Asking Around
No ratings yet
Cherp 2012 Defining Energy Security Takes More Than Asking Around
2 pages
Green_Energy_Technologies_A_Key_Driver_in_Carbon_E
No ratings yet
Green_Energy_Technologies_A_Key_Driver_in_Carbon_E
50 pages
Ultrasonic Level Transmitter and Sensor: SW 838023 GB Shuttle Manual 100316
No ratings yet
Ultrasonic Level Transmitter and Sensor: SW 838023 GB Shuttle Manual 100316
70 pages
Resarch Chap1-3 - Effectiveness of Digital Tools
0% (1)
Resarch Chap1-3 - Effectiveness of Digital Tools
30 pages
PERTINENT AND CONTEMPORARY ISSUES (PCIs), PEE, VBE and LSPs
No ratings yet
PERTINENT AND CONTEMPORARY ISSUES (PCIs), PEE, VBE and LSPs
29 pages
Download full Solution Manual for Computer Vision: A Modern Approach, 2/E 2nd Edition : 013608592X all chapters
No ratings yet
Download full Solution Manual for Computer Vision: A Modern Approach, 2/E 2nd Edition : 013608592X all chapters
12 pages
Document Mp3
No ratings yet
Document Mp3
13 pages
The Foundation For Future Education
No ratings yet
The Foundation For Future Education
16 pages
Certificate: Private Banks" Submitted by XXX Is A Bonafide Piece of Work Conducted Under My Direct
No ratings yet
Certificate: Private Banks" Submitted by XXX Is A Bonafide Piece of Work Conducted Under My Direct
3 pages
Uji Kemiri Ke Rambut-Dikonversi
No ratings yet
Uji Kemiri Ke Rambut-Dikonversi
6 pages
Exp22 Excel Ch01 ML1 Rentals Instructions
No ratings yet
Exp22 Excel Ch01 ML1 Rentals Instructions
3 pages
Nurturing A Growth Mindset
100% (2)
Nurturing A Growth Mindset
85 pages
Tamil Nadu Arasu Cable TV Corporation Limited (Tactv) Consumer Charter
No ratings yet
Tamil Nadu Arasu Cable TV Corporation Limited (Tactv) Consumer Charter
8 pages
2.2.4 Boost Converter Example
No ratings yet
2.2.4 Boost Converter Example
7 pages
Aiv Vs Protodyakonov Method
No ratings yet
Aiv Vs Protodyakonov Method
31 pages
(ALPHA) Board Exam Checklist & Retrospective Revision Timetable 2024-25
No ratings yet
(ALPHA) Board Exam Checklist & Retrospective Revision Timetable 2024-25
22 pages
4500 Reader 20130926
No ratings yet
4500 Reader 20130926
2 pages
SimtekESO Flyer
No ratings yet
SimtekESO Flyer
2 pages
Lab Manual Electrical Power Systems PTUK
No ratings yet
Lab Manual Electrical Power Systems PTUK
111 pages
Circular Linked List
No ratings yet
Circular Linked List
6 pages
21.sop For Soil Investigation Work
No ratings yet
21.sop For Soil Investigation Work
21 pages
Periodic Surprises in Electromagnetism: Photonic Crystals
No ratings yet
Periodic Surprises in Electromagnetism: Photonic Crystals
29 pages
IGP Collect A Junk Segregate To MRF Proponent Dagupan Joseph
No ratings yet
IGP Collect A Junk Segregate To MRF Proponent Dagupan Joseph
10 pages
Hydrogen Risk Management WP PDF 103136 en Master
No ratings yet
Hydrogen Risk Management WP PDF 103136 en Master
7 pages
UCS503_EST_23
No ratings yet
UCS503_EST_23
4 pages
Curs 3 - Prezentare Arduino - IDE - Exemple 17 Nov 2018
No ratings yet
Curs 3 - Prezentare Arduino - IDE - Exemple 17 Nov 2018
88 pages
Synopsis of Construction of Peb Hangar With Annexe and Services
No ratings yet
Synopsis of Construction of Peb Hangar With Annexe and Services
5 pages