Building Autoencoders in Keras
Building Autoencoders in Keras
Note: all code examples have been updated to the Keras 2.0 API on March 14, 2017. You will
need Keras version 2.0.0 or higher to run them.
1) Autoencoders are data-specific, which means that they will only be able to compress data similar
to what they have been trained on. This is different from, say, the MPEG-2 Audio Layer III (MP3)
compression algorithm, which only holds assumptions about "sound" in general, but not about
specific types of sounds. An autoencoder trained on pictures of faces would do a rather poor job of
compressing pictures of trees, because the features it would learn would be face-specific.
2) Autoencoders are lossy, which means that the decompressed outputs will be degraded
compared to the original inputs (similar to MP3 or JPEG compression). This differs from lossless
arithmetic compression.
3) Autoencoders are learned automatically from data examples, which is a useful property: it
means that it is easy to train specialized instances of the algorithm that will perform well on a
specific type of input. It doesn't require any new engineering, just appropriate training data.
To build an autoencoder, you need three things: an encoding function, a decoding function, and a
distance function between the amount of information loss between the compressed
representation of your data and the decompressed representation (i.e. a "loss" function). The
encoder and decoder will be chosen to be parametric functions (typically neural networks), and to
be differentiable with respect to the distance function, so the parameters of the
encoding/decoding functions can be optimize to minimize the reconstruction loss, using Stochastic
Gradient Descent. It's simple! And you don't even need to understand any of these words to start
using autoencoders in practice.
Today two interesting practical applications of autoencoders are data denoising (which we feature
later in this post), and dimensionality reduction for data visualization. With appropriate
dimensionality and sparsity constraints, autoencoders can learn data projections that are more
interesting than PCA or other basic techniques.
For 2D visualization specifically, t-SNE (pronounced "tee-snee") is probably the best algorithm
around, but it typically requires relatively low-dimensional data. So a good strategy for visualizing
similarity relationships in high-dimensional data is to start by using an autoencoder to compress
your data into a low-dimensional space (e.g. 32 dimensional), then use t-SNE for mapping the
compressed data to a 2D plane. Note that a nice parametric implementation of t-SNE in Keras was
developed by Kyle McDonald and is available on Github. Otherwise scikit-learn also has a simple
and practical implementation.
Otherwise, one reason why they have attracted so much research and attention is because they
have long been thought to be a potential avenue for solving the problem of unsupervised learning,
i.e. the learning of useful representations without the need for labels. Then again, autoencoders
are not a true unsupervised learning technique (which would imply a different learning process
altogether), they are a self-supervised technique, a specific instance of supervised learning where
the targets are generated from the input data. In order to get self-supervised models to learn
interesting features, you have to come up with an interesting synthetic target and loss function,
and that's where problems arise: merely learning to reconstruct your input in minute detail might
not be the right choice here. At this point there is significant evidence that focusing on the
reconstruction of a picture at the pixel level, for instance, is not conductive to learning interesting,
abstract features of the kind that label-supervized learning induces (where targets are fairly
abstract concepts "invented" by humans such as "dog", "car"...). In fact, one may argue that the
best features in this regard are those that are the worst at exact input reconstruction while
achieving high performance on the main task that you are interested in (classification, localization,
etc).
First, we'll configure our model to use a per-pixel binary crossentropy loss, and the Adadelta
optimizer:
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')
Let's prepare our input data. We're using MNIST digits, and we're discarding the labels (since we're
only interested in encoding/decoding the input images).
We will normalize all values between 0 and 1 and we will flatten the 28x28 images into vectors of
size 784.
autoencoder.fit(x_train, x_train,
epochs=50,
batch_size=256,
shuffle=True,
validation_data=(x_test, x_test))
After 50 epochs, the autoencoder seems to reach a stable train/test loss value of about 0.11. We
can try to visualize the reconstructed inputs and the encoded representations. We will use
Matplotlib.
# display reconstruction
ax = plt.subplot(2, n, i + 1 + n)
plt.imshow(decoded_imgs[i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
Here's what we get. The top row is the original digits, and the bottom row is the reconstructed
digits. We are losing quite a bit of detail with this basic approach.
encoding_dim = 32
input_img = Input(shape=(784,))
# add a Dense layer with a L1 activity regularizer
encoded = Dense(encoding_dim, activation='relu',
activity_regularizer=regularizers.l1(10e-5))(input_img)
decoded = Dense(784, activation='sigmoid')(encoded)
Let's train this model for 100 epochs (with the added regularization the model is less likely to
overfit and can be trained longer). The models ends with a train loss of 0.11 and test loss of 0.10.
The difference between the two is mostly due to the regularization term being added to the loss
during training (worth about 0.01).
Deep autoencoder
We do not have to limit ourselves to a single layer as encoder or decoder, we could instead use a
stack of layers, such as:
input_img = Input(shape=(784,))
encoded = Dense(128, activation='relu')(input_img)
encoded = Dense(64, activation='relu')(encoded)
encoded = Dense(32, activation='relu')(encoded)
autoencoder.fit(x_train, x_train,
epochs=100,
batch_size=256,
shuffle=True,
validation_data=(x_test, x_test))
After 100 epochs, it reaches a train and test loss of ~0.097, a bit better than our previous models.
Our reconstructed digits look a bit better too:
Convolutional autoencoder
Since our inputs are images, it makes sense to use convolutional neural networks (convnets) as
encoders and decoders. In practical settings, autoencoders applied to images are always
convolutional autoencoders --they simply perform much better.
Let's implement one. The encoder will consist in a stack of Conv2D and MaxPooling2D layers (max
pooling being used for spatial down-sampling), while the decoder will consist in a stack of Conv2D
and UpSampling2D layers.
input_img = Input(shape=(28, 28, 1)) # adapt this if using `channels_first` image data format
To train it, we will use the original MNIST digits with shape (samples, 3, 28, 28), and we will just
normalize pixel values between 0 and 1.
Let's train this model for 50 epochs. For the sake of demonstrating how to visualize the results of a
model during training, we will be using the TensorFlow backend and the TensorBoard callback.
First, let's open up a terminal and start a TensorBoard server that will read logs stored at
/tmp/autoencoder.
tensorboard --logdir=/tmp/autoencoder
Then let's train our model. In the callbacks list we pass an instance of the TensorBoard callback.
After every epoch, this callback will write logs to /tmp/autoencoder, which can be read by our
TensorBoard server.
autoencoder.fit(x_train, x_train,
epochs=50,
batch_size=128,
shuffle=True,
validation_data=(x_test, x_test),
callbacks=[TensorBoard(log_dir='/tmp/autoencoder')])
This allows us to monitor training in the TensorBoard web interface (by navighating to
http://0.0.0.0:6006):
The model converges to a loss of 0.094, significantly better than our previous models (this is in
large part due to the higher entropic capacity of the encoded representation, 128 dimensions vs. 32
previously). Let's take a look at the reconstructed digits:
decoded_imgs = autoencoder.predict(x_test)
n = 10
plt.figure(figsize=(20, 4))
for i in range(n):
# display original
ax = plt.subplot(2, n, i)
plt.imshow(x_test[i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
# display reconstruction
ax = plt.subplot(2, n, i + n)
plt.imshow(decoded_imgs[i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
We can also have a look at the 128-dimensional encoded representations. These representations
are 8x4x4, so we reshape them to 4x32 in order to be able to display them as grayscale images.
n = 10
plt.figure(figsize=(20, 8))
for i in range(n):
ax = plt.subplot(1, n, i)
plt.imshow(encoded_imgs[i].reshape(4, 4 * 8).T)
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
Here's how we will generate synthetic noisy digits: we just apply a gaussian noise matrix and clip
the images between 0 and 1.
noise_factor = 0.5
x_train_noisy = x_train + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=
x_test_noisy = x_test + noise_factor * np.random.normal(loc=0.0, scale=1.0, size=x_test
n = 10
plt.figure(figsize=(20, 2))
for i in range(n):
ax = plt.subplot(1, n, i)
plt.imshow(x_test_noisy[i].reshape(28, 28))
plt.gray()
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.show()
If you squint you can still recognize them, but barely. Can our autoencoder learn to recover the
original digits? Let's find out.
Compared to the previous convolutional autoencoder, in order to improve the quality of the
reconstructed, we'll use a slightly different model with more filters per layer:
input_img = Input(shape=(28, 28, 1)) # adapt this if using `channels_first` image data format
Now let's take a look at the results. Top, the noisy digits fed to the network, and bottom, the digits
are reconstructed by the network.
It seems to work pretty well. If you scale this process to a bigger convnet, you can start building
document denoising or audio denoising models. Kaggle has an interesting dataset to get you
started.
Sequence-to-sequence autoencoder
If you inputs are sequences, rather than vectors or 2D images, then you may want to use as encoder
and decoder a type of model that can capture temporal structure, such as a LSTM. To build a LSTM-
based autoencoder, first use a LSTM encoder to turn your input sequences into a single vector that
contains information about the entire sequence, then repeat this vector n times (where n is the
number of timesteps in the output sequence), and run a LSTM decoder to turn this constant
sequence into the target sequence.
We won't be demonstrating that one on any specific dataset. We will just put a code example here
for future reference for the reader!
decoded = RepeatVector(timesteps)(encoded)
decoded = LSTM(input_dim, return_sequences=True)(decoded)
First, an encoder network turns the input samples x into two parameters in a latent space, which
we will note z_mean and z_log_sigma. Then, we randomly sample similar points z from the latent
normal distribution that is assumed to generate the data, via z = z_mean + exp(z_log_sigma) *
epsilon, where epsilon is a random normal tensor. Finally, a decoder network maps these latent
space points back to the original input data.
The parameters of the model are trained via two loss functions: a reconstruction loss forcing the
decoded samples to match the initial inputs (just like in our previous autoencoders), and the KL
divergence between the learned latent distribution and the prior distribution, acting as a
regularization term. You could actually get rid of this latter term entirely, although it does help in
learning well-formed latent spaces and reducing overfitting to the training data.
Because a VAE is a more complex example, we have made the code available on Github as a
standalone script. Here we will review step by step how the model is created.
First, here's our encoder network, mapping inputs to our latent distribution parameters:
x = Input(batch_shape=(batch_size, original_dim))
h = Dense(intermediate_dim, activation='relu')(x)
z_mean = Dense(latent_dim)(h)
z_log_sigma = Dense(latent_dim)(h)
We can use these parameters to sample new similar points from the latent space:
def sampling(args):
z_mean, z_log_sigma = args
epsilon = K.random_normal(shape=(batch_size, latent_dim),
mean=0., std=epsilon_std)
return z_mean + K.exp(z_log_sigma) * epsilon
Finally, we can map these sampled latent points back to reconstructed inputs:
# end-to-end autoencoder
vae = Model(x, x_decoded_mean)
We train the model using the end-to-end model, with a custom loss function: the sum of a
reconstruction term, and the KL divergence regularization term.
vae.compile(optimizer='rmsprop', loss=vae_loss)
vae.fit(x_train, x_train,
shuffle=True,
epochs=epochs,
batch_size=batch_size,
validation_data=(x_test, x_test))
Because our latent space is two-dimensional, there are a few cool visualizations that can be done
at this point. One is to look at the neighborhoods of different classes on the latent 2D plane:
Because the VAE is a generative model, we can also use it to generate new digits! Here we will scan
the latent plane, sampling latent points at regular intervals, and generating the corresponding digit
for each of these points. This gives us a visualization of the latent manifold that "generates" the
MNIST digits.
for i, yi in enumerate(grid_x):
for j, xi in enumerate(grid_y):
z_sample = np.array([[xi, yi]]) * epsilon_std
x_decoded = generator.predict(z_sample)
digit = x_decoded[0].reshape(digit_size, digit_size)
figure[i * digit_size: (i + 1) * digit_size,
j * digit_size: (j + 1) * digit_size] = digit
plt.figure(figsize=(10, 10))
plt.imshow(figure)
plt.show()
That's it! If you have suggestions for more topics to be covered in this post (or in future posts), you
can contact me on Twitter at @fchollet.
References
[1] Why does unsupervised pre-training help deep learning?
[2] Batch normalization: Accelerating deep network training by reducing internal covariate shift.