论文笔记—RecentAdvancesinAutoencoder-BasedRepresentationLearning.pdf资源-CSDN下载

需积分: 50 179 浏览量 2021-06-01 20:32:22 上传评论收藏 2.36MB PDF 举报

资源推荐

资源详情

资源评论

Recent Advances in Autoencoder-Based

Representation Learning

Michael Tschannen

ETH Zurich

[email protected]

Olivier Bachem

Google AI, Brain Team

[email protected]

Mario Lucic

Google AI, Brain Team

[email protected]

Abstract

Learning useful representations with little or no supervision is a key challenge in

artiﬁcial intelligence. We provide an in-depth review of recent advances in repre-

sentation learning with a focus on autoencoder-based models. To organize these

results we make use of meta-priors believed useful for downstream tasks, such as

disentanglement and hierarchical organization of features. In particular, we un-

cover three main mechanisms to enforce such properties, namely (i) regularizing

the (approximate or aggregate) posterior distribution, (ii) factorizing the encod-

ing and decoding distribution, or (iii) introducing a structured prior distribution.

While there are some promising results, implicit or explicit supervision remains

a key enabler and all current methods use strong inductive biases and modeling

assumptions. Finally, we provide an analysis of autoencoder-based representation

learning through the lens of rate-distortion theory and identify a clear tradeoff be-

tween the amount of prior knowledge available about the downstream tasks, and

how useful the representation is for this task.

1 Introduction

The ability to learn useful representations of data with little or no supervision is a key challenge

towards applying artiﬁcial intelligence to the vast amounts of unlabelled data collected in the world.

While it is clear that the usefulness of a representation learned on data heavily depends on the end

task which it is to be used for, one could imagine that there exists properties of representations

which are useful for many real-world tasks simultaneously. In a seminal paper on representation

learning Bengio et al. [1] proposed such a set of meta-priors. The meta-priors are derived from

general assumptions about the world such as the hierarchical organization or disentanglement of

explanatory factors, the possibility of semi-supervised learning, the concentration of data on low-

dimensional manifolds, clusterability, and temporal and spatial coherence.

Recently, a variety of (unsupervised) representation learning algorithms have been proposed based

on the idea of autoencoding where the goal is to learn a mapping from high-dimensional observa-

tions to a lower-dimensional representation space such that the original observations can be recon-

structed (approximately) from the lower-dimensional representation. While these approaches have

varying motivations and design choices, we argue that essentially all of the methods reviewed in this

paper implicitly or explicitly have at their core at least one of the meta-priors from Bengio et al. [1].

Given the unsupervised nature of the upstream representation learning task, the characteristics of

the meta-priors enforced in the representation learning step determine how useful the resulting rep-

resentation is for the real-world end task. Hence, it is critical to understand which meta-priors are

targeted by which models and which generic techniques are useful to enforce a given meta-prior. In

this paper, we provide a uniﬁed view which encompasses the majority of proposed models and relate

them to the meta-priors proposed by Bengio et al. [1]. We summarize the recent work focusing on

the meta-priors in Table 1.

Third workshop on Bayesian Deep Learning (NeurIPS 2018), Montr

eal, Canada.

arXiv:1812.05069v1 [cs.LG] 12 Dec 2018

正则化(近似的或聚集的)后

验分布

分解编码和解码分布

引入结构化先验分布

元先验来自于对世

界的一般假设

Table 1: Grouping of methods according to the meta-priors for representation learning from [1].

While many methods directly or indirectly address multiple meta-priors, we only considered the

most prominent target of each method. Note that meta-priors such as low dimensionality and mani-

fold structure are enforced by essentially all methods.

Meta-prior Methods

Disentanglement β-VAE (6) [2], FactorVAE (8) [3], β-TCVAE (9) [4], InfoVAE (9) [5], DIP-

VAE (11) [6], HSIC-VAE (12) [7], HFVAE (13) [8], VIB [9], Information

dropout (15) [10], DC-IGN [11], FaderNetworks (18) [12], VFAE (17) [13]

Hierarchical representation

PixelVAE [14], LVAE [15], VLaAE [16], Semi-supervised VAE [17],

PixelGAN-AE [18], VLAE [19], VQ-VAE [20]

Semi-supervised learning Semi-supervised VAE [17], [21], PixelGAN-AE (14) [18], AAE (16) [22]

Clustering PixelGAN-AE (14) [18], AAE (16) [22], JointVAE [23], SVAE [24]

Meta-priors of Bengio et al. [1]. Meta-priors capture very general premises about the world and

are therefore arguably useful for a broad set of downstream tasks. We brieﬂy summarize the most

important meta-priors which are targeted by the reviewed approaches.

1. Disentanglement: Assuming that the data is generated from independent factors of variation,

for example object orientation and lighting conditions in images of objects, disentanglement

as a meta-prior encourages these factors to be captured by different independent variables in

the representation. It should result in a concise abstract representation of the data useful for

a variety of downstream tasks and promises improved sample efﬁciency.

2. Hierarchical organization of explanatory factors: The intuition behind this meta-prior is

that the world can be described as a hierarchy of increasingly abstract concepts. For example

natural images can be abstractly described in terms of the objects they show at various levels

of granularity. Given the object, a more concrete description can be given by object attributes.

3. Semi-supervised learning: The idea is to share a representation between a supervised and an

unsupervised learning task which often leads to synergies: While the number of labeled data

points is usually too small to learn a good predictor (and thereby a representation), training

jointly with an unsupervised target allows the supervised task to learn a representation that

generalizes, but also guides the representation learning process.

4. Clustering structure: Many real-wold data sets have multi-category structure (such as im-

ages showing different object categories), with possibly category-dependent factors of varia-

tion. Such structure can be captured with a latent mixture model where each mixture compo-

nent corresponds to one category, and its distribution models the factors of variation within

that category. This naturally leads to a representation with clustering structure.

Very generic concepts such as smoothness as well as temporal and spatial coherence are not speciﬁc

to unsupervised learning and are used in most practical setups (for example weight decay to encour-

age smoothness of predictors, and convolutional layers to capture spatial coherence in image data).

We discuss the implicit supervision used by most approaches in Section 7.

Mechanisms for enforcing meta-priors. We identify the following three mechanisms to enforce

meta-priors:

(i) Regularization of the encoding distribution (Section 3).

(ii) Choice of the encoding and decoding distribution or model family (Section 4).

(iii) Choice of a ﬂexible prior distribution of the representation (Section 5).

For example, regularization of the encoding distribution is often used to encourage disentangled

representations. Alternatively, factorizing the encoding and decoding distribution in a hierarchical

fashion allows us to impose a hierarchical structure to the representation. Finally, a more ﬂexible

prior, say a mixture distribution, can be used to encourage clusterability.

While PixelGAN-AE [18], VLAE [19], and VQ-VAE [20] do not explicitly model a hierarchy of latents,

they learn abstract representations capturing global structure of images [18, 19] and speech signals [20], hence

internally representing the data in a hierarchical fashion.

解耦

分层表示

半监督学习

据类

(i)

ˆx

(i)



(z|x)

✓

(x|z)

p(z)

encoder

decoder

prior

(i)

ˆx

(i)



(z|x)

✓

(x|z)

(z)

(i)

ˆx

(i)

p(z)

(i)



|x)

✓

(x|z

)

PixelCNN

code

(a) Variational Autoencoder (VAE) framework.

(b) Samples from a trained VAE.

Figure 1: Figure (a) illustrates the Variational Autoencoder (VAE) framework speciﬁed by the en-

coder, decoder, and the prior distribution on the latent (representation/code) space. The encoder

maps the input to the representation space (inference), while the decoder reconstructs the original

input from the representation. The encoder is encouraged to satisfy some structure on the latent

space (e.g., it should be disentangled). Figure (b) shows samples from a trained autoencoder with

latent space of 2 dimensions on the MNIST data set. Each point on the left corresponds to the

representation of a digit (originally in 784 dimensions) and the reconstructed digits can be seen on

the right. One can observe that in this case the latent representation is clustered (various styles of

the same digit are close w.r.t. L

-distance, and within each group the position corresponds to the

rotation of the digit).

Before starting our overview, in Section 2 we present the main concepts necessary to understand

variational autoencoders (VAEs) [25, 26], underlying most of the methods considered in this pa-

per, and several techniques used to estimate divergences between probability distributions. We then

present a detailed discussion of regularization-based methods in Section 3, review methods rely-

ing on structured encoding and decoding distributions in Section 4, and present methods using a

structured prior distribution in Section 5. We conclude the review section by an overview of related

methods such as cross-domain representation learning [27–29] in Section 6. Finally, we provide

a critique of unsupervised representation learning through the rate-distortion framework of Alemi

et al. [30] and discuss the implications in Section 7.

2 Preliminaries

We assume familiarity with the key concepts in Bayesian data modeling. For a gentle introduction

to VAEs we refer the reader to [31]. VAEs [25, 26] aim to learn a parametric latent variable model

by maximizing the marginal log-likelihood of the training data {x

(i)

}

i=1

. By introducing an ap-

proximate posterior q

(z|x) which is an approximation of the intractable true posterior p

(z|x) we

can rewrite the negative log-likelihood as

ˆp(x)

[−log p

(x)] = L

VAE

(θ, φ) −E

ˆp(x)

(z|x)kp

(z|x))]

where

VAE

(θ, φ) = E

ˆp(x)

(z|x)

[−log p

(x|z)] + E

ˆp(x)

(z|x)kp(z))], (1)

generator

discriminator

c =1

c =0

noise

(a) The main idea behind GANs.

feature mapping '

MMD(p

, p

)

(b) The main idea behind MMD.

Figure 2: Adversarial density ratio estimation vs MMD. Figure (a): GANs use adversarial density

ratio estimation to train a generative model, which can be seen as a two-player game: The discrimi-

nator tries to predict whether samples are real or generated, while the generator tries to deceive the

discriminator by mimicking the distribution of the real samples. Figure (b): The MMD corresponds

to the distance between mean feature embeddings.

and E

ˆp(x)

[f(x)] =

i=1

f(x

(i)

) is the expectation of the function f(x) w.r.t. the empirical data

distribution. The approach is illustrated in Figure 1. The ﬁrst term in (1) measures the reconstruction

error and the second term quantiﬁes how well q

(z|x) matches the prior p(z). The structure of the

latent space heavily depends on this prior. As the KL divergence is non-negative, −L

VAE

lower-

bounds the marginal log-likelihood E

ˆp(x)

[log p

(x)] and is accordingly called the evidence lower

bound (ELBO).

There are several design choices available: (1) The prior distribution on the latent space, p(z), (2)

the family of approximate posterior distributions, q

(z|x), and (3) the decoder distribution, p

(x|z).

Ideally, the approximate posterior should be ﬂexible enough to match the intracable true posterior

(z|x). As we will see later, there are many available options for these design choices, leading to

various trade-offs in terms of the learned representation.

In practice, the ﬁrst term in (1) can be estimated from samples z

(i)

∼ q

(z|x

(i)

) and gradients are

backpropagated through the sampling operation using the reparametrization trick [25, Section 2.3],

enabling minimization of (1) via minibatch-stochastic gradient descent (SGD). Depending on the

choice of q

(z|x) the second term can either be computed in closed form or estimated from sam-

ples. For the usual choice of q

(z|x) = N(µ

(x), diag(σ

(x))), where µ

(x) and σ

(x) are de-

terministic functions parametrized as neural networks, and p(z) = N(0, I) for which the KL-term

in (1) can be computed in closed form (more complicated choices of p(z) rarely allow closed form

computation). To this end, we will brieﬂy discuss two ways in which one can measure distances

between distributions. We will focus on intuition behind these techniques and provide pointers to

detailed expositions.

Adversarial density-ratio estimation. Given a convex function f for which f (1) = 0, the f-

divergence between p

and p

is deﬁned as

) =



(x)



(x)dx.

For example, the choice f(t) = t log t corresponds to D

) = D

). Given samples

from p

and p

we can estimate the f -divergence using the density-ratio trick [32, 33], popularized

recently through the generative adversarial network (GAN) framework [34]. The trick is to express

and p

as conditional distributions, conditioned on a label c ∈ {0, 1}, and reduce the task to

binary classiﬁcation. In particular, let p

(x) = p(x|c = 1), p

(x) = p(x|c = 0), and consider a

discriminator S

trained to predict the probability that its input is a sample from distributions p

rather than p

, i.e, predict p(c = 1|x). The density ratio can be expressed as

(x)

p(x|c = 1)

p(x|c = 0)

p(c = 1|x)

p(c = 0|x)

≈

(x)

1 − S

(x)

, (2)

where the second equality follows from Bayes’ rule under the assumption that the marginal class

probabilities are equal. As such, given N i.i.d. samples {x

(i)

}

i=1

from p

and a trained classiﬁer

vae包含很多部分，

每个部分都可以进行

单独设计

剩余24页未读，继续阅读

评论收藏

内容反馈

liz_lee

粉丝: 70

论文笔记—Recent Advances in Autoencoder-Based Representation Learnin...

最新资源

论文笔记—Recent Advances in Autoencoder-Based Representation Learnin...

问题引领思维,促成深度学习——基于问题引领的深度学习策略.pdf

[论文笔记]Machine Learning for Vehicular Networks：Recent Advances and Application Examples

论文笔记

论文学习笔记

论文笔记1

DPM论文笔记

论文研究- .pdf

Graph Representation Learning.pdf

Recent advances in surrogate-based optimization

advances-in-data-analysis.pdf

An Introduction to Support Vector Machines and Other Kernel-based Learning Methods.chm

1a_Advances in Financial Machine Learning Lecture-1-9.pdf

论文研究-一种改进的实数编码遗传算法.pdf

论文研究-，.pdf

Advances in Real-Time Voxel-Based GI - GDC 2018.pdf

Advances and Open Problems in Federated Learning.pdf

Advances in Cryptology -- CRYPTO 2010.part2.rar

Advances-in-Radiosurgery-for-Lung-cancer.pdf

Advances in Cryptology -- CRYPTO 2010.part1

Advances in Applied Artificial Intelligence - John Fulcher.pdf

advances-in-bioinformatics-multimedia-and-electronics-circuits-a-2020.pdf

Advances in Network-Based Information Systems 无水印原版pdf

Recent Advances in Big Data and Deep Learning

[Advances in Applied Mathematics] Dean G. Duffy - Advanced Engineering Mathematics_ A Second Course with MATLAB (2022, CRC Press) - libgen.li.pdf

Approximation-of-large-scale-dynamical-system.djvu

ISSCC 2012 所有

Advances in Cryptology - EUROCRYPT 2003 .pdf

多线程三 并发容器简单使用

PADL 2001：声明式语言的实践与创新

最新资源

多线程三并发容器简单使用