没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
在很少或没有监督的情况下学习有用的表示是人工智能的一个关键挑战。我们深入回顾了表示学习的最新进展,重点关注基于自编码器的模型。为了组织这些结果,我们使用了被认为对下游任务有用的元先验,比如特征的解缠和分层组织。特别地,我们揭示了三种主要机制来执行这些特性,即(i)正则化(近似或聚集)后验分布,(ii)分解编码和解码分布,或(iii)引入结构化的先验分布。虽然有一些有希望的结果,隐性或显性监督仍然是一个关键的促成因素,所有当前的方法使用强烈的归纳偏差和建模假设。最后,我们通过率失真理论分析了基于自编码器的表示学习,并明确了关于下游任务的现有知识量和表示对该任务的有用程度之间的权衡。
资源推荐
资源详情
资源评论
格式:pdf 资源大小:466.2KB 页数:4
格式:docx 资源大小:55.7KB 页数:20
格式:pdf 资源大小:1.9MB 页数:13
格式:x-rar 资源大小:3.7MB
格式:pdf 资源大小:14.3MB 页数:466

Recent Advances in Autoencoder-Based
Representation Learning
Michael Tschannen
ETH Zurich
Olivier Bachem
Google AI, Brain Team
Mario Lucic
Google AI, Brain Team
Abstract
Learning useful representations with little or no supervision is a key challenge in
artificial intelligence. We provide an in-depth review of recent advances in repre-
sentation learning with a focus on autoencoder-based models. To organize these
results we make use of meta-priors believed useful for downstream tasks, such as
disentanglement and hierarchical organization of features. In particular, we un-
cover three main mechanisms to enforce such properties, namely (i) regularizing
the (approximate or aggregate) posterior distribution, (ii) factorizing the encod-
ing and decoding distribution, or (iii) introducing a structured prior distribution.
While there are some promising results, implicit or explicit supervision remains
a key enabler and all current methods use strong inductive biases and modeling
assumptions. Finally, we provide an analysis of autoencoder-based representation
learning through the lens of rate-distortion theory and identify a clear tradeoff be-
tween the amount of prior knowledge available about the downstream tasks, and
how useful the representation is for this task.
1 Introduction
The ability to learn useful representations of data with little or no supervision is a key challenge
towards applying artificial intelligence to the vast amounts of unlabelled data collected in the world.
While it is clear that the usefulness of a representation learned on data heavily depends on the end
task which it is to be used for, one could imagine that there exists properties of representations
which are useful for many real-world tasks simultaneously. In a seminal paper on representation
learning Bengio et al. [1] proposed such a set of meta-priors. The meta-priors are derived from
general assumptions about the world such as the hierarchical organization or disentanglement of
explanatory factors, the possibility of semi-supervised learning, the concentration of data on low-
dimensional manifolds, clusterability, and temporal and spatial coherence.
Recently, a variety of (unsupervised) representation learning algorithms have been proposed based
on the idea of autoencoding where the goal is to learn a mapping from high-dimensional observa-
tions to a lower-dimensional representation space such that the original observations can be recon-
structed (approximately) from the lower-dimensional representation. While these approaches have
varying motivations and design choices, we argue that essentially all of the methods reviewed in this
paper implicitly or explicitly have at their core at least one of the meta-priors from Bengio et al. [1].
Given the unsupervised nature of the upstream representation learning task, the characteristics of
the meta-priors enforced in the representation learning step determine how useful the resulting rep-
resentation is for the real-world end task. Hence, it is critical to understand which meta-priors are
targeted by which models and which generic techniques are useful to enforce a given meta-prior. In
this paper, we provide a unified view which encompasses the majority of proposed models and relate
them to the meta-priors proposed by Bengio et al. [1]. We summarize the recent work focusing on
the meta-priors in Table 1.
Third workshop on Bayesian Deep Learning (NeurIPS 2018), Montr
´
eal, Canada.
arXiv:1812.05069v1 [cs.LG] 12 Dec 2018
正则化(近似的或聚集的)后
验分布
分解编码和解码分布
引入结构化先验分布
元先验来自于对世
界的一般假设

Table 1: Grouping of methods according to the meta-priors for representation learning from [1].
While many methods directly or indirectly address multiple meta-priors, we only considered the
most prominent target of each method. Note that meta-priors such as low dimensionality and mani-
fold structure are enforced by essentially all methods.
Meta-prior Methods
Disentanglement β-VAE (6) [2], FactorVAE (8) [3], β-TCVAE (9) [4], InfoVAE (9) [5], DIP-
VAE (11) [6], HSIC-VAE (12) [7], HFVAE (13) [8], VIB [9], Information
dropout (15) [10], DC-IGN [11], FaderNetworks (18) [12], VFAE (17) [13]
Hierarchical representation
1
PixelVAE [14], LVAE [15], VLaAE [16], Semi-supervised VAE [17],
PixelGAN-AE [18], VLAE [19], VQ-VAE [20]
Semi-supervised learning Semi-supervised VAE [17], [21], PixelGAN-AE (14) [18], AAE (16) [22]
Clustering PixelGAN-AE (14) [18], AAE (16) [22], JointVAE [23], SVAE [24]
Meta-priors of Bengio et al. [1]. Meta-priors capture very general premises about the world and
are therefore arguably useful for a broad set of downstream tasks. We briefly summarize the most
important meta-priors which are targeted by the reviewed approaches.
1. Disentanglement: Assuming that the data is generated from independent factors of variation,
for example object orientation and lighting conditions in images of objects, disentanglement
as a meta-prior encourages these factors to be captured by different independent variables in
the representation. It should result in a concise abstract representation of the data useful for
a variety of downstream tasks and promises improved sample efficiency.
2. Hierarchical organization of explanatory factors: The intuition behind this meta-prior is
that the world can be described as a hierarchy of increasingly abstract concepts. For example
natural images can be abstractly described in terms of the objects they show at various levels
of granularity. Given the object, a more concrete description can be given by object attributes.
3. Semi-supervised learning: The idea is to share a representation between a supervised and an
unsupervised learning task which often leads to synergies: While the number of labeled data
points is usually too small to learn a good predictor (and thereby a representation), training
jointly with an unsupervised target allows the supervised task to learn a representation that
generalizes, but also guides the representation learning process.
4. Clustering structure: Many real-wold data sets have multi-category structure (such as im-
ages showing different object categories), with possibly category-dependent factors of varia-
tion. Such structure can be captured with a latent mixture model where each mixture compo-
nent corresponds to one category, and its distribution models the factors of variation within
that category. This naturally leads to a representation with clustering structure.
Very generic concepts such as smoothness as well as temporal and spatial coherence are not specific
to unsupervised learning and are used in most practical setups (for example weight decay to encour-
age smoothness of predictors, and convolutional layers to capture spatial coherence in image data).
We discuss the implicit supervision used by most approaches in Section 7.
Mechanisms for enforcing meta-priors. We identify the following three mechanisms to enforce
meta-priors:
(i) Regularization of the encoding distribution (Section 3).
(ii) Choice of the encoding and decoding distribution or model family (Section 4).
(iii) Choice of a flexible prior distribution of the representation (Section 5).
For example, regularization of the encoding distribution is often used to encourage disentangled
representations. Alternatively, factorizing the encoding and decoding distribution in a hierarchical
fashion allows us to impose a hierarchical structure to the representation. Finally, a more flexible
prior, say a mixture distribution, can be used to encourage clusterability.
1
While PixelGAN-AE [18], VLAE [19], and VQ-VAE [20] do not explicitly model a hierarchy of latents,
they learn abstract representations capturing global structure of images [18, 19] and speech signals [20], hence
internally representing the data in a hierarchical fashion.
2
解耦
分层表示
半监督学习
据类

x
(i)
ˆx
(i)
z
(i)
µ
q
(z|x)
p
✓
(x|z)
p(z)
encoder
decoder
prior
x
(i)
ˆx
(i)
z
(i)
µ
q
(z|x)
p
✓
(x|z)
p
1
(z)
p
2
(z)
x
(i)
ˆx
(i)
p(z)
z
(i)
1
z
(i)
2
µ
1
µ
2
2
1
q
(z
1
,z
2
|x)
p
✓
(x|z
1
,z
2
)
PixelCNN
or
code
(a) Variational Autoencoder (VAE) framework.
(b) Samples from a trained VAE.
Figure 1: Figure (a) illustrates the Variational Autoencoder (VAE) framework specified by the en-
coder, decoder, and the prior distribution on the latent (representation/code) space. The encoder
maps the input to the representation space (inference), while the decoder reconstructs the original
input from the representation. The encoder is encouraged to satisfy some structure on the latent
space (e.g., it should be disentangled). Figure (b) shows samples from a trained autoencoder with
latent space of 2 dimensions on the MNIST data set. Each point on the left corresponds to the
representation of a digit (originally in 784 dimensions) and the reconstructed digits can be seen on
the right. One can observe that in this case the latent representation is clustered (various styles of
the same digit are close w.r.t. L
2
-distance, and within each group the position corresponds to the
rotation of the digit).
Before starting our overview, in Section 2 we present the main concepts necessary to understand
variational autoencoders (VAEs) [25, 26], underlying most of the methods considered in this pa-
per, and several techniques used to estimate divergences between probability distributions. We then
present a detailed discussion of regularization-based methods in Section 3, review methods rely-
ing on structured encoding and decoding distributions in Section 4, and present methods using a
structured prior distribution in Section 5. We conclude the review section by an overview of related
methods such as cross-domain representation learning [27–29] in Section 6. Finally, we provide
a critique of unsupervised representation learning through the rate-distortion framework of Alemi
et al. [30] and discuss the implications in Section 7.
2 Preliminaries
We assume familiarity with the key concepts in Bayesian data modeling. For a gentle introduction
to VAEs we refer the reader to [31]. VAEs [25, 26] aim to learn a parametric latent variable model
by maximizing the marginal log-likelihood of the training data {x
(i)
}
N
i=1
. By introducing an ap-
proximate posterior q
φ
(z|x) which is an approximation of the intractable true posterior p
θ
(z|x) we
can rewrite the negative log-likelihood as
E
ˆp(x)
[−log p
θ
(x)] = L
VAE
(θ, φ) −E
ˆp(x)
[D
KL
(q
φ
(z|x)kp
θ
(z|x))]
where
L
VAE
(θ, φ) = E
ˆp(x)
[E
q
φ
(z|x)
[−log p
θ
(x|z)] + E
ˆp(x)
[D
KL
(q
φ
(z|x)kp(z))], (1)
3

generator
discriminator
p
y
p
x
c =1
c =0
noise
(a) The main idea behind GANs.
feature mapping '
MMD(p
x
, p
y
)
p
x
p
y
(b) The main idea behind MMD.
Figure 2: Adversarial density ratio estimation vs MMD. Figure (a): GANs use adversarial density
ratio estimation to train a generative model, which can be seen as a two-player game: The discrimi-
nator tries to predict whether samples are real or generated, while the generator tries to deceive the
discriminator by mimicking the distribution of the real samples. Figure (b): The MMD corresponds
to the distance between mean feature embeddings.
and E
ˆp(x)
[f(x)] =
1
N
P
N
i=1
f(x
(i)
) is the expectation of the function f(x) w.r.t. the empirical data
distribution. The approach is illustrated in Figure 1. The first term in (1) measures the reconstruction
error and the second term quantifies how well q
φ
(z|x) matches the prior p(z). The structure of the
latent space heavily depends on this prior. As the KL divergence is non-negative, −L
VAE
lower-
bounds the marginal log-likelihood E
ˆp(x)
[log p
θ
(x)] and is accordingly called the evidence lower
bound (ELBO).
There are several design choices available: (1) The prior distribution on the latent space, p(z), (2)
the family of approximate posterior distributions, q
φ
(z|x), and (3) the decoder distribution, p
φ
(x|z).
Ideally, the approximate posterior should be flexible enough to match the intracable true posterior
p
θ
(z|x). As we will see later, there are many available options for these design choices, leading to
various trade-offs in terms of the learned representation.
In practice, the first term in (1) can be estimated from samples z
(i)
∼ q
φ
(z|x
(i)
) and gradients are
backpropagated through the sampling operation using the reparametrization trick [25, Section 2.3],
enabling minimization of (1) via minibatch-stochastic gradient descent (SGD). Depending on the
choice of q
φ
(z|x) the second term can either be computed in closed form or estimated from sam-
ples. For the usual choice of q
φ
(z|x) = N(µ
φ
(x), diag(σ
φ
(x))), where µ
φ
(x) and σ
φ
(x) are de-
terministic functions parametrized as neural networks, and p(z) = N(0, I) for which the KL-term
in (1) can be computed in closed form (more complicated choices of p(z) rarely allow closed form
computation). To this end, we will briefly discuss two ways in which one can measure distances
between distributions. We will focus on intuition behind these techniques and provide pointers to
detailed expositions.
Adversarial density-ratio estimation. Given a convex function f for which f (1) = 0, the f-
divergence between p
x
and p
y
is defined as
D
f
(p
x
kp
y
) =
Z
f
p
x
(x)
p
y
(x)
p
y
(x)dx.
For example, the choice f(t) = t log t corresponds to D
f
(p
x
kp
y
) = D
KL
(p
x
kp
y
). Given samples
from p
x
and p
y
we can estimate the f -divergence using the density-ratio trick [32, 33], popularized
recently through the generative adversarial network (GAN) framework [34]. The trick is to express
p
x
and p
y
as conditional distributions, conditioned on a label c ∈ {0, 1}, and reduce the task to
binary classification. In particular, let p
x
(x) = p(x|c = 1), p
y
(x) = p(x|c = 0), and consider a
discriminator S
η
trained to predict the probability that its input is a sample from distributions p
x
rather than p
y
, i.e, predict p(c = 1|x). The density ratio can be expressed as
p
x
(x)
p
y
(x)
=
p(x|c = 1)
p(x|c = 0)
=
p(c = 1|x)
p(c = 0|x)
≈
S
η
(x)
1 − S
η
(x)
, (2)
where the second equality follows from Bayes’ rule under the assumption that the marginal class
probabilities are equal. As such, given N i.i.d. samples {x
(i)
}
N
i=1
from p
x
and a trained classifier
4
vae包含很多部分,
每个部分都可以进行
单独设计
剩余24页未读,继续阅读
资源评论
liz_lee
- 粉丝: 70
上传资源 快速赚钱
我的内容管理
展开
我的资源
快来上传第一个资源
我的收益 登录查看自己的收益
我的积分
登录查看自己的积分
我的C币
登录后查看C币余额
我的收藏
我的下载
下载帮助
前往需求广场,查看用户热搜最新资源
- Java编程实战指南:从入门到精通
- 通信工程概预算测验考试库.doc
- 计算机网络存储技术.docx
- 深度学习下初中历史的活动教学策略.docx
- 法律知识问题互联网安全方面法律.doc
- 基于大数据的计算机网络信息安全防护技术分析.docx
- 单片微型计算机方案设计书报告.doc
- 第9章网络安全技术.ppt
- 大数据思维在高校思政教育中的融入.docx
- 分析智能楼宇计算机系统设计与施工要点.docx
- 51单片机控制直流电机的调速方案设计书.doc
- (源码)基于JavaScript的等值面生成与裁切系统.zip
- “分析研究主导型”本科自动化专业发展现状调查.doc
- 单片机定时闹钟设计方案.doc
- 网络环境下校本研修的研修资源建设.doc
- 探究计算机教学中学生创新思维能力的培养.docx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功