Visual Text-to-Speech with Diffusion
Visual Text-to-Speech with Diffusion
Huadai Liu1∗, Rongjie Huang1∗, Xuan Lin2∗, Wenqiang Xu2 , Maozong Zheng2 , Hong Chen2 ,
Jinzheng He1 , Zhou Zhao1†
Zhejiang University1 , Ant Group2
{liuhuadai,rongjiehuang,jinzhenghe,zhaozhou}@zju.edu.cn
{daxuan.lx,yugong.xwq,zhengmaozong.zmz,wuyi.ch}@antgroup.com
15957
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15957–15969
December 6-10, 2023 ©2023 Association for Computational Linguistics
text fusion to integrate visual and textual informa- in parallel. More recently, Grad-TTS (Popov et al.,
tion, which provides fine-grained language-visual 2021), DiffSpeech (MoonInTheRiver, 2021), and
reasoning by attending to regions of the image; 2) ProDiff (Huang et al., 2022c) have employed dif-
leverage transformer architecture to promote the fusion generative models to generate high-quality
scalability of the diffusion model. Regarding the audio, but they all rely on the convolutional archi-
data shortage challenge, we pre-train the encoder tecture such as WaveNet (Oord et al., 2016) and
and decoder in a self-supervised manner, showing U-Net (Ronneberger et al., 2015) as the backbone.
that large-scale pre-training reduces data require- In contrast, some studies (Peebles and Xie, 2023;
ments for training visual TTS models. Bao et al., 2023) in image generation tasks have
Experiments results demonstrate that ViT-TTS explored transformers (Vaswani et al., 2017) as an
generates speech samples with accurate reverbera- alternative to convolutional architectures, achieving
tion effects in target scenarios, achieving new state- competitive results with U-Net. In this paper, we
of-the-art results in terms of perceptual quality. In present the first transformer-based diffusion model
addition, we investigate the scalability of ViT-TTS as an alternative of convolutional architecture. By
and its performance under low-resource conditions harnessing the scalable properties of transformers,
(1h/2h/5h). The main contributions of this work we enhance the model capacity to more effectively
are summarized as follows: capture visual scene information and promote the
model performance.
• We propose the first visual Text-to-Speech
model ViT-TTS with vision-text fusion, which 2.2 Self-supervised Pre-training
enables the generation of high-perceived au-
There are two main criteria for optimizing speech
dio that matches the physical environment.
pre-training: contrastive loss (Oord et al., 2018;
• We show that large-scale pre-training allevi- Chung and Glass, 2020; Baevski et al., 2020) and
ates the data scarcity in training visual TTS masked prediction loss (Devlin et al., 2018). Con-
models. trastive loss is used to distinguish between positive
and negative samples with respect to a reference
• We introduce the diffusion transformer scal- sample, while masked prediction loss is originally
able in terms of parameters and capacity to proposed for natural language processing (Devlin
learn visual scene information. et al., 2018; Lewis et al., 2019) and later applied to
• Experimental results on subjective and ob- speech processing (Baevski et al., 2020; Hsu et al.,
jective evaluation demonstrate the state-of- 2021). Some recent work (Chung et al., 2021) has
the-art results in terms of perceptual qual- combined the two approaches, achieving good per-
ity. With low-resource data (1h, 2h, 5h), ViT- formance for downstream automatic speech recog-
TTS achieves comparative results with rich- nition (ASR) tasks. In this work, we leverage the
resource baselines. success of self-supervised to enhance both the en-
coder and decoder to alleviate the data scarcity
2 Related Work issue.
reverberation, thus simulating the reverberation of the field of natural language processing. To allevi-
the target space or processing algorithm (Koo et al., ate the data scarcity issue (Huang et al., 2022d; Liu
2021; Sarroff and Michaels, 2020). Recently, there et al., 2023; Huang et al., 2023c) and learn robust
is research on visual acoustic matching (Chen et al., contextual encoder, we are encouraged to adopt
2022), which involves generating audio recorded the masking strategy like BERT in the pre-training
in the target environment based on the input source stage. Specifically, we randomly mask the 15% of
audio clip and an image of the target environment. each phoneme sequence and predict those masked
However, our proposed visual TTS is distinct from tokens rather than reconstructing the entire input.
those mentioned above as as it aims to generate The masked phoneme sequence is then input into
audio that captures the room acoustics in the tar- the text encoder to obtain hidden states. The final
get environment based on the written text and the hidden states are fed into a linear projection layer
target environment image. over the vocabulary to obtain the predicted tokens.
Finally, we calculate the cross entropy loss between
3 Method the predicted tokens and target tokens.
3.1 Overview The masked token during the pre-training phase
will not be used in the fine-tuning phase. To mit-
The overall architecture has been presented as Fig-
igate this mismatch between the pre-training and
ure 1. To alleviate the issue of data scarcity, we
fine-tuning, we randomly choose the phonemes to
leverage unlabeled data to pre-train the visual-text
be masked: 1) 80% probability to add masks; 2)
encoder and denoiser decoder with scalable trans-
10% probability to keep phoneme unchanged, and
formers in a self-supervised manner. To capture the
3) 10% probability to replace with a random token
visual scene information, we employ the visual-text
in the dictionary.
fusion module to reason about how different image
patches contribute to texts. BigvGAN (Lee et al., Visual-Text Fusion In the fine-tuning stage, we
2022) converts the mel-spectrograms into audio integrate the visual modal and module into the en-
that matches the target scene as a neural vocoder. coder to integrate visual and textual information.
Before feeding into the visual-text encoder, we
3.2 Enhanced visual-text Encoder first extract image features of panoramic images
Self-supervised Pre-training The advent of the through ResNet18 (Oord et al., 2018) and obtain
masked language model (Devlin et al., 2018; Clark phoneme embedding. Both the image features and
et al., 2020) has marked a significant milestone in phoneme embedding are fed into one of the vari-
15959
ants of the transformer to get the hidden sequences. magnitude mel-spectrograms data to alleviate data
Specifically, we first pass the phoneme through scarcity. Specifically, assuming the target mel-
relative self-attention, which is defined as follows: spectrogram is x0 , we first random select 0.065%
of x0 as starting indices and apply a mask that
(Qi W Q )(Kj W K + Rij )T
α(i, j) = Sof tmax( √ ) (1) spans 10 steps following the Wav2vec2.0 (Baevski
dk
et al., 2020). Then, we obtain xt through a diffu-
where n is the length of phoneme embedding, Rij sion process, which is defined by a fixed Markov
are the relative position embedding of key and chain from data x0 to the latent variable xt .
value, dk is the dimension of key, and Q, K, V T
Y
are all the phoneme embedding. We use relative q(x1 , · · · , xT |x0 ) = q(xt |xt−1 ), (3)
self-attention to model how much phoneme pi at- t=1
tends to phoneme pj . After that, we choose to use At each diffusion step t ∈ [1, T ], a tiny Gaussian
cross-attention instead of a simplistic concatena- noise is added to xt−1 to obtain xt , according to a
tion approach as we can reason about how different small positive constant βt :
image patches contribute to the text after feature p
q(xt |xt−1 ) := N (xt ; 1 − βt xt−1 , βt I) (4)
extraction. The equation is defined as follows:
xt obtained from the diffusion process is passed
PV T
δ(V, P ) = Sof tmax( √ )V
dv
(2) through the transformer to predict Gaussian noise
ϵθ . Loss is defined as mean squared error in the ϵ
where P is the phoneme embedding, V is the visual space, and efficient training is optimizing a random
features, and dv is the dimension of vision features. term of t with stochastic gradient descent:
Finally, the feed-forward layer is applied to output q 2
the hidden sequence. LGrad
θ = ϵθ αt x0 + 1 − αt2 ϵ − ϵ , ϵ ∼ N (0, I)
2
(5)
3.3 Enhanced Diffusion Transformer
To this end, ViT-TTS takes advantage of the re-
Scalable Transformer As a rapidly growing cat- construction loss to predict the self-supervised rep-
egory of generative models, DDPMs have demon- resentations which largely alleviates the challenges
strated their exceptional ability to deliver top- of data scarcity. Detailed formulation of DDPM
notch results in both image (Zhang and Agrawala, has been attached in Appendix C.
2023; Ho and Salimans, 2022) and audio synthe-
sis (Huang et al., 2022c, 2023a; Lam et al., 2021). Controllable Fine-tuning During the fine-
However, the most dominant diffusion TTS models tuning stage, we will face the following challenges:
adopt a convolutional architecture like WaveNet or (1) there is a data scarcity issue with the available
U-Net as the de-factor choice of backbone. This panoramic images and target environmental audio
architectural choice limits the model scalability to for training; (2) a fast training method is equally
effectively incorporate panoramic visual images. crucial for optimizing the diffusion model, as it can
Recent research (Peebles and Xie, 2023; Bao et al., save a significant amount of time and storage space.
2023) in the image synthesis field has revealed that To address these challenges, we draw inspiration
the inductive bias of convolutional structures is from Zhang and Agrawala (2023) and implement
not a critical determinant of DDPMs’ performance. a swift fine-tuning technique. Specifically, we cre-
Instead, transformers have emerged as a viable al- ate two copies of the pre-trained diffusion model
ternative. weights, namely a "trainable copy" and a "locked
For this reason, we propose a diffusion trans- copy," to learn the input conditions. We fix all
former that leverages the scalability of transform- parameters of the pre-trained transformer, desig-
ers to expand model capacity and incorporate room nated as Θ, and duplicate them into a trainable
acoustic information. Moreover, we leverage the parameter Θt . We train these trainable parameters
adaptive normalization layers in GANs and ini- and connect them with the "locked copy" via zero
tialize the full transformer block as the identity convolution layers. These convolution layers are
function to enhance the transformer architecture. unique as they have a kernel size of one by one
and weights and biases set to zero, progressively
Unconditional Pre-training In this part, we in- growing from zeros to optimized parameters in a
vestigate self-supervised learning from orders of learned fashion.
15960
3.4 Architecture masked x0 is puted into denoiser to predict Gaus-
As illustrated in Figure 1, our model comprises a sian noise ϵθ . Then, the Mean Square Error(MSE)
visual-text encoder, variance adaptor, and spectro- loss is applied to the predicted Gaussian noise and
gram denoiser. The visual-text encoder converts target Gaussian noise.
phoneme embeddings and visual features into hid- Fine-tuning We begin by loading model weights
den sequences, while the variance adaptor predicts from the pre-trained visual-text encoder and uncon-
the duration of each hidden sequence to regulate ditional diffusion decoder, after which we finetune
the length of the hidden sequences to match that both of them until the model converges. The fi-
of speech frames. Furthermore, different variances nal loss term consists of the following parts: (1)
like pitch and speaker embedding are incorporated sample reconstruction loss Lθ : MSE between the
with hidden sequences following FastSpeech 2 Ren predicted Gaussian noise and target Gaussian noise.
et al. (2022). Finally, the spectrogram denoiser it- (2) variance reconstruction loss Ldur , Lp : MSE be-
eratively refines the length-regulated hidden states tween the predicted and the target phoneme-level
into mel-spectrograms. We put more details in Ap- duration, pitch.
pendix B.
Visual-Text Encoder The visual-text encoder Inference During inference, DDPM iteratively
consists of relative position transformer blocks runs the reverse process to obtain the data sample
based on the transformer architecture. Specifically, x0 , and then we use a pre-trained BigvGAN-16khz-
it convolves a pre-net for phoneme embedding, a vi- 80band as the vocoder to transform the generated
sual feature extractor for image, and a transformer mel-spectrograms into waveforms.
encoder which includes multi-head self-attention,
4 Experiment
multi-head cross-attention, and feed-forward layer.
Variance Adaptor In variance adaptor, the du- 4.1 Experimental Setup
ration and pitch predictors share a similar model Dataset We use the SoundSpaces-Speech
structure consisting of a 2-layer 1D-convolutional dataset (Chen et al., 2023), which is constructed
network with ReLU activation, each followed by on the SoundSpaces platform based on real-world
the layer normalization and the dropout layer, and 3D scans to obtain environmental audio. The
an extra linear layer to project the hidden states dataset includes 28,853/1,441/1,489 samples for
into the output sequence. training/validation/testing, each consisting of clean
Spectrogram Denoiser Spectrogram denoiser text, reverberant audio, and panoramic camera
takes in xt as input to predict ϵ added in diffusion angle images. Following (Chen et al., 2022), we
process conditioned on the step embedding Et and remove out-of-view samples and divide the test set
encoder output. We adopt a variant of the trans- into test-unseen and test-seen, where the unseen
former as our backbone and make some improve- set injects room acoustics depicted in novel images
ments upon the standard transformer motivated while the seen set only contains the scenes we
by Peebles and Xie (2023), mainly includes:(1) have seen in the training stage. We convert the
we explore replacing standard layer norm layers text sequence into the phoneme sequence with an
in transformer blocks with adaptive layer norm open-source grapheme-to-phoneme conversion
(adaLN) to regress scale and shift parameters from tool (Sun et al., 2019) 3 .
the sum of the embedding vector of t and hidden Following the common pratice (Ren et al., 2019;
sequence. (2) Inspired by ResNets (Oord et al., MoonInTheRiver, 2021), we conduct preprocess-
2018), we initialize the transformer block as the ing on the speech and text data: 1) extract the spec-
identity function and initialize the MLP to output trogram with the FFT size of 1024, hop size of 256,
the zero-vector. and window size of 1024 samples; 2) convert it to
3.5 Pre-training, Fine-tuning, and Inference a mel-spectrogram with 80 frequency bins; and 3)
Procedures extract F0 (fundamental frequency) from the raw
waveform using Parselmouth tool 4 .
Pre-training The pre-training has two stages: 1)
3
encoder stage: pre-train the visual-text encoder https://github.com/Kyubyong/g2p
4
https://github.com/YannickJadoul/
vias masked LM loss LCE (ie. cross-entropy loss) Parselmouth
to predict the masked tokens. 2) decoder stage: the
15961
Test-Seen Test-Unseen
Method Params
MOS(↑) RTE (↓) MCD (↓) MOS(↑) RTE (↓) MCD (↓)
GT 4.34±0.07 / / 4.24±0.07 / / /
GT (voc.) 4.18±0.05 0.006 1.46 4.19±0.07 0.008 1.50 /
WaveNet 3.85±0.09 0.091 4.61 3.78±0.12 0.110 4.69 42.3M
Transformer-S 3.92±0.07 0.068 4.57 3.80±0.06 0.077 4.68 32.38M
Transformer-B 3.98±0.06 0.061 4.53 3.90±0.07 0.066 4.62 41.36M
Transformer-L 4.02±0.08 0.056 4.37 3.95±0.07 0.061 4.50 56.96M
Transformer-XL 4.05±0.07 0.047 4.35 4.00±0.05 0.053 4.39 115.12M
Table 1: Comparison between the diffusion WaveNet and diffusion transformers sweeping over model config(S,
B, L, XL). All models remove the pre-training stage and other conditions not related to backbone in training and
inference remain the same.
Model Configurations The size of the phoneme modeling. (2) Our proposed diffusion transformer
vocabulary is 73. The dimension of phoneme em- outperforms WaveNet backbone under similar pa-
beddings and the hidden size of the visual-text rameters across both test-unseen and test-seen sets,
transformer block are both 256. We use the pre- significantly in the rt60 metric. We attribute this to
trained ResNet18 as an image feature extractor. As the fact that instead of directly concatenating the
for the pitch encoder, the size of the lookup table condition input like WaveNet, we replace standard
and encoded pitch embedding are set to 300 and layer norm layers in transformer blocks with adap-
256. In the denoiser, the number of transformer-B tive layer norm to regress dimension-wise scale and
layers is 5 with the hidden size 384 and head 12. shift parameters from the sum of the embedding
We initialize each transformer block as the iden- vectors of diffusion step and encoder output, which
tity function and set T to 100 and β to constants can better incorporate the conditional information,
increasing linearly from β1 = 10−4 to βT = 0.06. as proven in GANs (Brock et al., 2018; Karras et al.,
We have attached more detailed information on the 2019).
model configuration in Appendix B 4.3 Model Performances
Pre-training, Fine-tuning, and Inference Dur-
In this study, we conduct a comprehensive com-
ing the pre-training stage, we pre-train the encoder
parison of the generated audio quality with other
for 120k steps and the decoder for 160k until con-
systems, including 1) GT, the ground-truth au-
vergence. The diffusion probabilistic models have
dio; 2) GT(voc.), where we first convert the
been trained using 1 NVIDIA A100 GPU with a
groud-truth audio into mel-spectrograms and then
batch size of 48 sentences. In the inference stage,
convert them to audio using BigvGAN; 3) Diff-
we uniformly use a pre-trained BigvGAN-16khz-
Speech (MoonInTheRiver, 2021), one of the
80band (Lee et al., 2022) as a vocoder to transform
most popular DDPM based on WaveNet; 4)ProD-
the generated mel-spectrograms into waveforms.
iff (Huang et al., 2022c), a recent generator-based
4.2 Scalable Diffusion Transformer diffusion model proposed to reduce the sampling
time; 5)Visual-DiffSpeech, incorporate visual-text
We compare and examine diffusion transformer
fusion module into DiffSpeech; 6) Cascaded, the
sweeping over model config(S, B, L, XL), and
system composed of DiffSpeech and Visual Acous-
conduct evaluations in terms of audio quality and
tic Matching(VAM) (Chen et al., 2022). The results,
parameters. Appendix A gives the details of the
compiled and presented in Table 2, provide valu-
model configs. The results have been shown in Ta-
able insights into the effectiveness of our approach:
ble 1. We have some observations from the results:
(1) As expected, the results in the test-unseen
(1) Increasing the depth and number of layers in
set do poorer than the test-seen part because there
the transformer can significantly enhance the per-
are invisible scenarios among the test-unseen set.
formance of the diffusion model, resulting in an
However, our proposed model has achieved the
improvement in both objective metrics and subjec-
best performance compared to baseline systems in
tive metrics, which demonstrates that expanding
both sets, indicating that our model generates the
the model size enables finer-grained room acoustic
best-perceived audio that matches the target envi-
15962
Test-Seen Test-Unseen
Method Params
MOS (↑) RTE (↓) MCD (↓) MOS (↑) RTE (↓) MCD (↓)
GT 4.34±0.07 / / 4.24±0.07 / / /
GT(voc.) 4.18±0.05 0.006 1.46 4.19±0.07 0.008 1.50 /
DiffSpeech 3.79±0.08 0.104 4.65 3.67±0.05 0.120 4.71 29.9M
ProDiff 3.76±0.13 0.121 4.67 3.65±0.06 0.137 4.72 29.9M
Visual-DiffSpeech 3.85±0.09 0.091 4.61 3.78±0.12 0.110 4.69 42.3M
Cascaded 3.61±0.08 0.071 5.13 3.59±0.08 0.082 5.25 146.5M
ViT-TTS 3.95±0.06 0.066 4.52 3.86±0.05 0.076 4.59 41.3M
Table 2: Comparison with baselines on the SoundSpaces-Speech for Seen and Unseen scenarios. The diffusion step
of all diffusion models is set to 100. We use the pre-trained model provided by VAM for the evaluation of cascaded.
ronment from written text. (2) Our model surpassed data (1h/2h/5h) and leverage large-scale text-only
TTS diffusion models(i.e.DiffSpeech and ProDiff) and audio-only data to boost the performance of the
across all metric scores, especially in terms of RTE visual TTS system, to investigate the effectiveness
values. This suggests that conventional diffusion of our self-supervised learning methods. The re-
models in TTS do poorly in modeling room acous- sults are compiled and presented in Table 3, and we
tic information, as they mainly focus on audio con- have the following observations: 1)As training data
tent, pitch, energy, etc. Our proposed visual-text is reduced in the low-resource scenario, a distinct
fusion module addresses this challenge by injecting degradation in generated audio quality could be wit-
visual properties into the model, resulting in a more nessed in both test sets (test-seen and test-unseen).
accurate prediction of the correct acoustics from 2) Leveraging orders of magnitude text-only and
images and high-perceived audio synthesis. (3) audio-only data with self-supervised learning, the
The results of comparison with Visual-DiffSpeech ViT-TTS achieve RTE scores of 0.082 and 0.068
highlight the advantages of our choice of trans- respectively in test-unseen and test-seen, showing
former and self-supervised pre-training. Although a significant promotion regardless of the unseen
Visual-DiffSpeech adds the visual-text module, the scene. In this way, the dependence on a large num-
choice of WaveNet and the lack of a self-supervised ber of parallel audio-visual data can be reduced for
pre-training strategy make it perform worse in constructing visual text-to-speech systems.
predicting the correct acoustics from images and
synthesizing high-perceived audio. (4) The cas- Method MOS (↑) RTE (↓) MCD (↓)
caded system composed of DiffSpeech and Visual Finetune with 1 hour data
Acoustic Matching model visual properties is bet-
Test-Seen 3.72±0.05 0.092 5.04
ter than other baselines. However, compared to our
Test-Unseen 3.67±0.06 0.101 5.11
proposed model, it performed worse in both test-
unseen and test-seen environments. This suggests Finetune with 2 hours data
that our direct visual text-to-speech system elimi- Test-Seen 3.75±0.06 0.089 4.85
nates the influence of error propagation caused by Test-Unseen 3.70±0.07 0.097 4.89
the cascaded manner, resulting in high-perceived
audio. In conclusion, our comprehensive evalua- Finetune with 5 hours data
tion results demonstrate the effectiveness of our Test-Seen 3.83±0.05 0.068 4.65
proposed model in generating high-quality audio Test-Unseen 3.73±0.09 0.082 4.72
that matches the target environment.
4.4 Low Resource Evaluation Table 3: Low resource evaluation results.
Mel-spectrograms produced by ViT-TTS are no- Method MOS (↑) RTE (↓) MCD (↓)
ticeably more similar to the target counterpart. 2) GT(voc.) 4.18±0.07 0.008 1.50
Moreover in challenging scenarios with invisible
scene images, cascaded systems suffer severely ViT-TTS 3.86±0.05 0.076 4.59
from the issue of noisy and reverb details missing, w/o EP 3.82±0.07 0.078 4.63
which is largely alleviated in ViT-TTS. w/o DP 3.83±0.06 0.081 4.65
w/o Visual 3.78±0.07 0.102 4.68
4.6 Ablation Studies w/ RI 3.73±0.08 0.103 4.75
We conduct ablation studies to demonstrate the w/ Concat 3.80 ±0.06 0.089 4.63
effectiveness of several key techniques on the Test-
Table 4: Ablation study results. EP, DP, and RI are
Unseen set in our model, including the encoder
encoder pre-training, decoder pre-training, and random
pre-training(EP), decoder pre-training(DP), visual images respectively.
input, random image, and concat function. The
results of both subjective and objective evalua- that after replacing the target image with a random
tions have been presented in Table 4, and we have image, the performance of our model significantly
the following observations: 1) Removing the self- degraded, indicating that our model could model
supervised encoder and decoder pre-training strat- the room acoustic information of visual input.
egy results in a decline in all indicators, which 5 Conclusion
demonstrates the effectiveness and efficiency of
the proposed pre-training strategy in reducing data In this paper, we proposed ViT-TTS, the first visual
variance and promoting model convergence. 2) text-to-speech synthesis model that aimed to con-
Without the input of RGB-D image and removing vert written text and target environmental images
all of the modules related to the image causes a into audio that matches the target environment. To
distinct degradation in RTE values, which demon- mitigate the data scarcity for training visual TTS
strates that our model successfully learns acoustics tasks and model visual acoustic information, we 1)
from the visual scene. 3) The replacement of cross- introduced a self-supervised learning framework to
attention with the concat fusion function results in enhance both the visual-text encoder and denoiser
a decrease in performance across all metrics, high- decoder; 2) leveraged the diffusion transformer
lighting the effectiveness of our visual-text fusion scalable in terms of parameters and capacity to
module. improve performance.
Furthermore, we conducted a more detailed ex- Experimental results demonstrated that ViT-TTS
ploration of our model’s processing and reasoning achieved new state-of-the-art results and performed
about different patches in the RGB-D images. To comparably to rich-resource baselines even with
achieve this, we deliberately substituted the target limited data. To this end, ViT-TTS provided a solid
image with random images, allowing us to deter- foundation for future visual text-to-speech studies,
mine whether the model can derive meaningful rep- and we envision that our approach will have far-
resentations from visual inputs. Our findings show reaching impacts on the fields of AR and VR.
15964
6 Limitation and Potential Risks Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng
Chiu, James Qin, Ruoming Pang, and Yonghui Wu.
As indicated in the experimental setup, we utilized 2021. w2v-bert: Combining contrastive learning
ResNet-18 as our image feature extractor. While it and masked language modeling for self-supervised
is a classic extractor, there may be newer extractors speech pre-training. 2021 IEEE Automatic Speech
Recognition and Understanding Workshop (ASRU),
that perform better. In future work, we will explore pages 244–250.
the use of superior extractors to enhance the quality
of generated audio. Kevin Clark, Minh-Thang Luong, Quoc V Le, and
Christopher D Manning. 2020. Electra: Pre-training
Moreover, our pre-trained encoder and decoder text encoders as discriminators rather than generators.
are based on the SoundSpace-Speech dataset, arXiv preprint arXiv:2003.10555.
which, as described in the dataset section, is not
sufficiently large. To address this limitation in fu- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
ture work, we will pre-train on a large-scale dataset bidirectional transformers for language understand-
to achieve better performance in low-resource sce- ing. arXiv preprint arXiv:1810.04805.
narios.
Jeff Donahue, Sander Dieleman, Mikołaj Bińkowski,
ViT-TTS lowers the requirements for visual text- Erich Elsen, and Karen Simonyan. 2020. End-
to-speech generation, which may cause fraud and to-end adversarial text-to-speech. arXiv preprint
scams by impersonating someone else’s voice. Fur- arXiv:2006.03575.
thermore, there is the potential for leading to the
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. De-
spread of false information and rumors. noising diffusion probabilistic models.
15965
Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and
Chenye Cui, and Yi Ren. 2022c. Prodiff: Progressive Ming Liu. 2019. Neural speech synthesis with trans-
fast diffusion model for high-quality text-to-speech. former network. In Proceedings of the AAAI Con-
In Proceedings of the 30th ACM International Con- ference on Artificial Intelligence, volume 33, pages
ference on Multimedia, pages 2595–2605. 6706–6713.
Rongjie Huang, Zhou Zhao, Jinglin Liu, Huadai Liu, Huadai Liu, Rongjie Huang, Jinzheng He, Gang Sun,
Yi Ren, Lichao Zhang, and Jinzheng He. 2022d. Ran Shen, Xize Cheng, and Zhou Zhao. 2023.
Transpeech: Speech-to-speech translation with bilat- Wav2sql: Direct generalizable speech-to-sql parsing.
eral perturbation. arXiv preprint arXiv:2205.12523.
Wolfgang Mack, Shuwen Deng, and Emanuël Habets.
Tero Karras, Samuli Laine, and Timo Aila. 2019. A 2020. Single-channel blind direct-to-reverberation
style-based generator architecture for generative ad- ratio estimation using masking. In Interspeech.
versarial networks. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recogni- Amine Mezghani and A. Lee Swindlehurst. 2018. Blind
tion, pages 4401–4410. estimation of sparse broadband massive MIMO chan-
nels with ideal and one-bit ADCs. IEEE Transactions
Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. on Signal Processing, 66(11):2972–2983.
Conditional variational autoencoder with adversar-
ial learning for end-to-end text-to-speech. In Inter- MoonInTheRiver. 2021. Diffsinger. https:
national Conference on Machine Learning, pages //github.com/MoonInTheRiver/
5530–5540. PMLR. DiffSinger.
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Prateek Murgai, Mark Rau, and Jean-Marc Jot. 2017.
Hifi-gan: Generative adversarial networks for effi- Blind estimation of the reverberation fingerprint of
cient and high fidelity speech synthesis. Proc. of unknown acoustic environments. Journal of The Au-
NeurIPS. dio Engineering Society.
Junghyun Koo, Seungryeol Paik, and Kyogu Lee. 2021. Aaron van den Oord, Sander Dieleman, Heiga Zen,
Reverb conversion of mixed vocal tracks using Karen Simonyan, Oriol Vinyals, Alex Graves,
an end-to-end convolutional deep neural network. Nal Kalchbrenner, Andrew Senior, and Koray
In ICASSP 2021-2021 IEEE International Confer- Kavukcuoglu. 2016. Wavenet: A generative model
ence on Acoustics, Speech and Signal Processing for raw audio. arXiv preprint arXiv:1609.03499.
(ICASSP), pages 81–85. IEEE.
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018.
Max WY Lam, Jun Wang, Rongjie Huang, Dan Su, and Representation learning with contrastive predictive
Dong Yu. 2021. Bilateral denoising diffusion models. coding. arXiv preprint arXiv:1807.03748.
arXiv preprint arXiv:2108.11514.
William Peebles and Saining Xie. 2023. Scalable diffu-
Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catan- sion models with transformers.
zaro, and Sungroh Yoon. 2022. Bigvgan: A univer-
sal neural vocoder with large-scale training. arXiv Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima
preprint arXiv:2206.04658. Sadekova, and Mikhail Kudinov. 2021. Grad-tts:
A diffusion probabilistic model for text-to-speech.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan In International Conference on Machine Learning,
Ghazvininejad, Abdelrahman Mohamed, Omer Levy, pages 8599–8608. PMLR.
Veselin Stoyanov, and Luke Zettlemoyer. 2019. Bart:
Denoising sequence-to-sequence pre-training for nat- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
ural language generation, translation, and compre- Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
hension. In Annual Meeting of the Association for try, Amanda Askell, Pamela Mishkin, Jack Clark,
Computational Linguistics. et al. 2021. Learning transferable visual models from
natural language supervision. In International confer-
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. ence on machine learning, pages 8748–8763. PMLR.
2023. Blip-2: Bootstrapping language-image pre-
training with frozen image encoders and large lan- Rama Ratnam, Douglas L Jones, Bruce C Wheeler,
guage models. arXiv preprint arXiv:2301.12597. William D O’Brien Jr, Charissa R Lansing, and Al-
bert S Feng. 2003. Blind estimation of reverberation
Junnan Li, Dongxu Li, Caiming Xiong, and Steven time. The Journal of the Acoustical Society of Amer-
Hoi. 2022. Blip: Bootstrapping language-image pre- ica, 114(5):2877–2892.
training for unified vision-language understanding
and generation. In International Conference on Ma- Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao,
chine Learning, pages 12888–12900. PMLR. Zhou Zhao, and Tie-Yan Liu. 2022. Fastspeech 2:
Fast and high-quality end-to-end text to speech.
15966
Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao,
Zhou Zhao, and Tie-Yan Liu. 2019. Fastspeech: Fast,
robust and controllable text to speech. Advances in
Neural Information Processing Systems, 32.
Olaf Ronneberger, Philipp Fischer, and Thomas Brox.
2015. U-net: Convolutional networks for biomedical
image segmentation. In Medical Image Computing
and Computer-Assisted Intervention–MICCAI 2015:
18th International Conference, Munich, Germany,
October 5-9, 2015, Proceedings, Part III 18, pages
234–241. Springer.
Andy Sarroff and Roth Michaels. 2020. Blind arbi-
trary reverb matching. In Proceedings of the 23rd
International Conference on Digital Audio Effects
(DAFx-2020), volume 2.
Manfred R Schroeder. 1965. New method of measuring
reverberation time. The Journal of the Acoustical
Society of America, 37(6):1187–1188.
Hao Sun, Xu Tan, Jun-Wei Gan, Hongzhi Liu, Sheng
Zhao, Tao Qin, and Tie-Yan Liu. 2019. Token-level
ensemble distillation for grapheme-to-phoneme con-
version. arXiv preprint arXiv:1904.03446.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need.
Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui
Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang,
Ying Xiao, Zhifeng Chen, Samy Bengio, et al.
2017. Tacotron: Towards end-to-end speech syn-
thesis. arXiv preprint arXiv:1703.10135.
Feifei Xiong, Stefan Goetze, Birger Kollmeier, and
Bernd T Meyer. 2018. Joint estimation of reverber-
ation time and early-to-late reverberation ratio from
single-channel speech signals. IEEE/ACM Transac-
tions on Audio, Speech, and Language Processing,
27(2):255–267.
15967
A TRANSFORMER CONFIGURATION ation adds Gaussian noise.
The details of transformer denoisers are shown in T
Y
Table 5, while B, M, L, and XL means the base, q(x1 , · · · , xT |x0 ) = q(xt |xt−1 ),
medium, large, extra large respectively. t=1
p
q(xt |xt−1 ) =N (xt ; 1 − βt xt−1 , βt I)
Model layers Hidden Size Heads (7)
Transformer-S 4 256 8 We emphasize the property observed by (Ho
Transformer-B 5 384 12 et al., 2020), the diffusion process can be computed
Transformer-L 6 512 16 in a closed form:
Transformer-XL 8 768 16
q(xt |x0 ) = N (xt ; αt x0 , σt I) (8)
Table 5: Diffusion Transformer Configs.
Applying Bayes’ rule, we can obtain the forward
B ARCHITECTURE process posterior when conditioned on x0
We list the model hyper-parameters of ViT-TTS in q(xt |xt−1 , x0 )q(xt−1 |x0 )
Table 6. q(xt−1 |xt , x0 ) =
q(xt |x0 )
Hyperparameter ViT-TTS = N (xt−1 ; µ̃t (xt , x0 ), β̃t I),
Phoneme Embedding 256 (9)
Pre-net Layers 3
Pre-net Hidden 256 αt−1 βt
where µ̃t (xt , x0 ) = σt x0 +
Visual Conv2d Kernel (7, 7) √
1−βt (σt−1 ) σt−1
Visual-Text Encoder Visual Conv2d Stride (2, 2) xt , β̃t =
σt σt β t
Encoder Layers 4
Encoder Hidden 256
Encoder Conv1d Kernel 9
D DIFFUSION ALGORITHM
Conv1D Filter Size 1024
Attention Heads 2 See Algorithm 1 and 2.
Dropout 0.1
Algorithm 1 Training procedure
Conv1D Kernel 3
Variance Predictor Conv1D Filter Size 256 1: Input: The denoiser ϵθ , diffusion step T and
Dropout 0.5
variance condition c.
Diffusion Embedding 384 2: repeat
Transformer Layers 5
Denoiser Transformer Hidden 384 3: Sample x0 ∼ qdata , ϵ ∼ N (0, I)
Attention Heads 12 4: Take√ gradient√ descent steps on ∇θ ||ϵ −
Position Embedding 384 ϵθ ( αt x0 + 1 − αt ϵ, c, t)||.
Scale/Shift Size 384
5: until convergence
Total Number of Parameters 41.36M
15968
Figure 3: Screenshots of subjective evaluations.
15969