Visual Text-to-Speech with Diffusion

Uploaded by

Kystar Miritashi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views13 pages

Visual Text-to-Speech with Diffusion

Uploaded by

Kystar Miritashi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

Huadai Liu1∗, Rongjie Huang1∗, Xuan Lin2∗, Wenqiang Xu2 , Maozong Zheng2 , Hong Chen2 ,
Jinzheng He1 , Zhou Zhao1†
Zhejiang University1 , Ant Group2
{liuhuadai,rongjiehuang,jinzhenghe,zhaozhou}@zju.edu.cn
{daxuan.lx,yugong.xwq,zhengmaozong.zmz,wuyi.ch}@antgroup.com

Abstract it is also influenced by the surrounding physical

environment. For instance, a room with hard sur-
Text-to-speech(TTS) has undergone remark-
faces like concrete or glass reflects sound waves,
able improvements in performance, particularly
with the advent of Denoising Diffusion Prob-
whereas a room with soft surfaces such as carpets
abilistic Models (DDPMs). However, the per- or curtains absorbs them. This variance can dras-
ceived quality of audio depends not solely on tically impact the clarity and quality of the sound
its content, pitch, rhythm, and energy, but also we hear.
on the physical environment. In this work, we To ensure an authentic and captivating expe-
propose ViT-TTS, the first visual TTS model rience, it is imperative to accurately model the
with scalable diffusion transformers. ViT-TTS acoustics of a room, particularly in virtual real-
complement the phoneme sequence with the
ity (VR) and augmented reality (AR) applications.
visual information to generate high-perceived
audio, opening up new avenues for practical Recent years have seen a surge in significant re-
applications of AR and VR to allow a more search (Li et al., 2022; Radford et al., 2021; Li et al.,
immersive and realistic audio experience. To 2023; Huang et al., 2023b) addressing the language-
mitigate the data scarcity in learning visual visual modeling problem. For instance, Li et al.
acoustic information, we 1) introduce a self- (2022) have proposed a unified video-language pre-
supervised learning framework to enhance both training framework for learning robust representa-
the visual-text encoder and denoiser decoder;
tion, while Radford et al. (2021) have focused on
2) leverage the diffusion transformer scalable
in terms of parameters and capacity to learn
large-scale image-text pairs pre-training via con-
visual scene information. Experimental results trastive learning. Visual TTS open-ups numerous
demonstrate that ViT-TTS achieves new state- practical applications, including dubbing archival
of-the-art results, outperforming cascaded sys- films, providing a more immersive and realistic ex-
tems and other baselines regardless of the visi- perience in virtual and augmented reality, or adding
bility of the scene. With low-resource data (1h, appropriate sound effects to games.
2h, 5h), ViT-TTS achieves comparative results Despite the benefits of language-visual ap-
with rich-resource baselines. 1 2
proaches, training visual TTS models typically
1 Introduction requires a large amount of training data, while
there are very few resources providing parallel text-
Text-to-speech (TTS) (Ren et al., 2019; Huang visual-audio data due to the heavy workload. Be-
et al., 2022a,b) aims to synthesize audios that is sides, creating a sound experience that matches the
consistent with the reference samples in terms of visual content remains challenging when develop-
semantic meaning, timbre, emotions, and melody, ing AR/VR applications, as it is still unclear how
and has shown remarkable advancements with the various regions of the image contribute to reverber-
advent of Denoising Diffusion Probabilistic Mod- ation and how to incorporate the visual modality as
els (DDPMs). However, the perceived audio qual- auxiliary information in TTS.
ity is not solely determined by these aspects, as In this work, we formulate the task of visual TTS
∗
Equal contributions to generate audio with reverberation effects in tar-
†
Corresponding author get scenarios given a text and environmental image,
1
Audio samples are available at https://ViT-TTS. introducing ViT-TTS to address the issues of data
github.io/.
2
Code is available at https://github.com/
scarcity and room acoustic modeling. To enhance
liuhuadai/ViT-TTS/ visual-acoustic matching, we 1) propose the visual-

15957
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15957–15969
December 6-10, 2023 ©2023 Association for Computational Linguistics
text fusion to integrate visual and textual informa- in parallel. More recently, Grad-TTS (Popov et al.,
tion, which provides fine-grained language-visual 2021), DiffSpeech (MoonInTheRiver, 2021), and
reasoning by attending to regions of the image; 2) ProDiff (Huang et al., 2022c) have employed dif-
leverage transformer architecture to promote the fusion generative models to generate high-quality
scalability of the diffusion model. Regarding the audio, but they all rely on the convolutional archi-
data shortage challenge, we pre-train the encoder tecture such as WaveNet (Oord et al., 2016) and
and decoder in a self-supervised manner, showing U-Net (Ronneberger et al., 2015) as the backbone.
that large-scale pre-training reduces data require- In contrast, some studies (Peebles and Xie, 2023;
ments for training visual TTS models. Bao et al., 2023) in image generation tasks have
Experiments results demonstrate that ViT-TTS explored transformers (Vaswani et al., 2017) as an
generates speech samples with accurate reverbera- alternative to convolutional architectures, achieving
tion effects in target scenarios, achieving new state- competitive results with U-Net. In this paper, we
of-the-art results in terms of perceptual quality. In present the first transformer-based diffusion model
addition, we investigate the scalability of ViT-TTS as an alternative of convolutional architecture. By
and its performance under low-resource conditions harnessing the scalable properties of transformers,
(1h/2h/5h). The main contributions of this work we enhance the model capacity to more effectively
are summarized as follows: capture visual scene information and promote the
model performance.
• We propose the first visual Text-to-Speech
model ViT-TTS with vision-text fusion, which 2.2 Self-supervised Pre-training
enables the generation of high-perceived au-
There are two main criteria for optimizing speech
dio that matches the physical environment.
pre-training: contrastive loss (Oord et al., 2018;
• We show that large-scale pre-training allevi- Chung and Glass, 2020; Baevski et al., 2020) and
ates the data scarcity in training visual TTS masked prediction loss (Devlin et al., 2018). Con-
models. trastive loss is used to distinguish between positive
and negative samples with respect to a reference
• We introduce the diffusion transformer scal- sample, while masked prediction loss is originally
able in terms of parameters and capacity to proposed for natural language processing (Devlin
learn visual scene information. et al., 2018; Lewis et al., 2019) and later applied to
• Experimental results on subjective and ob- speech processing (Baevski et al., 2020; Hsu et al.,
jective evaluation demonstrate the state-of- 2021). Some recent work (Chung et al., 2021) has
the-art results in terms of perceptual qual- combined the two approaches, achieving good per-
ity. With low-resource data (1h, 2h, 5h), ViT- formance for downstream automatic speech recog-
TTS achieves comparative results with rich- nition (ASR) tasks. In this work, we leverage the
resource baselines. success of self-supervised to enhance both the en-
coder and decoder to alleviate the data scarcity
2 Related Work issue.

2.1 Text-To-Speech 2.3 Acoustic Matching

Text-to-Speech(TTS) tasks are divided into two cat- The primary objective of acoustic matching is to
egories: (1) generating a mel-spectrogram from convert audio from a source environment into au-
text or phoneme sequence first (Wang et al., 2017; dio that resembles the target environment. In the
Ren et al., 2019), and then converting the gener- field of blind estimation (Mack et al., 2020; Xiong
ated spectrum into a waveform via vocoder (Kong et al., 2018; Murgai et al., 2017; Mezghani and
et al., 2020; Lee et al., 2022; Huang et al., 2022a); Swindlehurst, 2018), acoustic matching is applied
(2) generating audio directly from text (Donahue to generate a simple room impulse response (RIR)
et al., 2020; Kim et al., 2021). The earlier TTS (Li that can be used to synthesize the corresponding
et al., 2019; Wang et al., 2017) models adopt an target audio using two critical acoustic metrics - the
autoregressive manner, which suffers from the prob- direct-to-reverberant ratio (DRR) (Zahorik, 2002)
lem of slow inference speed. As a solution, non- and the reverberation time 60 (RT60) (Ratnam
autoregressive models have been proposed to en- et al., 2003). The music production community
able fast inference by generating mel-spectrograms also implements acoustic matching to modify the
15958
Figure 1: The overall architecture for ViT-TTS. In subfigure (b), Vi denotes the visual sequence and N1 denotes the
layers of Encoder. In subfigure (c), N2 is the number of transformer layers. α and β are the dimension-wise scale
parameters, while γ is the dimension-wise shift parameters. c is the variance adaptor’s output and t is the diffusion
step.

reverberation, thus simulating the reverberation of the field of natural language processing. To allevi-
the target space or processing algorithm (Koo et al., ate the data scarcity issue (Huang et al., 2022d; Liu
2021; Sarroff and Michaels, 2020). Recently, there et al., 2023; Huang et al., 2023c) and learn robust
is research on visual acoustic matching (Chen et al., contextual encoder, we are encouraged to adopt
2022), which involves generating audio recorded the masking strategy like BERT in the pre-training
in the target environment based on the input source stage. Specifically, we randomly mask the 15% of
audio clip and an image of the target environment. each phoneme sequence and predict those masked
However, our proposed visual TTS is distinct from tokens rather than reconstructing the entire input.
those mentioned above as as it aims to generate The masked phoneme sequence is then input into
audio that captures the room acoustics in the tar- the text encoder to obtain hidden states. The final
get environment based on the written text and the hidden states are fed into a linear projection layer
target environment image. over the vocabulary to obtain the predicted tokens.
Finally, we calculate the cross entropy loss between
3 Method the predicted tokens and target tokens.
3.1 Overview The masked token during the pre-training phase
will not be used in the fine-tuning phase. To mit-
The overall architecture has been presented as Fig-
igate this mismatch between the pre-training and
ure 1. To alleviate the issue of data scarcity, we
fine-tuning, we randomly choose the phonemes to
leverage unlabeled data to pre-train the visual-text
be masked: 1) 80% probability to add masks; 2)
encoder and denoiser decoder with scalable trans-
10% probability to keep phoneme unchanged, and
formers in a self-supervised manner. To capture the
3) 10% probability to replace with a random token
visual scene information, we employ the visual-text
in the dictionary.
fusion module to reason about how different image
patches contribute to texts. BigvGAN (Lee et al., Visual-Text Fusion In the fine-tuning stage, we
2022) converts the mel-spectrograms into audio integrate the visual modal and module into the en-
that matches the target scene as a neural vocoder. coder to integrate visual and textual information.
Before feeding into the visual-text encoder, we
3.2 Enhanced visual-text Encoder first extract image features of panoramic images
Self-supervised Pre-training The advent of the through ResNet18 (Oord et al., 2018) and obtain
masked language model (Devlin et al., 2018; Clark phoneme embedding. Both the image features and
et al., 2020) has marked a significant milestone in phoneme embedding are fed into one of the vari-

15959
ants of the transformer to get the hidden sequences. magnitude mel-spectrograms data to alleviate data
Specifically, we first pass the phoneme through scarcity. Specifically, assuming the target mel-
relative self-attention, which is defined as follows: spectrogram is x0 , we first random select 0.065%
of x0 as starting indices and apply a mask that
(Qi W Q )(Kj W K + Rij )T
α(i, j) = Sof tmax( √ ) (1) spans 10 steps following the Wav2vec2.0 (Baevski
dk
et al., 2020). Then, we obtain xt through a diffu-
where n is the length of phoneme embedding, Rij sion process, which is defined by a fixed Markov
are the relative position embedding of key and chain from data x0 to the latent variable xt .
value, dk is the dimension of key, and Q, K, V T
Y
are all the phoneme embedding. We use relative q(x1 , · · · , xT |x0 ) = q(xt |xt−1 ), (3)
self-attention to model how much phoneme pi at- t=1

tends to phoneme pj . After that, we choose to use At each diffusion step t ∈ [1, T ], a tiny Gaussian
cross-attention instead of a simplistic concatena- noise is added to xt−1 to obtain xt , according to a
tion approach as we can reason about how different small positive constant βt :
image patches contribute to the text after feature p
q(xt |xt−1 ) := N (xt ; 1 − βt xt−1 , βt I) (4)
extraction. The equation is defined as follows:
xt obtained from the diffusion process is passed
PV T
δ(V, P ) = Sof tmax( √ )V
dv
(2) through the transformer to predict Gaussian noise
ϵθ . Loss is defined as mean squared error in the ϵ
where P is the phoneme embedding, V is the visual space, and efficient training is optimizing a random
features, and dv is the dimension of vision features. term of t with stochastic gradient descent:
Finally, the feed-forward layer is applied to output q 2
the hidden sequence. LGrad
θ = ϵθ αt x0 + 1 − αt2 ϵ − ϵ , ϵ ∼ N (0, I)
2
(5)
3.3 Enhanced Diffusion Transformer
To this end, ViT-TTS takes advantage of the re-
Scalable Transformer As a rapidly growing cat- construction loss to predict the self-supervised rep-
egory of generative models, DDPMs have demon- resentations which largely alleviates the challenges
strated their exceptional ability to deliver top- of data scarcity. Detailed formulation of DDPM
notch results in both image (Zhang and Agrawala, has been attached in Appendix C.
2023; Ho and Salimans, 2022) and audio synthe-
sis (Huang et al., 2022c, 2023a; Lam et al., 2021). Controllable Fine-tuning During the fine-
However, the most dominant diffusion TTS models tuning stage, we will face the following challenges:
adopt a convolutional architecture like WaveNet or (1) there is a data scarcity issue with the available
U-Net as the de-factor choice of backbone. This panoramic images and target environmental audio
architectural choice limits the model scalability to for training; (2) a fast training method is equally
effectively incorporate panoramic visual images. crucial for optimizing the diffusion model, as it can
Recent research (Peebles and Xie, 2023; Bao et al., save a significant amount of time and storage space.
2023) in the image synthesis field has revealed that To address these challenges, we draw inspiration
the inductive bias of convolutional structures is from Zhang and Agrawala (2023) and implement
not a critical determinant of DDPMs’ performance. a swift fine-tuning technique. Specifically, we cre-
Instead, transformers have emerged as a viable al- ate two copies of the pre-trained diffusion model
ternative. weights, namely a "trainable copy" and a "locked
For this reason, we propose a diffusion trans- copy," to learn the input conditions. We fix all
former that leverages the scalability of transform- parameters of the pre-trained transformer, desig-
ers to expand model capacity and incorporate room nated as Θ, and duplicate them into a trainable
acoustic information. Moreover, we leverage the parameter Θt . We train these trainable parameters
adaptive normalization layers in GANs and ini- and connect them with the "locked copy" via zero
tialize the full transformer block as the identity convolution layers. These convolution layers are
function to enhance the transformer architecture. unique as they have a kernel size of one by one
and weights and biases set to zero, progressively
Unconditional Pre-training In this part, we in- growing from zeros to optimized parameters in a
vestigate self-supervised learning from orders of learned fashion.
15960
3.4 Architecture masked x0 is puted into denoiser to predict Gaus-
As illustrated in Figure 1, our model comprises a sian noise ϵθ . Then, the Mean Square Error(MSE)
visual-text encoder, variance adaptor, and spectro- loss is applied to the predicted Gaussian noise and
gram denoiser. The visual-text encoder converts target Gaussian noise.
phoneme embeddings and visual features into hid- Fine-tuning We begin by loading model weights
den sequences, while the variance adaptor predicts from the pre-trained visual-text encoder and uncon-
the duration of each hidden sequence to regulate ditional diffusion decoder, after which we finetune
the length of the hidden sequences to match that both of them until the model converges. The fi-
of speech frames. Furthermore, different variances nal loss term consists of the following parts: (1)
like pitch and speaker embedding are incorporated sample reconstruction loss Lθ : MSE between the
with hidden sequences following FastSpeech 2 Ren predicted Gaussian noise and target Gaussian noise.
et al. (2022). Finally, the spectrogram denoiser it- (2) variance reconstruction loss Ldur , Lp : MSE be-
eratively refines the length-regulated hidden states tween the predicted and the target phoneme-level
into mel-spectrograms. We put more details in Ap- duration, pitch.
pendix B.
Visual-Text Encoder The visual-text encoder Inference During inference, DDPM iteratively
consists of relative position transformer blocks runs the reverse process to obtain the data sample
based on the transformer architecture. Specifically, x0 , and then we use a pre-trained BigvGAN-16khz-
it convolves a pre-net for phoneme embedding, a vi- 80band as the vocoder to transform the generated
sual feature extractor for image, and a transformer mel-spectrograms into waveforms.
encoder which includes multi-head self-attention,
4 Experiment
multi-head cross-attention, and feed-forward layer.
Variance Adaptor In variance adaptor, the du- 4.1 Experimental Setup
ration and pitch predictors share a similar model Dataset We use the SoundSpaces-Speech
structure consisting of a 2-layer 1D-convolutional dataset (Chen et al., 2023), which is constructed
network with ReLU activation, each followed by on the SoundSpaces platform based on real-world
the layer normalization and the dropout layer, and 3D scans to obtain environmental audio. The
an extra linear layer to project the hidden states dataset includes 28,853/1,441/1,489 samples for
into the output sequence. training/validation/testing, each consisting of clean
Spectrogram Denoiser Spectrogram denoiser text, reverberant audio, and panoramic camera
takes in xt as input to predict ϵ added in diffusion angle images. Following (Chen et al., 2022), we
process conditioned on the step embedding Et and remove out-of-view samples and divide the test set
encoder output. We adopt a variant of the trans- into test-unseen and test-seen, where the unseen
former as our backbone and make some improve- set injects room acoustics depicted in novel images
ments upon the standard transformer motivated while the seen set only contains the scenes we
by Peebles and Xie (2023), mainly includes:(1) have seen in the training stage. We convert the
we explore replacing standard layer norm layers text sequence into the phoneme sequence with an
in transformer blocks with adaptive layer norm open-source grapheme-to-phoneme conversion
(adaLN) to regress scale and shift parameters from tool (Sun et al., 2019) 3 .
the sum of the embedding vector of t and hidden Following the common pratice (Ren et al., 2019;
sequence. (2) Inspired by ResNets (Oord et al., MoonInTheRiver, 2021), we conduct preprocess-
2018), we initialize the transformer block as the ing on the speech and text data: 1) extract the spec-
identity function and initialize the MLP to output trogram with the FFT size of 1024, hop size of 256,
the zero-vector. and window size of 1024 samples; 2) convert it to
3.5 Pre-training, Fine-tuning, and Inference a mel-spectrogram with 80 frequency bins; and 3)
Procedures extract F0 (fundamental frequency) from the raw
waveform using Parselmouth tool 4 .
Pre-training The pre-training has two stages: 1)
3
encoder stage: pre-train the visual-text encoder https://github.com/Kyubyong/g2p
4
https://github.com/YannickJadoul/
vias masked LM loss LCE (ie. cross-entropy loss) Parselmouth
to predict the masked tokens. 2) decoder stage: the

15961
Test-Seen Test-Unseen
Method Params
MOS(↑) RTE (↓) MCD (↓) MOS(↑) RTE (↓) MCD (↓)
GT 4.34±0.07 / / 4.24±0.07 / / /
GT (voc.) 4.18±0.05 0.006 1.46 4.19±0.07 0.008 1.50 /
WaveNet 3.85±0.09 0.091 4.61 3.78±0.12 0.110 4.69 42.3M
Transformer-S 3.92±0.07 0.068 4.57 3.80±0.06 0.077 4.68 32.38M
Transformer-B 3.98±0.06 0.061 4.53 3.90±0.07 0.066 4.62 41.36M
Transformer-L 4.02±0.08 0.056 4.37 3.95±0.07 0.061 4.50 56.96M
Transformer-XL 4.05±0.07 0.047 4.35 4.00±0.05 0.053 4.39 115.12M

Table 1: Comparison between the diffusion WaveNet and diffusion transformers sweeping over model config(S,
B, L, XL). All models remove the pre-training stage and other conditions not related to backbone in training and
inference remain the same.

Model Configurations The size of the phoneme modeling. (2) Our proposed diffusion transformer
vocabulary is 73. The dimension of phoneme em- outperforms WaveNet backbone under similar pa-
beddings and the hidden size of the visual-text rameters across both test-unseen and test-seen sets,
transformer block are both 256. We use the pre- significantly in the rt60 metric. We attribute this to
trained ResNet18 as an image feature extractor. As the fact that instead of directly concatenating the
for the pitch encoder, the size of the lookup table condition input like WaveNet, we replace standard
and encoded pitch embedding are set to 300 and layer norm layers in transformer blocks with adap-
256. In the denoiser, the number of transformer-B tive layer norm to regress dimension-wise scale and
layers is 5 with the hidden size 384 and head 12. shift parameters from the sum of the embedding
We initialize each transformer block as the iden- vectors of diffusion step and encoder output, which
tity function and set T to 100 and β to constants can better incorporate the conditional information,
increasing linearly from β1 = 10−4 to βT = 0.06. as proven in GANs (Brock et al., 2018; Karras et al.,
We have attached more detailed information on the 2019).
model configuration in Appendix B 4.3 Model Performances
Pre-training, Fine-tuning, and Inference Dur-
In this study, we conduct a comprehensive com-
ing the pre-training stage, we pre-train the encoder
parison of the generated audio quality with other
for 120k steps and the decoder for 160k until con-
systems, including 1) GT, the ground-truth au-
vergence. The diffusion probabilistic models have
dio; 2) GT(voc.), where we first convert the
been trained using 1 NVIDIA A100 GPU with a
groud-truth audio into mel-spectrograms and then
batch size of 48 sentences. In the inference stage,
convert them to audio using BigvGAN; 3) Diff-
we uniformly use a pre-trained BigvGAN-16khz-
Speech (MoonInTheRiver, 2021), one of the
80band (Lee et al., 2022) as a vocoder to transform
most popular DDPM based on WaveNet; 4)ProD-
the generated mel-spectrograms into waveforms.
iff (Huang et al., 2022c), a recent generator-based
4.2 Scalable Diffusion Transformer diffusion model proposed to reduce the sampling
time; 5)Visual-DiffSpeech, incorporate visual-text
We compare and examine diffusion transformer
fusion module into DiffSpeech; 6) Cascaded, the
sweeping over model config(S, B, L, XL), and
system composed of DiffSpeech and Visual Acous-
conduct evaluations in terms of audio quality and
tic Matching(VAM) (Chen et al., 2022). The results,
parameters. Appendix A gives the details of the
compiled and presented in Table 2, provide valu-
model configs. The results have been shown in Ta-
able insights into the effectiveness of our approach:
ble 1. We have some observations from the results:
(1) As expected, the results in the test-unseen
(1) Increasing the depth and number of layers in
set do poorer than the test-seen part because there
the transformer can significantly enhance the per-
are invisible scenarios among the test-unseen set.
formance of the diffusion model, resulting in an
However, our proposed model has achieved the
improvement in both objective metrics and subjec-
best performance compared to baseline systems in
tive metrics, which demonstrates that expanding
both sets, indicating that our model generates the
the model size enables finer-grained room acoustic
best-perceived audio that matches the target envi-

15962
Test-Seen Test-Unseen
Method Params
MOS (↑) RTE (↓) MCD (↓) MOS (↑) RTE (↓) MCD (↓)
GT 4.34±0.07 / / 4.24±0.07 / / /
GT(voc.) 4.18±0.05 0.006 1.46 4.19±0.07 0.008 1.50 /
DiffSpeech 3.79±0.08 0.104 4.65 3.67±0.05 0.120 4.71 29.9M
ProDiff 3.76±0.13 0.121 4.67 3.65±0.06 0.137 4.72 29.9M
Visual-DiffSpeech 3.85±0.09 0.091 4.61 3.78±0.12 0.110 4.69 42.3M
Cascaded 3.61±0.08 0.071 5.13 3.59±0.08 0.082 5.25 146.5M
ViT-TTS 3.95±0.06 0.066 4.52 3.86±0.05 0.076 4.59 41.3M

Table 2: Comparison with baselines on the SoundSpaces-Speech for Seen and Unseen scenarios. The diffusion step
of all diffusion models is set to 100. We use the pre-trained model provided by VAM for the evaluation of cascaded.

ronment from written text. (2) Our model surpassed data (1h/2h/5h) and leverage large-scale text-only
TTS diffusion models(i.e.DiffSpeech and ProDiff) and audio-only data to boost the performance of the
across all metric scores, especially in terms of RTE visual TTS system, to investigate the effectiveness
values. This suggests that conventional diffusion of our self-supervised learning methods. The re-
models in TTS do poorly in modeling room acous- sults are compiled and presented in Table 3, and we
tic information, as they mainly focus on audio con- have the following observations: 1)As training data
tent, pitch, energy, etc. Our proposed visual-text is reduced in the low-resource scenario, a distinct
fusion module addresses this challenge by injecting degradation in generated audio quality could be wit-
visual properties into the model, resulting in a more nessed in both test sets (test-seen and test-unseen).
accurate prediction of the correct acoustics from 2) Leveraging orders of magnitude text-only and
images and high-perceived audio synthesis. (3) audio-only data with self-supervised learning, the
The results of comparison with Visual-DiffSpeech ViT-TTS achieve RTE scores of 0.082 and 0.068
highlight the advantages of our choice of trans- respectively in test-unseen and test-seen, showing
former and self-supervised pre-training. Although a significant promotion regardless of the unseen
Visual-DiffSpeech adds the visual-text module, the scene. In this way, the dependence on a large num-
choice of WaveNet and the lack of a self-supervised ber of parallel audio-visual data can be reduced for
pre-training strategy make it perform worse in constructing visual text-to-speech systems.
predicting the correct acoustics from images and
synthesizing high-perceived audio. (4) The cas- Method MOS (↑) RTE (↓) MCD (↓)
caded system composed of DiffSpeech and Visual Finetune with 1 hour data
Acoustic Matching model visual properties is bet-
Test-Seen 3.72±0.05 0.092 5.04
ter than other baselines. However, compared to our
Test-Unseen 3.67±0.06 0.101 5.11
proposed model, it performed worse in both test-
unseen and test-seen environments. This suggests Finetune with 2 hours data
that our direct visual text-to-speech system elimi- Test-Seen 3.75±0.06 0.089 4.85
nates the influence of error propagation caused by Test-Unseen 3.70±0.07 0.097 4.89
the cascaded manner, resulting in high-perceived
audio. In conclusion, our comprehensive evalua- Finetune with 5 hours data
tion results demonstrate the effectiveness of our Test-Seen 3.83±0.05 0.068 4.65
proposed model in generating high-quality audio Test-Unseen 3.73±0.09 0.082 4.72
that matches the target environment.
4.4 Low Resource Evaluation Table 3: Low resource evaluation results.

Training visual text-to-speech models typically re- 4.5 Case Study

quires a large amount of parallel target environment We provide two examples of generation sampled
image and audio training data, while there may be from a large empty room with significant rever-
very few resources due to the heavy workload. In beration in the Test-Seen environment depicted in
this section, we prepare low-resource audio-visual Figure 2, and have the following observations: 1)
15963
Figure 2: Visualizations of the ground truth and generated mel-spectrograms by different Visual TTS models. The
text corresponding to the first line in test-seen is "it is so made that everywhere we feel the sense of punishment"
while the second line in test-unseen is "the task will not be difficult returned david hesitating though i greatly fear
your presence would rather increase than mitigate his unhappy fortunes ".

Mel-spectrograms produced by ViT-TTS are no- Method MOS (↑) RTE (↓) MCD (↓)
ticeably more similar to the target counterpart. 2) GT(voc.) 4.18±0.07 0.008 1.50
Moreover in challenging scenarios with invisible
scene images, cascaded systems suffer severely ViT-TTS 3.86±0.05 0.076 4.59
from the issue of noisy and reverb details missing, w/o EP 3.82±0.07 0.078 4.63
which is largely alleviated in ViT-TTS. w/o DP 3.83±0.06 0.081 4.65
w/o Visual 3.78±0.07 0.102 4.68
4.6 Ablation Studies w/ RI 3.73±0.08 0.103 4.75
We conduct ablation studies to demonstrate the w/ Concat 3.80 ±0.06 0.089 4.63
effectiveness of several key techniques on the Test-
Table 4: Ablation study results. EP, DP, and RI are
Unseen set in our model, including the encoder
encoder pre-training, decoder pre-training, and random
pre-training(EP), decoder pre-training(DP), visual images respectively.
input, random image, and concat function. The
results of both subjective and objective evalua- that after replacing the target image with a random
tions have been presented in Table 4, and we have image, the performance of our model significantly
the following observations: 1) Removing the self- degraded, indicating that our model could model
supervised encoder and decoder pre-training strat- the room acoustic information of visual input.
egy results in a decline in all indicators, which 5 Conclusion
demonstrates the effectiveness and efficiency of
the proposed pre-training strategy in reducing data In this paper, we proposed ViT-TTS, the first visual
variance and promoting model convergence. 2) text-to-speech synthesis model that aimed to con-
Without the input of RGB-D image and removing vert written text and target environmental images
all of the modules related to the image causes a into audio that matches the target environment. To
distinct degradation in RTE values, which demon- mitigate the data scarcity for training visual TTS
strates that our model successfully learns acoustics tasks and model visual acoustic information, we 1)
from the visual scene. 3) The replacement of cross- introduced a self-supervised learning framework to
attention with the concat fusion function results in enhance both the visual-text encoder and denoiser
a decrease in performance across all metrics, high- decoder; 2) leveraged the diffusion transformer
lighting the effectiveness of our visual-text fusion scalable in terms of parameters and capacity to
module. improve performance.
Furthermore, we conducted a more detailed ex- Experimental results demonstrated that ViT-TTS
ploration of our model’s processing and reasoning achieved new state-of-the-art results and performed
about different patches in the RGB-D images. To comparably to rich-resource baselines even with
achieve this, we deliberately substituted the target limited data. To this end, ViT-TTS provided a solid
image with random images, allowing us to deter- foundation for future visual text-to-speech studies,
mine whether the model can derive meaningful rep- and we envision that our approach will have far-
resentations from visual inputs. Our findings show reaching impacts on the fields of AR and VR.

15964
6 Limitation and Potential Risks Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng
Chiu, James Qin, Ruoming Pang, and Yonghui Wu.
As indicated in the experimental setup, we utilized 2021. w2v-bert: Combining contrastive learning
ResNet-18 as our image feature extractor. While it and masked language modeling for self-supervised
is a classic extractor, there may be newer extractors speech pre-training. 2021 IEEE Automatic Speech
Recognition and Understanding Workshop (ASRU),
that perform better. In future work, we will explore pages 244–250.
the use of superior extractors to enhance the quality
of generated audio. Kevin Clark, Minh-Thang Luong, Quoc V Le, and
Christopher D Manning. 2020. Electra: Pre-training
Moreover, our pre-trained encoder and decoder text encoders as discriminators rather than generators.
are based on the SoundSpace-Speech dataset, arXiv preprint arXiv:2003.10555.
which, as described in the dataset section, is not
sufficiently large. To address this limitation in fu- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
ture work, we will pre-train on a large-scale dataset bidirectional transformers for language understand-
to achieve better performance in low-resource sce- ing. arXiv preprint arXiv:1810.04805.
narios.
Jeff Donahue, Sander Dieleman, Mikołaj Bińkowski,
ViT-TTS lowers the requirements for visual text- Erich Elsen, and Karen Simonyan. 2020. End-
to-speech generation, which may cause fraud and to-end adversarial text-to-speech. arXiv preprint
scams by impersonating someone else’s voice. Fur- arXiv:2006.03575.
thermore, there is the potential for leading to the
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. De-
spread of false information and rumors. noising diffusion probabilistic models.

Acknowledgements Jonathan Ho and Tim Salimans. 2022. Classifier-

free diffusion guidance. arXiv preprint
This work was supported in part by the National arXiv:2207.12598.
Natural Science Foundation of China under Grant
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
No.61836002 and Ant Group Research Fund. Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel-
rahman Mohamed. 2021. Hubert: Self-supervised
speech representation learning by masked prediction
References of hidden units. IEEE/ACM Transactions on Audio,
Speech, and Language Processing, 29:3451–3460.
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed,
and Michael Auli. 2020. wav2vec 2.0: A framework Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren,
for self-supervised learning of speech representations. Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xi-
Advances in neural information processing systems, ang Yin, and Zhou Zhao. 2023a. Make-an-audio:
33:12449–12460. Text-to-audio generation with prompt-enhanced dif-
fusion models. arXiv preprint arXiv:2301.12661.
Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi
Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Rongjie Huang, Max WY Lam, Jun Wang, Dan Su,
Jun Zhu. 2023. One transformer fits all distributions Dong Yu, Yi Ren, and Zhou Zhao. 2022a. Fastdiff:
in multi-modal diffusion at scale. arXiv preprint A fast conditional diffusion model for high-quality
arXiv:2303.06555. speech synthesis. arXiv preprint arXiv:2204.09934.
Andrew Brock, Jeff Donahue, and Karen Simonyan. Rongjie Huang, Mingze Li, Dongchao Yang, Jia-
2018. Large scale gan training for high fi- tong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu,
delity natural image synthesis. arXiv preprint Zhiqing Hong, Jiawei Huang, Jinglin Liu, et al.
arXiv:1809.11096. 2023b. Audiogpt: Understanding and generating
speech, music, sound, and talking head. arXiv
Changan Chen, Ruohan Gao, Paul Calamia, and Kris- preprint arXiv:2304.12995.
ten Grauman. 2022. Visual acoustic matching. In
Proceedings of the IEEE/CVF Conference on Com- Rongjie Huang, Huadai Liu, Xize Cheng, Yi Ren, Lin-
puter Vision and Pattern Recognition (CVPR), pages jun Li, Zhenhui Ye, Jinzheng He, Lichao Zhang,
18858–18868. Jinglin Liu, Xiang Yin, and Zhou Zhao. 2023c. Av-
transpeech: Audio-visual robust speech-to-speech
Changan Chen, Wei Sun, David Harwath, and Kristen translation.
Grauman. 2023. Learning audio-visual dereverbera-
tion. Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui,
and Zhou Zhao. 2022b. Generspeech: Towards
Yu-An Chung and James Glass. 2020. Improved speech style transfer for generalizable out-of-domain text-to-
representations with multi-target autoregressive pre- speech. In Advances in Neural Information Process-
dictive coding. arXiv preprint arXiv:2004.05274. ing Systems.

15965
Rongjie Huang, Zhou Zhao, Huadai Liu, Jinglin Liu, Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and
Chenye Cui, and Yi Ren. 2022c. Prodiff: Progressive Ming Liu. 2019. Neural speech synthesis with trans-
fast diffusion model for high-quality text-to-speech. former network. In Proceedings of the AAAI Con-
In Proceedings of the 30th ACM International Con- ference on Artificial Intelligence, volume 33, pages
ference on Multimedia, pages 2595–2605. 6706–6713.
Rongjie Huang, Zhou Zhao, Jinglin Liu, Huadai Liu, Huadai Liu, Rongjie Huang, Jinzheng He, Gang Sun,
Yi Ren, Lichao Zhang, and Jinzheng He. 2022d. Ran Shen, Xize Cheng, and Zhou Zhao. 2023.
Transpeech: Speech-to-speech translation with bilat- Wav2sql: Direct generalizable speech-to-sql parsing.
eral perturbation. arXiv preprint arXiv:2205.12523.
Wolfgang Mack, Shuwen Deng, and Emanuël Habets.
Tero Karras, Samuli Laine, and Timo Aila. 2019. A 2020. Single-channel blind direct-to-reverberation
style-based generator architecture for generative ad- ratio estimation using masking. In Interspeech.
versarial networks. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recogni- Amine Mezghani and A. Lee Swindlehurst. 2018. Blind
tion, pages 4401–4410. estimation of sparse broadband massive MIMO chan-
nels with ideal and one-bit ADCs. IEEE Transactions
Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. on Signal Processing, 66(11):2972–2983.
Conditional variational autoencoder with adversar-
ial learning for end-to-end text-to-speech. In Inter- MoonInTheRiver. 2021. Diffsinger. https:
national Conference on Machine Learning, pages //github.com/MoonInTheRiver/
5530–5540. PMLR. DiffSinger.

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Prateek Murgai, Mark Rau, and Jean-Marc Jot. 2017.
Hifi-gan: Generative adversarial networks for effi- Blind estimation of the reverberation fingerprint of
cient and high fidelity speech synthesis. Proc. of unknown acoustic environments. Journal of The Au-
NeurIPS. dio Engineering Society.

Junghyun Koo, Seungryeol Paik, and Kyogu Lee. 2021. Aaron van den Oord, Sander Dieleman, Heiga Zen,
Reverb conversion of mixed vocal tracks using Karen Simonyan, Oriol Vinyals, Alex Graves,
an end-to-end convolutional deep neural network. Nal Kalchbrenner, Andrew Senior, and Koray
In ICASSP 2021-2021 IEEE International Confer- Kavukcuoglu. 2016. Wavenet: A generative model
ence on Acoustics, Speech and Signal Processing for raw audio. arXiv preprint arXiv:1609.03499.
(ICASSP), pages 81–85. IEEE.
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018.
Max WY Lam, Jun Wang, Rongjie Huang, Dan Su, and Representation learning with contrastive predictive
Dong Yu. 2021. Bilateral denoising diffusion models. coding. arXiv preprint arXiv:1807.03748.
arXiv preprint arXiv:2108.11514.
William Peebles and Saining Xie. 2023. Scalable diffu-
Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catan- sion models with transformers.
zaro, and Sungroh Yoon. 2022. Bigvgan: A univer-
sal neural vocoder with large-scale training. arXiv Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima
preprint arXiv:2206.04658. Sadekova, and Mikhail Kudinov. 2021. Grad-tts:
A diffusion probabilistic model for text-to-speech.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan In International Conference on Machine Learning,
Ghazvininejad, Abdelrahman Mohamed, Omer Levy, pages 8599–8608. PMLR.
Veselin Stoyanov, and Luke Zettlemoyer. 2019. Bart:
Denoising sequence-to-sequence pre-training for nat- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
ural language generation, translation, and compre- Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
hension. In Annual Meeting of the Association for try, Amanda Askell, Pamela Mishkin, Jack Clark,
Computational Linguistics. et al. 2021. Learning transferable visual models from
natural language supervision. In International confer-
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. ence on machine learning, pages 8748–8763. PMLR.
2023. Blip-2: Bootstrapping language-image pre-
training with frozen image encoders and large lan- Rama Ratnam, Douglas L Jones, Bruce C Wheeler,
guage models. arXiv preprint arXiv:2301.12597. William D O’Brien Jr, Charissa R Lansing, and Al-
bert S Feng. 2003. Blind estimation of reverberation
Junnan Li, Dongxu Li, Caiming Xiong, and Steven time. The Journal of the Acoustical Society of Amer-
Hoi. 2022. Blip: Bootstrapping language-image pre- ica, 114(5):2877–2892.
training for unified vision-language understanding
and generation. In International Conference on Ma- Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao,
chine Learning, pages 12888–12900. PMLR. Zhou Zhao, and Tie-Yan Liu. 2022. Fastspeech 2:
Fast and high-quality end-to-end text to speech.

15966
Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao,
Zhou Zhao, and Tie-Yan Liu. 2019. Fastspeech: Fast,
robust and controllable text to speech. Advances in
Neural Information Processing Systems, 32.
Olaf Ronneberger, Philipp Fischer, and Thomas Brox.
2015. U-net: Convolutional networks for biomedical
image segmentation. In Medical Image Computing
and Computer-Assisted Intervention–MICCAI 2015:
18th International Conference, Munich, Germany,
October 5-9, 2015, Proceedings, Part III 18, pages
234–241. Springer.
Andy Sarroff and Roth Michaels. 2020. Blind arbi-
trary reverb matching. In Proceedings of the 23rd
International Conference on Digital Audio Effects
(DAFx-2020), volume 2.
Manfred R Schroeder. 1965. New method of measuring
reverberation time. The Journal of the Acoustical
Society of America, 37(6):1187–1188.
Hao Sun, Xu Tan, Jun-Wei Gan, Hongzhi Liu, Sheng
Zhao, Tao Qin, and Tie-Yan Liu. 2019. Token-level
ensemble distillation for grapheme-to-phoneme con-
version. arXiv preprint arXiv:1904.03446.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need.
Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui
Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang,
Ying Xiao, Zhifeng Chen, Samy Bengio, et al.
2017. Tacotron: Towards end-to-end speech syn-
thesis. arXiv preprint arXiv:1703.10135.
Feifei Xiong, Stefan Goetze, Birger Kollmeier, and
Bernd T Meyer. 2018. Joint estimation of reverber-
ation time and early-to-late reverberation ratio from
single-channel speech signals. IEEE/ACM Transac-
tions on Audio, Speech, and Language Processing,
27(2):255–267.

Pavel Zahorik. 2002. Direct-to-reverberant energy ratio

sensitivity. The Journal of the Acoustical Society of
America, 112(5):2110–2117.
Lvmin Zhang and Maneesh Agrawala. 2023. Adding
conditional control to text-to-image diffusion models.

15967
A TRANSFORMER CONFIGURATION ation adds Gaussian noise.
The details of transformer denoisers are shown in T
Y
Table 5, while B, M, L, and XL means the base, q(x1 , · · · , xT |x0 ) = q(xt |xt−1 ),
medium, large, extra large respectively. t=1
p
q(xt |xt−1 ) =N (xt ; 1 − βt xt−1 , βt I)
Model layers Hidden Size Heads (7)
Transformer-S 4 256 8 We emphasize the property observed by (Ho
Transformer-B 5 384 12 et al., 2020), the diffusion process can be computed
Transformer-L 6 512 16 in a closed form:
Transformer-XL 8 768 16
q(xt |x0 ) = N (xt ; αt x0 , σt I) (8)
Table 5: Diffusion Transformer Configs.
Applying Bayes’ rule, we can obtain the forward
B ARCHITECTURE process posterior when conditioned on x0
We list the model hyper-parameters of ViT-TTS in q(xt |xt−1 , x0 )q(xt−1 |x0 )
Table 6. q(xt−1 |xt , x0 ) =
q(xt |x0 )
Hyperparameter ViT-TTS = N (xt−1 ; µ̃t (xt , x0 ), β̃t I),
Phoneme Embedding 256 (9)
Pre-net Layers 3
Pre-net Hidden 256 αt−1 βt
where µ̃t (xt , x0 ) = σt x0 +
Visual Conv2d Kernel (7, 7) √
1−βt (σt−1 ) σt−1
Visual-Text Encoder Visual Conv2d Stride (2, 2) xt , β̃t =
σt σt β t
Encoder Layers 4
Encoder Hidden 256
Encoder Conv1d Kernel 9
D DIFFUSION ALGORITHM
Conv1D Filter Size 1024
Attention Heads 2 See Algorithm 1 and 2.
Dropout 0.1
Algorithm 1 Training procedure
Conv1D Kernel 3
Variance Predictor Conv1D Filter Size 256 1: Input: The denoiser ϵθ , diffusion step T and
Dropout 0.5
variance condition c.
Diffusion Embedding 384 2: repeat
Transformer Layers 5
Denoiser Transformer Hidden 384 3: Sample x0 ∼ qdata , ϵ ∼ N (0, I)
Attention Heads 12 4: Take√ gradient√ descent steps on ∇θ ||ϵ −
Position Embedding 384 ϵθ ( αt x0 + 1 − αt ϵ, c, t)||.
Scale/Shift Size 384
5: until convergence
Total Number of Parameters 41.36M

Table 6: Hyperparameters of ViT-TTS models.

Algorithm 2 Sampling
1: Input: The denoiser ϵθ , and variance condition
C DIFFUSION POSTERIOR c.
DISTRIBUTION 2: Sample xT ∼ N (0, I)
3: for t = T, · · · , 1 do
Firstly we compute the corresponding constants 4: if t = 1 then
respective to diffusion and reverse process: 5: z=0
t p
Y q 6: else
αt = 1 − βi σt = 1 − αt2 (6) 7: Sample z ∼ N (0, I)
i=1 8: end if
9: Sample xt−1 = √1 (xt −
αt
The Gaussian posterior in diffusion process is
√1−αt ϵθ (xt , c, t)) + σt z
defined through the Markov chain, where each iter- 1−αt
10: end for

15968
Figure 3: Screenshots of subjective evaluations.

E EVALUATION MATRIX E.3 MOS Evaluation

E.1 Evaluation Metrics To probe audio quality, we conduct the MOS (mean
opinion score) tests and explicitly instruct the raters
We measure the sample quality of the generated
to “focus on examining the audio quality, natural-
waveform using both objective metrics and subjec-
ness and whether the audio matches with the given
tive indicators. The objective metrics we collected
image.”. The testers present and rate the samples,
are designed to measure varied aspects of wave-
and each tester is asked to evaluate the subjective
form quality between the ground-truth audio and
naturalness on a 1-5 Likert scale.
the generated sample. Following the common prac-
Our subjective evaluation tests are crowd-
tice of (Huang et al., 2022c; MoonInTheRiver,
sourced and conducted via Amazon Mechanical
2021; Popov et al., 2021), we randomly select a
Turk. These ratings are obtained independently
part of the test set for objective evaluation, here is
for model samples and reference audio, and both
50. We provide the following metrics: (1) RT60
are reported. The screenshots of instructions for
Error(RTE)-the correctness of the room acoustics
testers have been shown in Figure 3. A small subset
between the predicted waveform and target wave-
of speech samples used in the test is available at
form’s RT60 values. RT60 indicates the reverbera-
https://ViT-TTS.github.io/
tion time in seconds for the audio signal to decay
by 60 dB, a standard metric to characterize room F LOW RESOURCE SETTING
acoustics. We estimate the RT60 directly from
magnitude spectrograms of the output audio, using We partition the training set of SoundSpaces-
a model trained with disjoint SoundSpaces data. Speech into 1h/2h/5h subsets based on the alphabet-
(2) Mel Cepstral Distortion(MCD)-measures the ical order of speech IDs. Subsequently, we employ
spectral distance between the synthesized and refer- these subsets to fine-tune our pre-trained models
ence mel-spectrum features. The utilization of RTE and assess their performance on identical test sets.
is solely intended for evaluating the room acoustic
performance of the generated audio, and as an ad-
ditional measure, we have incorporated the MCD
metric to assess the quality of the mel-spectrogram.
For subjective metrics, we use crowd-sourced
human evaluation via Amazon Mechanical Turk,
where raters are asked to rate Mean Opinion
Score(MOS) on a 1-5 Likert scale.

E.2 RT60 Estimator

Following (Chen et al., 2022), we first encode the
2.56s speech clips as spectrograms, process them
with a ResNet18 (Oord et al., 2018) and predict
the RT60 of the speech. The ground truth RT60 is
calculated with the Schroeder (Schroeder, 1965).
We optimize the MSE loss between the predicted
RT60 and the ground truth RT60.

15969

Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
No ratings yet
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
15 pages
DiTTo TTS
No ratings yet
DiTTo TTS
34 pages
ETTA Elucidating The Design Space of Text-to-Audio Models
No ratings yet
ETTA Elucidating The Design Space of Text-to-Audio Models
27 pages
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
No ratings yet
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
8 pages
Thesis
No ratings yet
Thesis
37 pages
Minimax Speech
No ratings yet
Minimax Speech
20 pages
F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
No ratings yet
F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
11 pages
VATT: Video-to-Audio Generation Model
No ratings yet
VATT: Video-to-Audio Generation Model
30 pages
Video-to-Audio Generation Insights
No ratings yet
Video-to-Audio Generation Insights
17 pages
Visual-Text to Speech (vTTS) Method
No ratings yet
Visual-Text to Speech (vTTS) Method
5 pages
Speech Prediction from Silent Videos
No ratings yet
Speech Prediction from Silent Videos
5 pages
Low Resource Text To Speech Synthesis
No ratings yet
Low Resource Text To Speech Synthesis
15 pages
Suoni
No ratings yet
Suoni
38 pages
Tavt Towards Transferable Audio Visual Text Generation 2h20y12957
No ratings yet
Tavt Towards Transferable Audio Visual Text Generation 2h20y12957
17 pages
Diff-TTS A Denoising Diffusion Model For Text-To-Speech
No ratings yet
Diff-TTS A Denoising Diffusion Model For Text-To-Speech
5 pages
Audio-to-Visual Scene Generation Model
No ratings yet
Audio-to-Visual Scene Generation Model
11 pages
DRL for Emotional Text-to-Speech
No ratings yet
DRL for Emotional Text-to-Speech
5 pages
ASV for Voice Identification and TTS
No ratings yet
ASV for Voice Identification and TTS
6 pages
Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
No ratings yet
Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
5 pages
3 Gan
No ratings yet
3 Gan
12 pages
NaturalSpeech: Human-Level TTS System
No ratings yet
NaturalSpeech: Human-Level TTS System
12 pages
VALL-E: Zero-Shot TTS with Language Models
No ratings yet
VALL-E: Zero-Shot TTS with Language Models
16 pages
Multilingual TTS via Voice Conversion
No ratings yet
Multilingual TTS via Voice Conversion
5 pages
Low-Resource Multilingual and Zero-Shot Multispeaker TTS - 2022.aacl-Main.56
No ratings yet
Low-Resource Multilingual and Zero-Shot Multispeaker TTS - 2022.aacl-Main.56
11 pages
Real-Time Voice Cloning with Deep Learning
No ratings yet
Real-Time Voice Cloning with Deep Learning
6 pages
Audiovisual Speech Synthesis Using Tacotron2: Ahmed H. Abdelaziz and Anushree P. Kumar Have Contributed Equally
No ratings yet
Audiovisual Speech Synthesis Using Tacotron2: Ahmed H. Abdelaziz and Anushree P. Kumar Have Contributed Equally
18 pages
AI Based Presentation Creator With Customized Audio Content Delivery
No ratings yet
AI Based Presentation Creator With Customized Audio Content Delivery
5 pages
Seminar Report Parthiv
No ratings yet
Seminar Report Parthiv
58 pages
Stutter Tts Controlled Synthesis and Improved Recognition of Stuttered Speech
No ratings yet
Stutter Tts Controlled Synthesis and Improved Recognition of Stuttered Speech
8 pages
Ieee
No ratings yet
Ieee
12 pages
Review of Text-to-Speech Technologies
No ratings yet
Review of Text-to-Speech Technologies
4 pages
ISM Report Final
No ratings yet
ISM Report Final
33 pages
F5-TTS: A Fairytaler That Fakes Fluent and Faithful Speech With Flow Matching
No ratings yet
F5-TTS: A Fairytaler That Fakes Fluent and Faithful Speech With Flow Matching
17 pages
Visual Speech Enhancement Transformer
No ratings yet
Visual Speech Enhancement Transformer
5 pages
F5 TTS
No ratings yet
F5 TTS
17 pages
Auto Prep 2024
No ratings yet
Auto Prep 2024
5 pages
Metastylespeech
No ratings yet
Metastylespeech
16 pages
Consulting Proposal
No ratings yet
Consulting Proposal
28 pages
XTTS - A Massively Multilingual Zero-Shot Text-to-Speech Model
No ratings yet
XTTS - A Massively Multilingual Zero-Shot Text-to-Speech Model
5 pages
Better Speech Synthesis Through Scaling
No ratings yet
Better Speech Synthesis Through Scaling
12 pages
Pheme
No ratings yet
Pheme
15 pages
Seminar Report Final
No ratings yet
Seminar Report Final
37 pages
StyleTTS: Advanced Text-to-Speech Model
No ratings yet
StyleTTS: Advanced Text-to-Speech Model
20 pages
Huang 22
No ratings yet
Huang 22
17 pages
Audio Generation With Diffusion Models
No ratings yet
Audio Generation With Diffusion Models
16 pages
Text To Audio Generation Instruction LLM
No ratings yet
Text To Audio Generation Instruction LLM
15 pages
Paper TTS+Conversion
No ratings yet
Paper TTS+Conversion
13 pages
Style TTS2
No ratings yet
Style TTS2
28 pages
Neural TTS Synthesis Review: HiFi-GAN Insights
No ratings yet
Neural TTS Synthesis Review: HiFi-GAN Insights
5 pages
Deep Learning-Based Expressive Speech Synthesis: A Systematic Review of Approaches, Challenges, and Resources
No ratings yet
Deep Learning-Based Expressive Speech Synthesis: A Systematic Review of Approaches, Challenges, and Resources
34 pages
Unified ASR and TTS Model STTATTS
No ratings yet
Unified ASR and TTS Model STTATTS
11 pages
Encodec Trans
No ratings yet
Encodec Trans
5 pages
Mid Defence Clone
No ratings yet
Mid Defence Clone
45 pages
Natural TTS Samples via Quantized VAE
No ratings yet
Natural TTS Samples via Quantized VAE
5 pages
F5-TTS: Flow Matching for Speech Synthesis
No ratings yet
F5-TTS: Flow Matching for Speech Synthesis
18 pages
NAUTILUS A Versatile Voice Cloning System
No ratings yet
NAUTILUS A Versatile Voice Cloning System
15 pages
Literature Survey
No ratings yet
Literature Survey
6 pages
Liu22c Interspeech
No ratings yet
Liu22c Interspeech
5 pages
Exim Policy Impact
No ratings yet
Exim Policy Impact
7 pages
Diya Enterprises Muster Roll Register
No ratings yet
Diya Enterprises Muster Roll Register
9 pages
Zhang2018 - Stress-Induced Functional Alterations in Amygdala
No ratings yet
Zhang2018 - Stress-Induced Functional Alterations in Amygdala
9 pages
Assessment 2 - Business Plan Report
No ratings yet
Assessment 2 - Business Plan Report
10 pages
Gender, Peace, and Conflict Analysis
100% (1)
Gender, Peace, and Conflict Analysis
26 pages
Report On Keffa Coffee Website
No ratings yet
Report On Keffa Coffee Website
4 pages
Myanmar Supply Chain Report 2016
No ratings yet
Myanmar Supply Chain Report 2016
52 pages
Chapter 4 Unit 1 The Role of Government
No ratings yet
Chapter 4 Unit 1 The Role of Government
4 pages
ENI - Final Presentation
No ratings yet
ENI - Final Presentation
16 pages
Back to School: Matematika Peminatan
No ratings yet
Back to School: Matematika Peminatan
10 pages
Buddhism and The Coming Revolution
100% (2)
Buddhism and The Coming Revolution
2 pages
Former St. John's University Lacrosse Player Gets 6 Years in Prison For Near-Fatal Stabbing of Roommate - New York Daily News
No ratings yet
Former St. John's University Lacrosse Player Gets 6 Years in Prison For Near-Fatal Stabbing of Roommate - New York Daily News
5 pages
Preliminary Assessment in GE 8 Ethics Test I. Identification. Direction: Identify What Is Being Asked in The Following Items
100% (1)
Preliminary Assessment in GE 8 Ethics Test I. Identification. Direction: Identify What Is Being Asked in The Following Items
2 pages
Agreement
No ratings yet
Agreement
11 pages
Types of Evidence Explained for Writing
No ratings yet
Types of Evidence Explained for Writing
5 pages
English-Spanish Phrasal Verbs List
No ratings yet
English-Spanish Phrasal Verbs List
1 page
The Forgotten Wife Cap 147
80% (5)
The Forgotten Wife Cap 147
6 pages
Visegrad Group and Ukraine's EU Integration
No ratings yet
Visegrad Group and Ukraine's EU Integration
317 pages
Understanding Statistics in Economics
No ratings yet
Understanding Statistics in Economics
121 pages
3D Printing vs. Traditional Construction Cost Analysis
No ratings yet
3D Printing vs. Traditional Construction Cost Analysis
22 pages
Cap Table Modeling and Scenario Analysis Explained
No ratings yet
Cap Table Modeling and Scenario Analysis Explained
11 pages
Ax 100
0% (1)
Ax 100
3 pages
DR APJ Abdul Kalam Project-2
No ratings yet
DR APJ Abdul Kalam Project-2
4 pages
Lesson Plan in Bread and Pastry Production Ncii I. Objectives
100% (3)
Lesson Plan in Bread and Pastry Production Ncii I. Objectives
2 pages
Chapter 3 - Environmental Risk Economics, Assessment, and Management
No ratings yet
Chapter 3 - Environmental Risk Economics, Assessment, and Management
13 pages
Passive Sentence Structure
No ratings yet
Passive Sentence Structure
12 pages
Isozaki Arata - MA - The Japanese Sense of Place PDF
100% (7)
Isozaki Arata - MA - The Japanese Sense of Place PDF
33 pages
Vivanco VA Wall Cabinet Specs
No ratings yet
Vivanco VA Wall Cabinet Specs
1 page
Ear Assessment 4
No ratings yet
Ear Assessment 4
49 pages