MODELINGMULTI-SPEAKERLATENTSPACE资源-CSDN下载

语音识别

需积分: 9 79 浏览量 2018-12-28 18:09:10 上传评论收藏 438KB PDF 举报

资源推荐

资源详情

资源评论

MODELING MULTI-SPEAKER LATENT SPACE TO IMPROVE NEURAL TTS:

QUICK ENROLLING NEW SPEAKER AND ENHANCING PREMIUM VOICE

Yan Deng, Lei He, Frank Soong

Microsoft, China

{yaden, helei, frankkps}@microsoft.com

ABSTRACT

Neural TTS has shown it can generate high quality synthe-

sized speech. In this paper, we investigate the multi-speaker

latent space to improve neural TTS for adapting the system to

new speakers with only several minutes of speech or enhanc-

ing a premium voice by utilizing the data from other speak-

ers for richer contextual coverage and better generalization.

A multi-speaker neural TTS model is built with the embed-

ded speaker information in both spectral and speaker latent

space. The experimental results show that, with less than 5

minutes of training data from a new speaker, the new model

can achieve an MOS score of 4.16 in naturalness and 4.64 in

speaker similarity close to human recordings (4.74). For a

well-trained premium voice, we can achieve an MOS score of

4.5 for out-of-domain texts, which is comparable to an MOS

of 4.58 for professional recordings, and signiﬁcantly outper-

forms single speaker result of 4.28.

Index Terms— neural TTS, multi-speaker modeling,

speaker adaptation, sequence-to-sequence modeling, auto-

regressive generative model

1. INTRODUCTION

In the past few years, there have been signiﬁcant research

progresses in neural TTS modeling with an end-to-end struc-

ture, e.g. Tacotron+WaveNet [1–3], Char2Wav [4], Deep-

Voice [5] and VoiceLoop [6]. These models are all trained di-

rectly from text-speech data pairs, via sequence-to-sequence

mapping with an attention mechanism. This approach can

bypass the need of a well-designed linguistic features front-

end and the acoustic feature used in the traditional HMM or

DNN/LSTM based system. It is also much easier to expand

the system to different languages and speakers. Recent re-

search has proven that neural TTS can generate natural speech

with high ﬁdelity that are close to natural recordings for in-

domain test sentences [3].

Although neural TTS can generate state-of-the-art natu-

ral sounding speech, it needs a much larger amount of train-

ing data to train a stable and high-quality model than the tra-

ditional HMM/DNN/LSTM approach. Usually, a corpus of

Paper submitted to IEEE ICASSP 2019.

around 15 hours of speech may still not be enough to train

a good end-to-end model. Moreover, there is a key chal-

lenge for end-to-end neural TTS is its generalization ability.

Degradation of naturalness on synthesizing an out-of-domain

sentence does happen, particularly for a long sentence with a

rather complex context. For all these problems, adding more

training data is a brute force solution. But such heavy data re-

quirements can’t be satisﬁed by using a single speaker corpus

for which the data is always limited. We can consider aug-

ment the training data by combining corpus of multiple speak-

ers to train a multi-speaker model, which has proven to be a

good way to relief the dependency on data size for end-to-end

neural TTS modeling [7]. Another beneﬁt of multi-speaker

modeling is to create customized voice for different speakers

using small corpus via speaker adaptation, which has been

widely used in traditional TTS systems [8–10]. But the natu-

ralness and speaker similarity are all far from perfect. And the

adapted voice also sounds a bit mufﬂe due to vocoder effect.

If we can create high quality customized voice with low re-

source settings, it will be very attractive for some TTS related

applications such as role playing in fairytale and speech-to-

speech translation. However, more efﬁcient modeling tech-

nology is needed for high quality customized voice creation.

Multi-speaker neural TTS will be promising for such prob-

lem.

There are already some researches on multi-speaker neu-

ral TTS modeling [6, 7, 11–14]. For all these systems, they

rely on speaker embeddings to support training multi-speaker

model using multi-speaker corpus. In DeepVoice2 and Deep-

Voice3, they focus on building a multi-speaker model which

can generate speech from different voices with less data than

single-speaker model [7, 11].But the MOS of naturalness is

only about 3.7 for seen speakers in training set, this is a bit

low compared with results in [14]. And the results are sim-

ilar for VoiceLoop [6, 13], still not good in terms of natural-

ness. Then, they try few-shot speaker adaptation, which can

generate speech of new speakers using only a few seconds of

speech [12–14]. Among them, the best results can be obtained

in [14] and the average MOS score is above 4.0 in naturalness

for both seen and unseen speakers. But the speaker similar-

ity is still not good enough, especially for unseen speakers,

which blocks the real applications.

arXiv:1812.05253v2 [eess.AS] 18 Dec 2018

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余4页未读，立即下载

评论收藏

内容反馈

weixin_44276261

粉丝: 1

MODELING MULTI-SPEAKER LATENT SPACE

Integrated System-Level Modeling of Network-on-Chip enabled Multi-Processor Platforms.pdf

A 3D Lattice Boltzmann Code for Modeling Flow and Multi-Component Dispersion

Multi-Factor Authentication Modeling.pdf

Reliability and maintenance modeling of multi-state degraded systems

On the effective parallel programming of multi-core processors

Modeling Continuous-time Event Data with Neural Temporal Point P

multi-agent systems for traffic and transportation engineering

Modeling-Dynamics-and-Control-of-Electrified-Vehicles_2018

Modeling Air-to-Ground Path loss

multi-agent system

Matlab, Simulink - Simulink Modeling Tutorial - Train System

Multi-Scale Material and Product Modeling Using .pdf

Modeling Peer-Peer File Sharing Systems.pdf

A Multi-Stage Modeling Framework for Web Service Composition

A Multi-View Deep Learning Approach for Cross Domain User Modeling in Recommenda

Handbook of Modeling High-Frequency Data in Finance英文版

DeepSeek从入门到精通-清华大学-202502.pdf

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

pdf转为扫描件，免费

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

DEEP SEEK 本地部署（Ollama + ChatBox）+ 私有知识库（cherry studio）教程

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

行人跌倒数据集（VOC格式）

CIFAR10数据集免费下载

DeepSeek从入门到精通-清华大学

java基础知识-内部类

探索Arduino：从零开始的创新之旅

最新资源