没有合适的资源?快使用搜索试试~ 我知道了~
MODELING MULTI-SPEAKER LATENT SPACE
需积分: 9 1 下载量 79 浏览量
2018-12-28
18:09:10
上传
评论
收藏 438KB PDF 举报
温馨提示
MODELING MULTI-SPEAKER LATENT SPACE TO IMPROVE NEURAL TTS: QUICK ENROLLING NEW SPEAKER AND ENHANCING PREMIUM VOICE
资源推荐
资源详情
资源评论


格式:pdf 资源大小:1.4MB 页数:164

格式:pdf 资源大小:451.7KB 页数:4










格式:pdf 资源大小:470.6KB 页数:11



格式:pdf 资源大小:5.0MB 页数:127

格式:pdf 资源大小:4.8MB 页数:103

格式:zip 资源大小:293.9MB











MODELING MULTI-SPEAKER LATENT SPACE TO IMPROVE NEURAL TTS:
QUICK ENROLLING NEW SPEAKER AND ENHANCING PREMIUM VOICE
Yan Deng, Lei He, Frank Soong
Microsoft, China
{yaden, helei, frankkps}@microsoft.com
ABSTRACT
Neural TTS has shown it can generate high quality synthe-
sized speech. In this paper, we investigate the multi-speaker
latent space to improve neural TTS for adapting the system to
new speakers with only several minutes of speech or enhanc-
ing a premium voice by utilizing the data from other speak-
ers for richer contextual coverage and better generalization.
A multi-speaker neural TTS model is built with the embed-
ded speaker information in both spectral and speaker latent
space. The experimental results show that, with less than 5
minutes of training data from a new speaker, the new model
can achieve an MOS score of 4.16 in naturalness and 4.64 in
speaker similarity close to human recordings (4.74). For a
well-trained premium voice, we can achieve an MOS score of
4.5 for out-of-domain texts, which is comparable to an MOS
of 4.58 for professional recordings, and significantly outper-
forms single speaker result of 4.28.
Index Terms— neural TTS, multi-speaker modeling,
speaker adaptation, sequence-to-sequence modeling, auto-
regressive generative model
1. INTRODUCTION
In the past few years, there have been significant research
progresses in neural TTS modeling with an end-to-end struc-
ture, e.g. Tacotron+WaveNet [1–3], Char2Wav [4], Deep-
Voice [5] and VoiceLoop [6]. These models are all trained di-
rectly from text-speech data pairs, via sequence-to-sequence
mapping with an attention mechanism. This approach can
bypass the need of a well-designed linguistic features front-
end and the acoustic feature used in the traditional HMM or
DNN/LSTM based system. It is also much easier to expand
the system to different languages and speakers. Recent re-
search has proven that neural TTS can generate natural speech
with high fidelity that are close to natural recordings for in-
domain test sentences [3].
Although neural TTS can generate state-of-the-art natu-
ral sounding speech, it needs a much larger amount of train-
ing data to train a stable and high-quality model than the tra-
ditional HMM/DNN/LSTM approach. Usually, a corpus of
Paper submitted to IEEE ICASSP 2019.
around 15 hours of speech may still not be enough to train
a good end-to-end model. Moreover, there is a key chal-
lenge for end-to-end neural TTS is its generalization ability.
Degradation of naturalness on synthesizing an out-of-domain
sentence does happen, particularly for a long sentence with a
rather complex context. For all these problems, adding more
training data is a brute force solution. But such heavy data re-
quirements can’t be satisfied by using a single speaker corpus
for which the data is always limited. We can consider aug-
ment the training data by combining corpus of multiple speak-
ers to train a multi-speaker model, which has proven to be a
good way to relief the dependency on data size for end-to-end
neural TTS modeling [7]. Another benefit of multi-speaker
modeling is to create customized voice for different speakers
using small corpus via speaker adaptation, which has been
widely used in traditional TTS systems [8–10]. But the natu-
ralness and speaker similarity are all far from perfect. And the
adapted voice also sounds a bit muffle due to vocoder effect.
If we can create high quality customized voice with low re-
source settings, it will be very attractive for some TTS related
applications such as role playing in fairytale and speech-to-
speech translation. However, more efficient modeling tech-
nology is needed for high quality customized voice creation.
Multi-speaker neural TTS will be promising for such prob-
lem.
There are already some researches on multi-speaker neu-
ral TTS modeling [6, 7, 11–14]. For all these systems, they
rely on speaker embeddings to support training multi-speaker
model using multi-speaker corpus. In DeepVoice2 and Deep-
Voice3, they focus on building a multi-speaker model which
can generate speech from different voices with less data than
single-speaker model [7, 11].But the MOS of naturalness is
only about 3.7 for seen speakers in training set, this is a bit
low compared with results in [14]. And the results are sim-
ilar for VoiceLoop [6, 13], still not good in terms of natural-
ness. Then, they try few-shot speaker adaptation, which can
generate speech of new speakers using only a few seconds of
speech [12–14]. Among them, the best results can be obtained
in [14] and the average MOS score is above 4.0 in naturalness
for both seen and unseen speakers. But the speaker similar-
ity is still not good enough, especially for unseen speakers,
which blocks the real applications.
arXiv:1812.05253v2 [eess.AS] 18 Dec 2018
资源评论


weixin_44276261
- 粉丝: 1
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- 基于Simulink的主动悬架控制系统:LQR与五自由度模糊PID控制的对比研究
- yolov5实现基于kld的旋转目标检测
- 机器人运动控制领域中Marilink平台的上位机源码解析及多种运动算法实现
- C#上位机与西门子PLC通讯技术案例:实现数据读取、存储至数据库,生成报表查询,报警历史追溯,变量自定义配置 · 多线程
- MATLAB实现混合整数二阶锥规划在主动配电网动态最优潮流中的求解与应用 完整版
- 基于 PyTorch 的计算机视觉入门:图像分类与目标检测教程
- C#上位机OPC DA网口通讯协议:连接95%PLC的通用解决方案,附编程课程与OPC服务器赠送。 精选版
- 基于调度经济性的光热电站储热容量优化配置研究:探索成本与效益的平衡点
- Matlab环境下基于CNN-LSTM的多特征分类预测模型构建与优化
- STM32锅炉控制器系统:企业级完整项目,含源码、原理图与PCB,涵盖文件系统、SD卡驱动等关键技术
- 智能楼宇微网优化调度模型:融合绿证交易与碳排放考量的综合能源系统 v1.1
- 基于MATLAB的混合ACDC微电网系统电力调度与管理技术研究
- 带遗忘因子最小二乘参数估计法在LabVIEW中的应用与实践 - LabVIEW
- COMSOL模拟近场金属探针激发表面等离子体激元(SPP)的技术研究与应用
- 电力人工智能数据竞赛-液压吊车目标检测赛道
- 目标检测任务中常用的数据转换及数据处理函数
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制
