PyPI官网下载|deepvoice3_pytorch-0.0.1.tar.gz_deepvoice下载,deepvoice3下载资源-CSDN下载

版权申诉

pytorch

人工智能

python

深度学习

机器学习

5星 · 超过95%的资源 108 浏览量 2022-02-13 11:02:18 上传评论收藏 21KB GZ 举报

共26个文件

py：16个

txt：4个

pkg-info：2个

《PyPI官网下载 | deepvoice3_pytorch-0.0.1.tar.gz——探索Python语音合成框架》 PyPI（Python Package Index）是Python开发者的重要资源库，它提供了丰富的Python软件包供全球用户下载和使用。在PyPI上，我们可以找到各种各样的工具和框架，其中就包括了"deepvoice3_pytorch-0.0.1.tar.gz"这个压缩包，它是专门为深度学习爱好者和开发者设计的语音合成框架。 "deepvoice3_pytorch"是一款基于PyTorch的开源项目，其目标是实现高效的端到端变声模型。该框架利用了PyTorch的强大功能，为研究人员和工程师提供了一个便捷的平台，以实现自定义的语音合成任务。0.0.1版本代表这是项目的早期版本，可能包含基本功能，但可能尚未经过大规模测试和完善。 PyTorch是Facebook AI Research团队开发的一个强大的深度学习框架，以其易用性和灵活性著称。它支持动态计算图，使得模型构建和调试变得更加直观，非常适合研究工作。在"deepvoice3_pytorch"中，PyTorch被用来构建和训练复杂的神经网络模型，用于将文本转换成自然流畅的语音。 "deepvoice3_pytorch"框架主要涉及以下几个关键知识点： 1. **端到端模型**：传统的语音合成系统通常由多个模块组成，如文本预处理、声学建模、波形生成等。而端到端模型则尝试直接从文本输入生成语音，简化了流程，减少了人工特征工程的需求。 2. **变声技术**：该框架允许用户改变语音的音色，模拟不同人的声音，这在语音合成应用中非常有用，比如个性化语音助手或电影制作。 3. **卷积神经网络（CNN）**与**循环神经网络（RNN）**：在语音合成领域，这两种类型的神经网络常常被结合使用。CNN用于提取音频的局部特征，RNN则负责捕捉时间序列中的依赖关系，确保生成的语音连贯。 4. **注意力机制**：在处理长序列数据时，注意力机制能够帮助模型更好地聚焦于重要信息，提高生成质量。 5. **WaveNet**和** Griffin-Lim**算法：这些波形生成技术用于将合成的声谱图转化为实际可听的声音，提高了合成语音的自然度。 6. **数据预处理**：在训练模型之前，需要对原始音频数据进行预处理，如分帧、归一化、梅尔频率倒谱系数（MFCC）提取等，以便模型能够有效学习。 7. **优化算法**：优化器如Adam或SGD用于调整模型参数，以最小化损失函数，从而改善模型性能。 8. **模型评估**：通常会使用MOS（Mean Opinion Score）等评价指标来评估合成语音的质量，这涉及到人类主观判断。通过深入理解和实践"deepvoice3_pytorch"框架，开发者不仅可以掌握语音合成的基本原理，还能了解到PyTorch在深度学习项目中的应用技巧，进一步提升在人工智能和机器学习领域的专业技能。同时，这个项目也为研究者提供了探索新的语音合成技术和算法的实验平台。

资源推荐

资源详情

资源评论

收起资源包目录

deepvoice3_pytorch-0.0.1.tar.gz （26个子文件）

deepvoice3_pytorch-0.0.1

setup.cfg 38B

README.md 10KB

PKG-INFO 277B

deepvoice3_pytorch.egg-info

dependency_links.txt 1B

PKG-INFO 277B

SOURCES.txt 808B

top_level.txt 19B

requires.txt 149B

deepvoice3_pytorch

nyanko.py 16KB

deepvoice3.py 24KB

modules.py 8KB

frontend

__init__.py 2KB

text

cmudict.py 2KB

symbols.py 618B

numbers.py 2KB

cleaners.py 3KB

__init__.py 2KB

__init__.py 355B

__init__.py 820B

builder.py 10KB

__init__.py 5KB

version.py 22B

conv.py 2KB

LICENSE.md 1KB

MANIFEST.in 29B

setup.py 2KB

# Deepvoice3_pytorch [![Build Status](https://travis-ci.org/r9y9/deepvoice3_pytorch.svg?branch=master)](https://travis-ci.org/r9y9/deepvoice3_pytorch) PyTorch implementation of convolutional networks-based text-to-speech synthesis models: 1. [arXiv:1710.07654](https://arxiv.org/abs/1710.07654): Deep Voice 3: 2000-Speaker Neural Text-to-Speech. 2. [arXiv:1710.08969](https://arxiv.org/abs/1710.08969): Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention. Audio sampels are available at https://r9y9.github.io/deepvoice3_pytorch/. ## Highlights - Convolutional sequence-to-sequence model with attention for text-to-speech synthesis - Multi-speaker and single speaker versions of DeepVoice3 - Audio samples and pre-trained models - Preprocessor for [LJSpeech (en)](https://keithito.com/LJ-Speech-Dataset/), [JSUT (jp)](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html) datasets - Language-dependent frontend text processor for English and Japanese ## Pretrained models | URL | Model | Data | Hyper paramters | Git commit | Steps | |-----|------------|----------|--------------------------------------------------|----------------------|--------| | [link](https://www.dropbox.com/s/cs6d070ommy2lmh/20171213_deepvoice3_checkpoint_step000210000.pth?dl=0) | DeepVoice3 | LJSpeech | `builder=deepvoice3,preset=deepvoice3_ljspeech` | [4357976](https://github.com/r9y9/deepvoice3_pytorch/tree/43579764f35de6b8bac2b18b52a06e4e11b705b2)| 21k ~ | | [link](https://www.dropbox.com/s/1y8bt6bnggbzzlp/20171129_nyanko_checkpoint_step000585000.pth?dl=0) | Nyanko | LJSpeech | `builder=nyanko,preset=nyanko_ljspeech` | [ba59dc7](https://github.com/r9y9/deepvoice3_pytorch/tree/ba59dc75374ca3189281f6028201c15066830116) | 58.5k | | [link](https://www.dropbox.com/s/uzmtzgcedyu531k/20171222_deepvoice3_vctk108_checkpoint_step000300000.pth?dl=0) | Multi-speaker DeepVoice3 | VCTK | `builder=deepvoice3_vctk,preset=deepvoice3_vctk` | [0421749](https://github.com/r9y9/deepvoice3_pytorch/tree/0421749af908905d181f089f06956fddd0982d47) | 30k + 30k | See "Synthesize from a checkpoint" section in the README for how to generate speech samples. Please make sure that you are on the specific git commit noted above. ## Notes on hyper parameters - Default hyper parameters, used during preprocessing/training/synthesis stages, are turned for English TTS using LJSpeech dataset. You will have to change some of parameters if you want to try other datasets. See `hparams.py` for details. - `builder` specifies which model you want to use. `deepvoice3`, `deepvoice3_multispeaker` [1] and `nyanko` [2] are surpprted. - `presets` represents hyper parameters known to work well for particular dataset/model from my experiments. Before you try to find your best parameters, I would recommend you to try those presets by setting `preset=${name}`. e.g., for LJSpeech, you can try either ``` python train.py --data-root=./data/ljspeech --checkpoint-dir=checkpoints_deepvoice3 \ --hparams="builder=deepvoice3,preset=deepvoice3_ljspeech" \ --log-event-path=log/deepvoice3_preset ``` or ``` python train.py --data-root=./data/ljspeech --checkpoint-dir=checkpoints_nyanko \ --hparams="builder=nyanko,preset=nyanko_ljspeech" \ --log-event-path=log/nyanko_preset ``` - Hyper parameters described in DeepVoice3 paper for single speaker didn't work for LJSpeech dataset, so I changed a few things. Add dilated convolution, more channels, more layers and add guided attention loss, etc. See code for details. The changes are also applied for multi-speaker model. - Multiple attention layers are hard to learn. Empirically, one or two (first and last) attention layers seems enough. - With guided attention (see https://arxiv.org/abs/1710.08969), alignments get monotonic more quickly and reliably if we use multiple attention layers. With guided attention, I can confirm five attention layers get monotonic, though I cannot get speech quality improvements. - Binary divergence (described in https://arxiv.org/abs/1710.08969) seems stabilizes training particularly for deep (> 10 layers) networks. - Adam with step lr decay works. However, for deeper networks, I find Adam + noam's lr scheduler is more stable. ## Requirements - Python 3 - PyTorch >= v0.3 - TensorFlow >= v1.3 - [tensorboard-pytorch](https://github.com/lanpa/tensorboard-pytorch) (master) - [nnmnkwii](https://github.com/r9y9/nnmnkwii) >= v0.0.11 - [MeCab](http://taku910.github.io/mecab/) (Japanese only) ## Installation Please install packages listed above first, and then ``` git clone https://github.com/r9y9/deepvoice3_pytorch pip install -e ".[train]" ``` If you want Japanese text processing frontend, install additional dependencies by: ``` pip install -e ".[jp]" ``` ## Getting started ### 0. Download dataset - LJSpeech (en): https://keithito.com/LJ-Speech-Dataset/ - VCTK (en): http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html - JSUT (jp): https://sites.google.com/site/shinnosuketakamichi/publication/jsut ### 1. Preprocessing Preprocessing can be done by `preprocess.py`. Usage is: ``` python preprocess.py ${dataset_name} ${dataset_path} ${out_dir} ``` Supported `${dataset_name}`s for now are - `ljspeech` (en, single speaker) - `vctk` (en, multi-speaker) - `jsut` (jp, single speaker) Suppose you will want to preprocess LJSpeech dataset and have it in `~/data/LJSpeech-1.0`, then you can preprocess data by: ``` python preprocess.py ljspeech ~/data/LJSpeech-1.0/ ./data/ljspeech ``` When this is done, you will see extracted features (mel-spectrograms and linear spectrograms) in `./data/ljspeech`. ### 2. Training Basic usage of `train.py` is: ``` python train.py --data-root=${data-root} --hparams="parameters you want to override" ``` Suppose you will want to build a DeepVoice3-style model using LJSpeech dataset with default hyper parameters, then you can train your model by: ``` python train.py --data-root=./data/ljspeech/ --hparams="builder=deepvoice3,preset=deepvoice3_ljspeech" ``` Model checkpoints (.pth) and alignments (.png) are saved in `./checkpoints` directory per 5000 steps by default. If you are building a Japaneses TTS model, then for example, ``` python train.py --data-root=./data/jsut --hparams="frontend=jp" --hparams="builder=deepvoice3,preset=deepvoice3_ljspeech" ``` `frontend=jp` tell the training script to use Japanese text processing frontend. Default is `en` and uses English text processing frontend. Note that there are many hyper parameters and design choices. Some are configurable by `hparams.py` and some are hardcoded in the source (e.g., dilation factor for each convolution layer). If you find better hyper parameters, please let me know! ### 4. Moniter with Tensorboard Logs are dumped in `./log` directory by default. You can monitor logs by tensorboard: ``` tensorboard --logdir=log ``` ### 5. Synthesize from a checkpoint Given a list of text, `synthesis.py` synthesize audio signals from trained model. Usage is: ``` python synthesis.py ${checkpoint_path} ${text_list.txt} ${output_dir} ``` Example test_list.txt: ``` Generative adversarial network or variational auto-encoder. Once upon a time there was a dear little girl who was loved by every one who looked at her, but most of all by her grandmother, and there was nothing that she would not have given to the child. A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. ``` ## Advanced usage ### Multi-speaker model Currently VCTK is the only supported dataset for building a multi-speaker model. Since some audio samples in VCTK have long silences that affect performance, it's recommended to do phoneme alignment and remove silences according to [vctk_prepr

评论收藏

内容反馈

版权申诉