Generating Novel and Realistic Speakers for Voice Conversion

Abstract

Voice conversion models modify timbre while preserving paralinguistic features, enabling applications like dubbing and identity protection. However, most VC systems require access to target utterances, limiting their use when target data is unavailable or when users desire conversion to entirely novel, unseen voices. To address this, we introduce a lightweight method SpeakerVAE to generate novel speakers for VC. Our approach uses a deep hierarchical variational autoencoder to model the speaker timbre space. By sampling from the trained model, we generate novel speaker representations for voice synthesis in a VC pipeline. The proposed method is a flexible plug-in module compatible with various VC models, without co-training or fine-tuning of the base VC system. We evaluated our approach with state-of-the-art VC models: FACodec and CosyVoice2. The results demonstrate that our method successfully generates novel, unseen speakers with quality comparable to that of the training speakers.

Index Terms—  Voice Conversion, VAE, Speaker Generation

1 Introduction

Voice conversion (VC) models offer the capability to transform the timbre of a source speaker’s voice to that of a target speaker [ju2024naturalspeech, du2024cosyvoice, chen2022controlvc]. This is highly valuable in various applications. For instance, a single voice actor could dub an entire film using multiple distinct voices and performance styles. Another significant use case involves individuals wishing to alter their voice to protect their identity. Consequently, users often require not only novel, high-quality voices that are distinctive and free from legal restrictions, but also a diverse range of these generated voices to select from.

However, current voice conversion models typically depend on the access to pre-recorded target speaker utterances [cao2024neuralvc, baade2024neural, zhang2025vevo]. Generally, two main approaches exist: disentanglement-based methods and large language model (LLM) based systems. Disentanglement-based approaches [chen2022controlvc, yao2024promptvc, yang2024streamvc, wang2025spark], represent speech through separate embeddings for its distinct components, such as content, speaker identity, and pitch. During conversion, these methods require extracting speaker embeddings from target recordings to serve as input, while other components like content and prosody are retained from the source speaker. LLM-based systems [du2024cosyvoice, chen2024f5], on the other hand, utilize acoustic features like mel-spectrograms or speech tokens from the target voice as prompts. These prompts are often concatenated with other conditioning information, such as content representations, and then directly input into the speech-LLM to generate acoustic tokens for a subsequent vocoder. Both methods, however, assume that target utterances exist and thus lack the ability to generate new speakers.

Some recent models use interpolation to generate new speaker embeddings by combining existing ones [chattts2024]. However, these methods often do not ensure that the generated speakers are sufficiently novel or of high audio quality. While certain text-to-speech (TTS) models can generate novel speakers using label inputs, like sex and age, this technique has not yet been applied to voice conversion [tacospawn, shi2023voicelens]. Moreover, in many such studies, speaker generation is a side feature rather than the primary research focus. As a result, existing methods lack systematic methodology to evaluate the fidelity and novelty of these generated speakers, specifically, whether the proposed generation methods truly produce speakers distinct from the training set and whether these new speakers generate high audio quality.

In this paper, we address these limitations by proposing a novel speaker generation method named SpeakerVAE to create new, natural-sounding speakers specifically for voice conversion applications. Our approach utilizes a deep hierarchical variational autoencoder [vahdat2020nvae] to model the speaker space learned by the speaker module of a pre-trained voice conversion system. By sampling from this learned VAE space, we can generate diverse and unique speaker representations.

The proposed system offers two key advantages:

Efficiency: It requires minimal training, focusing only on the VAE module, with no retraining needed for the existing voice conversion model.

Flexibility: The method is adaptable and can be applied to various voice conversion systems and different distributions of speaker embedding spaces, whether they are regulated speaker verification embeddings or unregulated style vectors learned jointly with a synthesis system.

Additionally, we propose a systematic approach to evaluate the naturalness and range of the generated speakers. This evaluation framework aims to ensure that the newly generated speakers not only exhibit high generation quality but also possess statistical features comparable to those observed in the training dataset. The subsequent sections will detail the architecture of our proposed SpeakerVAE, the generation process, and the comprehensive evaluation results.

2 Speaker Generation with NVAE

The key idea of our pseudo speaker generation approach is to learn a generative model from speaker embeddings extracted from utterances of a training dataset with many different real speakers. After learning this generative model, one can then sample a pseudo speaker embedding and use it as the target speaker embedding in voice conversion applications. Ideally, the sampled pseudo speaker embedding does not collide with any real speaker embeddings in the training set, but its distribution follows that of the training speaker embeddings, ensuring the naturalness and range of pseudo speakers.

There are many generative models to consider for this purpose. In this paper, we propose to use the Nouveau Variational Auto-Encoder (NVAE) [vahdat2020nvae], a deep hierarchical VAE. We choose NVAE due to its ability to model complex, high-dimensional distributions through a hierarchical structure. Speaker embeddings, which encapsulate a diverse range of human vocal characteristics, inherently form such a complex space. These characteristics include variations in pitch, timbre, and other nuanced acoustic features that contribute to unique speaker identities. NVAE’s hierarchical approach is particularly good at learning representations at different levels of abstraction from such data. Furthermore, NVAE is designed to avoid over-regularization and preserve fine-grained variability, which is critical for generating diverse yet natural-sounding voices.

The speaker embeddings modeled by NVAE do not need to follow a specific distribution. They can be extracted from speech utterances with speaker verification models such as ECAPA-TDNN [desplanques2020ecapa]. They can also be extracted by speech encoders of voice conversion systems [li2024gtr, li2023styletts]. We argue that the latter may contain richer information for rendering high-quality and natural speech utterances in voice conversion applications. In this work, we train NVAE with both first type speaker verification embeddings (used in CosyVoice2[du2024cosyvoice]) and second type embeddings (used in FACodec [ju2024naturalspeech]) to verify its performance.

Refer to caption
Fig. 1: SpeakerVAE overview. Left side shows the architecture and training process. Right side depicts the inference pipeline.

2.1 NVAE Model Architecture

The original NVAE architecture [vahdat2020nvae] is a deep hierarchical VAE designed for high-quality image generation, leveraging a multi-scale latent space and residual cells to model complex pixel correlations. It features a sequence of latent variables z=(z1,,zL)z=(z_{1},\ldots,z_{L}), where LL is the total number of hierarchical latent groups and each zlz_{l} represents the latent variables at level ll. The generative model pθ(x,z)p_{\theta}(x,z) (with θ\theta denoting the decoder parameters and xx being the input data) follows a top-down process:

pθ(x,z)=pθ(x|zL)l=1Lpθ(zl|z<l),p_{\theta}(x,z)=p_{\theta}(x|z_{L})\prod_{l=1}^{L}p_{\theta}(z_{l}|z_{<l}), (1)

where pθ(z1|z<l)=pθ(z1)p_{\theta}(z_{1}|z_{<l})=p_{\theta}(z_{1}) forms the base case prior, z<l(z1,,zl1)z_{<l}\equiv(z_{1},\ldots,z_{l-1}) represents all coarser-level latents, pθ(x|zL)p_{\theta}(x|z_{L}) is the observation model, and pθ(zl|z<l)p_{\theta}(z_{l}|z_{<l}) is the conditional prior for zlz_{l} given latents from coarser levels z<lz_{<l}.

The encoder qϕ(z|x)q_{\phi}(z|x) (where ϕ\phi represents the encoder parameters) uses a bottom-up inference process, where each qϕ(zl|x,z>l)q_{\phi}(z_{l}|x,z_{>l}) (the approximate posterior) conditions on z>l(zl+1,,zL)z_{>l}\equiv(z_{l+1},\ldots,z_{L}) (the finer-level latents). NVAE stabilizes training via spectral regularization and residual cells with skip connections. The model optimizes the Evidence Lower Bound (ELBO):

(θ,ϕ;x)=𝔼qϕ(z|x)[logpθ(x|zL)]l=1L𝔼qϕ(z<l|x)[DKL(qϕ(zl|x,z<l)pθ(zl|z<l))],\begin{split}&\mathcal{L}(\theta,\phi;x)=\mathbb{E}_{q_{\phi}(z|x)}\left[\log p_{\theta}(x|z_{L})\right]\\ \quad-&\sum_{l=1}^{L}\mathbb{E}_{q_{\phi}(z_{<l}|x)}\left[D_{\text{KL}}(q_{\phi}(z_{l}|x,z_{<l})\|p_{\theta}(z_{l}|z_{<l}))\right],\end{split} (2)

where the first term represents the reconstruction likelihood and the second term contains the KL divergence between the approximate posterior qϕ(zl|x,z<l)q_{\phi}(z_{l}|x,z_{<l}) and conditional prior pθ(zl|z<l)p_{\theta}(z_{l}|z_{<l}) at each level ll.

2.2 SpeakerVAE Adaptation

To apply NVAE to our 1D speaker embedding data, several adaptations to the original architecture, initially designed for 2D image data, were necessary. Specifically, we replaced all 2D convolutional operations with 1D equivalents, simplified the model by removing autoregressive normalizing flows while retaining the hierarchical latent structure, implemented robust quantile-based normalization for speaker embeddings, and introduced free-bits regularization [kingma2016improved] with KL coefficient warmup [higgins2017beta] to prevent posterior collapse. We reconfigured the number of hierarchical levels or latent dimensions per level to better suit the dimensionality and inherent complexity of speaker embeddings. The architecture of our SpeakerVAE model is shown in 1.

During training, we first use the speaker extractor module of the voice conversion models to extract speaker embeddings from training utterances. Then we train the SpeakerVAE model with the ELBO loss. At inference, we 1) sample a novel speaker embedding from SpeakerVAE, with a temperature of 1.0, 2) keep the source utterance’s content/prosody representations unchanged, and 3) invoke the voice conversion model to synthesize speech with the source utterance’s content/prosody and the generated speaker embedding. Because all modules operate in the same latent domain, no additional alignment or fine-tuning is needed, and Sec. 4 shows that the resulting conversions retain intelligibility while achieving perceptually distinct, natural-sounding timbres.

2.3 Voice Conversion Models

2.3.1 FACodec

FACodec [ju2024naturalspeech] is a factorized neural speech codec used as one of our VC models. Unlike traditional residual-VQ codecs, FACodec explicitly decomposes the waveform into four disentangled sub-spaces (content and timbre) and reconstructs speech from these representations with minimal quality loss, making it an ideal architecture for SpeakerVAE. Its strong zero-shot VC capability further allows us to evaluate novel speaker embeddings without any data-specific fine-tuning.

The FACodec system utilizes a 1024-dimensional latent space to represent speaker timbre, learned from scratch on its training data. In the SpeakerVAE training stage, we infer speaker embeddings using the speaker encoder module of a pretrained FAcodec model; these embeddings then serve as the training dataset for SpeakerVAE. We ensured that the corpus from which we inferred these speaker embeddings was part of the pretrained FAcodec model’s training dataset to prevent out-of-domain issues that might cause unexpected behavior. During inference, we extract content tokens from source speech and concatenate them with a generated speaker embedding, input to the same pretrained FAcodec model to perform voice conversion.

We use a widely-accepted unofficial FACodec implementation and checkpoint from [facodecgithub].

2.3.2 CosyVoice2

CosyVoice2 [du2024cosyvoice] is a streaming-capable, zero-shot text-to-speech and voice converion model that factorizes speech generation into three successive modules: a supervised semantic tokenizer, a unified text-speech language model, and a chunk-aware causal flow-matching decoder. Crucially for our work, a speaker embedding is used to provide timbre information during the language model stage. This design allows us to integrate our SpeakerVAE model into the CosyVoice2 system, using its output to replace the original speaker embedding input. Furthermore, CosyVoice2 demonstrates an incredible ability in zero-shot voice conversion, a key feature for this work to synthesize entirely new speakers. The training and inference process for the SpeakerVAE+CosyVoice2 is same as that described for the SpeakerVAE+FAcodec model.

The CosyVoice2 system employs a 192-dimensional CAM++ speaker embedding [wang2023campp], which is pre-trained on a speaker verification task. For our experiments, we utilize the official implementation and checkpoints provided by the CosyVoice2 authors [cosyvoicegithub] for both the main CosyVoice2 model and its CAM++ submodule.

3 Experiment

3.1 Baseline

We propose a Gaussian Mixture Model (GMM) [reynolds2009gaussian, tacospawn] as a baseline for modeling and generating speaker embeddings. We use the implementation from [sklearn2011]. This model uses k=12k=12 components with diagonal covariance matrices. To determine the optimal number of components kk, we experimented with integer values for kk ranging from 3 to 150. For each value, GMM parameters (means, covariance matrices, and mixture weights) were estimated from the training speaker embeddings. To select kk, we computed the mean squared error (MSE) between the original speaker embeddings and the mean of their most probable GMM component. This MSE metric was chosen over likelihood curves to directly assess how well the GMM components could represent the data points in the embedding space. While the MSE generally decreased with increasing kk, we selected k=16k=16, as this value was identified within an inflection region of the MSE curve, offering a good balance between model complexity and its ability to represent the embeddings. For generation, we then drew 1000 random samples from this fitted GMM. These generated samples were subsequently used as speaker embeddings for the FAcodec and CosyVoice2 vocoder models to synthesize audio waveforms, in the same manner as SpeakerVAE generated embeddings.

3.2 Dataset

We experiment on LibriTTS train-clean-100 and LibriTTS train-clean-360 datasets [zen2019libritts]. The split of LibriTTS train-clean-100 offers \approx 460 hours of studio-quality, 24 kHz read speech drawn from public-domain audiobooks. It balances gender with 553 female and 598 male speakers. We extract embeddings for each utterance as one training sample, resulting in 149,715 samples in total.

3.3 Training Setups

3.3.1 Normalization

To standardize the speaker embeddings while minimizing the influence of outliers, we employ a quantile-based normalization technique [merad2023robust]. This process is applied on a per-feature basis. First, we mitigate outliers by clipping values below the 0.001 quantile and above the 0.999 quantile for each feature. Then each feature is independently scaled and shifted to a target range of [-1, 1] using min-max normalization.

3.3.2 Model Configurations

The FACodec system utilizes a 1024-dimensional latent space to represent speaker timbre, learned from scratch using its training data. We use a widely-accepted unofficial FACodec implementation from [facodecgithub]. For the SpeakerVAE model applied to these FACodec embeddings, we configure a hierarchical structure with 2 levels, with each level containing 5 groups, and each group consisting of 20 latent dimensions. Both the encoder and decoder hidden layer sizes are set to 64.

The CosyVoice2 system employs a 192-dimensional CAM++ speaker embedding [wang2023campp], which is pre-trained on a speaker verification task. We use the official implementation of CosyVoice2 from [cosyvoicegithub]. When applying SpeakerVAE to these CosyVoice embeddings, we configure it with 2 levels, 3 groups each level, and 8 latent dimensions each group. Similar to the FACodec setup, the encoder and decoder hidden layer sizes are set at 64.

All SpeakerVAE models are trained on a single Nvidia RTX 4090 GPU for 1000 epochs, using a batch size of 1024.

3.4 Evaluation Metrics

3.4.1 Speaker Generation Quality

We evaluate speaker generation quality along four key aspects: diversity, coverage, fidelity, and stability. All metrics are computed using speaker embeddings extracted with WavLM-base-plus-sv model [chen2022wavlm], a state-of-the-art speaker verification system. The cosine similarity between two embeddings is calculated as:

cos(𝐚,𝐛)=𝐚𝐛𝐚𝐛.\text{cos}(\mathbf{a},\mathbf{b})=\frac{\mathbf{a}\cdot\mathbf{b}}{\|\mathbf{a}\|\|\mathbf{b}\|}. (3)

All experiments use m=1000m=1000 utterances. We construct the following datasets for testing:

  • GTGT (Ground Truth): mm randomly sampled utterances from the train dataset

  • GTSameSpeakerGT_{SameSpeaker}: mm utterances of the training dataset where each one is a randomly selected different utterance with the same speaker for each utterance in GT

  • SSynS_{Syn} (Synthesis with Same Speaker): mm resynthesized utterances, one for each GT utterance by a VC model using the speaker embedding extracted from its corresponding GTSameSpeakerGT_{SameSpeaker} utterance

  • SReconS_{Recon} (Reconstruction of GT): mm reconstructed utterances of the GT utterances

  • GSynG_{Syn} (Synthesis with Generated Speaker): GT utterances converted to different target pseudo speakers that are generated by our model.

We define the following cosine similarity based metrics for evaluation:

Pairwise. Measures speaker diversity by calculating the average cosine similarity between all possible combinations of distinct utterances from the specified sets, excluding self-comparisons.

Corresponding. Assesses speaker preservation accuracy by calculating the average cosine similarity between matched utterance pairs sharing the same utterance ID.

Stability. Evaluates speaker generation consistency by measuring similarity between speaker embeddings extracted from converted utterances to the same generated speaker but from different source speakers

Natural Consistency Establishes baseline consistency for stability by measuring similarity between different utterances from the same natural speaker, capturing inherent speaker variation.

3.4.2 Audio Quality

To quantify intelligibility we report word-error rate (WER) and character-error rate (CER). After synthesizing each utterance, we transcribe it with the open-source Whisper-base ASR model (74 M parameters, multilingual) released by OpenAI [cao2012whisper]. WER is computed as

WER=S+D+IN,WER=\frac{S+D+I}{N}, (4)

where S=S= substitutions, D=D= deletions, I=I= insertions, and N=N= words in the reference transcript. CER applies the same formula at the character level. We implement both metrics with the lightweight JiWER Python toolkit [jiwer2024], which provides out-of-the-box WER and CER functions as well as a configurable text-normalization pipeline. For every audio sample we pass Whisper’s hypothesis and the ground-truth transcript to JiWER, average the per-utterance scores over the test sets.

To rate perceptual quality without human panels, we adopt UTMOSv2 [baba2024t05]. UTMOSv2 replaces human listening panels with a feed-forward inference pass: the model takes waveform input, extracts self-supervised features and multi-resolution mel spectrogram and predicts a mean-opinion score through a small fully-connected head. No manual rating loop is involved at evaluation time. We report these UTMOSv2-MOS scores alongside intelligibility (WER/CER) to provide an objective proxy for human naturalness judgments in all our experiments.

4 Results

Table 1: Speaker generation quality results.
FACodec CosyVoice2
SpeakerVAE GMM SpeakerVAE GMM
Pairwise(Ssyn,Ssyn)Pairwise(S_{syn},S_{syn}), Orginal Diversity 0.67±0.18 0.65±0.19
Pairwise(GsynG_{syn}, GsynG_{syn}), Generated Diversity 0.74±0.31 0.65±0.19 0.71±0.15 0.73±0.14
Pairwise(GsynG_{syn}, SsynS_{syn}), Original Coverage 0.70±0.15 0.64±0.19 0.67±0.16 0.69±0.16
Pairwise(SsynS_{syn}, GTGT), Distribution Fidelity 0.65±0.17 0.67±0.16
Corresponding(SsynS_{syn}, GTGT), Speaker Fidelity 0.92±0.04 0.94±0.03
Corresponding(SreconS_{recon}, GTGT), Speaker Fidelity 0.93±0.04 0.93±0.04
Stability 0.85±0.10 0.90±0.06 0.90±0.06 0.86±0.09
Natural Consistency 0.91±0.05 0.94±0.03
Refer to caption
Fig. 2: Audio quality metrics.
Refer to caption
Fig. 3: UMAP visualization of the training and generated embeddings.

The audio quality metrics in Fig. 2 demonstrate that while speech synthesis introduces artifacts that increase WER/CER for both SsynS_{syn} and SreconS_{recon} compared to GT, speech quality remains largely unaffected. For FACodec and CosyVoice2 systems, speaker generation using our method maintains both intelligibility (WER/CER) and perceptual quality (MOS). The exception is the GMM baseline, which degrades FACodec’s intelligibility with substantially higher error rates.

The cosine similarity results, detailed in 1, offer a comparative view of speaker generation quality between SpeakerVAE and GMM across the FAcodec and CosyVoice2 systems. Regarding diversity and coverage, Pairwise(SsynS_{syn}, SsynS_{syn}) measures internal similarity among resynthesized known speakers (Original Diversity), while Pairwise(Gsyn0G_{syn0}, Gsyn0G_{syn0}) measures model-generated speakers (Generated Diversity). It’s important to note that these two diversity metrics are based on pairwise cosine similarities. Therefore, a higher score indicates that the speaker embeddings within that set are more similar to each other, which implies lower diversity. The Pairwise(Gsyn0,Ssyn)Pairwise(G_{syn0},S_{syn}) metric assesses how well generated speakers cover the original range. Ideally, these three metrics would have similar scores, indicating that generated speakers match the original distribution. For the FACodec system, SpeakerVAE’s generated speakers were slightly less diverse than the resynthesized original set but achieved better coverage, whereas GMM’s generated diversity was closer to that of the original set with lower coverage. In the CosyVoice2 setup, both SpeakerVAE and GMM produced speakers that were slightly less diverse than the resynthesized original set. In both VC setups, the three diversity and coverage metrics shows slight variations between the models but indicated reasonable performance. Thus, both SpeakerVAE and GMM demonstrate a good ability to model the original distribution.

The fidelity metrics shows similarity between resynthesized known speakers and their ground truth counterparts. The higher is better. The high scores in the table show both the FACodec and Cosyvoice model can faithfully reconstruct speaker timbre provided by the input speaker embeddings.

Stability measures the consistency of generated speaker identity across different input texts and source speakers. For FACodec, GMM exhibited higher stability than SpeakerVAE , while for CosyVoice2, SpeakerVAE was more stable than GMM. This variation might relate to how effectively each generation method’s learned embedding space aligns with the specific characteristics of each VC. Natural Consistency is use as a reference to the stability measure, since both are cacluted on utterances with same speaker identiy but different texts. The similar number of stability and natural consistency shows the synthesized speech maintains a comparable level of real speech consistency.

5 Conclusion and Future Work

SpeakerVAE enables efficient novel speaker generation for voice conversion by modeling timbre space with a hierarchical VAE. This lightweight, plug-and-play solution requires only VAE training and works across VC systems (FACodec, CosyVoice2). Our evaluation shows it maintains audio quality (WER/CER/UTMOS comparable to original speakers) while generate speakers with original diversity and fidelity. Future work includes developing attribute-controlled (age/gender/accent) speaker generation model using guided latent space sampling.