CosyVoice-Enhanced 🚀

Enhanced CosyVoice with Advanced API Services, WebUI, and Pretrained Voice Support

✨ Key Enhancements

This enhanced version of CosyVoice adds powerful production-ready features on top of the original CosyVoice project:

🎯 Pretrained Voice Support

8 Built-in Voices: Keira, 步非烟, 营销号-女声, 嘉然, 钟离, etc.
Instant Loading: No need for audio samples or prompts
Optimized Performance: RTF ~1.4-1.6 with smart caching
SFT Pathway: Proper routing for pretrained vs zero-shot voices

🌐 Production API Service

FastAPI Backend: RESTful endpoints with OpenAPI documentation
Multiple Output Modes: Standard, SSE streaming, WAV streaming
Zero-Latency Features: Smart caching eliminates 1-2s delays
GPU Optimization: FP16 support with 3-10x speed improvements
Performance Monitoring: Real-time RTF tracking and benchmarks

🎨 Modern WebUI

Gradio 3.x Compatible: Fixed compatibility issues
Voice Management: Easy selection between zero-shot and pretrained
Real-time Preview: Instant audio generation and playback
Responsive Design: Works on desktop and mobile

⚡ Smart Caching System

Feature Caching: Pre-extract and cache voice features
Pretrained Embedding Cache: Instant voice switching
Performance Boost: RTF improvements from 2.3→1.6
Memory Efficient: Intelligent cache management

🏗️ Architecture

CosyVoice-Enhanced/
├── api_service.py          # FastAPI production server
├── webui_service.py        # Modern Gradio WebUI
├── cached_voice_manager.py # Smart caching system
├── audio_utils.py          # Audio processing utilities
├── voice/                  # Pretrained voice files (*.pt)
├── cosyvoice/             # Core CosyVoice engine
└── pretrained_models/     # Model checkpoints

🚀 Quick Start

1. Installation

git clone https://github.com/yourusername/CosyVoice-Enhanced.git
cd CosyVoice-Enhanced

# Create conda environment
conda create -n cosyvoice-enhanced python=3.10
conda activate cosyvoice-enhanced

# Install dependencies
pip install -r requirements.txt

# For RTX 30/40 series GPUs
pip install -r requirements-rtx30-40.txt

2. Model Setup

# Download models via ModelScope
from modelscope import snapshot_download
snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
snapshot_download('iic/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')

3. Start Services

# Start API service (production)
python api_service.py --port 8080

# Start WebUI (development/demo)
python webui_service.py --port 7861

📡 API Usage

REST API Examples

# Standard TTS with pretrained voice
curl -X POST http://localhost:8080/v1/audio/speech \
  -F "voice_name=Keira" \
  -F "text=Hello, this is CosyVoice Enhanced!" \
  -o output.wav

# SSE Streaming (real-time)
curl -N -X POST http://localhost:8080/v1/audio/speech/sse \
  -F "voice_name=嘉然" \
  -F "text=实时流式语音合成测试"

# List available voices
curl http://localhost:8080/voices | jq .

# Performance monitoring
curl http://localhost:8080/performance | jq .

Python SDK

import requests

# TTS Generation
response = requests.post(
    "http://localhost:8080/v1/audio/speech",
    data={
        "voice_name": "Keira",
        "text": "Enhanced CosyVoice is amazing!"
    }
)

with open("output.wav", "wb") as f:
    f.write(response.content)

🎭 Available Voices

Pretrained Voices (Instant)

Keira: English female voice
步非烟: Chinese elegant female
营销号-女声: Chinese marketing-style female
嘉然: Chinese cute female
钟离: Chinese deep male
叶内法: Chinese sophisticated female

Zero-shot Voices

Upload your own audio samples for custom voices
Automatic feature extraction and caching

📊 Performance

Benchmark Results

Voice Type	RTF (First Call)	RTF (Cached)	Latency
Pretrained	1.8-2.9	1.4-1.6	~6-8s
Zero-shot	2.5-3.2	1.5-1.8	~7-10s
CPU Mode	8-12	6-10	~30-45s

GPU Optimization

FP16 Support: 2-3x speed improvement
CUDA Acceleration: Automatic GPU detection
Memory Efficient: Smart memory management
Batch Processing: Multiple requests handling

🔧 Configuration

Environment Variables

export COSYVOICE_MODEL_DIR="pretrained_models/CosyVoice2-0.5B"
export COSYVOICE_CACHE_SIZE="100"
export COSYVOICE_GPU_MEMORY_FRACTION="0.8"

API Configuration

# api_service.py
api_service = CosyVoiceAPIService(
    model_dir="pretrained_models/CosyVoice2-0.5B",
    use_gpu=True,
    fp16=True,
    cache_size=100
)

🐳 Docker Deployment

FROM nvidia/cuda:11.8-devel-ubuntu20.04

COPY . /app
WORKDIR /app

RUN pip install -r requirements.txt
EXPOSE 8080

CMD ["python", "api_service.py", "--host", "0.0.0.0", "--port", "8080"]

docker build -t cosyvoice-enhanced .
docker run -d --gpus all -p 8080:8080 cosyvoice-enhanced

🤝 Contributing

Contributions are welcome! Please read our contributing guidelines and submit pull requests.

Development Setup

git clone https://github.com/yourusername/CosyVoice-Enhanced.git
cd CosyVoice-Enhanced
pip install -e .

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

This enhanced version maintains the same Apache 2.0 license as the original CosyVoice project to ensure compatibility and proper attribution.

🙏 Acknowledgments

Built on top of the amazing CosyVoice project by FunAudioLLM team.

Core Dependencies

CosyVoice: Base TTS engine
FastAPI: Modern API framework
Gradio: WebUI framework
PyTorch: Deep learning framework
ModelScope: Model hosting platform

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: [email protected]

🗺️ Roadmap

Real-time Streaming: WebSocket support for real-time TTS
Voice Cloning: Advanced zero-shot voice cloning
Multi-language: Enhanced multilingual support
Mobile SDK: iOS/Android SDK development
Cloud Deployment: Kubernetes helm charts
Voice Studio: Advanced voice management interface

⭐ Star this repo if you find it useful! ⭐

Name		Name	Last commit message	Last commit date
Latest commit History 412 Commits
.github		.github
asset		asset
cosyvoice		cosyvoice
docker		docker
examples		examples
runtime		runtime
third_party		third_party
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
FAQ.md		FAQ.md
INSTALL_GPU.md		INSTALL_GPU.md
LICENSE		LICENSE
README.md		README.md
api_service.py		api_service.py
audio_utils.py		audio_utils.py
cached_voice_manager.py		cached_voice_manager.py
requirements-rtx30-40.txt		requirements-rtx30-40.txt
requirements.txt		requirements.txt
webui_service.py		webui_service.py

License

alanliu14/CosyVoice-api

Folders and files

Latest commit

History

Repository files navigation