Enhanced CosyVoice with Advanced API Services, WebUI, and Pretrained Voice Support
This enhanced version of CosyVoice adds powerful production-ready features on top of the original CosyVoice project:
- 8 Built-in Voices: Keira, ๆญฅ้็, ่ฅ้ๅท-ๅฅณๅฃฐ, ๅ็ถ, ้็ฆป, etc.
- Instant Loading: No need for audio samples or prompts
- Optimized Performance: RTF ~1.4-1.6 with smart caching
- SFT Pathway: Proper routing for pretrained vs zero-shot voices
- FastAPI Backend: RESTful endpoints with OpenAPI documentation
- Multiple Output Modes: Standard, SSE streaming, WAV streaming
- Zero-Latency Features: Smart caching eliminates 1-2s delays
- GPU Optimization: FP16 support with 3-10x speed improvements
- Performance Monitoring: Real-time RTF tracking and benchmarks
- Gradio 3.x Compatible: Fixed compatibility issues
- Voice Management: Easy selection between zero-shot and pretrained
- Real-time Preview: Instant audio generation and playback
- Responsive Design: Works on desktop and mobile
- Feature Caching: Pre-extract and cache voice features
- Pretrained Embedding Cache: Instant voice switching
- Performance Boost: RTF improvements from 2.3โ1.6
- Memory Efficient: Intelligent cache management
CosyVoice-Enhanced/
โโโ api_service.py # FastAPI production server
โโโ webui_service.py # Modern Gradio WebUI
โโโ cached_voice_manager.py # Smart caching system
โโโ audio_utils.py # Audio processing utilities
โโโ voice/ # Pretrained voice files (*.pt)
โโโ cosyvoice/ # Core CosyVoice engine
โโโ pretrained_models/ # Model checkpoints
git clone https://github.com/yourusername/CosyVoice-Enhanced.git
cd CosyVoice-Enhanced
# Create conda environment
conda create -n cosyvoice-enhanced python=3.10
conda activate cosyvoice-enhanced
# Install dependencies
pip install -r requirements.txt
# For RTX 30/40 series GPUs
pip install -r requirements-rtx30-40.txt# Download models via ModelScope
from modelscope import snapshot_download
snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
snapshot_download('iic/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')# Start API service (production)
python api_service.py --port 8080
# Start WebUI (development/demo)
python webui_service.py --port 7861# Standard TTS with pretrained voice
curl -X POST http://localhost:8080/v1/audio/speech \
-F "voice_name=Keira" \
-F "text=Hello, this is CosyVoice Enhanced!" \
-o output.wav
# SSE Streaming (real-time)
curl -N -X POST http://localhost:8080/v1/audio/speech/sse \
-F "voice_name=ๅ็ถ" \
-F "text=ๅฎๆถๆตๅผ่ฏญ้ณๅๆๆต่ฏ"
# List available voices
curl http://localhost:8080/voices | jq .
# Performance monitoring
curl http://localhost:8080/performance | jq .import requests
# TTS Generation
response = requests.post(
"http://localhost:8080/v1/audio/speech",
data={
"voice_name": "Keira",
"text": "Enhanced CosyVoice is amazing!"
}
)
with open("output.wav", "wb") as f:
f.write(response.content)- Keira: English female voice
- ๆญฅ้็: Chinese elegant female
- ่ฅ้ๅท-ๅฅณๅฃฐ: Chinese marketing-style female
- ๅ็ถ: Chinese cute female
- ้็ฆป: Chinese deep male
- ๅถๅ ๆณ: Chinese sophisticated female
- Upload your own audio samples for custom voices
- Automatic feature extraction and caching
| Voice Type | RTF (First Call) | RTF (Cached) | Latency |
|---|---|---|---|
| Pretrained | 1.8-2.9 | 1.4-1.6 | ~6-8s |
| Zero-shot | 2.5-3.2 | 1.5-1.8 | ~7-10s |
| CPU Mode | 8-12 | 6-10 | ~30-45s |
- FP16 Support: 2-3x speed improvement
- CUDA Acceleration: Automatic GPU detection
- Memory Efficient: Smart memory management
- Batch Processing: Multiple requests handling
export COSYVOICE_MODEL_DIR="pretrained_models/CosyVoice2-0.5B"
export COSYVOICE_CACHE_SIZE="100"
export COSYVOICE_GPU_MEMORY_FRACTION="0.8"# api_service.py
api_service = CosyVoiceAPIService(
model_dir="pretrained_models/CosyVoice2-0.5B",
use_gpu=True,
fp16=True,
cache_size=100
)FROM nvidia/cuda:11.8-devel-ubuntu20.04
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt
EXPOSE 8080
CMD ["python", "api_service.py", "--host", "0.0.0.0", "--port", "8080"]docker build -t cosyvoice-enhanced .
docker run -d --gpus all -p 8080:8080 cosyvoice-enhancedContributions are welcome! Please read our contributing guidelines and submit pull requests.
git clone https://github.com/yourusername/CosyVoice-Enhanced.git
cd CosyVoice-Enhanced
pip install -e .This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
This enhanced version maintains the same Apache 2.0 license as the original CosyVoice project to ensure compatibility and proper attribution.
Built on top of the amazing CosyVoice project by FunAudioLLM team.
- CosyVoice: Base TTS engine
- FastAPI: Modern API framework
- Gradio: WebUI framework
- PyTorch: Deep learning framework
- ModelScope: Model hosting platform
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: [email protected]
- Real-time Streaming: WebSocket support for real-time TTS
- Voice Cloning: Advanced zero-shot voice cloning
- Multi-language: Enhanced multilingual support
- Mobile SDK: iOS/Android SDK development
- Cloud Deployment: Kubernetes helm charts
- Voice Studio: Advanced voice management interface
โญ Star this repo if you find it useful! โญ