Deployments
Use deployments for more control over how your models run.
Table of contents
Replicate makes it easy to run machine learning models. You can run the best open-source models with just one line of code, or deploy your own custom models. But sometimes you need more control. That’s where deployments come in.
What are deployments?
Deployments give you production-grade control over your model’s infrastructure and provide private, dedicated API endpoints.
Hardware flexibility
Choose from multiple GPU architectures including NVIDIA A100s, H100s, T4s, and more. Switch hardware types without changing your code to optimize for performance or cost.
Intelligent scaling
- Auto-scaling: Scale from zero to hundreds of instances based on traffic
- Always-on instances: Keep models warm to eliminate cold start delays
- Traffic-based scaling: Automatically add capacity during peak usage
- Scale-to-zero: Reduce costs by shutting down unused instances
Zero-downtime deployments
- Rolling updates: Deploy new model versions without interrupting service
- Canary deployments: Test new versions on a subset of traffic
- Instant rollbacks: Revert to previous versions if issues arise
Production monitoring
- Real-time metrics: Track latency, throughput, and error rates
- Instance health: Monitor whether instances are starting, idle, or processing
- Cost tracking: View detailed usage and spending analytics
- Request logs: Analyze predictions flowing through your model
Metrics dashboard
Each deployment provides a comprehensive metrics dashboard with the following features:
Data retention: View up to 24 hours of historical metrics data, giving you a full day’s worth of performance insights.
Aggregation windows: Metrics are aggregated into 15-minute intervals for optimal performance and quick loading, while still providing detailed trend analysis.
Available metrics:
- Latency: Track response times and identify performance bottlenecks
- Throughput: Monitor requests per second and capacity utilization
- Error rates: Identify and troubleshoot failed predictions
- Instance status: See how many instances are starting, idle, or actively processing requests
- Queue depth: Monitor pending predictions waiting for processing
- GPU memory usage: Monitor how much GPU memory your deployment is using across all instances
The metrics graphs automatically refresh and provide interactive controls for zooming and filtering data by time range.
GPU memory monitoring
GPU memory monitoring helps you optimize resource utilization and ensure your models are running efficiently. The GPU memory visualization shows:
Total memory available: The total GPU memory capacity allocated to your deployment across all instances.
Memory usage patterns: Both median and maximum GPU memory usage over configurable time periods (2 hours or 24 hours).
Multi-instance aggregation: Memory usage is aggregated across all running instances in your deployment, giving you a comprehensive view of resource utilization.
This monitoring helps you:
- Identify if your model is using GPU memory efficiently
- Determine if you need different hardware for better performance
- Spot memory usage patterns and potential optimizations
- Plan capacity for scaling your deployment
Access GPU memory monitoring by visiting your deployment page at replicate.com/deployments and selecting the specific deployment you want to monitor.
Enterprise security
- Private endpoints: Dedicated URLs that only you can access
- Audit logging: Track all model access and configuration changes
Deployments work with both open-source models and your own custom models.
Autoscaling
Deployments auto-scale according to demand. If you send a lot of traffic, they scale up to handle it, and when things are quiet they scale back down, so you only pay for what you need. You can also limit the maximum number of instances the deployment can use to limit your maximum spend, or set a minimum to keep some instances warm and ready for predictions.