Deployments

Replicate makes it easy to run machine learning models. You can run the best open-source models with just one line of code, or deploy your own custom models. But sometimes you need more control. That’s where deployments come in.

What are deployments?

Deployments give you production-grade control over your model’s infrastructure and provide private, dedicated API endpoints.

Hardware flexibility

Choose from multiple GPU architectures including NVIDIA A100s, H100s, T4s, and more. Switch hardware types without changing your code to optimize for performance or cost.

Intelligent scaling

Auto-scaling: Scale from zero to hundreds of instances based on traffic
Always-on instances: Keep models warm to eliminate cold start delays
Traffic-based scaling: Automatically add capacity during peak usage
Scale-to-zero: Reduce costs by shutting down unused instances

Zero-downtime deployments

Rolling updates: Deploy new model versions without interrupting service
Canary deployments: Test new versions on a subset of traffic
Instant rollbacks: Revert to previous versions if issues arise

Production monitoring

Real-time metrics: Track latency, throughput, and error rates
Instance health: Monitor whether instances are starting, idle, or processing
Cost tracking: View detailed usage and spending analytics
Request logs: Analyze predictions flowing through your model

Metrics dashboard

Each deployment provides a comprehensive metrics dashboard with the following features:

Data retention: View up to 24 hours of historical metrics data, giving you a full day’s worth of performance insights.

Aggregation windows: Metrics are aggregated into 15-minute intervals for optimal performance and quick loading, while still providing detailed trend analysis.

Available metrics:

Latency: Track response times and identify performance bottlenecks
Throughput: Monitor requests per second and capacity utilization
Error rates: Identify and troubleshoot failed predictions
Instance status: See how many instances are starting, idle, or actively processing requests
Queue depth: Monitor pending predictions waiting for processing
GPU memory usage: Monitor how much GPU memory your deployment is using across all instances

The metrics graphs automatically refresh and provide interactive controls for zooming and filtering data by time range.

GPU memory monitoring

GPU memory monitoring helps you optimize resource utilization and ensure your models are running efficiently. The GPU memory visualization shows:

Total memory available: The total GPU memory capacity allocated to your deployment across all instances.

Memory usage patterns: Both median and maximum GPU memory usage over configurable time periods (2 hours or 24 hours).

Multi-instance aggregation: Memory usage is aggregated across all running instances in your deployment, giving you a comprehensive view of resource utilization.

This monitoring helps you:

Identify if your model is using GPU memory efficiently
Determine if you need different hardware for better performance
Spot memory usage patterns and potential optimizations
Plan capacity for scaling your deployment

Access GPU memory monitoring by visiting your deployment page at replicate.com/deployments and selecting the specific deployment you want to monitor.

Enterprise security

Private endpoints: Dedicated URLs that only you can access
Audit logging: Track all model access and configuration changes

Deployments work with both open-source models and your own custom models.

Autoscaling

Deployments auto-scale according to demand. If you send a lot of traffic, they scale up to handle it, and when things are quiet they scale back down, so you only pay for what you need. You can also limit the maximum number of instances the deployment can use to limit your maximum spend, or set a minimum to keep some instances warm and ready for predictions.