Production Deployment

This guide covers best practices and considerations for deploying applications with Prometheus metrics to production environments.

Pre-Deployment Checklist

1. Metric Design Review

Metric names follow Prometheus conventions
Label cardinality is manageable (< 1000 series per metric)
All metrics have clear help text
No sensitive data in metric names or labels
Appropriate metric types chosen

2. Performance Validation

Metrics collection tested under expected load
Memory usage is acceptable
Metrics endpoint responds within 1 second
No performance degradation in application
Load testing completed with metrics enabled

3. Security Configuration

Metrics endpoint not exposed to public internet
Authentication configured if needed
Firewall rules configured
HTTPS enabled for external access
No sensitive data exposed in metrics

4. Monitoring Setup

Prometheus configured to scrape application
Scrape interval set appropriately
Dashboards created in Grafana
Critical alerts configured
Alert routing configured

Deployment Strategies

Single Instance Deployment

For single-instance applications:

# prometheus.yml
scrape_configs:
  - job_name: 'my-app'
    static_configs:
      - targets: ['app.example.com:9090']
    scrape_interval: 15s

Multi-Instance Deployment

For applications with multiple instances:

scrape_configs:
  - job_name: 'my-app-cluster'
    static_configs:
      - targets:
        - 'app1.example.com:9090'
        - 'app2.example.com:9090'
        - 'app3.example.com:9090'
    labels:
      environment: 'production'
      cluster: 'main'

Load Balanced Deployment

Behind a load balancer:

scrape_configs:
  - job_name: 'my-app'
    # Scrape individual instances, not load balancer
    static_configs:
      - targets:
        - 'app-instance1.internal:9090'
        - 'app-instance2.internal:9090'

Important: Prometheus should scrape individual instances, not the load balancer endpoint.

Cloud Deployment

AWS

Using EC2 service discovery:

scrape_configs:
  - job_name: 'my-app-aws'
    ec2_sd_configs:
      - region: us-east-1
        port: 9090
    relabel_configs:
      - source_labels: [__meta_ec2_tag_App]
        regex: my-app
        action: keep

Azure

Using Azure service discovery:

scrape_configs:
  - job_name: 'my-app-azure'
    azure_sd_configs:
      - subscription_id: '<subscription-id>'
        resource_group: 'my-resource-group'

Docker/Kubernetes

Service discovery via file:

scrape_configs:
  - job_name: 'my-app-k8s'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: my-app
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Configuration Management

Environment-Specific Settings

type
  TMetricsConfig = record
    Enabled: Boolean;
    Port: Integer;
    Path: string;
    ScrapeInterval: Integer;  // seconds
  end;

function LoadMetricsConfig: TMetricsConfig;
begin
  Result.Enabled := StrToBoolDef(GetEnvironmentVariable('METRICS_ENABLED'), True);
  Result.Port := StrToIntDef(GetEnvironmentVariable('METRICS_PORT'), 9090);
  Result.Path := GetEnvironmentVariable('METRICS_PATH', '/metrics');
  Result.ScrapeInterval := StrToIntDef(GetEnvironmentVariable('SCRAPE_INTERVAL'), 15);
end;

Feature Toggles

procedure RegisterMetrics;
begin
  // Always register core metrics
  TCounter.Create('requests_total', 'Total requests').Register();
  TGauge.Create('memory_bytes', 'Memory usage').Register();

  // Optional detailed metrics
  if FeatureFlags.IsEnabled('detailed_metrics') then
  begin
    THistogram.Create('request_duration_seconds', 'Request duration').Register();
    THistogram.Create('response_size_bytes', 'Response size').Register();
  end;
end;

Security Best Practices

1. Network Security

Internal Network Only:

// Check if request is from internal network
function IsInternalIP(const AIP: string): Boolean;
begin
  Result := AIP.StartsWith('10.') or
            AIP.StartsWith('172.16.') or
            AIP.StartsWith('192.168.') or
            AIP = '127.0.0.1';
end;

procedure HandleMetricsRequest(ARequest: TRequest; AResponse: TResponse);
begin
  if not IsInternalIP(ARequest.RemoteAddr) then
  begin
    AResponse.StatusCode := 403;
    AResponse.Content := 'Forbidden';
    Exit;
  end;

  // Serve metrics...
end;

2. Authentication

Basic Auth:

procedure HandleMetricsRequest(ARequest: TRequest; AResponse: TResponse);
begin
  if not ValidateBasicAuth(ARequest.Headers['Authorization']) then
  begin
    AResponse.StatusCode := 401;
    AResponse.SetHeader('WWW-Authenticate', 'Basic realm="Metrics"');
    Exit;
  end;

  // Serve metrics...
end;

Configure Prometheus:

scrape_configs:
  - job_name: 'secured-app'
    static_configs:
      - targets: ['app.example.com:9090']
    basic_auth:
      username: 'prometheus'
      password: 'secret'

3. HTTPS

Use HTTPS for external access:

scrape_configs:
  - job_name: 'secure-app'
    scheme: https
    tls_config:
      ca_file: /path/to/ca.crt
      cert_file: /path/to/client.crt
      key_file: /path/to/client.key
    static_configs:
      - targets: ['app.example.com:443']

4. Data Privacy

Never expose sensitive data:

// Bad - exposes sensitive data
TCounter.Create('user_actions', 'Actions', ['user_email', 'api_key']);

// Good - no sensitive data
TCounter.Create('user_actions', 'Actions', ['user_type', 'action_type']);

Performance Optimization

1. Metrics Endpoint Caching

For high-traffic scenarios:

type
  TMetricsCache = class
  private
    FCache: string;
    FLastUpdate: TDateTime;
    FCacheDuration: Integer;  // seconds
    FLock: TCriticalSection;
  public
    constructor Create(ACacheDuration: Integer = 5);
    function GetMetrics: string;
  end;

function TMetricsCache.GetMetrics: string;
begin
  TMonitor.Enter(FLock);
  try
    if (Now - FLastUpdate) > (FCacheDuration / SecsPerDay) then
    begin
      var LExposer := TTextExposer.Create;
      try
        FCache := LExposer.Render(
          TCollectorRegistry.DefaultRegistry.Collect()
        );
        FLastUpdate := Now;
      finally
        LExposer.Free;
      end;
    end;

    Result := FCache;
  finally
    TMonitor.Exit(FLock);
  end;
end;

2. Selective Metric Registration

Register only necessary metrics:

// Development
if IsDebugMode then
  RegisterDebugMetrics();

// Production
if IsProduction then
  RegisterCoreMetrics();

3. Label Cardinality Control

Monitor and limit cardinality:

procedure CheckCardinality;
var
  LSamples: TArray<TMetricSamples>;
  LTotalSeries: Integer;
begin
  LSamples := TCollectorRegistry.DefaultRegistry.Collect();
  LTotalSeries := 0;

  for var LMetric in LSamples do
    LTotalSeries := LTotalSeries + Length(LMetric.Samples);

  if LTotalSeries > 10000 then
    LogWarning('High metric cardinality: %d series', [LTotalSeries]);
end;

Monitoring the Metrics System

Self-Monitoring

Monitor the metrics system itself:

var
  GMetricsCollectionDuration: THistogram;
  GMetricsSeriesTotal: TGauge;
  GMetricsMemoryBytes: TGauge;

procedure RegisterSelfMetrics;
begin
  GMetricsCollectionDuration := THistogram.Create(
    'metrics_collection_duration_seconds',
    'Time spent collecting metrics'
  ).Register();

  GMetricsSeriesTotal := TGauge.Create(
    'metrics_series_total',
    'Total number of metric series'
  ).Register();

  GMetricsMemoryBytes := TGauge.Create(
    'metrics_memory_bytes',
    'Memory used by metrics system'
  ).Register();
end;

procedure CollectMetrics: string;
var
  LStopwatch: TStopwatch;
  LSamples: TArray<TMetricSamples>;
  LExposer: TTextExposer;
begin
  LStopwatch := TStopwatch.StartNew;
  try
    LSamples := TCollectorRegistry.DefaultRegistry.Collect();

    // Update self-metrics
    var LSeriesCount := 0;
    for var LMetric in LSamples do
      LSeriesCount := LSeriesCount + Length(LMetric.Samples);

    GMetricsSeriesTotal.SetTo(LSeriesCount);

    LExposer := TTextExposer.Create;
    try
      Result := LExposer.Render(LSamples);
    finally
      LExposer.Free;
    end;
  finally
    LStopwatch.Stop;
    GMetricsCollectionDuration.Observe(LStopwatch.Elapsed.TotalSeconds);
  end;
end;

Troubleshooting in Production

Logging

Add logging for debugging:

procedure HandleMetricsRequest(ARequest: TRequest; AResponse: TResponse);
begin
  LogDebug('Metrics request from: %s', [ARequest.RemoteAddr]);

  try
    var LMetrics := GetMetrics();
    AResponse.Content := LMetrics;
    LogDebug('Metrics response: %d bytes', [Length(LMetrics)]);
  except
    on E: Exception do
    begin
      LogError('Metrics error: %s', [E.Message]);
      raise;
    end;
  end;
end;

Health Checks

Implement health check endpoint:

procedure HandleHealthCheck(ARequest: TRequest; AResponse: TResponse);
var
  LHealth: TJSONObject;
begin
  LHealth := TJSONObject.Create;
  try
    LHealth.AddPair('status', 'healthy');
    LHealth.AddPair('metrics_enabled', TJSONBool.Create(True));
    LHealth.AddPair('metrics_count',
      TJSONNumber.Create(GetMetricsCount()));

    AResponse.ContentType := 'application/json';
    AResponse.Content := LHealth.ToString;
  finally
    LHealth.Free;
  end;
end;

Debugging Metrics Issues

// Add debug endpoint (only in non-production)
{$IFNDEF RELEASE}
procedure HandleDebugMetrics(ARequest: TRequest; AResponse: TResponse);
var
  LInfo: TStringList;
  LSamples: TArray<TMetricSamples>;
begin
  LInfo := TStringList.Create;
  try
    LSamples := TCollectorRegistry.DefaultRegistry.Collect();

    LInfo.Add('Total metrics: ' + IntToStr(Length(LSamples)));

    for var LMetric in LSamples do
    begin
      LInfo.Add('');
      LInfo.Add('Metric: ' + LMetric.MetricName);
      LInfo.Add('  Type: ' + GetEnumName(TypeInfo(TMetricType),
        Ord(LMetric.MetricType)));
      LInfo.Add('  Samples: ' + IntToStr(Length(LMetric.Samples)));
    end;

    AResponse.Content := LInfo.Text;
  finally
    LInfo.Free;
  end;
end;
{$ENDIF}

Backup and Recovery

Metrics Configuration Backup

Store metrics configuration in version control:

// metrics_config.pas
unit MetricsConfig;

procedure RegisterApplicationMetrics;
begin
  // HTTP Metrics
  TCounter.Create('http_requests_total',
    'Total HTTP requests', ['method', 'status', 'endpoint']).Register();

  THistogram.Create('http_request_duration_seconds',
    'HTTP request duration', ['method', 'endpoint']).Register();

  // Database Metrics
  TGauge.Create('db_connections_active',
    'Active database connections').Register();

  // Application Metrics
  TGauge.Create('app_memory_bytes',
    'Application memory usage').Register();
end;

Scaling Considerations

Horizontal Scaling

Each instance exposes its own metrics:

scrape_configs:
  - job_name: 'app-cluster'
    static_configs:
      - targets: ['app1:9090', 'app2:9090', 'app3:9090']
    # Prometheus aggregates across instances

High Availability

Deploy Prometheus in HA mode:

# prometheus-1.yml
global:
  external_labels:
    replica: prometheus-1

# prometheus-2.yml
global:
  external_labels:
    replica: prometheus-2

Alerting

Critical Alerts

groups:
  - name: critical-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate in production"

      - alert: ApplicationDown
        expr: up{job="my-app"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Application instance is down"

Documentation

Operations Runbook

Document for operations team:

# Metrics Operations Guide

## Endpoints
- Metrics: http://app:9090/metrics
- Health: http://app:9090/health

## Common Issues
1. Metrics endpoint slow
   - Check cardinality: curl app:9090/metrics | wc -l
   - Review recent label changes

2. Missing metrics
   - Verify Prometheus can reach endpoint
   - Check Prometheus targets page
   - Verify metrics are registered

## Emergency Contacts
- On-call: ...
- Team lead: ...

Production Deployment

Production Deployment

Pre-Deployment Checklist

1. Metric Design Review

2. Performance Validation

3. Security Configuration

4. Monitoring Setup

Deployment Strategies

Single Instance Deployment

Multi-Instance Deployment

Load Balanced Deployment

Cloud Deployment

AWS

Azure

Docker/Kubernetes

Configuration Management

Environment-Specific Settings

Feature Toggles

Security Best Practices

1. Network Security

2. Authentication

3. HTTPS

4. Data Privacy

Performance Optimization

1. Metrics Endpoint Caching

2. Selective Metric Registration

3. Label Cardinality Control

Monitoring the Metrics System

Self-Monitoring

Troubleshooting in Production

Logging

Health Checks

Debugging Metrics Issues

Backup and Recovery

Metrics Configuration Backup

Scaling Considerations

Horizontal Scaling

High Availability

Alerting

Critical Alerts

Documentation

Operations Runbook

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally