Module 12 of 13 · DevOps & Platform Engineering · Intermediate

Site Reliability Engineering

Duration: 120 min

Site Reliability Engineering (SRE) applies software engineering principles to operations. This module covers SLOs, SLIs, error budgets, incident management, and chaos engineering—practices that enable reliable systems at scale.

SLOs, SLIs, and SLAs

SLI (Service Level Indicator): Measurable metric of service performance
SLO (Service Level Objective): Target for SLI (e.g., 99.9% availability)
SLA (Service Level Agreement): Contractual commitment with consequences

# Example SLOs and SLIs
services:
  - name: api-service
    slos:
      - name: availability
        target: 99.9%
        sli: uptime_percentage
        window: 30d
      
      - name: latency
        target: p99 < 200ms
        sli: request_latency_p99
        window: 30d
      
      - name: error_rate
        target: < 0.1%
        sli: error_rate
        window: 30d

# Prometheus queries for SLIs
queries:
  uptime_percentage: |
    (1 - (increase(http_requests_total{status=~"5.."}[30d]) / 
           increase(http_requests_total[30d]))) * 100
  
  request_latency_p99: |
    histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[30d]))
  
  error_rate: |
    (increase(http_requests_total{status=~"5.."}[30d]) / 
     increase(http_requests_total[30d])) * 100

Error Budgets

Error budget is the acceptable amount of downtime or errors:

# Calculate error budget
slo_target = 0.999  # 99.9% availability
time_period_seconds = 30 * 24 * 60 * 60  # 30 days

# Maximum allowed downtime
max_downtime = time_period_seconds * (1 - slo_target)
max_downtime_minutes = max_downtime / 60

print(f"Error budget: {max_downtime_minutes:.2f} minutes per month")
# Output: Error budget: 43.20 minutes per month

# Track error budget consumption
current_downtime = 15  # minutes
remaining_budget = max_downtime_minutes - current_downtime
budget_percentage = (remaining_budget / max_downtime_minutes) * 100

print(f"Remaining budget: {remaining_budget:.2f} minutes ({budget_percentage:.1f}%)")
# Output: Remaining budget: 28.20 minutes (65.3%)

Incident Management

Incident response process:

# Incident severity levels
# SEV-1: Critical - Complete service outage
# SEV-2: High - Significant degradation
# SEV-3: Medium - Minor impact
# SEV-4: Low - Cosmetic issues

# Declare incident
incident declare --severity SEV-1 --title "API service down" --channel #incidents

# Assign roles
incident assign --role incident-commander --user alice
incident assign --role communications --user bob
incident assign --role technical-lead --user charlie

# Create war room
incident war-room create --incident-id INC-12345

# Post updates
incident update --message "Identified database connection pool exhaustion"

# Resolve incident
incident resolve --incident-id INC-12345 --resolution "Increased connection pool size"

Blameless Postmortems

# Postmortem: API Service Outage on 2024-01-15

## Summary
API service was unavailable for 45 minutes due to database connection pool exhaustion.

## Timeline
- 14:30 UTC: Monitoring alerts triggered for high error rate
- 14:32 UTC: Incident declared (SEV-1)
- 14:35 UTC: Root cause identified: connection pool exhausted
- 14:40 UTC: Temporary mitigation: increased connection pool size
- 14:45 UTC: Service recovered
- 15:15 UTC: Permanent fix deployed

## Root Cause
A new feature deployed at 14:00 UTC created connection leaks in the database driver.
The connection pool was exhausted within 30 minutes.

## Contributing Factors
1. Connection pool monitoring was not in place
2. Load testing did not simulate the new feature's connection behavior
3. Deployment happened during peak traffic hours

## Impact
- 45 minutes of service unavailability
- Affected 10,000+ users
- Error budget consumed: 100% of monthly budget

## Action Items
1. Implement connection pool monitoring (Owner: Alice, Due: 2024-01-22)
2. Add connection leak detection to load tests (Owner: Bob, Due: 2024-01-22)
3. Implement deployment traffic controls (Owner: Charlie, Due: 2024-01-29)
4. Review database driver version (Owner: Alice, Due: 2024-01-20)

## Lessons Learned
- Connection pool exhaustion is a critical failure mode
- Load testing must simulate realistic connection patterns
- Deployments should be staggered during peak hours

Runbooks

# Runbook: High API Latency

## Symptoms
- P99 latency > 500ms
- Error rate increasing
- CPU utilization high

## Diagnosis
1. Check current metrics
   ```bash
   kubectl top nodes
   kubectl top pods -n production
  1. Check application logs

    kubectl logs -n production deployment/api-service --tail=100
  2. Check database performance

    aws rds describe-db-instances --db-instance-identifier prod-db

Mitigation (Immediate)

  1. Scale up replicas

    kubectl scale deployment api-service -n production --replicas=10
  2. Enable caching

    kubectl set env deployment/api-service -n production CACHE_ENABLED=true
  3. Reduce traffic (if necessary)

    kubectl patch service api-service -n production -p '{"spec":{"sessionAffinity":"ClientIP"}}'

Resolution (Long-term)

  1. Identify slow queries
  2. Add database indexes
  3. Optimize application code
  4. Increase database resources

## Chaos Engineering

Chaos engineering proactively tests system resilience:

```python
# Chaos experiment using Chaos Toolkit
from chaoslib.action import action
from chaoslib.exceptions import ActivityFailed

@action
def terminate_random_pod(namespace: str = "production"):
    """Terminate a random pod to test resilience"""
    import subprocess
    
    # Get random pod
    result = subprocess.run(
        f"kubectl get pods -n {namespace} -o jsonpath='{{.items[0].metadata.name}}'",
        shell=True,
        capture_output=True,
        text=True
    )
    
    pod_name = result.stdout.strip()
    
    # Delete pod
    subprocess.run(
        f"kubectl delete pod {pod_name} -n {namespace}",
        shell=True
    )
    
    return {"deleted_pod": pod_name}

@action
def inject_latency(service: str, latency_ms: int = 500):
    """Inject latency into service"""
    # Use Istio VirtualService to inject latency
    pass

@action
def simulate_database_failure():
    """Simulate database connection failure"""
    # Block database traffic using network policies
    pass

Chaos experiment definition:

version: 1.0.0
title: API Service Resilience Test
description: Test API service resilience to pod failures

steady-state-hypothesis:
  title: API service is healthy
  probes:
    - type: probe
      name: api-responds
      tolerance: 200
      provider:
        type: http
        url: http://api-service:8080/health

method:
  - type: action
    name: terminate-random-pod
    provider:
      type: python
      module: chaos_experiments
      func: terminate_random_pod
      arguments:
        namespace: production
  
  - type: probe
    name: wait-for-recovery
    tolerance: 200
    provider:
      type: http
      url: http://api-service:8080/health
    pauses:
      after: 30

rollbacks:
  - type: action
    name: restore-pods
    provider:
      type: kubernetes
      action: scale
      deployment: api-service
      replicas: 3

Observability for SRE

# SRE-focused metrics
import prometheus_client

# Request metrics
request_duration = prometheus_client.Histogram(
    'request_duration_seconds',
    'Request duration',
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

request_errors = prometheus_client.Counter(
    'request_errors_total',
    'Total request errors',
    ['error_type']
)

# System metrics
uptime = prometheus_client.Gauge(
    'uptime_seconds',
    'Service uptime'
)

# Business metrics
transactions_processed = prometheus_client.Counter(
    'transactions_processed_total',
    'Total transactions processed'
)

# SLO tracking
slo_compliance = prometheus_client.Gauge(
    'slo_compliance_percentage',
    'SLO compliance percentage',
    ['service', 'slo_name']
)

# Usage
request_duration.observe(0.25)
request_errors.labels(error_type='timeout').inc()
slo_compliance.labels(service='api', slo_name='availability').set(99.95)

On-Call Practices

# On-call schedule
schedule:
  - week: 1
    primary: alice
    secondary: bob
    escalation: charlie
  
  - week: 2
    primary: bob
    secondary: charlie
    escalation: alice

# On-call responsibilities
responsibilities:
  - Monitor alerts and dashboards
  - Respond to incidents within 5 minutes
  - Follow runbooks for common issues
  - Escalate to technical lead if needed
  - Document all actions taken
  - Participate in postmortems

# On-call support
support:
  - Slack channel: #on-call
  - Pagerduty integration for alerts
  - Runbooks available in wiki
  - Escalation contacts documented

❓ What is an SLO (Service Level Objective)?

❓ What is an error budget?

❓ What is the purpose of a blameless postmortem?

❓ What is chaos engineering?

❓ What is the primary goal of SRE?

← Previous Continue interactively → Next →

Related Courses