Site Reliability Engineering

Duration: 120 min

Site Reliability Engineering (SRE) applies software engineering principles to operations. This module covers SLOs, SLIs, error budgets, incident management, and chaos engineering—practices that enable reliable systems at scale.

SLOs, SLIs, and SLAs

SLI (Service Level Indicator): Measurable metric of service performance
SLO (Service Level Objective): Target for SLI (e.g., 99.9% availability)
SLA (Service Level Agreement): Contractual commitment with consequences

# Example SLOs and SLIs
services:
  - name: api-service
    slos:
      - name: availability
        target: 99.9%
        sli: uptime_percentage
        window: 30d
      
      - name: latency
        target: p99 < 200ms
        sli: request_latency_p99
        window: 30d
      
      - name: error_rate
        target: < 0.1%
        sli: error_rate
        window: 30d

# Prometheus queries for SLIs
queries:
  uptime_percentage: |
    (1 - (increase(http_requests_total{status=~"5.."}[30d]) / 
           increase(http_requests_total[30d]))) * 100
  
  request_latency_p99: |
    histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[30d]))
  
  error_rate: |
    (increase(http_requests_total{status=~"5.."}[30d]) / 
     increase(http_requests_total[30d])) * 100

Error Budgets

Error budget is the acceptable amount of downtime or errors:

# Calculate error budget
slo_target = 0.999  # 99.9% availability
time_period_seconds = 30 * 24 * 60 * 60  # 30 days

# Maximum allowed downtime
max_downtime = time_period_seconds * (1 - slo_target)
max_downtime_minutes = max_downtime / 60

print(f"Error budget: {max_downtime_minutes:.2f} minutes per month")
# Output: Error budget: 43.20 minutes per month

# Track error budget consumption
current_downtime = 15  # minutes
remaining_budget = max_downtime_minutes - current_downtime
budget_percentage = (remaining_budget / max_downtime_minutes) * 100

print(f"Remaining budget: {remaining_budget:.2f} minutes ({budget_percentage:.1f}%)")
# Output: Remaining budget: 28.20 minutes (65.3%)

Incident Management

Incident response process:

# Incident severity levels
# SEV-1: Critical - Complete service outage
# SEV-2: High - Significant degradation
# SEV-3: Medium - Minor impact
# SEV-4: Low - Cosmetic issues

# Declare incident
incident declare --severity SEV-1 --title "API service down" --channel #incidents

# Assign roles
incident assign --role incident-commander --user alice
incident assign --role communications --user bob
incident assign --role technical-lead --user charlie

# Create war room
incident war-room create --incident-id INC-12345

# Post updates
incident update --message "Identified database connection pool exhaustion"

# Resolve incident
incident resolve --incident-id INC-12345 --resolution "Increased connection pool size"

Blameless Postmortems

# Postmortem: API Service Outage on 2024-01-15

## Summary
API service was unavailable for 45 minutes due to database connection pool exhaustion.

## Timeline
- 14:30 UTC: Monitoring alerts triggered for high error rate
- 14:32 UTC: Incident declared (SEV-1)
- 14:35 UTC: Root cause identified: connection pool exhausted
- 14:40 UTC: Temporary mitigation: increased connection pool size
- 14:45 UTC: Service recovered
- 15:15 UTC: Permanent fix deployed

## Root Cause
A new feature deployed at 14:00 UTC created connection leaks in the database driver.
The connection pool was exhausted within 30 minutes.

## Contributing Factors
1. Connection pool monitoring was not in place
2. Load testing did not simulate the new feature's connection behavior
3. Deployment happened during peak traffic hours

## Impact
- 45 minutes of service unavailability
- Affected 10,000+ users
- Error budget consumed: 100% of monthly budget

## Action Items
1. Implement connection pool monitoring (Owner: Alice, Due: 2024-01-22)
2. Add connection leak detection to load tests (Owner: Bob, Due: 2024-01-22)
3. Implement deployment traffic controls (Owner: Charlie, Due: 2024-01-29)
4. Review database driver version (Owner: Alice, Due: 2024-01-20)

## Lessons Learned
- Connection pool exhaustion is a critical failure mode
- Load testing must simulate realistic connection patterns
- Deployments should be staggered during peak hours

Runbooks

# Runbook: High API Latency

## Symptoms
- P99 latency > 500ms
- Error rate increasing
- CPU utilization high

## Diagnosis
1. Check current metrics
   ```bash
   kubectl top nodes
   kubectl top pods -n production

Check application logs

kubectl logs -n production deployment/api-service --tail=100

Check database performance

aws rds describe-db-instances --db-instance-identifier prod-db

Mitigation (Immediate)

Scale up replicas

kubectl scale deployment api-service -n production --replicas=10

Enable caching

kubectl set env deployment/api-service -n production CACHE_ENABLED=true

Reduce traffic (if necessary)

kubectl patch service api-service -n production -p '{"spec":{"sessionAffinity":"ClientIP"}}'

Resolution (Long-term)

Identify slow queries
Add database indexes
Optimize application code
Increase database resources


## Chaos Engineering

Chaos engineering proactively tests system resilience:

```python
# Chaos experiment using Chaos Toolkit
from chaoslib.action import action
from chaoslib.exceptions import ActivityFailed

@action
def terminate_random_pod(namespace: str = "production"):
    """Terminate a random pod to test resilience"""
    import subprocess
    
    # Get random pod
    result = subprocess.run(
        f"kubectl get pods -n {namespace} -o jsonpath='{{.items[0].metadata.name}}'",
        shell=True,
        capture_output=True,
        text=True
    )
    
    pod_name = result.stdout.strip()
    
    # Delete pod
    subprocess.run(
        f"kubectl delete pod {pod_name} -n {namespace}",
        shell=True
    )
    
    return {"deleted_pod": pod_name}

@action
def inject_latency(service: str, latency_ms: int = 500):
    """Inject latency into service"""
    # Use Istio VirtualService to inject latency
    pass

@action
def simulate_database_failure():
    """Simulate database connection failure"""
    # Block database traffic using network policies
    pass

Chaos experiment definition:

version: 1.0.0
title: API Service Resilience Test
description: Test API service resilience to pod failures

steady-state-hypothesis:
  title: API service is healthy
  probes:
    - type: probe
      name: api-responds
      tolerance: 200
      provider:
        type: http
        url: http://api-service:8080/health

method:
  - type: action
    name: terminate-random-pod
    provider:
      type: python
      module: chaos_experiments
      func: terminate_random_pod
      arguments:
        namespace: production
  
  - type: probe
    name: wait-for-recovery
    tolerance: 200
    provider:
      type: http
      url: http://api-service:8080/health
    pauses:
      after: 30

rollbacks:
  - type: action
    name: restore-pods
    provider:
      type: kubernetes
      action: scale
      deployment: api-service
      replicas: 3

Observability for SRE

# SRE-focused metrics
import prometheus_client

# Request metrics
request_duration = prometheus_client.Histogram(
    'request_duration_seconds',
    'Request duration',
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

request_errors = prometheus_client.Counter(
    'request_errors_total',
    'Total request errors',
    ['error_type']
)

# System metrics
uptime = prometheus_client.Gauge(
    'uptime_seconds',
    'Service uptime'
)

# Business metrics
transactions_processed = prometheus_client.Counter(
    'transactions_processed_total',
    'Total transactions processed'
)

# SLO tracking
slo_compliance = prometheus_client.Gauge(
    'slo_compliance_percentage',
    'SLO compliance percentage',
    ['service', 'slo_name']
)

# Usage
request_duration.observe(0.25)
request_errors.labels(error_type='timeout').inc()
slo_compliance.labels(service='api', slo_name='availability').set(99.95)

On-Call Practices

# On-call schedule
schedule:
  - week: 1
    primary: alice
    secondary: bob
    escalation: charlie
  
  - week: 2
    primary: bob
    secondary: charlie
    escalation: alice

# On-call responsibilities
responsibilities:
  - Monitor alerts and dashboards
  - Respond to incidents within 5 minutes
  - Follow runbooks for common issues
  - Escalate to technical lead if needed
  - Document all actions taken
  - Participate in postmortems

# On-call support
support:
  - Slack channel: #on-call
  - Pagerduty integration for alerts
  - Runbooks available in wiki
  - Escalation contacts documented

❓ What is an SLO (Service Level Objective)?

A target for service performance (e.g., 99.9% availability) A measurable metric of service performance A contractual commitment with customers A monitoring tool

❓ What is an error budget?

The cost of fixing errors The number of bugs allowed in code The acceptable amount of downtime or errors within an SLO A financial allocation for operations

❓ What is the purpose of a blameless postmortem?

To identify who caused the incident To learn from incidents and prevent recurrence To punish team members To document system architecture

❓ What is chaos engineering?

Proactively testing system resilience by injecting failures Randomly breaking production systems A type of security attack Disorganized incident response

❓ What is the primary goal of SRE?

To eliminate all system failures To balance reliability with velocity using error budgets To replace developers with automation To maximize uptime at any cost