Site Reliability Engineering
Duration: 120 min
Site Reliability Engineering (SRE) applies software engineering principles to operations. This module covers SLOs, SLIs, error budgets, incident management, and chaos engineering—practices that enable reliable systems at scale.
SLOs, SLIs, and SLAs
SLI (Service Level Indicator): Measurable metric of service performance
SLO (Service Level Objective): Target for SLI (e.g., 99.9% availability)
SLA (Service Level Agreement): Contractual commitment with consequences
# Example SLOs and SLIs
services:
- name: api-service
slos:
- name: availability
target: 99.9%
sli: uptime_percentage
window: 30d
- name: latency
target: p99 < 200ms
sli: request_latency_p99
window: 30d
- name: error_rate
target: < 0.1%
sli: error_rate
window: 30d
# Prometheus queries for SLIs
queries:
uptime_percentage: |
(1 - (increase(http_requests_total{status=~"5.."}[30d]) /
increase(http_requests_total[30d]))) * 100
request_latency_p99: |
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[30d]))
error_rate: |
(increase(http_requests_total{status=~"5.."}[30d]) /
increase(http_requests_total[30d])) * 100Error Budgets
Error budget is the acceptable amount of downtime or errors:
# Calculate error budget
slo_target = 0.999 # 99.9% availability
time_period_seconds = 30 * 24 * 60 * 60 # 30 days
# Maximum allowed downtime
max_downtime = time_period_seconds * (1 - slo_target)
max_downtime_minutes = max_downtime / 60
print(f"Error budget: {max_downtime_minutes:.2f} minutes per month")
# Output: Error budget: 43.20 minutes per month
# Track error budget consumption
current_downtime = 15 # minutes
remaining_budget = max_downtime_minutes - current_downtime
budget_percentage = (remaining_budget / max_downtime_minutes) * 100
print(f"Remaining budget: {remaining_budget:.2f} minutes ({budget_percentage:.1f}%)")
# Output: Remaining budget: 28.20 minutes (65.3%)Incident Management
Incident response process:
# Incident severity levels
# SEV-1: Critical - Complete service outage
# SEV-2: High - Significant degradation
# SEV-3: Medium - Minor impact
# SEV-4: Low - Cosmetic issues
# Declare incident
incident declare --severity SEV-1 --title "API service down" --channel #incidents
# Assign roles
incident assign --role incident-commander --user alice
incident assign --role communications --user bob
incident assign --role technical-lead --user charlie
# Create war room
incident war-room create --incident-id INC-12345
# Post updates
incident update --message "Identified database connection pool exhaustion"
# Resolve incident
incident resolve --incident-id INC-12345 --resolution "Increased connection pool size"Blameless Postmortems
# Postmortem: API Service Outage on 2024-01-15
## Summary
API service was unavailable for 45 minutes due to database connection pool exhaustion.
## Timeline
- 14:30 UTC: Monitoring alerts triggered for high error rate
- 14:32 UTC: Incident declared (SEV-1)
- 14:35 UTC: Root cause identified: connection pool exhausted
- 14:40 UTC: Temporary mitigation: increased connection pool size
- 14:45 UTC: Service recovered
- 15:15 UTC: Permanent fix deployed
## Root Cause
A new feature deployed at 14:00 UTC created connection leaks in the database driver.
The connection pool was exhausted within 30 minutes.
## Contributing Factors
1. Connection pool monitoring was not in place
2. Load testing did not simulate the new feature's connection behavior
3. Deployment happened during peak traffic hours
## Impact
- 45 minutes of service unavailability
- Affected 10,000+ users
- Error budget consumed: 100% of monthly budget
## Action Items
1. Implement connection pool monitoring (Owner: Alice, Due: 2024-01-22)
2. Add connection leak detection to load tests (Owner: Bob, Due: 2024-01-22)
3. Implement deployment traffic controls (Owner: Charlie, Due: 2024-01-29)
4. Review database driver version (Owner: Alice, Due: 2024-01-20)
## Lessons Learned
- Connection pool exhaustion is a critical failure mode
- Load testing must simulate realistic connection patterns
- Deployments should be staggered during peak hoursRunbooks
# Runbook: High API Latency
## Symptoms
- P99 latency > 500ms
- Error rate increasing
- CPU utilization high
## Diagnosis
1. Check current metrics
```bash
kubectl top nodes
kubectl top pods -n productionCheck application logs
kubectl logs -n production deployment/api-service --tail=100Check database performance
aws rds describe-db-instances --db-instance-identifier prod-db
Mitigation (Immediate)
Scale up replicas
kubectl scale deployment api-service -n production --replicas=10Enable caching
kubectl set env deployment/api-service -n production CACHE_ENABLED=trueReduce traffic (if necessary)
kubectl patch service api-service -n production -p '{"spec":{"sessionAffinity":"ClientIP"}}'
Resolution (Long-term)
- Identify slow queries
- Add database indexes
- Optimize application code
- Increase database resources
## Chaos Engineering
Chaos engineering proactively tests system resilience:
```python
# Chaos experiment using Chaos Toolkit
from chaoslib.action import action
from chaoslib.exceptions import ActivityFailed
@action
def terminate_random_pod(namespace: str = "production"):
"""Terminate a random pod to test resilience"""
import subprocess
# Get random pod
result = subprocess.run(
f"kubectl get pods -n {namespace} -o jsonpath='{{.items[0].metadata.name}}'",
shell=True,
capture_output=True,
text=True
)
pod_name = result.stdout.strip()
# Delete pod
subprocess.run(
f"kubectl delete pod {pod_name} -n {namespace}",
shell=True
)
return {"deleted_pod": pod_name}
@action
def inject_latency(service: str, latency_ms: int = 500):
"""Inject latency into service"""
# Use Istio VirtualService to inject latency
pass
@action
def simulate_database_failure():
"""Simulate database connection failure"""
# Block database traffic using network policies
passChaos experiment definition:
version: 1.0.0
title: API Service Resilience Test
description: Test API service resilience to pod failures
steady-state-hypothesis:
title: API service is healthy
probes:
- type: probe
name: api-responds
tolerance: 200
provider:
type: http
url: http://api-service:8080/health
method:
- type: action
name: terminate-random-pod
provider:
type: python
module: chaos_experiments
func: terminate_random_pod
arguments:
namespace: production
- type: probe
name: wait-for-recovery
tolerance: 200
provider:
type: http
url: http://api-service:8080/health
pauses:
after: 30
rollbacks:
- type: action
name: restore-pods
provider:
type: kubernetes
action: scale
deployment: api-service
replicas: 3Observability for SRE
# SRE-focused metrics
import prometheus_client
# Request metrics
request_duration = prometheus_client.Histogram(
'request_duration_seconds',
'Request duration',
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
request_errors = prometheus_client.Counter(
'request_errors_total',
'Total request errors',
['error_type']
)
# System metrics
uptime = prometheus_client.Gauge(
'uptime_seconds',
'Service uptime'
)
# Business metrics
transactions_processed = prometheus_client.Counter(
'transactions_processed_total',
'Total transactions processed'
)
# SLO tracking
slo_compliance = prometheus_client.Gauge(
'slo_compliance_percentage',
'SLO compliance percentage',
['service', 'slo_name']
)
# Usage
request_duration.observe(0.25)
request_errors.labels(error_type='timeout').inc()
slo_compliance.labels(service='api', slo_name='availability').set(99.95)On-Call Practices
# On-call schedule
schedule:
- week: 1
primary: alice
secondary: bob
escalation: charlie
- week: 2
primary: bob
secondary: charlie
escalation: alice
# On-call responsibilities
responsibilities:
- Monitor alerts and dashboards
- Respond to incidents within 5 minutes
- Follow runbooks for common issues
- Escalate to technical lead if needed
- Document all actions taken
- Participate in postmortems
# On-call support
support:
- Slack channel: #on-call
- Pagerduty integration for alerts
- Runbooks available in wiki
- Escalation contacts documented❓ What is an SLO (Service Level Objective)?
❓ What is an error budget?
❓ What is the purpose of a blameless postmortem?
❓ What is chaos engineering?
❓ What is the primary goal of SRE?