Observability & Monitoring
Duration: 120 min
Observability is the ability to understand system behavior through external outputs. This module covers CloudWatch, Prometheus, Grafana, and distributed tracing—essential for maintaining reliable systems.
Observability Pillars
Observability consists of three pillars:
- Metrics: Quantitative measurements (CPU, memory, request latency)
- Logs: Detailed event records from applications and systems
- Traces: Request flow across distributed systems
AWS CloudWatch
CloudWatch is AWS's native monitoring and logging service:
# Put custom metric
aws cloudwatch put-metric-data \
--namespace MyApp \
--metric-name RequestLatency \
--value 150 \
--unit Milliseconds
# Get metric statistics
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--start-time 2024-01-01T00:00:00Z \
--end-time 2024-01-02T00:00:00Z \
--period 3600 \
--statistics Average,Maximum,Minimum
# Create alarm
aws cloudwatch put-metric-alarm \
--alarm-name high-cpu \
--alarm-description "Alert when CPU exceeds 80%" \
--metric-name CPUUtilization \
--namespace AWS/EC2 \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-east-1:123456789:alertsCloudWatch Logs
# Create log group
aws logs create-log-group --log-group-name /app/production
# Put log events
aws logs put-log-events \
--log-group-name /app/production \
--log-stream-name app-instance-1 \
--log-events timestamp=$(date +%s000),message="Application started"
# Create metric filter
aws logs put-metric-filter \
--log-group-name /app/production \
--filter-name ErrorCount \
--filter-pattern "[ERROR]" \
--metric-transformations metricName=ErrorCount,metricNamespace=MyApp,metricValue=1
# Query logs with CloudWatch Insights
aws logs start-query \
--log-group-name /app/production \
--start-time $(date -d '1 hour ago' +%s) \
--end-time $(date +%s) \
--query-string 'fields @timestamp, @message | filter @message like /ERROR/ | stats count() by bin(5m)'Application Instrumentation
# Python application with CloudWatch metrics
import boto3
import time
from datetime import datetime
cloudwatch = boto3.client('cloudwatch')
def put_metric(metric_name, value, unit='None'):
cloudwatch.put_metric_data(
Namespace='MyApp',
MetricData=[
{
'MetricName': metric_name,
'Value': value,
'Unit': unit,
'Timestamp': datetime.utcnow()
}
]
)
# Track request latency
start_time = time.time()
# ... process request ...
latency = (time.time() - start_time) * 1000
put_metric('RequestLatency', latency, 'Milliseconds')
# Track business metrics
put_metric('OrdersProcessed', 42, 'Count')
put_metric('RevenueGenerated', 1500.00, 'None')Prometheus
Prometheus is an open-source monitoring system:
# prometheus.yml configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
- job_name: 'app-metrics'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/metrics'Deploy Prometheus:
# Download and run Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvfz prometheus-2.45.0.linux-amd64.tar.gz
cd prometheus-2.45.0.linux-amd64
./prometheus --config.file=prometheus.yml
# Access Prometheus UI at http://localhost:9090Grafana
Grafana visualizes metrics from Prometheus and other sources:
# Run Grafana in Docker
docker run -d \
-p 3000:3000 \
-e GF_SECURITY_ADMIN_PASSWORD=admin \
grafana/grafana:latest
# Access at http://localhost:3000
# Default credentials: admin/adminCreate a Grafana dashboard:
{
"dashboard": {
"title": "Application Metrics",
"panels": [
{
"title": "Request Latency",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
}
],
"type": "graph"
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(http_requests_total{status=~'5..'}[5m])"
}
],
"type": "graph"
},
{
"title": "CPU Usage",
"targets": [
{
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)"
}
],
"type": "graph"
}
]
}
}Distributed Tracing
Distributed tracing tracks requests across microservices:
# Python application with OpenTelemetry tracing
from opentelemetry import trace, metrics
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831,
)
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
tracer = trace.get_tracer(__name__)
# Create spans
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order_id", "12345")
with tracer.start_as_current_span("validate_order"):
# Validation logic
pass
with tracer.start_as_current_span("process_payment"):
# Payment logic
passAlerting Strategy
# Prometheus alerting rules
groups:
- name: application
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~'5..'}[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
annotations:
summary: "High request latency"
description: "P95 latency is {{ $value }}s"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1
for: 5m
annotations:
summary: "Pod is crash looping"Logging Best Practices
# Structured logging with JSON
import json
import logging
class JSONFormatter(logging.Formatter):
def format(self, record):
log_data = {
'timestamp': self.formatTime(record),
'level': record.levelname,
'logger': record.name,
'message': record.getMessage(),
'module': record.module,
'function': record.funcName,
'line': record.lineno
}
if record.exc_info:
log_data['exception'] = self.formatException(record.exc_info)
return json.dumps(log_data)
# Configure logging
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger = logging.getLogger(__name__)
logger.addHandler(handler)
# Log with context
logger.info("Order processed", extra={
'order_id': '12345',
'customer_id': '67890',
'amount': 99.99
})Terraform for Monitoring
# CloudWatch Log Group
resource "aws_cloudwatch_log_group" "app" {
name = "/ecs/my-app"
retention_in_days = 7
tags = {
Name = "app-logs"
}
}
# CloudWatch Alarm
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
alarm_name = "high-cpu-alarm"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = 300
statistic = "Average"
threshold = 80
alarm_description = "Alert when CPU exceeds 80%"
alarm_actions = [aws_sns_topic.alerts.arn]
}
# SNS Topic for alerts
resource "aws_sns_topic" "alerts" {
name = "devops-alerts"
}
resource "aws_sns_topic_subscription" "alerts_email" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "email"
endpoint = "devops@example.com"
}❓ What are the three pillars of observability?
❓ What is AWS CloudWatch?
❓ What is Prometheus used for?
❓ What is Grafana's primary function?
❓ What is distributed tracing used for?