Module 9 of 13 · DevOps & Platform Engineering · Intermediate

Observability & Monitoring

Duration: 120 min

Observability is the ability to understand system behavior through external outputs. This module covers CloudWatch, Prometheus, Grafana, and distributed tracing—essential for maintaining reliable systems.

Observability Pillars

Observability consists of three pillars:

AWS CloudWatch

CloudWatch is AWS's native monitoring and logging service:

# Put custom metric
aws cloudwatch put-metric-data \
  --namespace MyApp \
  --metric-name RequestLatency \
  --value 150 \
  --unit Milliseconds

# Get metric statistics
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time 2024-01-01T00:00:00Z \
  --end-time 2024-01-02T00:00:00Z \
  --period 3600 \
  --statistics Average,Maximum,Minimum

# Create alarm
aws cloudwatch put-metric-alarm \
  --alarm-name high-cpu \
  --alarm-description "Alert when CPU exceeds 80%" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:123456789:alerts

CloudWatch Logs

# Create log group
aws logs create-log-group --log-group-name /app/production

# Put log events
aws logs put-log-events \
  --log-group-name /app/production \
  --log-stream-name app-instance-1 \
  --log-events timestamp=$(date +%s000),message="Application started"

# Create metric filter
aws logs put-metric-filter \
  --log-group-name /app/production \
  --filter-name ErrorCount \
  --filter-pattern "[ERROR]" \
  --metric-transformations metricName=ErrorCount,metricNamespace=MyApp,metricValue=1

# Query logs with CloudWatch Insights
aws logs start-query \
  --log-group-name /app/production \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string 'fields @timestamp, @message | filter @message like /ERROR/ | stats count() by bin(5m)'

Application Instrumentation

# Python application with CloudWatch metrics
import boto3
import time
from datetime import datetime

cloudwatch = boto3.client('cloudwatch')

def put_metric(metric_name, value, unit='None'):
    cloudwatch.put_metric_data(
        Namespace='MyApp',
        MetricData=[
            {
                'MetricName': metric_name,
                'Value': value,
                'Unit': unit,
                'Timestamp': datetime.utcnow()
            }
        ]
    )

# Track request latency
start_time = time.time()
# ... process request ...
latency = (time.time() - start_time) * 1000
put_metric('RequestLatency', latency, 'Milliseconds')

# Track business metrics
put_metric('OrdersProcessed', 42, 'Count')
put_metric('RevenueGenerated', 1500.00, 'None')

Prometheus

Prometheus is an open-source monitoring system:

# prometheus.yml configuration
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'app-metrics'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics'

Deploy Prometheus:

# Download and run Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvfz prometheus-2.45.0.linux-amd64.tar.gz
cd prometheus-2.45.0.linux-amd64
./prometheus --config.file=prometheus.yml

# Access Prometheus UI at http://localhost:9090

Grafana

Grafana visualizes metrics from Prometheus and other sources:

# Run Grafana in Docker
docker run -d \
  -p 3000:3000 \
  -e GF_SECURITY_ADMIN_PASSWORD=admin \
  grafana/grafana:latest

# Access at http://localhost:3000
# Default credentials: admin/admin

Create a Grafana dashboard:

{
  "dashboard": {
    "title": "Application Metrics",
    "panels": [
      {
        "title": "Request Latency",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~'5..'}[5m])"
          }
        ],
        "type": "graph"
      },
      {
        "title": "CPU Usage",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)"
          }
        ],
        "type": "graph"
      }
    ]
  }
}

Distributed Tracing

Distributed tracing tracks requests across microservices:

# Python application with OpenTelemetry tracing
from opentelemetry import trace, metrics
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Configure Jaeger exporter
jaeger_exporter = JaegerExporter(
    agent_host_name="localhost",
    agent_port=6831,
)

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)

tracer = trace.get_tracer(__name__)

# Create spans
with tracer.start_as_current_span("process_order") as span:
    span.set_attribute("order_id", "12345")
    
    with tracer.start_as_current_span("validate_order"):
        # Validation logic
        pass
    
    with tracer.start_as_current_span("process_payment"):
        # Payment logic
        pass

Alerting Strategy

# Prometheus alerting rules
groups:
  - name: application
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~'5..'}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        annotations:
          summary: "High request latency"
          description: "P95 latency is {{ $value }}s"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1
        for: 5m
        annotations:
          summary: "Pod is crash looping"

Logging Best Practices

# Structured logging with JSON
import json
import logging

class JSONFormatter(logging.Formatter):
    def format(self, record):
        log_data = {
            'timestamp': self.formatTime(record),
            'level': record.levelname,
            'logger': record.name,
            'message': record.getMessage(),
            'module': record.module,
            'function': record.funcName,
            'line': record.lineno
        }
        
        if record.exc_info:
            log_data['exception'] = self.formatException(record.exc_info)
        
        return json.dumps(log_data)

# Configure logging
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logger = logging.getLogger(__name__)
logger.addHandler(handler)

# Log with context
logger.info("Order processed", extra={
    'order_id': '12345',
    'customer_id': '67890',
    'amount': 99.99
})

Terraform for Monitoring

# CloudWatch Log Group
resource "aws_cloudwatch_log_group" "app" {
  name              = "/ecs/my-app"
  retention_in_days = 7

  tags = {
    Name = "app-logs"
  }
}

# CloudWatch Alarm
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "high-cpu-alarm"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 300
  statistic           = "Average"
  threshold           = 80
  alarm_description   = "Alert when CPU exceeds 80%"
  alarm_actions       = [aws_sns_topic.alerts.arn]
}

# SNS Topic for alerts
resource "aws_sns_topic" "alerts" {
  name = "devops-alerts"
}

resource "aws_sns_topic_subscription" "alerts_email" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "email"
  endpoint  = "devops@example.com"
}

❓ What are the three pillars of observability?

❓ What is AWS CloudWatch?

❓ What is Prometheus used for?

❓ What is Grafana's primary function?

❓ What is distributed tracing used for?

← Previous Continue interactively → Next →

Related Courses