Monitoring and Observability
LEVEL 0
The Problem
Your app is running in production. Suddenly users report errors. You check:
docker compose ps
# All containers show "Up"
Everything looks fine. But users can’t complete purchases. What’s wrong?
You need visibility into what’s actually happening inside your containers.
LEVEL 1
The Concept — The Airplane Cockpit
The Concept
Imagine flying a plane with no instruments.
You can’t see:
- Altitude
- Speed
- Fuel level
- Engine temperature
- Navigation
You’re flying blind. You’ll crash.
Monitoring is the instrument panel for your application.
You need to see:
- Traffic volume (requests per second)
- Error rates (% of failed requests)
- Latency (response times)
- Resource usage (CPU, memory)
- Health status
LEVEL 2
The Mechanics — The Three Pillars
The Mechanics
1. Metrics (Quantitative data)
Numbers over time:
- CPU usage: 45%
- Memory usage: 512MB / 1GB
- Request rate: 1000 req/s
- Error rate: 0.5%
- P95 latency: 250ms
2. Logs (Event records)
What happened:
2024-01-15 10:23:45 INFO User 123 logged in
2024-01-15 10:23:46 ERROR Database connection failed
2024-01-15 10:23:47 WARN Retry attempt 1/3
3. Traces (Request flow)
Follow a request through the system:
Request ID: abc123
→ nginx: 2ms
→ app: 45ms
→ database query: 30ms
→ cache lookup: 10ms
→ response: 5ms
Total: 52ms
LEVEL 3
Prometheus + Grafana Stack
docker-compose.yml:
version: '3.9'
services:
app:
image: myapp
ports:
- "8000:8000"
# App exposes /metrics endpoint
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
prometheus-data:
grafana-data:
prometheus.yml:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'myapp'
static_configs:
- targets: ['app:8000']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
App exposes metrics:
from prometheus_client import Counter, Histogram, generate_latest
# Metrics
requests_total = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'])
request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration')
@app.route('/metrics')
def metrics():
return generate_latest()
@app.route('/api/users')
@request_duration.time()
def get_users():
requests_total.labels(method='GET', endpoint='/api/users', status=200).inc()
# ... handle request
LEVEL 4
Centralized Logging (ELK Stack)
docker-compose.yml:
services:
elasticsearch:
image: elasticsearch:8.11.0
environment:
- discovery.type=single-node
- ES_JAVA_OPTS=-Xms512m -Xmx512m
ports:
- "9200:9200"
volumes:
- es-data:/usr/share/elasticsearch/data
logstash:
image: logstash:8.11.0
volumes:
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
depends_on:
- elasticsearch
kibana:
image: kibana:8.11.0
ports:
- "5601:5601"
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
depends_on:
- elasticsearch
app:
image: myapp
logging:
driver: gelf
options:
gelf-address: "udp://logstash:12201"
tag: "myapp"
volumes:
es-data:
logstash.conf:
input {
gelf {
port => 12201
}
}
filter {
json {
source => "message"
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "app-logs-%{+YYYY.MM.dd}"
}
}
LEVEL 5
Alerting
Prometheus alerting rules:
# alert.rules.yml
groups:
- name: app_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} req/s"
- alert: HighMemoryUsage
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Container memory usage above 90%"
- alert: ContainerDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Container is down"
Alertmanager configuration:
# alertmanager.yml
route:
receiver: 'team-notifications'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: 'team-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
severity: '{{ .GroupLabels.severity }}'