Implementing Health Checks
LEVEL 0
The Problem
You know health checks are important. But how do you actually implement them?
- What command should the health check run?
- How do you test if a web server is working?
- What about databases? Background workers? Message queues?
- How do you avoid false positives (marking healthy things unhealthy)?
- How do you avoid false negatives (marking unhealthy things healthy)?
LEVEL 1
The Concept — The Wellness Test
The Concept
Imagine a doctor’s checkup.
A good checkup tests the things that matter:
- Blood pressure (circulation working?)
- Temperature (no infection?)
- Reflexes (nervous system working?)
- Quick questions (cognitive function working?)
A bad checkup would just ask “Are you alive?” Everyone who shows up is alive.
A good health check tests actual functionality. Not just “Is the process running?” but “Can this service do its job?”
LEVEL 2
The Mechanics — Health Check Syntax
The Mechanics
In Dockerfile:
FROM nginx:alpine
HEALTHCHECK --interval=30s \
--timeout=3s \
--start-period=5s \
--retries=3 \
CMD wget --quiet --tries=1 --spider http://localhost/ || exit 1
Parameters:
--interval=30s— Run every 30 seconds--timeout=3s— If check doesn’t complete in 3s, it’s a failure--start-period=5s— Grace period after container starts before checking--retries=3— Mark unhealthy after 3 consecutive failuresCMD ...— The command to run. Exit 0 = healthy, non-zero = unhealthy
In docker-compose.yml:
services:
web:
image: nginx:alpine
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost/"]
interval: 30s
timeout: 3s
retries: 3
start_period: 10s
At runtime:
docker run -d \
--health-cmd="curl -f http://localhost/ || exit 1" \
--health-interval=30s \
--health-retries=3 \
my-app
LEVEL 3
Health Check Commands for Different Services
Web servers (nginx, Apache):
HEALTHCHECK CMD curl -f http://localhost/ || exit 1
Or with wget:
HEALTHCHECK CMD wget --quiet --tries=1 --spider http://localhost/ || exit 1
APIs with health endpoints:
Most APIs have a /health or /healthz endpoint:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
In your API code:
@app.route('/health')
def health():
# Check database connection
try:
db.execute("SELECT 1")
return {"status": "healthy"}, 200
except:
return {"status": "unhealthy", "reason": "database unavailable"}, 503
PostgreSQL:
healthcheck:
test: ["CMD", "pg_isready", "-U", "postgres"]
interval: 10s
timeout: 5s
retries: 5
MySQL:
healthcheck:
test: ["CMD", "mysqladmin", "ping", "-h", "localhost"]
interval: 10s
timeout: 5s
retries: 3
Redis:
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 3
MongoDB:
healthcheck:
test: ["CMD", "mongo", "--eval", "db.adminCommand('ping')"]
interval: 10s
timeout: 5s
retries: 3
Background workers (no HTTP server):
Create a simple status file:
# worker.py
import time
while True:
try:
process_job()
# Update heartbeat file
with open('/tmp/worker-heartbeat', 'w') as f:
f.write(str(time.time()))
except Exception as e:
# Don't update heartbeat if error
pass
HEALTHCHECK CMD test -f /tmp/worker-heartbeat && \
test $(( $(date +%s) - $(cat /tmp/worker-heartbeat) )) -lt 60 || exit 1
This checks if the heartbeat file exists and was updated in the last 60 seconds.
LEVEL 4
Designing Effective Health Checks
1. Test actual functionality, not just process existence
❌ Bad:
HEALTHCHECK CMD ps aux | grep nginx
Docker already knows nginx is running.
✅ Good:
HEALTHCHECK CMD curl -f http://localhost/
This tests if nginx can actually serve requests.
2. Keep it fast (< 1 second ideal)
❌ Bad:
HEALTHCHECK CMD sleep 5 && curl http://localhost/
Adds unnecessary delay.
✅ Good:
HEALTHCHECK --timeout=3s CMD curl -f http://localhost/
3. Make it independent
❌ Bad:
HEALTHCHECK CMD curl -f http://database:5432/
If the database is down, this marks the web server unhealthy even though the web server itself is fine.
✅ Good:
HEALTHCHECK CMD curl -f http://localhost/healthz
Your /healthz endpoint can internally check dependencies and return appropriate status.
4. Avoid side effects
❌ Bad:
HEALTHCHECK CMD curl -f http://localhost/process-next-job
This triggers job processing on every health check!
✅ Good:
HEALTHCHECK CMD curl -f http://localhost/health
Read-only status check.
5. Handle dependencies properly
# Health endpoint that checks dependencies
@app.route('/health')
def health():
checks = {}
healthy = True
# Check database
try:
db.execute("SELECT 1")
checks['database'] = 'ok'
except:
checks['database'] = 'fail'
healthy = False
# Check redis
try:
redis.ping()
checks['cache'] = 'ok'
except:
checks['cache'] = 'fail'
healthy = False
status_code = 200 if healthy else 503
return checks, status_code
LEVEL 5
Health Checks in Docker Compose
Wait for dependencies to be healthy:
version: '3.9'
services:
db:
image: postgres:15
healthcheck:
test: ["CMD", "pg_isready", "-U", "postgres"]
interval: 5s
timeout: 3s
retries: 5
app:
image: myapp
depends_on:
db:
condition: service_healthy # Wait for db to be healthy
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
Now app won’t start until db is healthy.
Full example:
version: '3.9'
services:
nginx:
image: nginx:alpine
ports:
- "80:80"
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost/"]
interval: 30s
timeout: 3s
retries: 3
start_period: 5s
depends_on:
app:
condition: service_healthy
app:
build: ./app
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:5000/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 10s
depends_on:
db:
condition: service_healthy
cache:
condition: service_healthy
db:
image: postgres:15
environment:
POSTGRES_PASSWORD: secret
healthcheck:
test: ["CMD", "pg_isready", "-U", "postgres"]
interval: 10s
timeout: 5s
retries: 5
volumes:
- db-data:/var/lib/postgresql/data
cache:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 3
volumes:
db-data: