Phase 6 of 7

Monitoring and Observability

Add dashboards, metrics, and alerting to make production behavior measurable.

Overview

Extend the stack with telemetry so behavior, latency, queue depth, and system health are visible and actionable.

What to build

Deliverables

Advance only when these outputs exist in your code or compose definitions.

  1. Add Prometheus metrics endpoints and service instrumentation
  2. Create Prometheus scrape configuration for core services
  3. Run Grafana with dashboard structure for latency, queues, DB, and workers
  4. Define alertmanager routing and sample alert rules
  5. Optionally wire tracing with Jaeger

Done when

Success criteria

These are acceptance indicators, not a checklist to start from.

  • Prometheus successfully scrapes key service metrics
  • Grafana reflects live request, latency, and worker data
  • Key alerts fire on health and threshold breaches
  • Alert notifications route to a configured endpoint
  • Historical metric retention includes at least 30 days

Verification

Testing and validation

Run these in order. Confirm each result before moving to the next step.

  1. docker compose -f docker-compose.dev.yml up -d

    `docker compose -f docker-compose.dev.yml up -d`

  2. open http://localhost:3001

    `open http://localhost:3001`

  3. open http://localhost:9090

    and confirm target discovery

  4. Generate synthetic load and check dashboard movement

  5. docker compose stop api

    and confirm alerting behavior

  6. Optionally open `http://localhost:16686` if Jaeger is wired