Sample Deliverable

Incident Triage Report (Public Example)

This is the report format clients receive during active coverage. Scenario below is anonymized but technically realistic.

Generated format for the $39/month Monthly Incident Guard pilot. Payment link is active.

1. Incident Snapshot

2. Ranked Hypotheses

  1. Upstream app workers saturated after migration lock (highest likelihood).
  2. Connection pool exhausted because stale transactions were not released.
  3. Ingress timeout lower than app timeout, causing premature 504 at proxy.

3. Verification Commands

Run in this exact order and capture UTC timestamps with each output:

# Proxy 5xx concentration by minute
awk '$9 ~ /^50[24]$/ {print substr($4,2,17)}' access.log | sort | uniq -c | tail -n 20

# App worker and event loop pressure
curl -sS http://127.0.0.1:3000/health
curl -sS http://127.0.0.1:3000/metrics | grep -E 'event_loop|active_requests|db_pool'

# Database lock and wait inspection
psql "$DATABASE_URL" -c "select now(), wait_event_type, wait_event, state, query from pg_stat_activity where state <> 'idle' order by query_start asc limit 20;"

4. Safe Fix Sequence

  1. Freeze deploys and autoscaling changes for 30 minutes.
  2. Raise proxy read timeout from 30s to 75s only for affected route.
  3. Restart one app replica at a time; confirm error-rate drop before next restart.
  4. Kill long-running DB sessions older than 120s tied to failed migration.
  5. Re-run synthetic checks and compare p95 latency versus pre-incident baseline.

5. Rollback Checkpoints

6. 7-Day Hardening Plan

Need this report format for your active outage? Start at the outage fast path.