Sample Deliverable
Incident Triage Report (Public Example)
This is the report format clients receive during active coverage. Scenario below is anonymized but technically realistic.
Generated format for the $39/month Monthly Incident Guard pilot. Payment link is active.
1. Incident Snapshot
- Window: 2026-02-18 03:22-04:11 UTC
- Primary symptom: 504 spikes on
POST /api/bounties - Impact: 31% request failure during deploy window
- Boundary signal: proxy timed out before upstream app response
2. Ranked Hypotheses
- Upstream app workers saturated after migration lock (highest likelihood).
- Connection pool exhausted because stale transactions were not released.
- Ingress timeout lower than app timeout, causing premature 504 at proxy.
3. Verification Commands
Run in this exact order and capture UTC timestamps with each output:
# Proxy 5xx concentration by minute
awk '$9 ~ /^50[24]$/ {print substr($4,2,17)}' access.log | sort | uniq -c | tail -n 20
# App worker and event loop pressure
curl -sS http://127.0.0.1:3000/health
curl -sS http://127.0.0.1:3000/metrics | grep -E 'event_loop|active_requests|db_pool'
# Database lock and wait inspection
psql "$DATABASE_URL" -c "select now(), wait_event_type, wait_event, state, query from pg_stat_activity where state <> 'idle' order by query_start asc limit 20;"
4. Safe Fix Sequence
- Freeze deploys and autoscaling changes for 30 minutes.
- Raise proxy read timeout from 30s to 75s only for affected route.
- Restart one app replica at a time; confirm error-rate drop before next restart.
- Kill long-running DB sessions older than 120s tied to failed migration.
- Re-run synthetic checks and compare p95 latency versus pre-incident baseline.
5. Rollback Checkpoints
- If 5xx rate does not drop below 3% in 10 minutes, revert to previous app image.
- If DB wait events remain above baseline for 15 minutes, rollback latest schema change.
- If error budget burn exceeds 5%/hour, disable non-critical background workers.
6. 7-Day Hardening Plan
- Add route-level timeout budget matrix (proxy/app/db) to release checklist.
- Ship health endpoint contract test in CI for dependency readiness checks.
- Add deploy canary gate: block full rollout when canary 5xx exceeds threshold.
- Publish weekly reliability report with risk rank, owner, and due date per fix.
Need this report format for your active outage? Start at the outage fast path.