Incident Runbook
Internal SRE/ops reference for multi-region SaaS: US (Hetzner Ashburn), EU (Hetzner Falkenstein), India (Utho Mumbai). Stack: Docker Compose, PostgreSQL 16, Caddy 2, OpenTelemetry → SigNoz.
Assume deploy root and compose file (adjust if your layout differs):
export FS_DEPLOY=/opt/featuresignals
export COMPOSE="docker compose -f $FS_DEPLOY/deploy/docker-compose.region.yml"
cd "$FS_DEPLOY"
1. Severity levels
| Level | Definition | Examples | Response target |
|---|---|---|---|
| P1 | Full outage or data loss risk | All regions or single region completely unreachable; auth broken for all; DB unavailable | Immediate page; war room |
| P2 | Major degradation | One product surface down (API or app); eval errors elevated; partial region impact | Within 15 min acknowledgement |
| P3 | Minor / limited | Elevated errors for subset; non-critical feature broken; perf slip below SLO | Next business hours queue |
| P4 | Cosmetic / internal | UI glitches; docs typo; non-customer-impacting bugs | Backlog |
Escalate severity when customer impact or blast radius grows.
2. First response checklist
SSH into the affected region’s host (Hetzner Cloud / Utho console → server IP).
ssh -i ~/.ssh/fs_ops deploy@<REGION_SERVER_IP>
cd /opt/featuresignals
Container health
docker compose -f deploy/docker-compose.region.yml ps -a
docker compose -f deploy/docker-compose.region.yml top
Recent logs (API, proxy, DB)
docker compose -f deploy/docker-compose.region.yml logs --tail=200 server
docker compose -f deploy/docker-compose.region.yml logs --tail=200 caddy
docker compose -f deploy/docker-compose.region.yml logs --tail=100 postgres
API health (from host, via container network)
docker compose -f deploy/docker-compose.region.yml exec -T server wget -qO- http://127.0.0.1:8080/health || true
curl -sfS https://"$DOMAIN_API"/health # after exporting DOMAIN_API from .env
Database connectivity
set -a && source .env && set +a
docker compose -f deploy/docker-compose.region.yml exec -T postgres \
psql -U fs -d featuresignals -c "SELECT 1 AS ok, now() AS ts;"
SigNoz
- Open your SigNoz instance (e.g. cloud or self-hosted UI).
- Services →
featuresignals-api(orOTEL_SERVICE_NAMEfrom.env). - Filter region =
us/eu/in(OTEL_SERVICE_REGION). - Check error rate, p99 latency, trace waterfall for 5xx spikes.
- Logs (if enabled) correlated by
trace_id/request_id.
Quick external checks
curl -sI "https://$DOMAIN_API/health"
curl -sI "https://$DOMAIN_APP/"
3. Region down
Symptoms: Health checks fail for one region only; DNS/geo still points users at that POP.
Confirm scope
# On the bad host
$COMPOSE ps
$COMPOSE logs --tail=100 caddy server postgres
Mitigate single region
- DNS / traffic steering: At your DNS or CDN (e.g. geo records or health-checked failover), point the failing region’s hostnames to a healthy region’s edge only if that region can legally and technically serve those users (latency, data residency, org
data_region— see product policy). Otherwise keep DNS as-is and restore the region. - Temporary redirect (example): Lower TTL on affected names (e.g.
api.eu.…) ahead of changes; swap A/AAAA to a standby IP or to US edge if approved. - Dashboard multi-region URLs:
NEXT_PUBLIC_API_URL_EU/NEXT_PUBLIC_API_URL_INmust match live API endpoints after any DNS change; redeploy dashboard if those envs change at build time.
Bring region back
cd "$FS_DEPLOY"
$COMPOSE pull # if using pinned images
$COMPOSE up -d --force-recreate caddy server dashboard postgres
Post-incident: Document RTO/RPO for the region; verify backups and runbook times.
4. Database issues
Connection pool exhausted / “too many clients”
docker compose -f deploy/docker-compose.region.yml exec -T postgres \
psql -U fs -d featuresignals -c "SELECT count(*) FROM pg_stat_activity;"
docker compose -f deploy/docker-compose.region.yml exec -T postgres \
psql -U fs -d featuresignals -c "SELECT state, wait_event_type, wait_event, count(*) FROM pg_stat_activity GROUP BY 1,2,3 ORDER BY 4 DESC;"
Mitigation: restart server containers to drop leaked pools (causes brief disconnects); scale API replicas only if architecture supports it; raise max_connections only with sizing review; fix stuck transactions.
Slow queries
docker compose -f deploy/docker-compose.region.yml exec -T postgres \
psql -U fs -d featuresignals -c "SELECT pid, now() - query_start AS dur, state, left(query,120) FROM pg_stat_activity WHERE state <> 'idle' ORDER BY dur DESC LIMIT 20;"
Enable pg_stat_statements if not already; correlate with SigNoz DB client spans.
Replication lag (if using streaming replica)
docker compose -f deploy/docker-compose.region.yml exec -T postgres \
psql -U fs -d featuresignals -c "SELECT pg_is_in_recovery(), pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn();"
Mitigation: network/disk on replica; pause heavy writes; fail over only via approved procedure.
Disk full
df -h
docker system df
docker compose -f deploy/docker-compose.region.yml exec -T postgres df -h /var/lib/postgresql/data
Mitigation: prune old images/logs after confirming space; expand volume per provider; VACUUM / archive WAL per playbook — avoid deleting pgdata files manually.
Restore from backup
- Stop writers:
$COMPOSE stop server dashboard(or full stack). - Restore provider-specific snapshot or
pg_dump/pg_restoreintopgdatavolume per your backup tool’s docs. - Run migrations if needed:
$COMPOSE run --rm migrate(verify image/command matches prod). - Start stack:
$COMPOSE up -d; verify/healthand SigNoz.
5. High API latency
- SigNoz traces: Service → latency graph → slow traces → span breakdown (HTTP handler vs DB).
- Database: Run slow-query SQL from §4; check CPU/iowait on DB host.
- Eval cache: In logs, look for cache miss storms or LISTEN/notify issues; restart
serverif cache corruption suspected (last resort). - Pool wait: Application metrics / logs for acquisition timeouts; align pool size with Postgres
max_connections. - Caddy / TLS:
$COMPOSE logs caddyfor upstream timeouts or TLS handshake delays.
docker stats --no-stream
$COMPOSE logs --tail=300 server | grep -Ei "error|slow|timeout|pool"
6. Certificate expiry (Caddy)
Caddy auto-renews via ACME. If renewal fails:
$COMPOSE logs --tail=200 caddy | grep -Ei "acme|certificate|tls|error"
$COMPOSE exec caddy caddy list-modules
Manual renew / reload
$COMPOSE exec caddy caddy reload --config /etc/caddy/Caddyfile
# If file was edited on host, ensure volume mount is correct then:
$COMPOSE restart caddy
DNS / HTTP-01: Verify _acme-challenge or TLS-ALPN reachability; ports 80/443 open; no stale firewall. Rate limits: stagger restarts across regions.
7. Deployment rollback (Docker Compose)
Git-based (typical for image build: compose)
cd "$FS_DEPLOY"
git fetch origin
git log --oneline -10
git checkout <GOOD_COMMIT_SHA>
docker compose -f deploy/docker-compose.region.yml build --parallel
docker compose -f deploy/docker-compose.region.yml rm -fsv website-build docs-build migrate 2>/dev/null || true
docker volume rm -f featuresignals_website-dist featuresignals_docs-dist 2>/dev/null || true
docker compose -f deploy/docker-compose.region.yml up -d
docker compose -f deploy/docker-compose.region.yml ps
Tagged images (if you use image: + registry)
# Edit .env or override on CLI to previous tag, then:
docker compose -f deploy/docker-compose.region.yml pull server dashboard
docker compose -f deploy/docker-compose.region.yml up -d server dashboard
Verify health, SigNoz error rate, and dashboard/API smoke tests before closing incident.
8. Security incident
Contain
- Rotate compromised credentials first; block attacker IPs at firewall/CDN if applicable.
- Preserve logs: copy relevant
docker compose logsoutput and SigNoz exports to secure storage.
Key rotation
| Secret | Action |
|---|---|
| JWT_SECRET | Generate new secret; update .env; restart server; all sessions invalidated — notify customers if needed. |
| API keys | In app DB: revoke/rotate per org in admin tooling or SQL (hashed keys only in DB); customers re-issue keys. |
| POSTGRES_PASSWORD | Change in Postgres + .env; update DATABASE_URL references; restart postgres (planned window) and server. |
| OTEL / third-party | Rotate OTEL_INGESTION_KEY in .env; restart server. |
openssl rand -base64 48 # JWT_SECRET candidate
$COMPOSE up -d --force-recreate server
Aftermath: Postmortem, customer notice per legal/comms, audit log review, dependency patch deploy.
9. Scaling
Vertical (Hetzner / Utho)
- Snapshot/backup instance.
- Resize CPU/RAM/disk in provider console (disk expansion procedure per vendor).
- Reboot if required; confirm
dockerstarts,$COMPOSE ps, Postgres data mount intact. - Tune Postgres
shared_buffers/work_memand GoDATABASE_URLpool sizes to match new RAM.
Horizontal
- Each region’s compose file is a single-node stack; adding second API node requires shared state (same DB, cache invalidation via existing LISTEN/NOTIFY), load balancer in front of multiple
servercontainers or hosts, and sticky sessions not required for REST if stateless. - Do not run two writable Postgres primaries against the same data directory; use managed HA or Patroni if you need multi-node DB.
10. Communication template (status page)
Use until replaced by your status vendor’s workflow.
Title: [Investigating | Identified | Monitoring | Resolved] – Short customer-visible symptom
Status: [Major outage | Degraded performance | Partial service]
What happened:
We are investigating reports of <symptom> affecting <API / dashboard / evaluations / region>.
Impact:
Customers may experience <specific impact>. Data integrity is <not known / not impacted / under review>.
What we are doing:
Our team is actively working on this incident. We will update this page every <15> minutes or when status changes.
Workaround (if any):
<None / use region X / retry after Y>
Updated: <ISO 8601 UTC>
Reference: Deploy script deploy/deploy-region.sh; compose deploy/docker-compose.region.yml; Caddy deploy/Caddyfile.region.