Disaster Recovery Runbook
Recovery Objectives
| Metric | Target | Notes |
|---|---|---|
| RTO (Recovery Time Objective) | < 30 min single region, < 2 hours full rebuild | Time from incident detection to service restoration |
| RPO (Recovery Point Objective) | < 24 hours (daily backup) | Maximum data loss window |
| Status check interval | 5 minutes | Background recorder in main.go |
| Backup frequency | Daily (3:00 UTC) + weekly (Sundays) | pg-backup.sh via cron |
| Backup retention | 7 daily + 4 weekly + 3 remote copies | Rotation in pg-backup.sh, remote in pg-backup-replicate.sh |
Scenario 1: Single Region API Down
Symptoms: Health check fails for one region; other regions operational.
Detection: Status recorder logs ERROR level; /v1/status/global shows region as down.
Steps
-
SSH to the affected VPS:
ssh deploy@<VPS_HOST> -
Check service status:
cd /opt/featuresignalsdocker compose --project-directory . -f deploy/docker-compose.region.yml psdocker compose --project-directory . -f deploy/docker-compose.region.yml logs --tail=50 server -
If server container is crashing, check for bad deploy:
# View recent deployscat /mnt/data/deploy-history.log | tail -5# Rollback to previous known-good commitROLLBACK_COMMIT=<previous_sha> bash deploy/deploy-region.sh -
If database is the issue:
docker compose --project-directory . -f deploy/docker-compose.region.yml logs --tail=50 postgresdocker compose --project-directory . -f deploy/docker-compose.region.yml restart postgres# Wait for server to reconnect (automatic via pgxpool) -
If VPS is unreachable:
- Check cloud provider dashboard (Hetzner/Utho)
- Reboot VPS from cloud dashboard
- If VPS is destroyed, proceed to Scenario 3
Scenario 2: Database Corruption
Symptoms: Postgres errors in server logs; queries failing; data inconsistency.
Steps
-
Stop the API server to prevent further writes:
cd /opt/featuresignalsDC="docker compose --project-directory . -f deploy/docker-compose.region.yml"$DC stop server dashboard -
Locate latest backup:
ls -lt /mnt/data/backups/daily/# If local backups are corrupted, check remote copies:ls -lt /mnt/data/backups/remote/ -
Restore from backup:
# Stop postgres$DC stop postgres# Remove corrupted datasudo rm -rf /mnt/data/pgdata/*# Start fresh postgres$DC up -d postgressleep 10# Restore backupgunzip -c /mnt/data/backups/daily/<latest>.sql.gz | \docker exec -i $($DC ps -q postgres) psql -U fs -d featuresignals# Re-run migrations (in case backup predates latest migrations)$DC up migrate -
Restart all services:
$DC up -d -
Verify data integrity:
docker exec $($DC ps -q postgres) psql -U fs -d featuresignals -c "SELECT 'organizations' AS table_name, count(*) FROM organizationsUNION ALL SELECT 'users', count(*) FROM usersUNION ALL SELECT 'projects', count(*) FROM projectsUNION ALL SELECT 'flags', count(*) FROM flags;"
Scenario 3: Full Region Rebuild
Symptoms: VPS destroyed or irrecoverable.
Steps
-
Provision new VPS:
- Hetzner (US/EU): Use Terraform in
deploy/terraform/hetzner/ - Utho (IN): Use provisioning script in
deploy/terraform/utho/
- Hetzner (US/EU): Use Terraform in
-
Initial server setup:
# Run the setup script (Docker, firewall, deploy user)bash deploy/terraform/hetzner/setup.sh # or utho/setup-utho.sh -
Clone repository:
ssh deploy@<new_vps>git clone https://github.com/dinesh-g1/featuresignals.git /opt/featuresignals -
Restore database from remote backup:
# Copy backup from another regionscp deploy@<other_region_vps>:/mnt/data/backups/remote/<latest>.sql.gz /mnt/data/backups/# Follow Scenario 2 restore steps -
Deploy via GitHub Actions:
- Run CD Regional workflow with dispatch, targeting only the rebuilt region
- Or manually:
cd /opt/featuresignals && bash deploy/deploy-region.sh
-
Update DNS (if IP changed):
- Update A records for the region's domains
- Update GitHub secrets with new VPS host IP
-
Verify:
curl -sf https://<domain_api>/healthcurl -sf https://<domain_api>/v1/status
Scenario 4: Global Outage (All Regions)
Steps
-
Identify root cause — most likely a bad deploy pushed to all regions:
# Check if same commit is deployed everywherefor host in $VPS_HOST_IN $VPS_HOST_US $VPS_HOST_EU; dossh deploy@$host "cd /opt/featuresignals && git log -1 --format='%h %s'"done -
Rollback all regions via GitHub Actions:
- Dispatch CD Regional with
rollback_commitset to last known-good SHA
- Dispatch CD Regional with
-
If GitHub Actions is unavailable:
for host in $VPS_HOST_IN $VPS_HOST_US $VPS_HOST_EU; dossh deploy@$host "cd /opt/featuresignals && ROLLBACK_COMMIT=<sha> bash deploy/deploy-region.sh"done
Backup Verification
Weekly automated verification runs via deploy/pg-backup-verify.sh:
- Restores latest backup into a temporary container
- Runs sanity queries on core tables
- Logs results to
/var/log/fs-backup-verify.log
Manual verification:
bash /opt/featuresignals/deploy/pg-backup-verify.sh
Monitoring & Alerting
| Signal | Source | Alert Level |
|---|---|---|
| API health check failure | node-health.sh | ERROR |
| Disk usage > 85% | node-health.sh | ERROR |
| Memory usage > 90% | node-health.sh | ERROR |
| Container not running | node-health.sh | ERROR |
| PG connections > 80% max | node-health.sh | ERROR |
| Remote region unreachable | Status recorder | WARN |
| Backup verification failure | pg-backup-verify.sh | ERROR |
All ERROR-level logs flow to SigNoz via OTEL and should trigger alerts.
Cron Schedule (per VPS)
# Daily backup (3:00 UTC)
0 3 * * * /opt/featuresignals/deploy/pg-backup.sh >> /var/log/fs-backup.log 2>&1
# Daily backup replication (3:30 UTC)
30 3 * * * /opt/featuresignals/deploy/pg-backup-replicate.sh >> /var/log/fs-backup-replicate.log 2>&1
# Weekly backup verification (Sunday 6:00 UTC)
0 6 * * 0 /opt/featuresignals/deploy/pg-backup-verify.sh >> /var/log/fs-backup-verify.log 2>&1
# Weekly DB maintenance (Sunday 5:00 UTC)
0 5 * * 0 /opt/featuresignals/deploy/pg-maintenance.sh >> /var/log/fs-pg-maintenance.log 2>&1
# Weekly data cleanup (Sunday 4:00 UTC)
0 4 * * 0 /opt/featuresignals/deploy/cleanup-cron.sh >> /var/log/fs-cleanup.log 2>&1
# Per-minute health monitoring
* * * * * /opt/featuresignals/deploy/monitoring/node-health.sh 2>&1 | logger -t fs-health
Escalation
| Severity | Response Time | Who |
|---|---|---|
| P1 — All regions down | 15 min | On-call engineer |
| P2 — Single region down | 30 min | On-call engineer |
| P3 — Degraded performance | 4 hours | Engineering team |
| P4 — Non-critical (monitoring gap) | Next business day | Engineering team |