Disaster Recovery Runbook

Recovery Objectives

Metric	Target	Notes
RTO (Recovery Time Objective)	< 30 min single region, < 2 hours full rebuild	Time from incident detection to service restoration
RPO (Recovery Point Objective)	< 24 hours (daily backup)	Maximum data loss window
Status check interval	5 minutes	Background recorder in `main.go`
Backup frequency	Daily (3:00 UTC) + weekly (Sundays)	`pg-backup.sh` via cron
Backup retention	7 daily + 4 weekly + 3 remote copies	Rotation in `pg-backup.sh`, remote in `pg-backup-replicate.sh`

Scenario 1: Single Region API Down

Symptoms: Health check fails for one region; other regions operational.

Detection: Status recorder logs ERROR level; /v1/status/global shows region as down.

Steps

SSH to the affected VPS:
```
ssh deploy@<VPS_HOST>
```

Check service status:

cd /opt/featuresignals
docker compose --project-directory . -f deploy/docker-compose.region.yml ps
docker compose --project-directory . -f deploy/docker-compose.region.yml logs --tail=50 server

If server container is crashing, check for bad deploy:

# View recent deploys
cat /mnt/data/deploy-history.log | tail -5

# Rollback to previous known-good commit
ROLLBACK_COMMIT=<previous_sha> bash deploy/deploy-region.sh

If database is the issue:

docker compose --project-directory . -f deploy/docker-compose.region.yml logs --tail=50 postgres
docker compose --project-directory . -f deploy/docker-compose.region.yml restart postgres
# Wait for server to reconnect (automatic via pgxpool)

If VPS is unreachable:
- Check cloud provider dashboard (Hetzner/Utho)
- Reboot VPS from cloud dashboard
- If VPS is destroyed, proceed to Scenario 3

Scenario 2: Database Corruption

Symptoms: Postgres errors in server logs; queries failing; data inconsistency.

Steps

Stop the API server to prevent further writes:

cd /opt/featuresignals
DC="docker compose --project-directory . -f deploy/docker-compose.region.yml"
$DC stop server dashboard

Locate latest backup:

ls -lt /mnt/data/backups/daily/
# If local backups are corrupted, check remote copies:
ls -lt /mnt/data/backups/remote/

Restore from backup:

# Stop postgres
$DC stop postgres

# Remove corrupted data
sudo rm -rf /mnt/data/pgdata/*

# Start fresh postgres
$DC up -d postgres
sleep 10

# Restore backup
gunzip -c /mnt/data/backups/daily/<latest>.sql.gz | \
  docker exec -i $($DC ps -q postgres) psql -U fs -d featuresignals

# Re-run migrations (in case backup predates latest migrations)
$DC up migrate

Restart all services:
```
$DC up -d
```

Verify data integrity:

docker exec $($DC ps -q postgres) psql -U fs -d featuresignals -c "
  SELECT 'organizations' AS table_name, count(*) FROM organizations
  UNION ALL SELECT 'users', count(*) FROM users
  UNION ALL SELECT 'projects', count(*) FROM projects
  UNION ALL SELECT 'flags', count(*) FROM flags;
"

Scenario 3: Full Region Rebuild

Symptoms: VPS destroyed or irrecoverable.

Steps

Provision new VPS:
- Hetzner (US/EU): Use Terraform in deploy/terraform/hetzner/
- Utho (IN): Use provisioning script in deploy/terraform/utho/

Initial server setup:

# Run the setup script (Docker, firewall, deploy user)
bash deploy/terraform/hetzner/setup.sh  # or utho/setup-utho.sh

Clone repository:

ssh deploy@<new_vps>
git clone https://github.com/dinesh-g1/featuresignals.git /opt/featuresignals

Restore database from remote backup:

# Copy backup from another region
scp deploy@<other_region_vps>:/mnt/data/backups/remote/<latest>.sql.gz /mnt/data/backups/

# Follow Scenario 2 restore steps

Deploy via GitHub Actions:
- Run CD Regional workflow with dispatch, targeting only the rebuilt region
- Or manually: cd /opt/featuresignals && bash deploy/deploy-region.sh
Update DNS (if IP changed):
- Update A records for the region's domains
- Update GitHub secrets with new VPS host IP

Verify:

curl -sf https://<domain_api>/health
curl -sf https://<domain_api>/v1/status

Scenario 4: Global Outage (All Regions)

Steps

Identify root cause — most likely a bad deploy pushed to all regions:

# Check if same commit is deployed everywhere
for host in $VPS_HOST_IN $VPS_HOST_US $VPS_HOST_EU; do
  ssh deploy@$host "cd /opt/featuresignals && git log -1 --format='%h %s'"
done

Rollback all regions via GitHub Actions:
- Dispatch CD Regional with rollback_commit set to last known-good SHA

If GitHub Actions is unavailable:

for host in $VPS_HOST_IN $VPS_HOST_US $VPS_HOST_EU; do
  ssh deploy@$host "cd /opt/featuresignals && ROLLBACK_COMMIT=<sha> bash deploy/deploy-region.sh"
done

Backup Verification

Weekly automated verification runs via deploy/pg-backup-verify.sh:

Restores latest backup into a temporary container
Runs sanity queries on core tables
Logs results to /var/log/fs-backup-verify.log

Manual verification:

bash /opt/featuresignals/deploy/pg-backup-verify.sh

Monitoring & Alerting

Signal	Source	Alert Level
API health check failure	`node-health.sh`	ERROR
Disk usage > 85%	`node-health.sh`	ERROR
Memory usage > 90%	`node-health.sh`	ERROR
Container not running	`node-health.sh`	ERROR
PG connections > 80% max	`node-health.sh`	ERROR
Remote region unreachable	Status recorder	WARN
Backup verification failure	`pg-backup-verify.sh`	ERROR

All ERROR-level logs flow to SigNoz via OTEL and should trigger alerts.

Cron Schedule (per VPS)

# Daily backup (3:00 UTC)
0 3 * * * /opt/featuresignals/deploy/pg-backup.sh >> /var/log/fs-backup.log 2>&1

# Daily backup replication (3:30 UTC)
30 3 * * * /opt/featuresignals/deploy/pg-backup-replicate.sh >> /var/log/fs-backup-replicate.log 2>&1

# Weekly backup verification (Sunday 6:00 UTC)
0 6 * * 0 /opt/featuresignals/deploy/pg-backup-verify.sh >> /var/log/fs-backup-verify.log 2>&1

# Weekly DB maintenance (Sunday 5:00 UTC)
0 5 * * 0 /opt/featuresignals/deploy/pg-maintenance.sh >> /var/log/fs-pg-maintenance.log 2>&1

# Weekly data cleanup (Sunday 4:00 UTC)
0 4 * * 0 /opt/featuresignals/deploy/cleanup-cron.sh >> /var/log/fs-cleanup.log 2>&1

# Per-minute health monitoring
* * * * * /opt/featuresignals/deploy/monitoring/node-health.sh 2>&1 | logger -t fs-health

Escalation

Severity	Response Time	Who
P1 — All regions down	15 min	On-call engineer
P2 — Single region down	30 min	On-call engineer
P3 — Degraded performance	4 hours	Engineering team
P4 — Non-critical (monitoring gap)	Next business day	Engineering team

Recovery Objectives​

Scenario 1: Single Region API Down​

Steps​

Scenario 2: Database Corruption​

Steps​

Scenario 3: Full Region Rebuild​

Steps​

Scenario 4: Global Outage (All Regions)​

Steps​

Backup Verification​

Monitoring & Alerting​

Cron Schedule (per VPS)​

Escalation​

Recovery Objectives

Scenario 1: Single Region API Down

Steps

Scenario 2: Database Corruption

Steps

Scenario 3: Full Region Rebuild

Steps

Scenario 4: Global Outage (All Regions)

Steps

Backup Verification

Monitoring & Alerting

Cron Schedule (per VPS)

Escalation