Outages hit revenue/users immediately; panicked changes worsen them. Structure triage by failure type-unreachable, 500s, timeouts-for 80% faster MTTR. Proven steps from 100+ web app incidents.
First: What Kind of Down?
“Down” means different things. Before doing anything, identify which you’re dealing with:
- Completely unreachable - DNS failure, server not responding, load balancer misconfigured
- 500 errors - Application running but throwing unhandled exceptions
- Timeouts - Requests reaching the server but not completing
- Partial failure - Some pages/features broken, others working
- Slow but functional - Not down, but degraded enough to cause user impact
Each has a different diagnosis path.
Completely Unreachable
Check DNS first - dig yourdomain.com or a DNS lookup tool. If DNS is failing, a recent DNS change or domain registration expiry is likely.
If DNS resolves, check whether your server is responding at all:
curl -I https://yourapp.com
# Or bypass DNS to test the server directly
curl -I --resolve yourapp.com:443:SERVER_IP https://yourapp.com
Check your load balancer or reverse proxy (nginx, Caddy). Check whether your SSL certificate has expired.
500 Errors
Application errors mean the app is running but crashing. Check logs immediately:
# Heroku
heroku logs --tail -n 200
# Server logs
tail -200 /var/log/app/production.log | grep ERROR
# Kubernetes
kubectl logs deployment/your-app --tail=200
Look for the exception class and message. The most common causes:
- A recent deploy introduced a bug (check
git log, consider rolling back) - A missing environment variable or credential
- A database migration that partially ran or has a bug
- A gem incompatibility after a dependency update
Timeouts
Requests reaching the server but not completing point to a resource exhaustion problem.
Check database connections first:
-- PostgreSQL: see current connections
SELECT count(*), state FROM pg_stat_activity GROUP BY state;
Check background job queues - a backed-up Sidekiq queue can starve the web process of database connections.
Check memory - if your application processes are being OOM-killed, requests will hang or fail:
free -m
ps aux --sort=-%mem | head -20
Partial Failure
If some routes work and others don’t, the failure is likely specific to:
- A database table or query used only by the broken routes
- A third-party integration (payment processor, email service, file storage)
- A feature that was recently deployed
Identify the common element in everything that’s broken.
After You Identify the Cause
Once you know what’s causing the failure, you have two paths:
Rollback - if a recent deployment is the cause, rolling back is almost always faster than fixing forward under pressure. Get the system running, then fix in a branch with proper testing.
Targeted fix - if the cause is unrelated to a recent deploy (infrastructure failure, data issue, third-party outage), a targeted fix or workaround may be necessary.
In either case: restore service first, understand root cause second, implement prevention third.
Build Prevention After Recovery
Rollback/hotfix first, root cause/docs next-incidents recur 40% less. Your team gains processes for independence. Contact for web app outage help or process overview.