Outages hit revenue/users immediately; panicked changes worsen them. Structure triage by failure type-unreachable, 500s, timeouts-for 80% faster MTTR. Proven steps from 100+ web app incidents.
First: What Kind of Down?
“Down” means different things. Before doing anything, identify which you’re dealing with:
- Completely unreachable - DNS failure, server not responding, load balancer misconfigured
- 500 errors - Application running but throwing unhandled exceptions
- Timeouts - Requests reaching the server but not completing
- Partial failure - Some pages/features broken, others working
- Slow but functional - Not down, but degraded enough to cause user impact
Each has a different diagnosis path.
Completely Unreachable
Check DNS first - dig yourdomain.com or a DNS lookup tool. If DNS is failing, a recent DNS change or domain registration expiry is likely.
If DNS resolves, check whether your server is responding at all:
curl -I https://yourapp.com
# Or bypass DNS to test the server directly
curl -I --resolve yourapp.com:443:SERVER_IP https://yourapp.comCheck your load balancer or reverse proxy (nginx, Caddy). Check whether your SSL certificate has expired.
500 Errors
Application errors mean the app is running but crashing. Check logs immediately:
# Heroku
heroku logs --tail -n 200
# Server logs
tail -200 /var/log/app/production.log | grep ERROR
# Kubernetes
kubectl logs deployment/your-app --tail=200Look for the exception class and message. The most common causes:
- A recent deploy introduced a bug (check
git log, consider rolling back) - A missing environment variable or credential
- A database migration that partially ran or has a bug
- A gem incompatibility after a dependency update
Timeouts
Requests reaching the server but not completing point to a resource exhaustion problem.
Check database connections first:
-- PostgreSQL: see current connections
SELECT count(*), state FROM pg_stat_activity GROUP BY state;Check background job queues - a backed-up Sidekiq queue can starve the web process of database connections.
Check memory - if your application processes are being OOM-killed, requests will hang or fail:
free -m
ps aux --sort=-%mem | head -20Partial Failure
If some routes work and others don’t, the failure is likely specific to:
- A database table or query used only by the broken routes
- A third-party integration (payment processor, email service, file storage)
- A feature that was recently deployed
Identify the common element in everything that’s broken.
After You Identify the Cause
Once you know what’s causing the failure, you have two paths:
Rollback - if a recent deployment is the cause, rolling back is almost always faster than fixing forward under pressure. Get the system running, then fix in a branch with proper testing.
Targeted fix - if the cause is unrelated to a recent deploy (infrastructure failure, data issue, third-party outage), a targeted fix or workaround may be necessary.
In either case: restore service first, understand root cause second, implement prevention third.
Build Prevention After Recovery
Rollback/hotfix first, root cause/docs next-incidents recur 40% less. Your team gains processes for independence. Contact for web app outage help or process overview.

