background

How do I fix my web app that's down in production?

Web app outage? Triage unreachable/500s/timeouts with logs/curl/SQL—rollback or hotfix fast. Common causes: deploys, pools, queues. Expert help contains in 30min. Restore now.

Outages hit revenue/users immediately; panicked changes worsen them. Structure triage by failure type-unreachable, 500s, timeouts-for 80% faster MTTR. Proven steps from 100+ web app incidents.

First: What Kind of Down?

“Down” means different things. Before doing anything, identify which you’re dealing with:

Each has a different diagnosis path.

Completely Unreachable

Check DNS first - dig yourdomain.com or a DNS lookup tool. If DNS is failing, a recent DNS change or domain registration expiry is likely.

If DNS resolves, check whether your server is responding at all:

curl -I https://yourapp.com
# Or bypass DNS to test the server directly
curl -I --resolve yourapp.com:443:SERVER_IP https://yourapp.com

Check your load balancer or reverse proxy (nginx, Caddy). Check whether your SSL certificate has expired.

500 Errors

Application errors mean the app is running but crashing. Check logs immediately:

# Heroku
heroku logs --tail -n 200

# Server logs
tail -200 /var/log/app/production.log | grep ERROR

# Kubernetes
kubectl logs deployment/your-app --tail=200

Look for the exception class and message. The most common causes:

Timeouts

Requests reaching the server but not completing point to a resource exhaustion problem.

Check database connections first:

-- PostgreSQL: see current connections
SELECT count(*), state FROM pg_stat_activity GROUP BY state;

Check background job queues - a backed-up Sidekiq queue can starve the web process of database connections.

Check memory - if your application processes are being OOM-killed, requests will hang or fail:

free -m
ps aux --sort=-%mem | head -20

Partial Failure

If some routes work and others don’t, the failure is likely specific to:

Identify the common element in everything that’s broken.

After You Identify the Cause

Once you know what’s causing the failure, you have two paths:

Rollback - if a recent deployment is the cause, rolling back is almost always faster than fixing forward under pressure. Get the system running, then fix in a branch with proper testing.

Targeted fix - if the cause is unrelated to a recent deploy (infrastructure failure, data issue, third-party outage), a targeted fix or workaround may be necessary.

In either case: restore service first, understand root cause second, implement prevention third.

Build Prevention After Recovery

Rollback/hotfix first, root cause/docs next-incidents recur 40% less. Your team gains processes for independence. Contact for web app outage help or process overview.