How do I fix my web app that's down in production?

Step-by-step triage for a production web app outage: read logs, roll back or hotfix, and diagnose common causes like failed deploys and pool exhaustion.

Fix Web App Down in Production: Step-by-Step

Outages hit revenue/users immediately; panicked changes worsen them. Structure triage by failure type-unreachable, 500s, timeouts-for 80% faster MTTR. Proven steps from 100+ web app incidents.

First: What Kind of Down?

“Down” means different things. Before doing anything, identify which you’re dealing with:

Completely unreachable - DNS failure, server not responding, load balancer misconfigured
500 errors - Application running but throwing unhandled exceptions
Timeouts - Requests reaching the server but not completing
Partial failure - Some pages/features broken, others working
Slow but functional - Not down, but degraded enough to cause user impact

Each has a different diagnosis path.

Completely Unreachable

Check DNS first - dig yourdomain.com or a DNS lookup tool. If DNS is failing, a recent DNS change or domain registration expiry is likely.

If DNS resolves, check whether your server is responding at all:

curl -I https://yourapp.com
# Or bypass DNS to test the server directly
curl -I --resolve yourapp.com:443:SERVER_IP https://yourapp.com

Check your load balancer or reverse proxy (nginx, Caddy). Check whether your SSL certificate has expired.

500 Errors

Application errors mean the app is running but crashing. Check logs immediately:

# Heroku
heroku logs --tail -n 200

# Server logs
tail -200 /var/log/app/production.log | grep ERROR

# Kubernetes
kubectl logs deployment/your-app --tail=200

Look for the exception class and message. The most common causes:

A recent deploy introduced a bug (check git log, consider rolling back)
A missing environment variable or credential
A database migration that partially ran or has a bug
A gem incompatibility after a dependency update

Timeouts

Requests reaching the server but not completing point to a resource exhaustion problem.

Check database connections first:

-- PostgreSQL: see current connections
SELECT count(*), state FROM pg_stat_activity GROUP BY state;

Check background job queues - a backed-up Sidekiq queue can starve the web process of database connections.

Check memory - if your application processes are being OOM-killed, requests will hang or fail:

free -m
ps aux --sort=-%mem | head -20

Partial Failure

If some routes work and others don’t, the failure is likely specific to:

A database table or query used only by the broken routes
A third-party integration (payment processor, email service, file storage)
A feature that was recently deployed

Identify the common element in everything that’s broken.

After You Identify the Cause

Once you know what’s causing the failure, you have two paths:

Rollback - if a recent deployment is the cause, rolling back is almost always faster than fixing forward under pressure. Get the system running, then fix in a branch with proper testing.

Targeted fix - if the cause is unrelated to a recent deploy (infrastructure failure, data issue, third-party outage), a targeted fix or workaround may be necessary.

In either case: restore service first, understand root cause second, implement prevention third.

Build Prevention After Recovery

Rollback/hotfix first, root cause/docs next-incidents recur 40% less. Your team gains processes for independence. Contact for web app outage help or process overview.