background

What should I do when my app is down?

When your web app goes down in production, the first 30 minutes determine how quickly you recover. Here's the exact sequence to follow.

When your web app goes down in production, the first 30 minutes determine how quickly you recover. Panic and uncoordinated action make incidents worse. Here’s the sequence to follow.

1. Establish an Incident Commander

One person owns the incident. Everyone else takes direction from them. Without this, multiple engineers make conflicting changes simultaneously and you lose track of what was tried. The incident commander doesn’t have to be the most senior engineer - they just have to be the one person coordinating.

2. Assess Scope Before Touching Anything

Before making changes, understand what’s actually broken:

# Check recent deployments
git log --oneline -10

# Look for error spikes in logs
grep "ERROR\|FATAL" /var/log/app.log | tail -50

# Check database connectivity
rails db:version

3. Check the Obvious Things First

Most production outages have one of a short list of root causes. Check these in order:

4. Contain First, Fix Second

If you can identify a recent change that caused the outage, roll it back immediately. Don’t try to fix it in place under pressure. Get the system running, then fix the root cause in a branch with proper testing.

If rollback isn’t possible, look for ways to restore partial service - disable the broken feature, route around the failing service, or serve a maintenance page while you work.

5. Communicate While You Work

Someone should be updating your status page and notifying affected customers. Silence makes incidents worse - customers assume the worst when they hear nothing. A simple “we’re aware of the issue and working on it” buys goodwill.

6. Document Everything in Real Time

Keep a running log of what you tried and when. This serves two purposes: it prevents duplicate work (“did we already try restarting the workers?”), and it feeds directly into the post-incident review.

When to Call for Outside Help

If your team has been working an incident for more than 30-60 minutes without clear progress, bring in outside expertise. Fresh eyes on unfamiliar failure modes often find root causes quickly. The cost of an hour of expert help is almost always less than another hour of downtime.

Contact us if your app is currently down - or read about our emergency support services.