When your web app goes down in production, the first 30 minutes determine how quickly you recover. Panic and uncoordinated action make incidents worse. Here’s the sequence to follow.
1. Establish an Incident Commander
One person owns the incident. Everyone else takes direction from them. Without this, multiple engineers make conflicting changes simultaneously and you lose track of what was tried. The incident commander doesn’t have to be the most senior engineer - they just have to be the one person coordinating.
2. Assess Scope Before Touching Anything
Before making changes, understand what’s actually broken:
- Is the entire app down, or specific pages/features?
- Are all users affected, or a subset (specific accounts, regions, plans)?
- When did it start? Check your monitoring for the exact timestamp.
- Did anything deploy or change around that time?
# Check recent deployments
git log --oneline -10
# Look for error spikes in logs
grep "ERROR\|FATAL" /var/log/app.log | tail -50
# Check database connectivity
rails db:version3. Check the Obvious Things First
Most production outages have one of a short list of root causes. Check these in order:
- Recent deployment - did something ship in the last hour? Rolling back is often the fastest path to recovery.
- Database - connection errors, replication lag, disk space exhausted
- Memory/CPU - servers pegged at 100%, application processes killed by OOM
- External dependencies - a payment processor, email service, or CDN that’s down
- SSL/DNS - certificate expired, DNS misconfigured after an infrastructure change
4. Contain First, Fix Second
If you can identify a recent change that caused the outage, roll it back immediately. Don’t try to fix it in place under pressure. Get the system running, then fix the root cause in a branch with proper testing.
If rollback isn’t possible, look for ways to restore partial service - disable the broken feature, route around the failing service, or serve a maintenance page while you work.
5. Communicate While You Work
Someone should be updating your status page and notifying affected customers. Silence makes incidents worse - customers assume the worst when they hear nothing. A simple “we’re aware of the issue and working on it” buys goodwill.
6. Document Everything in Real Time
Keep a running log of what you tried and when. This serves two purposes: it prevents duplicate work (“did we already try restarting the workers?”), and it feeds directly into the post-incident review.
When to Call for Outside Help
If your team has been working an incident for more than 30-60 minutes without clear progress, bring in outside expertise. Fresh eyes on unfamiliar failure modes often find root causes quickly. The cost of an hour of expert help is almost always less than another hour of downtime.
Contact us if your app is currently down - or read about our emergency support services.

