How do I fix SaaS downtime?

SaaS downtime has a short list of common causes. Here's how to diagnose and fix the most frequent ones in Rails-based SaaS applications.

How To Fix SaaS Downtime | Emergency Software Fixes FAQ

SaaS downtime has a short list of common causes. The fix depends entirely on which one you’re dealing with - and misdiagnosing the cause wastes time you don’t have.

Step One: Determine Scope

Before touching anything, answer these questions:

Is every customer affected, or only some? (All-tenant vs. single-tenant failure)
Is the entire application down, or specific features?
Is it degraded performance or total failure?
When did it start, and what changed around that time?

All-tenant failures usually point to infrastructure, database, or a bad deployment. Single-tenant failures often point to data-specific issues, tenant configuration, or quota/rate limits.

Common Causes and Fixes

Database Connection Exhaustion

Symptoms: Intermittent 500 errors, requests timing out, “too many connections” errors in logs.

Fix: Identify what’s holding connections. Check your connection pool configuration against the number of application threads and background workers. Recent background job additions are a frequent culprit - workers that don’t properly release ActiveRecord connections after completion.

# Check current pool size in database.yml
pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>

# Force connection release in background jobs
ensure
  ActiveRecord::Base.connection_pool.release_connection

Memory Exhaustion / OOM Kills

Symptoms: Puma or Sidekiq processes disappearing, requests randomly failing, server memory graphs showing a sawtooth pattern.

Fix: Identify the memory leak source. Common culprits: Sidekiq jobs that load large ActiveRecord collections without pagination, background jobs that cache objects at the class level, or gems with known memory issues. Short-term fix is to restart workers on a schedule; durable fix is to find and eliminate the leak.

Bad Deployment

Symptoms: Failures started immediately after a deploy, errors reference code that was just changed.

Fix: Roll back immediately. Don’t try to fix it forward under pressure unless the migration can’t be reversed. Get the system running, then fix in a branch.

# Roll back in Heroku
heroku releases:rollback

# Roll back in Kamal
kamal rollback

Failed or Partial Migration

Symptoms: Certain database operations failing with column-not-found or constraint errors, some features broken while others work.

Fix: Check migration status. Determine if the migration ran partially or not at all. If partially, you may need to manually clean up the database state before re-running. Don’t run the migration again without understanding what it completed.

Third-Party Dependency Failure

Symptoms: Specific features broken (payment processing, email, file uploads), your own infrastructure looks healthy.

Fix: Check the status pages for your dependencies - Stripe, SendGrid, AWS S3, etc. If it’s an upstream failure, your options are limited to circuit-breaking the dependency and serving degraded functionality until it recovers.

Multi-Tenancy Isolation Failure

Symptoms: Customers seeing each other’s data, data appearing in wrong tenant contexts, unusual authorization errors.

Fix: This is a data integrity emergency. Take the affected tenants offline immediately if there’s any risk of data exposure. Identify the query or scope that’s missing tenant isolation. Do not restore service until you’ve confirmed the isolation is enforced.

When to Get Outside Help

SaaS downtime is a revenue and trust emergency. If your team hasn’t identified the cause within 30 minutes, bringing in an engineer who has handled this failure class before is almost always worth it.