What is an uptime SLA breach and how do I fix it?

An uptime SLA breach occurs when your application's availability falls below the level guaranteed in your service agreement. Here's what it means and how to respond.

What Is an Uptime SLA Breach and How Do You Fix It?

An uptime SLA breach occurs when your application’s availability falls below the percentage guaranteed in your service level agreement. For a 99.9% uptime SLA, that’s more than 8.7 hours of downtime per year - or 43.8 minutes per month. For 99.99%, it’s 52 minutes per year.

A breach has two distinct problems: the technical problem that caused the downtime, and the contractual and customer trust problem that follows.

What Counts as Downtime

SLA definitions vary. Read yours carefully:

Does it cover all downtime, or only unplanned outages?
Does degraded performance (slow responses, partial failures) count?
Is it measured at the infrastructure layer or the user experience layer?
What monitoring source is authoritative - yours, the customer’s, or a third party?

These definitions matter when a breach is disputed. If your SLA doesn’t specify, you’re negotiating rather than measuring.

Immediate Response to a Breach

Restore service first. The clock is running on your SLA the entire time the system is down. Every minute of additional downtime worsens the breach. Containment takes priority over root cause analysis.

Document the outage timeline precisely. When did degradation start? When was it detected? When was service restored? This timeline is what your SLA calculations are based on - and what you’ll reference in customer communications and credits.

Notify affected customers proactively. Most enterprise SLAs require notification within a defined window. Check your contract. Getting ahead of the notification is better than customers discovering the breach before you tell them.

Root Cause and Prevention

After service is restored, the question is whether this breach was a one-time event or a symptom of structural reliability problems.

One-time events: A bad deployment, an infrastructure provider outage, a hardware failure. These can be mitigated with better deployment practices, redundancy, or failover, but they don’t necessarily indicate systemic unreliability.

Structural problems: Recurring database instability, memory leaks that accumulate over time, insufficient capacity for actual traffic patterns. These require architectural changes, not just incident response.

A post-incident review should determine which category you’re in and produce a credible plan for preventing recurrence - one you can share with affected customers.

SLA Credits and Customer Communication

Most SLAs require issuing service credits when an SLA is breached. Calculate the credit according to your contract terms and issue it proactively. Customers who receive credits without asking for them are more forgiving than customers who have to fight for them.

Your communication to affected customers should include:

What happened (non-technical summary)
When it started and when it was resolved
Root cause (honest, brief)
What you’re doing to prevent recurrence
The service credit they’re owed

Reducing Future SLA Risk

If you’re regularly close to your SLA limits, the fix is either improving reliability or adjusting your SLA to reflect what you can actually deliver. Overpromising on SLAs and then issuing frequent credits is worse for customer relationships than a more conservative commitment you consistently meet.

Common reliability improvements for Rails applications:

Database read replicas and connection pooling tuned to actual load
Horizontal scaling and load balancing for web servers
Background job isolation so worker failures don’t affect the web tier
Health checks and auto-recovery for application processes
Proper circuit breakers for third-party dependencies

Contact us if you’ve had an SLA breach and need help with root cause analysis and prevention, or learn about our emergency support services.