background

What is root cause analysis for software bugs?

Root cause analysis (RCA) is the process of identifying the underlying reason a software failure occurred — not just the symptom, but the condition that made the symptom possible.

Root cause analysis (RCA) is the process of identifying the underlying reason a software failure occurred - not just what broke, but the condition that made it possible to break in the first place.

Fixing symptoms without finding root causes means the same failure recurs. RCA is what separates a durable fix from a temporary patch.

Symptom vs. Root Cause

A symptom is what you observe: “the app returned 500 errors.” A root cause is the chain of conditions that produced it.

Example:

Fixing only the proximate cause (increasing pool size) delays the problem. Fixing the root cause (the connection leak, plus monitoring) eliminates it.

The Five Whys

The most practical RCA technique for software bugs is the Five Whys - ask “why” repeatedly until you reach a cause you can actually change.

  1. Why did the API time out? - The database connection pool was exhausted.
  2. Why was the pool exhausted? - Background jobs were holding connections open.
  3. Why were they holding connections? - The jobs weren’t closing ActiveRecord connections after completion.
  4. Why wasn’t this caught? - No test covered background job resource cleanup.
  5. Why was there no such test? - The team didn’t have a standard for testing resource cleanup in workers.

The fix isn’t just “close connections in the worker.” It’s also “add a test for resource cleanup” and “add monitoring for pool utilization.”

How We Conduct RCA

When we respond to a production incident, our RCA process involves:

Timeline reconstruction - We establish exactly when the failure started, what changed in the preceding hours (deploys, config changes, traffic patterns), and what the system looked like before and after.

Log and metric analysis - We examine application logs, database slow query logs, infrastructure metrics, and APM traces to identify the failure point and its preconditions.

Code archaeology - For failures tied to recent changes, we read the diff carefully. For longer-standing issues, we trace the execution path that leads to the failure.

Contributing factor identification - Most production failures have more than one contributing factor. We look for the full set: the immediate trigger, the underlying vulnerability, and the missing safeguards.

Preventive action list - Every RCA ends with specific, actionable items: tests to add, monitoring to configure, code to refactor, process to change.

What You Receive

After a production incident we handle, you receive a written post-incident report covering:

This document serves as institutional memory - the next engineer who encounters a similar symptom will know what to look for.

Contact us to discuss an incident, or learn about our emergency response services.