Posts
How to Perform Root Cause Analysis: The Real-World Approach That Actually Works
Connect with us: SB Nation Community | Managing Madrid Forum | McCovey Chronicles | Medium Blog
Three months ago, I watched a perfectly capable operations manager blame "human error" for the third system failure in as many weeks. That's when I knew we needed to have a serious conversation about root cause analysis.
See, most people think root cause analysis is about finding someone to blame. It's not. It's about finding the real reason things go sideways so you can actually fix them instead of just putting band-aids on symptoms.
The Problem with Most Root Cause Analysis
Here's what drives me absolutely mental: companies spend thousands on fancy RCA training, then their teams still conclude every investigation with "we need better training" or "staff need to be more careful."
Mate, if your solution is always "try harder next time," you're not doing root cause analysis. You're doing blame assignment with extra steps.
I've been running workplace improvement programs for nearly two decades now, and I can tell you that 87% of the time when someone says "human error," there's a systems failure underneath. People don't randomly start making mistakes. Something changed, something broke, or something was never properly designed in the first place.
What Root Cause Analysis Actually Is
Root cause analysis is detective work. You're looking for the underlying cause that, if fixed, would prevent the problem from happening again.
Think of it like this: if you've got water pooling on your office floor, you don't just mop it up and tell people to "be more careful about spills." You look up. Is there a leak in the ceiling? A burst pipe? Condensation from the air conditioning?
The water on the floor is the symptom. The broken pipe is the root cause.
The Five Whys Method (But Done Properly)
Everyone knows about the Five Whys, but most people use it wrong. They ask "why" five times and think they're done. The trick is asking the right type of "why" questions.
Let's say your customer service response times have blown out. Here's how most people would approach it:
- Why are response times slow? Because staff are taking longer to answer calls.
- Why are they taking longer? Because calls are more complex.
- Why are calls more complex? Because customers are frustrated.
- Why are customers frustrated? Because they can't find information online.
- Why can't they find information? Because the website is outdated.
That's decent, but here's the problem - you've only looked at one pathway. Real root cause analysis branches out.
The Fishbone Approach
I prefer the fishbone diagram because it forces you to consider multiple categories of potential causes. Draw a horizontal line with your problem at the head (like a fish head), then draw diagonal lines off it for different categories:
- People
- Process
- Equipment
- Environment
- Materials
- Methods
For each category, brainstorm potential contributing factors. This stops you from zeroing in on the first plausible explanation and missing the real culprit.
Getting Past the Obvious
Here's where most investigations fall down. They stop at the first reasonable-sounding explanation.
Customer complaint about rude service? Must be staff attitude. Project delivered late? Team didn't manage time properly. Safety incident? Someone wasn't following procedures.
But if you dig deeper, you often find:
- The "rude" staff member was dealing with their tenth identical query because the FAQ section is useless
- The project was late because three key people were pulled onto "urgent" tasks with no communication
- The safety incident happened because the procedure assumes equipment that hasn't worked properly in months
The Data Detective Approach
Now, I'm not saying you need to become a data scientist, but you do need to look at patterns. One incident tells you nothing. Ten incidents tell you everything.
When did the problems start? What changed around that time? Are certain teams, locations, or times of day more affected?
I once investigated a spike in customer complaints at a retail chain. The obvious answer was "staff need customer service training." But when we mapped the complaints by time and location, we discovered they all clustered around the new point-of-sale system rollout. The real issue? The system was so slow that staff were getting frustrated, and that frustration was bleeding through to customers.
Training wouldn't have fixed that. A system upgrade did.
Common Root Cause Categories
After years of doing this, I've noticed most root causes fall into a few categories:
Communication Failures: Information that should flow between teams, departments, or systems but doesn't. This covers about 40% of workplace problems.
Process Design Flaws: Procedures that sound good on paper but don't work in the real world. Often created by people who don't actually do the work.
Resource Constraints: Not enough time, people, or tools to do the job properly. Management knows this but expects miracles anyway.
Technology Issues: Systems that don't talk to each other, software that crashes under normal load, or equipment that's held together with wishful thinking.
Training Gaps: And yes, sometimes it really is a knowledge issue. But usually it's specific technical knowledge, not "be more professional" training.
The Human Factor (Without the Blame)
Here's where I get a bit controversial: I think the obsession with removing human error is counterproductive. Humans make mistakes. That's not a bug, it's a feature. We're creative, adaptable, and can handle unexpected situations. But we're also distractible, forgetful, and inconsistent.
Good root cause analysis methodology accepts this and designs systems that work with human nature, not against it.
Instead of asking "How can we stop people making mistakes?" ask "How can we make mistakes obvious and easy to catch?" or "How can we make the right choice the easy choice?"
Red Flags in Root Cause Analysis
Watch out for these warning signs that you're not digging deep enough:
- Conclusions that blame individuals rather than systems
- Solutions that involve telling people to "be more careful"
- Root causes that are actually symptoms in disguise
- Analysis that ignores recent changes or assumes everything was working fine before
- Recommendations that require people to work harder rather than differently
The Follow-Up That Everyone Skips
Here's the bit that separates serious organisations from the time-wasters: actually implementing changes and measuring whether they work.
I've seen beautiful root cause analysis reports that identified real problems and proposed sensible solutions. Then nothing happened. The report went in a drawer, and six months later the same problem popped up again.
Set a review date. Measure the metrics that matter. Be prepared to admit if your initial analysis was wrong and dig deeper.
Making It Stick in Your Organisation
If you want root cause analysis to become part of your culture (and not just another management fad), you need to make it safe for people to identify real problems.
That means no blame, no "gotcha" moments, and definitely no punishment for raising uncomfortable truths about systems or processes.
I worked with one manufacturer where equipment breakdowns were always blamed on "operator error." Once they implemented proper problem-solving training and removed the blame culture, they discovered that most breakdowns were caused by inadequate maintenance schedules and poor spare parts management. Fixing those actual root causes reduced downtime by 60%.
The Reality Check
Look, root cause analysis isn't rocket science, but it does require patience and intellectual honesty. Most people want quick answers and simple solutions. Real RCA takes time and often reveals uncomfortable truths about how your organisation actually works.
But here's the thing - if you keep treating symptoms instead of causes, you'll be fighting the same fires forever. Your best people will burn out, your customers will get frustrated, and you'll waste money on solutions that don't solve anything.
The choice is yours. You can keep applying band-aids, or you can actually fix things.
Just remember: the goal isn't to find someone to blame. It's to find something to fix.