Posted by: buzina | January 18, 2010

Root cause vs. cause analysis


After a wonderful skiing holiday in Aspen last week I am back thinking about problem management. My post Incident and Problem Management revisited and the ITSkeptic on root cause re-triggered my thoughts on the relationship between problems and causes.

ITIL v3:

ITIL defines a ‘problem’ as the unknown cause of one or more incidents.


ISO 20000:

problem
unknown underlying cause of one or more incidents

So according to these definitions there is a one-to-one relationship between a cause and a problem, only differing on the word underlying. Both do agree that there has to be at least one incident and also that the cause has to be unknown. As soon as the cause is known, it will become a known error. ITIL defines a known error as follows:

A Problem that has a documented Root Cause and a Workaround…

This means that a known error still is a problem. So we can skip the word unknown in the definitions above.

This is not enough to me. Since I am a big fan of proactive problem management (which somehow is neglected in the current ITIL version) I would like to create problems in terms of potential incidents. To support this, let us have a look at the last sentence in the ITIL known error definition:

… Known Errors may also be identified by Development or Suppliers.

As a known-error is a subset of the problems which have a documented root cause (and a workaround), this indicates that problems may exist without being a cause to a current incident. Since developers will for example provide information on defects of their software which may cause incidents in the future. This results into the following definition:

A problem is a cause of current or future incidents.

So problem and cause do define the same concept and are basically aliases. What does this mean for the root cause analysis? Well this will become recursive, since a root cause is the same as a root problem. A root problem itself may again have another root problem (and so on). Just as the ITSkeptic, I would like to do away with the misleading term of root cause. The skeptic is going to call the root cause primary cause from now on. I will not do this, I will just not use the word root anymore. From now on, I will talk about the cause analysis of a problem. The cause analysis will find the direct causes of a given problem and create or reference the list of problems causing this problem. Each of the new problems (causes!) should undergo further cause analysis. This will result in an ordered network of problems / causes.

Why should we do that? Because we have to address each of them in the right way. Let us use a sample. I used this in my previous post and I have added some additional nodes to show that a single problem may be causing more than one other problem.

So when the incident “Service Outage” comes up, the first analysis results in finding the network being unavailable. Searching this resulted in finding the poor guy that cut the cable and also uncovering that the fault tolerance did not kick in. The research on the fault tolerance shows that the manual switch to the backup line was not executed, so the incident is quickly recovered by executing the switch. Incident resolved. Now we can start the problem management as follows: In the incident analysis the following causes were created:

  • Network unavailable
    • Cable was cut
    • Fault tolerance not working
      • Manual workaround not used

Each of these causes are now problems and should be addressed. First step is to do the cause analysis, this results in the following problems:

  • Network unavailable
    • Cable was cut
      • Unqualified maintenance effort
      • Change process not adhered to
    • Fault tolerance not working
      • Design issue
      • Manual workaround not used
        • Procedure not documented

All of the new problems need to be investigated too and continuing this leads to the following complete cause analysis:

For each problem we have to decide what to do about them. The options are:

  • Live with it / ignore it for now
  • Resolve or mitigate it using a change or service request
  • Define a workaround for future reference (reducing the impact)
  • Reduce the risk of reoccurrence
  • Resolve it by resolving all contributing causes
  • or a combination of the above

Examples:

  • The cut cable should be resolved by a service request, getting the cable in place again.
  • The unqualified maintenance issue will be reduced in reoccurrence by addressing its contributing causes only, itself will be ignored.
  • The lack of training will be addressed directly and resolved for these individuals.
  • The lack of training will be avoided in the future by improving the personal Training Process.
  • The overall lack of governance will be addressed by a strategic management project improving oversight and control.

Even if this single outage does not justify all of these measures to be taken, documenting these and keeping track of all the (potential) trouble causes improves the awareness of what is really going on.

If we have a look at the overall view of problems / causes, we can ask us, what the root cause really is. Some possible “roots” can be

  • The cut cable, since that caused the outage directly
  • Not using the manual cut-over switching, since that is the cause that incident management fixed
  • The undocumented manual switch procedure, since that caused the people not to know what to do
  • Not adhering to the change procedure, since that is where we would have caught and prevented this from happening
  • The overall lack of governance, since multiple problems stem from that
  • … almost all the others as well

That is why I stop talking about root or primary causes. Every cause is a problem and should be documented and each requires thoughtful deciding about the measures they require.

To model this in data I would suggest the following:

To indicate that a given problem is the cause for an incident or another problem you would create an entry in the table causes which would reference the causing problem in prb_id and the caused incident or problem in req_id id (incident and problem share a common supertype table request in this model). alternatively you can use a more general relates table, which can contain different types of relationships between requests.

In this model you would create an entry in the table relates which has the causing problem in req_id_From and the caused incident or problem in req_id_To and the Type of “causes”.


Responses

  1. Fabulous post. one quibble: why do you consider a Problem to be a subtype of Request? i’ve never seen it that way or heard it treated that way. Surely it is a distinct underloing entity, just as Cause is?

    • Hi Skep,

      Thanks. I treated Problem as a subtype of Request as it often is in ITSM Tools. In my (thought-) model, the request supertype joins all functionality regarding work assignment, workflow and tasking, so a supertype of request indicates that the object is worked on. This is why I put it in as supertype of Incident, Problem, ServiceRequest, Change and all the other workable types.

      But overall, you are right, the main purpose of the object Problem is not to be worked on, but to be found when searching for causes. So maybe Request is not a supertype, but a related request will probably be created to handle the Problem Management Workflow.

  2. I use the RCA toolkit found here:
    http://matureit.blogspot.com/p/itil.html
    It was very helpful for me when doing an actual Root Cause Analysis


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

%d bloggers like this: