A discussion on the itskeptic started me thinking about the ways I believed Incident Management and Problem Management should interact and what is in which domain. You may read the original discussion here (it again has links to a LinkedIn discussion, other posts on the itskeptic blog etc.).
The question was, whether a problem record should be created in the early stages of the incident, after incident resolution when it does not refer an existing problem / known error record or at any other “freakin” time. Even though the initial discussion was focused on major incidents, the implications were greater. We then delved into the question if 2nd and 3rd Level support is really needed for Incident Management. My first thought was to answer: Of course do we still need 2nd & 3rd Level. Then I started thinking about the real objectives of Incident- and Problem Management again:
Incident managements goal is to reduce the impact on your customers (the business) and to restore the original service quality as quickly as possible.
The goal of problem management is to reduce the number and/or impact of incidents occurring.
Every now and then it is useful to rethink your process in terms of the original goal. For me this means that every time we need to analyse the root cause(s) of something, it is problem managements job to do so. But if you study the link between Incident- and Problem Management in the ITIL(r) documentation you find that a problem should be recorded after the incident has been solved (when the root cause is unknown). By now I have a different view. A problem should be recorded as soon as we expect a new unknown cause (aka a problem) to be found. Problem management than should inspect the problem and see if it is a duplicate of an existing problem. Only after this inspection it will be a recognized problem that will be classified and prioritized.
On the other hand the task of Incident Management is to restore the service. They will do this by searching for ways to execute a recovery plan. These plans may range from “instruct the user since the problem is between the ears” to “build a new system and restore the backups”, “revert to an earlier fire wall configuration”, “add more disk space to the server” and “switch network to a backup line” and many more. Ideally these are tried and tested plans to get back to a working state1.
I updated my combined Incident- and Problem Management flowchart to reflect this and I increased the detail level of the chart.
The chart shows three parts, the pure incident management, the pure problem management and the stuff in the middle connecting both. This does fall under the responsibility of Incident Management, but is very close to problem management.
So do we still need full 2nd and 3rd level support for incident management? In order to execute recovery plans, yes we will need 2nd level, since they are the ones with administrative access to the production environment. Do we need 3rd level support? My answer would be “this depends”. If you have a common set of people (developers, engineers, vendor contacts) you will still need access to them to ask the vital question: “How the **** are we going to get this system up again”. There answer should be a new recovery plan, bringing you back to a previous working state. If you want to ask the question: “What the **** do we need to do that this does not happen again immediately afterwords?” you are in problem management. In case of major incidents you have to combine both efforts anyway. For all others you may want to leave the 3rd level completely to the Problem Management process. Why? Because it reduces complexity.
This has some implications for the Problem Management process.
You may need to have some OLAs upon parts of the PM process to assure quick feedback to the Incident Management process. In most 24×7 environments that I have seen, Problem Management is a day time job. With PM being responsible for all root cause analysis you may need it 24×7.
Most problems are caused by organizational issues. The most common one is “process is not adhered”. One of these problems can have many different effects or symptoms. Also it is not uncommon for an incident to occur due to two or more problems occurring at the same time and interfering with each other. The root cause maybe that the service / system was not designed to handle multiple types of failures at the same time. For Incident Management we would still need the direct cause of our service outage. To illustrate this, have a look at the following fault tree analysis:
For Incident Management we will need the marked part of the FTA in order to find a good restoration process. This restoration process should be documented as a workaround and executed to solve the incident. In Problem Management you would address all the nodes of the FTA and find countermeasures for all leaf nodes. Basically all nodes in the FTA can be viewed as problems.
In a later post I will question the original ITIL definition of a problem – “The unknown cause of one or more incidents”.
1: Sometimes this process may lose valuable information that is required for root cause analysis. Modern technology, like virtualization and SAN disks, and good recovery scripts can reduce this risk by making a backup copy of the failed environment.