This is part 2 of a small blog post series I am starting to write on IT service management processes. #1 is on incident management, available here.
The next logical part is to make sure that fires do not occur. At least not so often.
Goal of Problem Management
A problem is defined as the cause of one or more current or future incidents. The goal of problem management is to reduce the number and/or impact of incidents occurring. This is achieved by identifying and managing, by providing solutions and/or workarounds to eliminate incidents or to reduce their impact.
This is a broader definition than found in ITIL® v3, which says the cause has to be unknown (thus the problem is gone from problem management as soon as we know the cause). ITIL® v3 also omits the reference to future incidents, making a problem only out of existing incidents. As a last criticism, ITIL® v3 states that the primary objective is to prevent problems (and resulting incidents) from happening. To my humble opinion, problem management will not prevent problems from occurring, but it knows how to deal with them.
The Principles of Problem Management
- Every Problem is submitted to the problem management process
If you do not forward all identified problems to problem management you create two different classes of problems. This will either harm the capabilities of your problem management or your organization will create a secondary problem process which will solve the really important problems not to be entrusted to the problem management process.
- Problem Management requires Resources
If your organization will not provide resources to the problem management, it will fail miserably. If the problem manager is the only resource, she can do nothing to cure root causes, since there is no access to the technical expertise of your shop.
- Problem Management manages all Workarounds
I have seen a recurring pattern of having a knowledge or solution database located in the first level support for incident management in addition to having problem management. There can be only one. If you have such a situation, you will have incorrect solutions in your knowledge database, missing problem information in problem management and both will harm your quality more than they will improve. So replace your solution database with the known error database.
- Problem Management provides Solutions, Workarounds and Known Causes
Play an open game in problem management. Anyone can report a problem, anyone can see the problems. You communicate clearly that problem management will work on analyzing causes, finding workarounds and then finding solutions. Problem management will (usually) not fund the implementation of solutions, but it may provide the business case for the decision to be made. Remember: Do Nothing is a valid solution to a problem.
- Problems are related to anything
The initial discovery of a problem may be due to a failed hard disk while it’s root cause is an issue with our governance, because an employee made a bogus contract with a supplier that he benefits from more than the company (and he was not found out). Analyzing a problem is peoples work and it cannot be automated. Why do you think so many people investigate a plane crash? They do have the black box, so why did they not implement a small display which shows the cause of the crash?
Activities of Problem Management
The problem management process includes the following activities.
- Identify / discover problems
This sounds pretty easy but identifying a common (and still unknown cause) within a set of symptoms (incidents) is quite hard. In addition to formal data crunching on incidents you should also think about having a regular meeting discussing the top issues. If people have the time to think about what happened last week they will often identify otherwise hidden patterns. Unfortunately time is a luxury especially in service desks and operation centers.
- Log your problem
This includes setting up the references to the incidents that are related to the problem.
I know the problem activities are very similar to the incident management. Please make sure the classification or category scheme(s) matches the incident scheme. See here for some ideas on incident classification. Prioritization is part of classification. During classification you may discover that a problem is a duplicate. You should eliminate the newer problem and forward all references to incidents to the already existing problem. If applicable you should also adept the description and/or classification of the problem as to make it easier to discover.
- Root cause analysis
Apply the common techniques for getting at the real cause of the problem. Keep an eye open for possible workarounds or symptom remedies. As soon as you find a feasible workaround, test it and document it for incident management. Make sure you are open to receive information and feedback from incident management, especially when a documented workaround is fixed. You may need to change it and this will provide valuable information for your analysis.
- Solution definition
As soon as you have a root cause for your problem, you can have a go at possible solution scenarios. Now you can start defining how you want to remedy your problem. Most causes allow many ways in avoiding the issue in the future, so it is up to you to choose. During this activity problem management may generate a business case based on the data collected to promote fixing the issue for good.
- Solution implementation
This depends strongly on the solution in place. One solution may be to exchange an employee, to setup a redundant server system or to implement a software patch. Others may require a change to the operations processes, to the service architecture or to the governance regulations steering your company. Use the means required to get the job done (personal, training, change management, projects, etc.).
- Review & Closure
Make sure your solution really solves your problem. This should be done for all problems and not only for major ones (like ITIL® v3 tells you to).
In addition to all this, problem management has a strong communication part. Problem management should report to upper management, work with its’ peer processes and be part of regular Ops or Service Desk meetings.
Attributes of a Problem
In addition to the attributes of an incident, a problem should contain the following:
Reference to the cause
In an ideal company it is possible to reference any process, any function or any other possible cause of a problem uniquely. As the world is not ideal we stick with references to CIs and some additional description. The more you can do here, the better will your analysis be.
Since this is one of the results of the problem management process, you should pay attention to the attributes that define the workaround. I would recommend having a textual description along with a few settings governing the access of the workaround.
- Workaround applicable (Yes/No) enables the visibility of your workaround.
- Workaround tested (Yes/No) enables you do document the effectiveness of a workaround.
- Future occurrence of Incidents (Low/Medium/High) will allow a sorting of high impact workarounds for incident management. Your High occurrence workarounds should be used as incident management templates. 1st level support can select those from a list of quick picks in their service desk application, speeding up incident logging.
- Implement individually (Yes/No) defines if the current workaround immediately solves related incidents or if individual work is needed.
A problem with a workaround should contain all data necessary for incident creation and closure, so if you define a closing category, your workaround should have one as well.
You may want to create a status which keeps the problem open but not in the current working queues of problem management. This can also be achieved by assigning these problems to a long running working queue. You will need this to have a list of all currently used workarounds which have not final solution yet. You should review these every once in a while to keep them up to date. New software versions may eliminate existing problems even if the new version is not rolled out to solve this problem in question.
Problem management is not about technology, but about getting your organization into an improving mode. You will find a lot of resistance in most organization against problem management. Sometimes it is misused as the major incident management, where large outages are forwarded; sometimes it is plainly ignored and underfunded. Your problem management needs to be very open and communicative to avoid being caught as being responsible for issues (you knew we had this problem and did nothing to fix it!). Make sure that all problems are communicated and your resource requirements are with the right people.
Separate problem from incident management, but make sure that all workarounds used in incident management are under problem management control (this also reduces the duplicates). Remember that successful problem management may reduce the KPI measurements of incident management, e.g. the 1st level resolution rate will go down, the effort spent per incident will increase, the average duration will increase, due to the fact the well known incidents do not occur anymore. This improves overall quality, even if it looks as if quality in incident management goes down.
Most fire departments prefer preventing fires over extinguishing them.