I would like to start a short series of blog posts, each around the main details of an IT service management process. As many (many, many, …) people have already blogged about change & configuration management I wanted to go back to one of the oldest processes, fighting the fires.
Goal of Incident Management
Incidents are disruptions of the service that you provide to your users (either defined directly using an SLA or indirectly by general acceptance). Incident Managements goal is to reduce the impact on your customers (the business) and to restore the original service quality as quickly as possible.
These disruptions are often noticeable by users (service unavailable), but this is not a requirement for incidents. Some have argued that an infrastructure failure that is not directly creating a noticeable effect is not an incident. I am not sharing that opinion, since an increased risk of a service failure is also affecting the promised quality of the service. If a disk in a mirrored OS disk is broken, this has no direct impact on the service provided by the application running on the server. But it changes the risks involved in operating the environment.
The Principles of Incident Management
- Every incident needs to be logged
If you do not log all incidents you will never be able to explain what you are spending all the resources on or your reports will only show the incidents that you did not solve quickly and you will look bad.
- Incident management requires well defined and agreed targets
Your incident process needs to be well accepted and each phase of the process should have targets that the support roles have to meet. This includes operating / support hours, response times, time to fix and similar concepts.
- Incidents may only be closed if the resolution is verified
Many organizations tend to close incidents when they believe the issue is gone (I installed the patch that *should* fix it). You may use a well defined waiting period (e.g. 2 business days) as a verification that everything runs well now, but then you really have to address found issues to the same incident.
- Incident are not a goal by themselves
Be aware that targets for incident management may become harder to meet when your IT organization matures. For example your KPI on 1st Level resolution rate will decline when your problem management process kicks in. This is normal and intended.
Activities of Incident Management
The normal flow of activities will be along the following phases.
- Identify / discover incidents
This may sound trivial, but it is worthwhile to think about the ways your organization finds out about incidents. The customer’s information should not be your only source. An incident is only discovered once one of your staff is informed about it. An entry in a database does not qualify here, unless automated actions apply.
- Log the incident
Although it is noted as the second action, this may also occur before the incident is really identified.
Very often this done during the logging, but for automated logging of incidents the classification may need manual interpretation and improvement. The classification of the incident drives the prioritization. See blow for more details on this.
- Diagnosis & Analysis
First level diagnosis will be done by checking the known error database generated by problem management. Second level diagnosis will be done by production specialists with proper access to the production environment. They should compare the current configuration of the environment with the planned configuration & architecture. Third level diagnosis will be done by the engineers responsible for the design and/or development of the affected service or technology, who will check for design or program flaws or bugs. Last level will be the original manufacturer (if available).
- Resolution & recovery
The solution for the incident may require a change to the environment and/or recovery of the systems affected. If a change is needed incident management should log a (possibly emergency-) RfC.
- Verification & closure
The positive result of the resolution needs to be verified and accepted by the customer. For low – medium impact incidents this may be achieved by leaving the incident in a “resolved” state for a defined period and then closing it if the customer does not request otherwise.
During these phases the incident may be subjected to the following activities outside of the “normal” flow.
- Functional escalation
The steps 3-5 (& 6) may be performed iterative during the incident resolution process. When one support level is not able to provide a solution or if other criteria are reached (e.g. the first level should not try to resolve an incident for more than 15 minutes) the next level of support is activated.
- Functional assignment / Re-classification
During the classification and analysis the specialists will form an educated guess about the reasons for the incident. Sometimes this is not immediately the right decision. If the origin of the incident is in a different environment or responsibility, the incident will be re-assigned. This can be part the normal resolution process (e.g. 3rd Level engineer finds the solution, 2nd Level production specialist implements the solution) or it may mean that the current classification is wrong. In this case, please make sure that the classification is changed (e.g. if possible do not allow crosswise assignments without verifying the classification).
- Hierarchic escalation
This is the quest for additional assistance to speed up the resolution process if it is in danger of breaking service levels. It may also be used to inform upper management about major incidents before the clients management will inform them.
- Reference to problem management
During the diagnosis phase we may find an existing problem record (or known error) to be the cause of the current incident. If the problem notes a workaround we may implement it. Otherwise it may be advisable to cease further investigation on this individual incident and just link it to the existing problem. The incident should be solved or continued when the problem cause, workaround or solution is available.
Attributes of an incident
ITIL® does define a set of attributes an incident should have. In my opinion some special thought is needed about category, CI, impact, priority, urgency and closure category.
Category & CI
Very often incident management categorization mixes different data into one large and complicated category tree. All experts on data management will tell you that it is not efficient to mix data values of different meaning. So my requirements to categorization would be:
- What is affected? This should be a reference to the service(s) impacted by the incident.
- What is the cause? This should be a reference to the CI that is (probably) causing the incident. If the exact CI cannot (yet) be defined exactly, you should use the CI type categorization as a temporary replacement. When the incident is closed a real CI should be selected. It could (and should) also be possible to select a change as the cause of the current incident.
- What is the resolution time? What are the other targets for this incident? This is dependent on the affected services and should be realized by a reference to the service level targets or objectives relevant for the incident (may be quite a list!)
- Who should be working on this incident? This should be a reference to the relevant incident model.
- What kind of incident is reported? Yes, this sounds like a category, but it should not be a multi-level deep selection tree. A few distinct types depending on the impacted services and/or the causing CI.
The process of logging an incident may not be rocket science, the agents in your service desk and the operators need to be able to define the right classification in a situation where their focus is on solving the issue at hand and helping your customer.
Impact, Urgency & Priority
Many installations of incident management systems use a very simple scheme of impact x urgency = priority = reaction/resolution times. This is not a real life situation. These service level objectives (or targets) are defined in the service level agreement that covers the service(s) that are affected by the incident. These SLAs may define different objectives for different impact, urgency or priority levels but this is up to the SLA. Additional timings may come from your OLAs. So there is not only on simple relationship, but a more complex one.
Most systems also allow a simple “pending” state stopping the timings of an incident. This is also unworldly. A better environment contains a list of “time outs” along with their start and end date and a reason for the time out. Each service level objective in turn should specify which time out reasons affect the measurement.
One error is calling it a closure category. It should be the Solution Category. This categorization may have two different meanings. One is to fix the categorization (and other incident attributes) to reflect the reality that is now known after the incident is resolved. The other meaning is the type of solution applied to this incident. In many cases this would be a reference to a problem. In defining other types of values one needs to think about why we want to log this. If you plan to analyze solution data you may want to reduce this to a short list of common options.
Useful System Functionalities
Modeling – Modern IT service management systems provide workflow functionalities like job plans or work orders for some of the processes. This mostly used for change management. This or similar functionality can also be used to define incident models, which limits the free choice of 1st, 2nd or 3rd level teams in incident assignment. This reduces the number of misroutings and ping-pong routing, as well as improving the data quality of the incident, by forcing the users to focus on the cause and resolution process instead of focusing on whom to assign the ticket.
Quick incidents based on problem management – A tight integration between problem and incident management allows quick and easy incident creation. During the logging process offer a choice of the most occurring problems for the user to select. Selecting a problem will fill out most of the data needed for incident creation and immediately create a link between the incident and the problem. An existing workaround can then be applied.
Matching problems – During the classification process always show a list of the most probable problems matching your current classification. This can be based on service, classification and CI data.
No separate solution database – This may sound strange as a useful functionality, but adding a solution database reduces the usage of the problem management process, which should be the one and only source for the management of solutions.
Even though the attention if often on other processes, there is quite some room for improvement in most incident management processes. Further ideas and suggestions, please leave a comment.