One of the most misused functionality in current incident management tools is the pending state (sometimes also called time out). When someone working on the incident decides that the SLA is in danger they can easily avoid escalation (and often SLA failure in the report) by setting the state to pending. This effectively cancels the SLA timing1 and by subtracting the time from start of the pending state up to the end of pending from the calculated incident SLAs. Quite often the reporting is reusing the pre-calculated and evaluated SLA timings these systems provide, so nobody will notice when an incident misses its service level objective by 1000%.
So should we remove the capability of pending states? Should we all alter the given functionality and remove any time outs? No would be my answer.
So do we have to change the way we use this functionality? Yes, now we are talking!
I suggest the following for time outs (my preferred terminology, pending is not exact enough):
- Define possible time out reasons in your SLA
Get the attention of your client or business to the possible reasons for delays. Make sure these are well understood and agreed upon.
- Make sure the time out reasons are the things out of your control
All things you can not control or guarantee should be listed as time out reasons. The first item will help in keeping this list short, since the business will not like too many exceptions.
- Provide the reason of each individual time out
Have the support personal select on of the reasons you listed in your SLA (even if it is only one). Log this in a reportable way and require an additional free text field for details.
- Allow only limited time outs
Indefinite time outs lead to indefinitely delayed incident resolution. If you stop escalations for eternity they will not return by itself.
- Inform the user about time outs
Send an automated mail about the time out or show the reason in your web support tool for the user to read. This manages the expectations and also reduces the possibility of pending fraud.
- Generate & review reports on time outs
Report on the delay per time out reason and check for ways of reducing this. Share this report with the business and jointly think about improvement options.
- Define the influence on escalations & SLAs separately
There may be situations where you have a defined action plan (e.g. vendor has to build a patch) which will take longer than your SLA. You may then either omit or reduce the escalations while still reporting the service level miss.
What kinds of time out reasons I can imagine? Well that depends on your situation and relationship with the business.
- A very common reason is that you need the customer / user to do something and he/she is not available.
- Another common one is “Agreed upon delay”. Sometimes the SLA forces me to fix this within a few hours while the customer / user does not want it to be fixed (e.g. battery issue with a laptop, while the user needs to go and present right now).
- If your business is responsible for third-party contracts and IT has no influence (or not enough) on external support another reason may be “Waiting for external service”.
- I have even seen a reason within an SLA stating that restoration of data is not part of the restoration of service since the amount of data influences this2.
This may seem like a lot of effort for a few simple process exceptions, but since they occur quite often it may be useful to think about this a bit more.
1: This is true for many incident workflow systems and in many configurations. It may or may not be the case for your individual setup, so please check it before relying on this information.
2: I would never suggest such a time out, it is just a sample. IT should design its restore capability to match the restore requirements and not the other way around.