
Yesterday I posted about The Past, the Present and the Future in IT service management and my stats showed: No reads. So lessons learned: Do not try to put something up into the blogging sphere when Apple is pushing a new product.

Yesterday I posted about The Past, the Present and the Future in IT service management and my stats showed: No reads. So lessons learned: Do not try to put something up into the blogging sphere when Apple is pushing a new product.
In a discussion I was looking for a way to distinguish Events, Incidents and Problems. I do think the distinction in ITIL(r) is a bit murky and not clear enough. So I refined it into a simple and easy to remember definition:
Everything starts in the past with an event. To me an event can be an alert from monitoring (CPU Load high, Connection dropped, process aborted, job failed, …) or a call from a customer. They report that something has happened, that may have an impact. What to do is:
Each of these actions may be run in parallel based upon one or more events.
Posted in ITIL | Tags: event management, incident management, ITIL, itsm, problem management
Recently I have been working on the implementation of an agreed upon service level agreement. It contained several different quality levels (I will refer to them as high, medium and low in this post), each having different service window, different time to fix and availability service level objectives.
I distilled an example out of the original values to avoid recognition:
| Service Level | Service Window | Time to Fix | Availability |
| High | 24 x 7 | 1 h | 99,9 % |
| Medium | Mo-Sa 06:00-22:00 | 4 h | 99,0 % |
| Low | Mo-Fr 08:00-18:00 | 20 h | 95,0 % |
Sounds reasonable doesn’t it?
So let us have a more detailed look at what these figures really mean. The high version states 99,9% availability out of 24 x 7. Per month this gets down to 99,9% of 720 hours. So how long can we permit to be down in total? The answer is
0,72 hours or 43 minutes and 12 seconds
So why did we state that the time to fix is one hour? If we adhere to the time to fix SLO, we may still miss the availability SLO even with a single outage!
For the medium SLA we have a 99% availability out of Mo-Sa 06:00 – 22:00, which is approx. 432 hours per month, leading to a maximum down of 4 hours and 32 minutes. This is closer to the mark.
But the greatest danger is in the low SLA. It states 95% out of Mo-Fr 08:00 – 18:00, which reduces the service time per month to 225 hours and results in a maximum down time of 11 hours and 15 minutes. This is little more than half of the SLO for time to fix!
What is my conclusion out of this? If you combine service level objectives you have to be very careful about what you do, since you may produce useless SLO targets and misguide people reading your service levels. It is much easier to understand what targets like time to fix mean for my job than a value of 95% availability, which may mean all or nothing. If you create SLAs, please make them consistent and logical without such traps. If you consume SLAs, verify everything and do not take logic for granted.
One of the most misused functionality in current incident management tools is the pending state (sometimes also called time out). When someone working on the incident decides that the SLA is in danger they can easily avoid escalation (and often SLA failure in the report) by setting the state to pending. This effectively cancels the SLA timing1 and by subtracting the time from start of the pending state up to the end of pending from the calculated incident SLAs. Quite often the reporting is reusing the pre-calculated and evaluated SLA timings these systems provide, so nobody will notice when an incident misses its service level objective by 1000%.
So should we remove the capability of pending states? Should we all alter the given functionality and remove any time outs? No would be my answer.
So do we have to change the way we use this functionality? Yes, now we are talking!
I suggest the following for time outs (my preferred terminology, pending is not exact enough):
What kinds of time out reasons I can imagine? Well that depends on your situation and relationship with the business.
This may seem like a lot of effort for a few simple process exceptions, but since they occur quite often it may be useful to think about this a bit more.
1: This is true for many incident workflow systems and in many configurations. It may or may not be the case for your individual setup, so please check it before relying on this information.
2: I would never suggest such a time out, it is just a sample. IT should design its restore capability to match the restore requirements and not the other way around.
Posted in ITIL, Job | Tags: incident management, itsm, process, service level management
After a wonderful skiing holiday in Aspen last week I am back thinking about problem management. My post Incident and Problem Management revisited and the ITSkeptic on root cause re-triggered my thoughts on the relationship between problems and causes.
ITIL v3:
ITIL defines a ‘problem’ as the unknown cause of one or more incidents.
ISO 20000:
problem
unknown underlying cause of one or more incidents
So according to these definitions there is a one-to-one relationship between a cause and a problem, only differing on the word underlying. Both do agree that there has to be at least one incident and also that the cause has to be unknown. As soon as the cause is known, it will become a known error. ITIL defines a known error as follows:
A Problem that has a documented Root Cause and a Workaround…
This means that a known error still is a problem. So we can skip the word unknown in the definitions above.
This is not enough to me. Since I am a big fan of proactive problem management (which somehow is neglected in the current ITIL version) I would like to create problems in terms of potential incidents. To support this, let us have a look at the last sentence in the ITIL known error definition:
… Known Errors may also be identified by Development or Suppliers.
As a known-error is a subset of the problems which have a documented root cause (and a workaround), this indicates that problems may exist without being a cause to a current incident. Since developers will for example provide information on defects of their software which may cause incidents in the future. This results into the following definition:
A problem is a cause of current or future incidents.
Read More…
Posted in ITIL | Tags: incident management, ITIL, itsm, problem management, root cause
A discussion on the itskeptic started me thinking about the ways I believed Incident Management and Problem Management should interact and what is in which domain. You may read the original discussion here (it again has links to a LinkedIn discussion, other posts on the itskeptic blog etc.).
The question was, whether a problem record should be created in the early stages of the incident, after incident resolution when it does not refer an existing problem / known error record or at any other “freakin” time. Even though the initial discussion was focused on major incidents, the implications were greater. We then delved into the question if 2nd and 3rd Level support is really needed for Incident Management. My first thought was to answer: Of course do we still need 2nd & 3rd Level. Then I started thinking about the real objectives of Incident- and Problem Management again:
Incident Management
Incident managements goal is to reduce the impact on your customers (the business) and to restore the original service quality as quickly as possible.
Problem Management
The goal of problem management is to reduce the number and/or impact of incidents occurring.
Posted in ITIL | Tags: flow chart, incident management, itsm, problem management
We just published the means for a perfect way to spend 10 minutes or more enjoying yourself while gaining a few insights in how service management processes can help an IT organization. Go have a look at http://www.noventum.com/itsmgame/en.html (or http://www.noventum.de/itsmgame for the German version).
This serious game (sounds serious but is fun) is a great way to show people the benefits of service management, so go have a look and tell others. I would really appreciate feedback on this, so please comment. Thanks!
Posted in ITIL, Job | Tags: game, itsm, serious gaming, simulation
I have just found this video on boingboing.net and this really is how teaching should be. Enjoy half an hour of witty interesting knowledge about the uniqueiest of the primates by Robert Sapolsky renowned professor of neurology, neurological sciences, neurosurgery and biological sciences at Standford.
Now that Germany has a new Government which is not composed of the two larger blocks (CDU & SPD), I would have hoped for an increase in momentum of political progress. unfortunately Chancellor Merkel and Vice Westerwelle continue their old faults in believing their voters to be extremely dumb.
They have a firm majority and have a strong backing for implementing change due to the current economical downturn and what do they do? They promise to reduce tax rates in the believe that the increases in debt will be paid back by increased incomes which would increase total tax turnout. All real estimations on this effect will tell you, that it will not return the full 100%, but will basically stop at approx. 60% of the loss. So we happily increase our national debt, which has just been hit extremely hard by the financial crisis. And for what? For reducing taxes, which is not at the heart of our problems.
So what would my proposal be? Boost productivity by setting hard to reach missions that a whole country (of even a continent) could rally behind. The current innovations (internet, eco-tech, bio-tech and nano-tech) should allow a country like Germany to produce a turnaround. One of the projects to improve on would be about money. Why do we still have these small printed papers (easily reproduced by modern technology) and strange round pieces of metal? These are too hard to transfer, keeping transaction costs way to high.
So Germany (along with the european economic and monetary union) should create the e-Euro. The e-Euro (e€) should allow the following:
The provision of a monetary system is one of the core tasks of a country. So it should provide a modern version of it. If micro payments can be made affordable, many new business models become viable. For example you would allow news to again become a paid for resource. News right now is expected to be available for free, along with some advertising. Many of us would pay a small fee for more in-depth news, either per article (1 article = 5 cent) or per time frame (e.g. 1 hour = 50 cent). With all current payment models available, this is not a suitable payment option due to the high transaction costs.
The ask a question sites could be a source of income for people, when a small sum could be paid for a proper answer to a question. Bands would be able to host their own music download website if you could easily transfer money to them. Charities would not have to rely on an overly expensive, un-democratic and often unfair payment system (like paypal) but could accept money directly from their donators.
My mission will probably not generate the same passion as
I believe that this nation should commit itself to achieving the goal, before this decade is out, of landing a man on the Moon and returning him back safely to the earth.
did, but maybe it could jumpstart our society into a more productive and innovative culture.
From every piece of art you buy, we donate 50% to the project, so join us on November 22nd and get yourself some interesting art! More information at http://www.aRTessen.de (sorry, only in German).
Posted in Round Table | Tags: art, Charity, Essen, kids, Round Table