Yesterday I posted about The Past, the Present and the Future in IT service management and my stats showed: No reads. So lessons learned: Do not try to put something up into the blogging sphere when Apple is pushing a new product.

In a discussion I was looking for a way to distinguish Events, Incidents and Problems. I do think the distinction in ITIL(r) is a bit murky and not clear enough. So I refined it into a simple and easy to remember definition:

The Past


The past is what has happened. It does not mean you have to do something, but it can result in actions. You have to decide which actions – if any – have to be taken. This is the definition of Events.

The Present


This is what impacts the business right now. The present is the reason IT people are shouted at. You have to make sure that the present pain is relieved as quickly as possible (RIGHT NOW). If your present is not hurtful, there usually is no awareness of the present (in IT). IT is only noticed when it is causing trouble. This is the definition of an Incident.

The Future


This is where we all want to be. We believe the future to be a bright and shiny place where there are no troubles and everything is working in synch (probably fully automated). As for the current pains, we have to make sure that they do not trouble our nice image of the future. Getting there is dealing with Problems.

Everything starts in the past with an event. To me an event can be an alert from monitoring (CPU Load high, Connection dropped, process aborted, job failed, …) or a call from a customer. They report that something has happened, that may have an impact. What to do is:


  • Does it have an impact on the service now? If yes, open an incident (or correlate one)

  • Do we fear this may happen again and do we need to avoid this in the future? If yes, open a problem record (or correlate one)

  • Do we have a standard service option to satisfy the events needed actions? If yes, open a service request

  • Do we need to modify the environment to satisfy the events needed actions? If yes, open a request for change1

Each of these actions may be run in parallel based upon one or more events.



1: This is a bit of a legacy option to me. There should not be many changes created based solely on events. They should be made into standard service requests quickly. If they are not we do have an impact.

Posted by: buzina | January 26, 2010

Time to fix vs. Availability SLO

Recently I have been working on the implementation of an agreed upon service level agreement. It contained several different quality levels (I will refer to them as high, medium and low in this post), each having different service window, different time to fix and availability service level objectives.

I distilled an example out of the original values to avoid recognition:

Service LevelService WindowTime to FixAvailability
High24 x 71 h99,9 %
MediumMo-Sa 06:00-22:004 h99,0 %
LowMo-Fr 08:00-18:0020 h95,0 %

Sounds reasonable doesn’t it?

So let us have a more detailed look at what these figures really mean. The high version states 99,9% availability out of 24 x 7. Per month this gets down to 99,9% of 720 hours. So how long can we permit to be down in total? The answer is

0,72 hours or 43 minutes and 12 seconds

So why did we state that the time to fix is one hour? If we adhere to the time to fix SLO, we may still miss the availability SLO even with a single outage!

For the medium SLA we have a 99% availability out of Mo-Sa 06:00 – 22:00, which is approx. 432 hours per month, leading to a maximum down of 4 hours and 32 minutes. This is closer to the mark.

But the greatest danger is in the low SLA. It states 95% out of Mo-Fr 08:00 – 18:00, which reduces the service time per month to 225 hours and results in a maximum down time of 11 hours and 15 minutes. This is little more than half of the SLO for time to fix!

What is my conclusion out of this? If you combine service level objectives you have to be very careful about what you do, since you may produce useless SLO targets and misguide people reading your service levels. It is much easier to understand what targets like time to fix mean for my job than a value of 95% availability, which may mean all or nothing. If you create SLAs, please make them consistent and logical without such traps. If you consume SLAs, verify everything and do not take logic for granted.

Posted by: buzina | January 25, 2010

Time Out or Pending state

One of the most misused functionality in current incident management tools is the pending state (sometimes also called time out). When someone working on the incident decides that the SLA is in danger they can easily avoid escalation (and often SLA failure in the report) by setting the state to pending. This effectively cancels the SLA timing1 and by subtracting the time from start of the pending state up to the end of pending from the calculated incident SLAs. Quite often the reporting is reusing the pre-calculated and evaluated SLA timings these systems provide, so nobody will notice when an incident misses its service level objective by 1000%.

So should we remove the capability of pending states? Should we all alter the given functionality and remove any time outs? No would be my answer.

So do we have to change the way we use this functionality? Yes, now we are talking!

I suggest the following for time outs (my preferred terminology, pending is not exact enough):

  1. Define possible time out reasons in your SLA
    Get the attention of your client or business to the possible reasons for delays. Make sure these are well understood and agreed upon.

  2. Make sure the time out reasons are the things out of your control
    All things you can not control or guarantee should be listed as time out reasons. The first item will help in keeping this list short, since the business will not like too many exceptions.

  3. Provide the reason of each individual time out
    Have the support personal select on of the reasons you listed in your SLA (even if it is only one). Log this in a reportable way and require an additional free text field for details.

  4. Allow only limited time outs
    Indefinite time outs lead to indefinitely delayed incident resolution. If you stop escalations for eternity they will not return by itself.

  5. Inform the user about time outs
    Send an automated mail about the time out or show the reason in your web support tool for the user to read. This manages the expectations and also reduces the possibility of pending fraud.

  6. Generate & review reports on time outs
    Report on the delay per time out reason and check for ways of reducing this. Share this report with the business and jointly think about improvement options.

  7. Define the influence on escalations & SLAs separately
    There may be situations where you have a defined action plan (e.g. vendor has to build a patch) which will take longer than your SLA. You may then either omit or reduce the escalations while still reporting the service level miss.

What kinds of time out reasons I can imagine? Well that depends on your situation and relationship with the business.


  • A very common reason is that you need the customer / user to do something and he/she is not available.
  • Another common one is “Agreed upon delay”. Sometimes the SLA forces me to fix this within a few hours while the customer / user does not want it to be fixed (e.g. battery issue with a laptop, while the user needs to go and present right now).
  • If your business is responsible for third-party contracts and IT has no influence (or not enough) on external support another reason may be “Waiting for external service”.
  • I have even seen a reason within an SLA stating that restoration of data is not part of the restoration of service since the amount of data influences this2.

This may seem like a lot of effort for a few simple process exceptions, but since they occur quite often it may be useful to think about this a bit more.


1: This is true for many incident workflow systems and in many configurations. It may or may not be the case for your individual setup, so please check it before relying on this information.
2: I would never suggest such a time out, it is just a sample. IT should design its restore capability to match the restore requirements and not the other way around.

Posted by: buzina | January 18, 2010

Root cause vs. cause analysis

After a wonderful skiing holiday in Aspen last week I am back thinking about problem management. My post Incident and Problem Management revisited and the ITSkeptic on root cause re-triggered my thoughts on the relationship between problems and causes.

ITIL v3:

ITIL defines a ‘problem’ as the unknown cause of one or more incidents.

ISO 20000:

problem
unknown underlying cause of one or more incidents

So according to these definitions there is a one-to-one relationship between a cause and a problem, only differing on the word underlying. Both do agree that there has to be at least one incident and also that the cause has to be unknown. As soon as the cause is known, it will become a known error. ITIL defines a known error as follows:

A Problem that has a documented Root Cause and a Workaround…

This means that a known error still is a problem. So we can skip the word unknown in the definitions above.

This is not enough to me. Since I am a big fan of proactive problem management (which somehow is neglected in the current ITIL version) I would like to create problems in terms of potential incidents. To support this, let us have a look at the last sentence in the ITIL known error definition:

… Known Errors may also be identified by Development or Suppliers.

As a known-error is a subset of the problems which have a documented root cause (and a workaround), this indicates that problems may exist without being a cause to a current incident. Since developers will for example provide information on defects of their software which may cause incidents in the future. This results into the following definition:

A problem is a cause of current or future incidents.
Read More…

Posted by: buzina | January 5, 2010

Incident and Problem Management revisited

A discussion on the itskeptic started me thinking about the ways I believed Incident Management and Problem Management should interact and what is in which domain. You may read the original discussion here (it again has links to a LinkedIn discussion, other posts on the itskeptic blog etc.).

The question was, whether a problem record should be created in the early stages of the incident, after incident resolution when it does not refer an existing problem / known error record or at any other “freakin” time. Even though the initial discussion was focused on major incidents, the implications were greater. We then delved into the question if 2nd and 3rd Level support is really needed for Incident Management. My first thought was to answer: Of course do we still need 2nd & 3rd Level. Then I started thinking about the real objectives of Incident- and Problem Management again:

Incident Management
Incident managements goal is to reduce the impact on your customers (the business) and to restore the original service quality as quickly as possible.

Problem Management
The goal of problem management is to reduce the number and/or impact of incidents occurring.

Read More…

Posted by: buzina | December 7, 2009

ITSM – The Game

We just published the means for a perfect way to spend 10 minutes or more enjoying yourself while gaining a few insights in how service management processes can help an IT organization. Go have a look at http://www.noventum.com/itsmgame/en.html (or http://www.noventum.de/itsmgame for the German version).

Screenshot ITSM The Game

This serious game (sounds serious but is fun) is a great way to show people the benefits of service management, so go have a look and tell others. I would really appreciate feedback on this, so please comment. Thanks!

Posted by: buzina | November 12, 2009

That is how teaching should be!

I have just found this video on boingboing.net and this really is how teaching should be. Enjoy half an hour of witty interesting knowledge about the uniqueiest of the primates by Robert Sapolsky renowned professor of neurology, neurological sciences, neurosurgery and biological sciences at Standford.

Posted by: buzina | November 3, 2009

Boosting Economy (e-€uro)

Now that Germany has a new Government which is not composed of the two larger blocks (CDU & SPD), I would have hoped for an increase in momentum of political progress. unfortunately Chancellor Merkel and Vice Westerwelle continue their old faults in believing their voters to be extremely dumb.

They have a firm majority and have a strong backing for implementing change due to the current economical downturn and what do they do? They promise to reduce tax rates in the believe that the increases in debt will be paid back by increased incomes which would increase total tax turnout. All real estimations on this effect will tell you, that it will not return the full 100%, but will basically stop at approx. 60% of the loss. So we happily increase our national debt, which has just been hit extremely hard by the financial crisis. And for what? For reducing taxes, which is not at the heart of our problems.

So what would my proposal be? Boost productivity by setting hard to reach missions that a whole country (of even a continent) could rally behind. The current innovations (internet, eco-tech, bio-tech and nano-tech) should allow a country like Germany to produce a turnaround. One of the projects to improve on would be about money. Why do we still have these small printed papers (easily reproduced by modern technology) and strange round pieces of metal? These are too hard to transfer, keeping transaction costs way to high.

So Germany (along with the european economic and monetary union) should create the e-Euro. The e-Euro (e€) should allow the following:

  • Quick transaction with marginal transaction costs (less than 1%) for businesses
  • No base fee for transactions, enabling small transactions (transfer 2 cents would cost no more than 0,02 cents)
  • Free transaction between individuals (when I give you a 10 € bill, there are no transaction costs)
  • Allowing anonymous online transactions up to a defined limit (a limit will be necessary for law enforcement requirements, cashing large sums at a bank will also trigger inquiries these days).
  • Make sure that the required infrastructure is in the hands of the european central bank (normal banks should not control cash, nor should they be able to track its usage)

The provision of a monetary system is one of the core tasks of a country. So it should provide a modern version of it. If micro payments can be made affordable, many new business models become viable. For example you would allow news to again become a paid for resource. News right now is expected to be available for free, along with some advertising. Many of us would pay a small fee for more in-depth news, either per article (1 article = 5 cent) or per time frame (e.g. 1 hour = 50 cent). With all current payment models available, this is not a suitable payment option due to the high transaction costs.

The ask a question sites could be a source of income for people, when a small sum could be paid for a proper answer to a question. Bands would be able to host their own music download website if you could easily transfer money to them. Charities would not have to rely on an overly expensive, un-democratic and often unfair payment system (like paypal) but could accept money directly from their donators.

My mission will probably not generate the same passion as

I believe that this nation should commit itself to achieving the goal, before this decade is out, of landing a man on the Moon and returning him back safely to the earth.

did, but maybe it could jumpstart our society into a more productive and innovative culture.

artessen auction flyer

aRTessen auction flyer


Our Round Table (Table 26 in Essen) is organizing our third art auction for charity. We again support the project “Sicherer Start” (secure start) for kids and their mothers in hardship. We sponsor additional aids to recent mothers by prolonging the time mid wifes support them. This helps keeping them connected to the society and avoids starting the vicous circle of neglect.

From every piece of art you buy, we donate 50% to the project, so join us on November 22nd and get yourself some interesting art! More information at http://www.aRTessen.de (sorry, only in German).

Older Posts »

Categories