Recently I have been working on the implementation of an agreed upon service level agreement. It contained several different quality levels (I will refer to them as high, medium and low in this post), each having different service window, different time to fix and availability service level objectives.
I distilled an example out of the original values to avoid recognition:
|Service Level||Service Window||Time to Fix||Availability|
|High||24 x 7||1 h||99,9 %|
|Medium||Mo-Sa 06:00-22:00||4 h||99,0 %|
|Low||Mo-Fr 08:00-18:00||20 h||95,0 %|
Sounds reasonable doesn’t it?
So let us have a more detailed look at what these figures really mean. The high version states 99,9% availability out of 24 x 7. Per month this gets down to 99,9% of 720 hours. So how long can we permit to be down in total? The answer is
0,72 hours or 43 minutes and 12 seconds
So why did we state that the time to fix is one hour? If we adhere to the time to fix SLO, we may still miss the availability SLO even with a single outage!
For the medium SLA we have a 99% availability out of Mo-Sa 06:00 – 22:00, which is approx. 432 hours per month, leading to a maximum down of 4 hours and 32 minutes. This is closer to the mark.
But the greatest danger is in the low SLA. It states 95% out of Mo-Fr 08:00 – 18:00, which reduces the service time per month to 225 hours and results in a maximum down time of 11 hours and 15 minutes. This is little more than half of the SLO for time to fix!
What is my conclusion out of this? If you combine service level objectives you have to be very careful about what you do, since you may produce useless SLO targets and misguide people reading your service levels. It is much easier to understand what targets like time to fix mean for my job than a value of 95% availability, which may mean all or nothing. If you create SLAs, please make them consistent and logical without such traps. If you consume SLAs, verify everything and do not take logic for granted.