Sometimes I get the opportunity to talk to students about DevOps patterns in cloud computing. Chances are, they don’t fully understand cloud, so speaking about KPIs, SLAs, and SLOs in that context isn’t going to sink in. These metrics are important aspects of running a cloud service, so I had to find something else. Luckily, we’re all in the possession of a machine our brains ride around in, so speaking about metrics in the context of the human body is something to which we can all relate.
KPI: Key Performance Indicator
These are metrics that you track in any system. In computing, these might be CPU utilization, thread count, memory pressure, things like that. In cloud, we add latency, error, traffic, and saturation rates.
Some KPIs for the human body are:
- Heart rate
- Blood pressure
- Blood sugar
- Any of the obscure metrics my doctor looks for when she thinks I haven’t been eating responsibly.
SLA: Service Level Agreement
This is what you promise to customers about your service. Breaching this typically means you have to give back some of their money, or give them future free stuff.
Cloud computing vendors often offer things like “5 9s” of availability. That means they offer 99.99999% up-time of their service. Take all of the minutes in a year, 525,600, multiply them by 0.00001%, and you’ll get a little less than 6 minutes.
Less than 6 minutes of downtime per year.
SLAs exist for all sorts of things. How long a business might make a customer sit on hold before talking to a real person. How long an internal team will take to get back to another team’s request. How long your mechanic will keep your car to fix something. How long a child is allowed to draw on the wall with permanent markers before getting in trouble.
An SLA is just an expectation. The world runs on expectations. We’re surrounded by both spoken and unspoken SLAs.
Getting back to our body analogy, an SLA for the human body might be “we won’t die unexpectedly.”
How do we keep that SLA in a way that can be measured and tracked? SLOs.
SLO: Service Level Objective
This is the range you want your metrics, your KPIs, to stay in. This is how health is defined. There’s typically an SLO for every KPI.
For instance, my optimal resting heart rate is somewhere between 40–80 beats per minute (bpm). If I’m in that range, I’m walking, standing, sitting around, or sleeping. Nothing strenuous. My maximum heart rate is 176 bpm, and that’s if I’m running for my life from an angry bear. Anything over that and I’m at high risk of dying from a heart attack (or eaten by a bear).
Keeping my SLA of “don’t die unexpectedly” in mind, an SLO might be “keep my heart rate between 40–176 bpm”. Anything outside of that range, and we’re likely to breach our SLA of “not dying unexpectedly”.
If we breach the SLO, we want to know about it. Knowing about it in cloud computing usually involves an alarm, and that alarm means paging someone to fix it. Even at 2 AM.
Using only one SLO/KPI to measure health is not enough. My heart rate might be fine, but my cholesterol levels can indicate severe, long term risks. So we’ll have another SLO to track that. Blood sugar, same thing. An SLO around that.
Enough SLOs, and we’ll have a comfortable level of observation of the system, the human body, to prevent an SLA breach of “not dying unexpectedly.”
There’s just one problem. There’s nothing in here about judging pain.
You might not think pain is something normally associated with cloud computing, but you’d get a different picture if you talk to a harassed on-call who keeps getting woken up in the middle of the night to deal with a panicky cloud computing system. Something that breaches its SLOs all the time for weird, random reasons, that somehow go away by the time the on-call wakes up and gets online.
To keep from burning out your on-calls, who are expensive to find and keep, it’s best to have something called an “error budget”. This is the maximum number of SLO breaches you’ll allow in a certain amount of time, before you run the elevated risk of breaching an SLA.
Imagine that, for some reason, my heart rate hit 180 bpm once every three weeks at 3 AM, and stayed there for two seconds. Maybe I’m getting up and sprinting to use the bathroom. Remember, my maximum is 176 bpm. That breaches our SLO. That would page somebody.
Now, imagine that I’m a system running in a cloud. There are thousands of me running there, all doing that weird sprinting thing at 3 AM. Three weeks turns into every night. People say they’ll get around to fixing me, but they have other, higher priority things to deal with. And my thousands of sprinting systems are paging the on-call every night. It’s exhausting.
Enter the error budget. Before, it was “page when the heart rate goes outside 40–176 bpm”. Now, it could be “page when the heart rate goes outside 40 -176 bpm for longer than five seconds.” Weird alarm storm at 3 AM goes away, and the on-call gets some more sleep.
Remember the 5 9s of availability? Less than 6 minutes of downtime per year? An error budget for that SLA likely exists, it’s just razor-thin. Averaging more than 13.14 seconds of downtime per day will eventually breach that SLA. An error budget might be a quarter of that. Anything more than 2 seconds of downtime per day would page somebody (It’s probably even lower).
KPIs, SLAs, and SLOs. A KPI is a metric you track, an SLA is something you promise, and an SLO is a range for those KPIs to live in. One KPI for the human body is heart rate, the SLA is “don’t die unexpectedly”, and the SLO is keep that KPI between 40–176 bpm (for me). All together, we use them to measure and keep a system healthy — along with whatever metrics my doctor uses to spy on my eating habits.
Thanks for reaching the bottom of the page
I’m Cy Tidd. I work in the cloud computing industry and write dark fantasy fiction. Thanks for reading!