Just finished the SRE course and thought I need to capture my thoughts both directly and indirectly related to the course.
This is a work-in-progress.
100% Reliability
In an perfect world everything is 100% reliable.
Too bad we don’t have one. Expecting perfect reliability is a mistake:
- environment is unreliable and systems don’t operate in isolation
- system is as reliable as its least reliable component
- deminishing return after certain reliability level
- deliberate un-reliability sets realistic expectations.
100% reliability is a wrong target.
How (un)reliable? That’s the question.
We want to be satisfied with our experiences(aka happy). When it comes to systems, we want: reliability and improvements.
Above two are at conflict, since:
- maintaining reliability means minimizing changes
- improvements require making changes
Thus
- reliability requires minimizing improvements?!
Rather than choosing between either reliability or improvements, SRE advocates for embracing a certain level of un-reliability to enable improvements and maintain the target reliability.
The goal is to find the acceptable balance between being open to change and providing stability, keeping in mind that it (the balance) may change over the life time.
The balance depends on the objectives: reliability expectations for an early startup and a mature enterprise are likely very different.
Reliability math
Let’s get familiar with simple math involved first.
It’s convenient to have reliability expressed as a relative value or percentage.
reliability | name | un-reliability | calculation |
---|---|---|---|
99.99% | 4 nines | 0.01% | 100%-99.99% |
99.9% | 3 nines | 0.1% | 100%-99.9% |
99.5% | 2.5 nines | 0.5% | 100%-99.5% |
and so on…
It’s worth mentioning that there are 2 ways reliability is calculated:
- time based
- event based
Time based calculation
Given the 1 year(or 525.960 mins = 365.25 * 24 * 60 mins
) duration:
reliability | max un-reliable min / year | calculation |
---|---|---|
99.99% | 52.5 | 525,960 * 0.01 /100 |
99.9% | 525.9 | 525,960 * 0.1 /100 |
99.5% | 2629.8 | 525,960 * 0.5 /100 |
Note: each 9
of reliability increases reliability 10x
Event based calculation
Given the 1_000_000 events:
reliability | max un-reliable events | calculation |
---|---|---|
99.99% | 100 | 1000000 * 0.01 /100 |
99.9% | 1000 | 1000000 * 0.1 /100 |
99.5% | 5000 | 1000000 * 0.5 /100 |
Un-reliability aka Error Budget
Let’s assume that we have a desired reliability target in mind(it’ll be discussed in upcoming sections). It, the reliability target, as shown by the above examples, also sets the target level of un-reliability also known as the Error Budget(EB).
The Error Budget is the “room” for making mistakes.
Similar to financial budget it’s:
- designed to be spent.
- sets the limits
Let’s say our reliability target is 99.9%
, then EB = 100%-99.9% = 0.1%
reliability target
---------------|-…---------|-…------------|-…----^-…--------------|
reliability 0 50% 99.5% 99.9% 100%
---------------|-…---------|-…------------|-…----|-…--------------|
period | | |
---------------|---------------------------------|----------------|
28d |########################################**********|
7d |#####################################*************|
1d |###########################***********************|
| |> Error Budget <|
Ascii bars above indicate the current reliability level and EB spending per period:
7d:
********** 10
------------------------- = ---- ~ 58.8% spent: budget surplus
***************** 17
Or going over the budget:
1d:
*********************** 24
----------------------- = ---- ~ 141% spent: budget debt
***************** 17
Satisfactory Objectives
Users set the objectives
Users seek satisfactions, they also establish the satisfactory levels.
For a business that means finding the trade-off between:
- maintain reliability to satisfy users and have some un-reliability to enable changes
- cost-management to maintain profitability and affordability for users
Along the Error Budget described previously, SRE introduces several concepts:
- SLO - service Level Objective: sets the reliability target
- SLI - Service Level Indicator: indicator(s) of current performance
- EB - Error Budget: the budget for mistakes
- SLA - Service Level Agreement: SLO breach consequences
For online businesses it’s common to have high reliability targets effectivelySLOs: 99%, 99.5%, 99.99%, etc.
SLI vs SLO
SLI is the indicator(measure) of the actual performance, normally it’s measured over a period of time since time is an important factor.
I like to see it as tension between want vs have:
HAVE: SLI | WANT: SLO | |
---|---|---|
Reliability | 98.4% | >=99% |
EB | 1.6% | <=1% |
Which means that over a period X, the service reliability, as measured by SLI, was below the target, as specified by SLO: 98.4% < 99%. As result the Error Budget was over-spent by 60%.
Thank you
That’s it for the part 01. Make sure to check out next parts in, hopefully, near future.