SRE Course: part 01: abstract

Just finished the SRE course and thought I need to capture my thoughts both directly and indirectly related to the course.

This is a work-in-progress.

100% Reliability

In an perfect world everything is 100% reliable.

Too bad we don’t have one. Expecting perfect reliability is a mistake:

environment is unreliable and systems don’t operate in isolation
system is as reliable as its least reliable component
deminishing return after certain reliability level
deliberate un-reliability sets realistic expectations.

100% reliability is a wrong target.

How (un)reliable? That’s the question.

We want to be satisfied with our experiences(aka happy). When it comes to systems, we want: reliability and improvements.

Above two are at conflict, since:

maintaining reliability means minimizing changes
improvements require making changes

Thus

reliability requires minimizing improvements?!

Rather than choosing between either reliability or improvements, SRE advocates for embracing a certain level of un-reliability to enable improvements and maintain the target reliability.

The goal is to find the acceptable balance between being open to change and providing stability, keeping in mind that it (the balance) may change over the life time.

The balance depends on the objectives: reliability expectations for an early startup and a mature enterprise are likely very different.

Reliability math

Let’s get familiar with simple math involved first.

It’s convenient to have reliability expressed as a relative value or percentage.

reliability	name	un-reliability	calculation
99.99%	4 nines	0.01%	100%-99.99%
99.9%	3 nines	0.1%	100%-99.9%
99.5%	2.5 nines	0.5%	100%-99.5%

and so on…

It’s worth mentioning that there are 2 ways reliability is calculated:

time based
event based

Time based calculation

Given the 1 year(or 525.960 mins = 365.25 * 24 * 60 mins) duration:

reliability	max un-reliable min / year	calculation
99.99%	52.5	525,960 * 0.01 /100
99.9%	525.9	525,960 * 0.1 /100
99.5%	2629.8	525,960 * 0.5 /100

Note: each 9 of reliability increases reliability 10x

Event based calculation

Given the 1_000_000 events:

reliability	max un-reliable events	calculation
99.99%	100	1000000 * 0.01 /100
99.9%	1000	1000000 * 0.1 /100
99.5%	5000	1000000 * 0.5 /100

Un-reliability aka Error Budget

Let’s assume that we have a desired reliability target in mind(it’ll be discussed in upcoming sections). It, the reliability target, as shown by the above examples, also sets the target level of un-reliability also known as the Error Budget(EB).

The Error Budget is the “room” for making mistakes.

Similar to financial budget it’s:

designed to be spent.
sets the limits

Let’s say our reliability target is 99.9%, then EB = 100%-99.9% = 0.1%

                                        reliability target
---------------|-…---------|-…------------|-…----^-…--------------|
reliability    0          50%           99.5%  99.9%             100%
---------------|-…---------|-…------------|-…----|-…--------------|
 period        |                                 |                |
---------------|---------------------------------|----------------|
 28d           |########################################**********|
 7d            |#####################################*************|
 1d            |###########################***********************|
               |                                 |> Error Budget <|

Ascii bars above indicate the current reliability level and EB spending per period:

7d: 
               **********    10
------------------------- = ---- ~ 58.8% spent: budget surplus
        *****************    17

Or going over the budget:

1d:
 ***********************    24
 ----------------------- = ---- ~ 141% spent: budget debt
       *****************    17

Satisfactory Objectives

Users set the objectives

Users seek satisfactions, they also establish the satisfactory levels.

For a business that means finding the trade-off between:

maintain reliability to satisfy users and have some un-reliability to enable changes
cost-management to maintain profitability and affordability for users

Along the Error Budget described previously, SRE introduces several concepts:

SLO - service Level Objective: sets the reliability target
SLI - Service Level Indicator: indicator(s) of current performance
EB - Error Budget: the budget for mistakes
SLA - Service Level Agreement: SLO breach consequences

For online businesses it’s common to have high reliability targets effectivelySLOs: 99%, 99.5%, 99.99%, etc.

SLI vs SLO

SLI is the indicator(measure) of the actual performance, normally it’s measured over a period of time since time is an important factor.

I like to see it as tension between want vs have:

	HAVE: SLI	WANT: SLO
Reliability	98.4%	>=99%
EB	1.6%	<=1%

Which means that over a period X, the service reliability, as measured by SLI, was below the target, as specified by SLO: 98.4% < 99%. As result the Error Budget was over-spent by 60%.

Thank you

That’s it for the part 01. Make sure to check out next parts in, hopefully, near future.

References

Coursera course