Site Reliability Engineering

Site reliability engineering (SRE) is a set of principles and practices that applies aspects of software engineering to IT infrastructure and operations.

— Wikipedia

In the Shivering-Isles Infrastructure various apps have an own set of SLOs to validate for service degradation on changes. It's also a good practice for SRE in other environments.

Besides maintaining reasonable SLOs, other SRE practices are implemented, such as post mortems and especially the practice of reducing toil. All components of the infrastructure have a maintenance budget, if it's depleted, it's time to fix the apps or get rid of it.

Service Level Objectives

All public facing apps and infrastructure components should have an Service Level Objective (SLO). The most basic SLOs for web apps are the availability and latency measured through the ingress controller. An examples for an SLO definitions is the Shivering-Isles blog.

Apps that provide more insight via metrics, can have app-specific SLOs to optimise for user impacting situations, that aren't covered by basic web metrics. An example is the sidekiq SLO for Mastodon.

The actual objectives in the Shivering-Isles infrastructure are often relatively low around 95 percent.

Self-Hosted Timebudget

Additional to your traditional error budget, for the Shivering-Isles Infrastructure there is self-hosted time budget. This is the acceptable amount of time per month to be spend on maintainence. The timebudgets are set for individual software as well as the entire infrastructure.

If the budget is reached or exceeded, work on anything new is halted and work focusses to

improving deployment processes,
replace hard to maintain software or
move it out of self-hosting.

This makes sure that self-hosting doesn't become a timecreep while keeping software up-to-date.

Incident Response

Aiming for SRE best practices in the home infrastructure, larger outages and other incidents should be acompanied by a post mortem, helping to improve the infrastructure and resolve sources for incidents permanently.

The post mortem template used for this is inspired by the SRE book.

Even if never finished or published, the post mortem helps to structure ideas and the situation itself. Making incident response much more thorough.

Learning about SRE

A good start is this small video Series by Google:

Further there is the Google SRE book as recommended read.

Further there are some good talks from SREcon:

Shivering-Isles GitOps Infrastructure

Site Reliability Engineering

Service Level Objectives

Self-Hosted Timebudget

Incident Response

Learning about SRE