What is Site Reliability Engineering? Your guide to understanding SRE.

Site Reliability Engineering evolved at Google in the early 2000s as a "prescriptive way of measuring and achieving reliability through engineering and operations work." It's often compared to DevOps, as a fellow approach to breaking down organisational silos in order to deliver better software faster, but has some key differences. So, what is Site Reliability Engineering?

What is Site Reliability Engineering?

The Father of Site Reliability Engineering is Benjamin Treynor Sloss, who introduced the term and approach while working at Google in 2003, going on to author the Google Site Reliability Engineering Book. In his words: “Site reliability engineering is what happens when you ask a software engineer to design an operations function."

"Site Reliability Engineering teams focus on hiring software engineers to run our products and to create systems to accomplish the work that would otherwise be performed, often manually, by sysadmins."

Site reliability teams are typically responsible for how code is deployed, configured, and monitored, as well as the availability, capacity management and change management, emergency response, and latency of services in production. Sound like "everything"? Kind of...

As Sloss noted, however: "Google places a 50% cap on the aggregate "ops" work for all SREs—tickets, on-call, manual tasks, etc. This cap ensures that the SRE team has enough time in their schedule to make the service stable and operable. This cap is an upper bound; over time, left to their own devices, the SRE team should end up with very little operational load and almost entirely engage in development tasks, because the service basically runs and repairs itself: we want systems that are automatic, not just automated. In practice, scale and new features keep SREs on their toes."

See also: What is Apache AGE?

Still confused by what is Site Reliability Engineering? More specifically as Seth Fargo and Liz Fong-Jones put it: "SRE ensures that everyone agrees on how to measure availability, and what to do when availability falls out of specification. This process includes individual contributors at every level, all the way up to VPs and executives, and it creates a shared responsibility for availability across the organization. SREs work with stakeholders to decide on Service Level Indicators (SLIs) and Service Level Objectives (SLOs)."

With SRE teams also agreeing service-level agreements (SLAs) that defines how reliable the system needs to be to end-users, they get what amounts to an "error budget". If they meet or exceed this budget/are operating below the defined SLA, all launches are frozen until they reduce the number of errors to a level that allows the launch to proceed. This gives both SREs and developers (whose bugs may cause the errors...) an incentive to collaborate, in order to minimise errors in production.

Benefits of Site Reliability Engineering

Site Reliability Engineering, in short, helps large systems function optimally through code. The engineer here uses metrics and software tools to manage and improve operations.

Developers wondering what are the skills needed to be a capable site reliability engineer can find that they vary widely.

Roles of a Site Reliability Engineer (SRE)

A SRE has a mix of skills in software development, operations management and business analytics. These are required to solve operational problems through coding. While DevOps is largely concerned with automating IT operations, SRE teams come up with plans and designs.

They control systems involved in production and monitor performance. This helps in calculation of outage costs and advance preparedness for contingencies. This is done through proper maintenance of runbooks, tools, and documentation. As mentioned above, SREs then introduce updated features in programming with the assistance of service-level agreements (SLAs), service-level indicators (SLI), and service-level objectives (SLO).

Examples of how to build operations resilience in the event of software system and service failures:

This can also mean involvement in broader IT resilience, for example making sure that:

In the event of a majority of servers failing in a geography, the system can automatically adapt by procuring servers from different regions.
If a preferred CPU cannot be engaged when a cloud provider cancels it, the swift response should be to take up the next best CPU. If so, the system has been successfully engineered.
When there is a huge spurt in the number of users, the number of servers could be increased to accommodate the growth of traffic.
Good logging reports are set up so tech support can correct errors quickly.
Metrics like mean time to failure (MTTF) and mean time to recovery (MTTR) are maintained to identify faulty components and restore normality.
Backup plans are built into the system.

Conclusion

Still wondering what is Site Reliability Engineering? You can start with Benjamin Treynor Sloss's introduction here.

The role is highly in demand, not least at Google itself, which is advertising over 600 jobs with Site Reliability Engineering in the title across the broader Alphabet family.