Home
>
DevOps News
>
No Grumpy Humans and Other Site Reliability Engineering Lessons from Google – InApps Technology 2022

March 21, 2022 by Phu Nguyen

No Grumpy Humans and Other Site Reliability Engineering Lessons from Google – InApps Technology 2022

Main Contents:

No Grumpy Humans and Other Site Reliability Engineering Lessons from Google – InApps Technology is an article under the topic Devops Many of you are most interested in today !! Today, let’s InApps.net learn No Grumpy Humans and Other Site Reliability Engineering Lessons from Google – InApps Technology in today’s post !

SRE Fundamentals

Technical skills are teachable, so Fong-Jones looks for engineers that have the ability to empathize and build trust with other people along with technical skills and curiosity. Curiosity is hard to teach, she said, but it’s critical to have engineers who are curious about how systems break and who really try to understand what happened.

It’s always important to hire the right people, Fong-Jones said but it’s not only the SREs that are important. Hiring production-minded product development software engineers is also key. “If you are in a small organization and you want to get off on the right foot, you cannot have people that are working against your reliability objectives by throwing half-finished stuff over the wall, or not really wanting to write metrics into the software that they’re writing,” she said.

The difference between having a bunch of disconnected islands of teams or lone individuals and having a community of SREs that work together to make things systemically better across the platform or the company is having people who deeply care about reliability and ensure best practices with each other, Fong-Jones explained.

SLO from the Get-Go

It’s important to define Service Level Objectives (SLO) — the metrics used for service level agreements — from the very beginning. “Because if you’re not having that reasoned conversation about what’s an appropriate reliability level,” she said, “it’s harder and harder to do as you go along.”

If you don’t have an explicit SLOs published, Fong-Jones said, then your SLO is whatever your customers are used to seeing. This leads to bad assumptions about your architecture, or it may fail spectacularly.

Her team starts with a helpful attitude, she said. “Hey, you already have risks, let’s enumerate them for you.” Focusing on the quantity of data has been helpful in conversations with their internal customers. For those reluctant to do the work to define new and accurate SLO, Fong-Jones will sometimes deliberately run their services exactly to their SLO. The resulting failure usually brings them around.

Start with a risk matrix, she suggested. Go straight for the team’s engineers and ask them to enumerate the risks. They may not like talking about it, but they know what they are worried about. “Everyone knows where the skeletons are buried,” she said.

Once the risks are defined and MTTD (Mean Time To Detect), MTTR (Mean Time to Recovery), MTBF (Mean Time Between Failures) are set, they can talk about the real business of the SRE. “Is this acceptable risk or not? What’s the cost of enumerating these risks? What do we think that an appropriate level of reliability is?” she asked.

Teams also need to define a service level indicator, which is a key performance metric that represents some facet of the business, she said. For example, the fraction of user queries that are successfully completed within 200 milliseconds without error.

No Grumpy Humans

Getting the balance between having enough visibility into your system and alert fatigue is tricky, Fong-Jones acknowledged. “It’s not just in terms of reliability, but what’s the effect on the humans? Are the humans going to be grumpy because they’ve been paged five times overnight? Because you can’t run a service off of really grumpy humans.”

In order to do this, she recommends turning off as many alerts as you can and focusing on the users experiencing pain. Sometimes the situation is so bad, she said that a group may be constantly failing their SLOs. When that’s the case, re-evaluating your SLOs may be in order.

“You need to either decide, ‘Okay, this is going to be a short-term issue, we know what we need to do, it’ll be fixed in a month, let’s ignore anything except for catastrophic failures,’” she said. Evaluate what is acceptable. “If users are happy and your service is 99 percent available instead of 99.9 percent available, maybe that’s where you should set your SLO. Maybe your business’s requirements were not accurately measured at the start.”

Standardization Is Key

Well, duh. But a large part of this is getting rid of the problem of Shadow IT, where an engineer decides they want to use a shiny new feature that’s outside the approved software. To combat this problem inside Google they use a bottom-up approach.

When they notice there are six different APIs doing the similar functions, for example, she talks to the engineers. “Okay, you talk all among yourselves and figure out how to merge. If you’re developing two competing things, and you say, “Okay, let’s just fold it into one project,” she said.

Google is encouraging their engineers to “look left, look right, see what other people are doing,” she said. It goes back to simplicity. They reward people for shutting down projects. At Google, that’s not something that gets you penalized. That’s something that will get you promoted.

Fong-Jones thinks it’s really important to reward people for doing reliability work. Rewarding people for thinking about systems as a whole, and make sure that your job ladder rewards thinking about what things make the product excellent from a reliability perspective and have a community of practice.

“Are you rewarding people for building a whole bunch of complex stuff that no one can maintain?” she asked. “Or are you rewarding people for doing the simplest thing, even if it means not writing any new software, just integrating something that’s existing. And that’s something that is valuable in software engineers, but even more valuable in site reliability engineers.”

When done right, it can be very rewarding, she said. Like when a Home Depot Vice-President who texted Google’s director of customer reliability engineering on Thanksgiving Day. “The message wasn’t, ‘Oh my god, everything’s on fire.’ What it said was, ‘Thank you, we’ve had a quiet Thanksgiving for the first time in forever.’”

Google is a sponsor of InApps Technology.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Recommended

Tech News

June 24, 2025 by Anh Hoang

No Grumpy Humans and Other Site Reliability Engineering Lessons from Google – InApps Technology 2022

Read more about No Grumpy Humans and Other Site Reliability Engineering Lessons from Google – InApps Technology at Wikipedia

SRE Fundamentals

SLO from the Get-Go

No Grumpy Humans

Standardization Is Key

AI‑Driven Automation: 7 Real‑Life Business Success Stories (2025 Update)

AI Automation for Business in 2025: A Step-by-Step Guide

FITNESS APP DEVELOPMENT

ONLINE COURSE APP

EVE HR – WEB DESIGN

AIRGOGO WEBSITE

WALLET APP DEVELOPMENT

Ho Chi Minh City Launches Digital Traffic App 2017

Why Your Business Needs a Mobile App Rather Than a Website

7 Questions To Ask Yourself Before You ‘App’ | Entrepreneur

Blog post

9 Practical Tips to Choose a Mobile App Development Company for 2025

AI‑Driven Automation: 7 Real‑Life Business Success Stories (2025 Update)

AI Automation for Business in 2025: A Step-by-Step Guide

Top 10 Offshore Development Companies (ODCs) in 2025

Locations

Read more about No Grumpy Humans and Other Site Reliability Engineering Lessons from Google – InApps Technology at Wikipedia

SRE Fundamentals

SLO from the Get-Go

No Grumpy Humans

Standardization Is Key

Get a custom Proposal

You need to enter your email to download

Blog post

Locations