How Can You Tell If Your On-Call System Is Broken? – InApps 2022

Main Contents:

How Can You Tell If Your On-Call System Is Broken? – InApps is an article under the topic Devops Many of you are most interested in today !! Today, let’s InApps.net learn How Can You Tell If Your On-Call System Is Broken? – InApps in today’s post !

The Need for a Good Feedback Loop

It’s always best to start with a baseline definition of what “good” is. Stevi Deter is a principal software engineer at platform-as-a-service DexCare. After years of coordinating all-hours support for complex systems integrations, she defines a successful on-call process as one with a positive feedback loop. A process fails, she told the conference audience, when the on-call process is siloed, when pages aren’t being captured and brought into the overall engineering process and prioritization.

Tailor the Schedule to the Team

To understand the quality of your on-call feedback loop, you’ve first got to understand the engineers’ workload, argued Ricardo Aravena, site reliability engineering (SRE) manager at Rakuten. He recommends planning a hand-off after every shift, asking:

What were the major alerts?
How many times have you seen this specific alert or incident before?
Is there something you could fix?
Anything you couldn’t fix?

It’s important to balance the shared workload, Aravena said. Of course, at a larger organization, you may have six or more people in rotation, but at a startup, you may have only two. Those two may communicate more easily, but as you expand internationally, you are more able to craft the on-call process around timezones, even allowing for 24-hour coverage in reasonable eight-hour shifts — meaning no one has to be paged at 2 a.m.

“When you have incidents … just make sure that something actually happens as a result of that. So if you do get woken up in the middle of the night and it turns out it was a major incident that you have this sense of, ‘OK, that sucked, but now we’re actually going to figure out how to not have that happen to us again.’”

—Stevi Deter, principal software engineer, DexCare @smd

On-call schedules will change as company size changes. During the pandemic, Calçado said, his live-event ticket retailer employer had to reduce its team size, while still providing on-call support for often unpredictable traffic.

Deter’s team also had to find ways to prioritize what gets done on call. Earlier this year, DexCare spun out as an incubator from the much larger Providence Digital Health. It went from three teams of six engineers on call to just one team of six — supporting the same amount of software, with a growing user base.

“Be realistic about what is expected, what they can achieve during a certain amount of time. And think also about how it affects your overall processes,” she said.

Deter urged the audience to ask, “What can you expect out of a person who is on call? Can you expect them to also be doing sprint work?”

One way DexCare ameliorated the tripled workload was to have a team just focus on paying down technical debt.

They also realized there was a shared dread of Mondays when they were ending shifts that ran through the weekend. They experimented with switching off on Fridays. This had a dramatic, positive effect when teammates finished midday Fridays and got the weekend to recuperate after a week on call.

Tailor the Process to the Organization

On-call must be a constantly evolving process, noted Jaime Woo, site reliability educator, mindfulness instructor, and co-editor of “97 Things Every SRE Should Know.”

“Keep changing it. Location matters. Team size matters. Needs matter. And I don’t think there’s always that intention of who is going to watch out as it evolves,” he added, to make sure “you have that flexibility and cohesion.”

Calçado has only worked at small to medium-sized businesses, where the on-call process was always in flux. The incident management side of the on-call process, he told the audience, can be pretty straightforward, but the “on-call component is a bit more complicated because it relates so much to people’s personal lives and expectations and just happiness around the company.”

He warned against attempting a company-wide on-call strategy, instead, allowing teams to decide their own. After all, they are the ones who should know everyone’s optimal working schedule. Many of his SeatGeek colleagues are based in Israel, where the weekend is Friday to Saturday, which meant the teams shifted on-call rotations a day earlier.

“You need to be flexible and work out even on a day-to-day basis what’s best for each team,” Calçado advised.

Of course, it’s always a good reminder to never page new parents. As Honeycomb founder and CTO Charity Majors has put it: “You should not have more than one thing waking you up in the middle of the night.”

Sync Up with Incident Response and Postmortems

Nothing is more demoralizing than feeling ineffective. You need a plan. For Aravena, the first step toward on-call success is opening a communication channel with your team. This can be regular retrospectives and handovers mixed with anonymous feedback tools like Slack polling — all backed, of course, by incident response managers.

You want to pay attention to key DevOps metrics like mean time to repair (MTTR) or even mean time to detect an issue. Leadership will always want to know if these are getting better or worse over time.

But you should also be on the lookout for patterns, like whether the same type of alerts keep occurring, or if engineers can’t do anything to fix recurring incidents. Tracking and avoiding false negatives is equally important to avoid burnout.

Debriefings and postmortems are essential, Calçado said, to make a habit out of the practice. Regularly ask teammates what happened on call, even if nothing big did. You may need to tune your alert system to be sure you’re not overlooking incidents.

He recommended you regularly create a timeline of what typical incidents look like and try to observe themes. This can help de-personalize discussion of problems, he noted, “so you’re more free to have your criticism instead of thinking that you’re talking about a colleague or a friend.”

Always ask: What would you change in the process? This is especially powerful in a one-on-one setting, when everyone gets a chance to speak.

Always remember, an on-call process has the potential to empower a team. For Woo, it’s all comes down to how you and your team choose to frame your experience:

“You can hate it — you still have to do it anyway. Or you can learn to not necessarily love it, but learn something from it. Why are we getting these alerts or why are people feeling this way? I think through that curiosity, through that humility, something great will happen.”

List of Keywords users find our article on Google:

dexcare

on call system

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.