What It Means, Why It Matters – InApps is an article under the topic Devops Many of you are most interested in today !! Today, let’s InApps.net learn What It Means, Why It Matters – InApps in today’s post !

Read more about What It Means, Why It Matters – InApps at Wikipedia

You can find content about What It Means, Why It Matters – InApps from the Wikipedia website

Gremlin sponsored this podcast.

Chaos engineering certainly evokes a lot of interest these days, especially as organizations increasingly rely on widely distributed data infrastructures that can extend across multicloud and on-premise environments — where the risk of failure grows exponentially. But while many agree that chaos engineering involves planning in some way, a widely accepted definition still remains elusive.

For Kolton Andrus, CEO and co-founder, Gremlin, chaos engineering is “is one of my favorite topics for debate,” and “is what makes chaos engineering sound fun and exciting.”

In this edition of InApps Makers podcast, Andrus defines chaos engineering and describes how organizations can make it work for them. Alex Williams, founder and publisher of InApps, hosted this episode.

Kolton Andrus, CEO and co-founder, Gremlin on Chaos Engineering

Also available on Apple Podcasts, Google Podcasts, Overcast, PlayerFM, Pocket Casts, Spotify, Stitcher, TuneIn

The very idea of chaos — and an IT organization’s embrace of it — can conjure up fear in many. “[Chaos engineering] scares the pants off of some old school folks that aren’t comfortable with that kind of chaos in their environments. And so most people think chaos engineering is randomly breaking things and seeing what happens,” said Andrus. “I think that chaos engineering is thoughtful, planned experiments that teach us about our system and one of the key concepts that goes with that is this idea of the ‘blast radius.’ When we run this experiment, whom might we impact? Because the goal is to prevent outages, not to cause an outage and we never want to inadvertently cause customer pain. We never want to cause an outage because we were being cavalier in our approach.”

Read More:   Update RasgoQL Brings the Modern Data Stack to Python Users

Andrus brings a deep background of the subject to the debate. Prior to founding Gremlin — as one of the pioneers in chaos engineering — Andrus became heavily involved in helping to avoid service outages, first at Amazon and then at Netflix. “When an outage happens, it’s time-intensive and expensive. It’s damaging to your brand,” he explained. “And if you work at a place like Amazon or Netflix, an outage costs hundreds of thousands to millions of dollars and so preventing every outage and preventing every minute of downtime is worth the investment.”

While his work at Amazon was more infrastructure-intensive, his mission at Netflix, as part of the API team, focused on application-level fault injection, involving injecting failure or delay in a specific service or function, such as managing customers’ identities, recommendations or recently watched movies.

“What would happen if one of those failed? Well, truthfully, if I can’t get your recently watched movies, I probably shouldn’t just crash the application — we can gracefully degrade and give you a cast list or just not show you that and you can continue on,” said Andrus. “And so that allowed us to go through and be very, very precise about where we wanted to run these experiments.”

A main takeaway at Netflix involved the business cases, such as understanding what the customer saw and what “the right behavior for the system is,” he explained. “And then we can go fix things so that when things go wrong, customers don’t see it — and they’re able to do whatever they came to do.”

The core technology infrastructure behind Gremlin’s experiments mainly relies on its agent, while the “future of where we’re going” is helping “people to measure the reliability of their services and to assess the potential risks that happened.” Gremlin’s team will “even run those experiments for them and tell them whether their system behaved correctly, or give them the set of things that weren’t handled correctly, so they have a shortlist of things to go fix and improve,” said Andrus.

Read More:   5 Ways to Reduce Toil by Automating Incident Response – InApps 2022

Source: InApps.net

Rate this post
As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Get a custom Proposal

Please fill in your information and your need to get a suitable solution.

    You need to enter your email to download


      Success. Downloading...