A Discussion About Chaos Engineering and More – InApps 2022

Main Contents:

A Discussion About Chaos Engineering and More – InApps is an article under the topic Devops Many of you are most interested in today !! Today, let’s InApps.net learn A Discussion About Chaos Engineering and More – InApps in today’s post !

Panelists

Kolton Andrus: The CEO and co-founder of Gremlin, the world’s first fully-hosted chaos engineering platform. Previously worked on building robust systems at Amazon and Netflix.
Charity Majors: The CTO and co-founder of Honeycomb, an observability platform to understand production systems. Previously worked at Facebook as a production engineering manager, focusing on their backend-as-a-service platform Parse.
John Egan: The CEO and co-founder of Kintaba, a modern incident management platform for your entire organization. Previously built a startup that was acquired by Facebook, where he then led product for their enterprise offering Workday.
Daniel “Spoons” Spoonhower: The CTO and co-founder of Lightstep, a cutting-edge observability and distributed tracing software. Previously worked at Google and is also the co-founder of the OpenTelemetry project.
Shahar Fogel: The CEO of Rookout, a live debugging platform enabling developers to debug modern applications faster than ever. Previously was the CEO of Brandtix and the VP of Product at Connectik Technologies.

Watch the full video below:

Key Takeaways from the Resilience Roundtable

Major Outages Impact Companies Both Big and Small

Yes, it’s true that Amazon can lose millions of dollars if they are down for even a few minutes and that Robinhood might lose countless users each time they crash during a major market movement. But for startups, even if they aren’t losing millions of dollars or hundreds of customers, the relative impact on their business can actually be much greater. Losing even a single big customer for a startup can mean losing a significant chunk of revenue. So while big companies make for big headlines, startups can feel the pain of major outages just as much — if not more.

Postmortems Should Be Shared Broadly and Publicly

Creating a culture that accepts failure and learns from it is a major and important shift for many companies. Too often when something goes wrong within traditional organizations, people that weren’t even there (e.g. management) dole out punishment and blame as the primary response. In modern incident management, blameless postmortems are a way to formally document what went wrong and why, in an effort to better understand the incident and prevent it from happening again. These documents should not only be shared with your team — they should also be shared publicly so that anyone interested can learn from what happened. (Cross-company resilience FTW)

You Build It, You Own It!

The best way to get software developers to care about the reliability of their applications… is to put them on call! Skin in the game can make a world of difference. If the engineer knows it’s their pager that will fire in the middle of the night or over the holiday break, they are much more likely to write code that stands up.

Resilience Is Shifting Left

This is a core promise of DevOps: That the daylight between the code being written, and then who is responsible for that code’s behavior in production, becomes narrower and narrower. When we think of shifting more of the operational burden upfront (i.e. Proactive Ops), we may also think of the cutting-edge discipline of Chaos Engineering. Like a vaccine, it’s important to inject a little failure upfront, on your own terms, in order to build longer-term resilience. And for software developers, resilience often means more than just checking if systems are up or down; it means being able to debug customer-facing issues on the fly, and provide a seamless online experience even when the unexpected happens.

Observability Is Real, AIOps Not So Much

Among the panelists, there was a near-unanimous reaction to the term “AIOps” (eye roll). While machines solving all of our problems make for good headlines, the truth is that the human is still very much needed in attributing value to machine-detected anomalies. You’re also adding another project for your engineers to be concerned about — before they wanted to just improve resilience, but now they have to build and maintain the AI to help with that resilience! Simply adopting the best DevOps/SRE practices will likely get you further, for now.

Lightstep is a sponsor of InApps.

InApps is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Real.

Source: InApps.net

Rate this post

Anh Hoang

Anh Hoang is Head of SEO Optimization at InApps Technology, ensuring that the message and research of InApps Technology reach the most people possible while adhering to our strict journalistic standards of excellence and integrity.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Recommended

Tech News

April 10, 2026 by Anh Hoang

A Discussion About Chaos Engineering and More – InApps 2022

Read more about A Discussion About Chaos Engineering and More – InApps at Wikipedia

Panelists

Key Takeaways from the Resilience Roundtable

Major Outages Impact Companies Both Big and Small