The Importance of Resilience Testing and Observability – InApps Technology 2025

Main Contents:

The Importance of Resilience Testing and Observability – InApps Technology is an article under the topic Software Development Many of you are most interested in today !! Today, let’s InApps.net learn The Importance of Resilience Testing and Observability – InApps Technology in today’s post !

Key Summary

This article from InApps Technology, authored by Phu Nguyen and featuring James Burns (Developer Advocate at LightStep), emphasizes the critical role of observability and resilience testing in modern software development. Sponsored by LightStep for Failover Conf 2021, it highlights how these practices improve developer experience, feature velocity, and customer satisfaction by addressing the challenges of operating complex systems in production. Observability connects effects to causes across services, enabling developers to understand application behavior, while resilience testing (e.g., Chaos Engineering) proactively identifies weaknesses to minimize failure impacts. Together, they ensure robust systems, reduce downtime, and enhance customer experiences without customers noticing the underlying efforts.

Context:
- Author: James Burns, with expertise in cloud operations and system failures, advocates for integrating observability and resilience into regular development workflows.
- Event: Written in anticipation of Failover Conf (April 21, 2021), sponsored by LightStep.
- Problem: Easy-to-deploy code (via containers, microservices) can lead to operational challenges, causing frequent alerts and reduced developer productivity.
Key Concepts:
- Observability:
  - Definition: An approach to understand system behavior through outputs, connecting effects (e.g., errors) to causes across services, domains, and scales.
  - Not Just Tools: Unlike telemetry or monitoring tools, observability is a mindset for building accurate, fact-based models of application behavior to inform decisions.
  - Purpose: Helps developers diagnose issues in production by providing insights into why something changed, enabling faster restoration or improvement.
- Resilience:
  - Definition: The ability of a socio-technical system (technology and people) to minimize failure impacts and maintain functionality.
  - Tactics: Includes circuit breakers, load shedding, and retries, but resilience is a broader mindset acknowledging inevitable failures in complex systems.
  - Chaos Engineering: A systematic approach to test resilience by introducing failures (e.g., service outages, latency) to identify gaps in observability and system robustness.
- Benefits:
  - Developer Impact: Regular observability and resilience testing improve feature velocity, reduce on-call fatigue, and enhance quality of life.
  - Customer Impact: Ensures faster apps, less downtime, and more features, maintaining functionality during traffic spikes, migrations, or dependency changes.
Practical Applications:
- When It Matters:
  - During scaling, peak loads, traffic spikes, new service deployments, migrations, or unexpected issues.
  - Ensures systems remain operational and developers avoid burnout.
- Outcome: Customers experience seamless service without noticing the underlying observability and resilience efforts.
- Example: Chaos Engineering reveals “unknown unknowns,” allowing teams to strengthen systems proactively.
InApps Insight:
- InApps Technology, ranked 1st in Vietnam and 5th in Southeast Asia for app and software development, specializes in observability and resilience-driven solutions using tools like LightStep.
- Leverages React Native, ReactJS, Node.js, Vue.js, Microsoft’s Power Platform, Azure, Power Fx (low-code), Azure Durable Functions, and GraphQL APIs (e.g., Apollo) to build robust, failure-resistant applications.
- Offers outsourcing services for startups and enterprises, delivering cost-effective solutions at 30% of local vendor costs, supported by Vietnam’s 430,000 software developers and 1.03 million ICT professionals.
Call to Action:
- Contact InApps Technology at www.inapps.net or sales@inapps.net to develop observable, resilient applications or explore Chaos Engineering solutions.

What is Observability?

Observability is often explained in terms of control theory, as understanding the state of a system only through its outputs. Others explain it as being built for unknown unknowns. The most practical definition is that observability allows you to connect effects with causes across many different services, domains, and scales.

When a team is trying to understand how their application is behaving in production, it’s not some abstract exercise. There’s something that changed and they need to understand why in a way that lets them restore or improve their application. Observability isn’t a tool and it isn’t telemetry; it’s an approach to having an accurate and fact-based model of application behavior to inform human decision making.

Resilience and How to Test It

Resilience is the ability of a (socio-technical) system to minimize the impact of failure. Failure happens, all the time. Resilient systems take progressive steps to allow the most useful parts of the system to still serve their purpose. While particular tactics like circuit breakers or load shedding — or even basics like retrying failures — are part of building resilience, resilience, too, is a mindset. It’s an approach that acknowledges the reality of complex systems and the presence of failure and error.

Resilience isn’t something that is achieved and then never considered again. It must be tested, and ideally not just when things happen to break. Chaos Engineering is a systematic approach to testing the resilience of a system (including the people) and also many types of failures. By purposefully introducing failures and degradations into the system, the development team can see what happens. Even more than that, they can find out what the gaps in their observability and resilience are, to “see what they can’t see.”

A Better Customer Experience

Ultimately, resilience testing and observability are about one thing: giving your customers a better experience. Faster apps, less downtime, and more features. By understanding the inner workings of all of your systems — and actively testing it against failures — you gain confidence in your teams, your technology, and your ability to keep customers happy even when things are going wrong. The journey to better sleep, and faster development, starts with a single step.

No customer will ever say, “wow this app is so observable” or “can you believe how resilient this software is?” In many ways, the effects of observability and resilience are unseen by those most impacted by them.

But when your system scales or reaches peak load, when traffic spikes or a large customer changes their behavior, when you deploy new services or your downstream dependencies deploy new services, when you begin a migration, when you complete a migration, or when nothing out of the ordinary appears to be happening but things are still breaking — in all these cases, a tested observability and resilience practice will let your system keep working and your devs keep sleeping.

Feature image via Pixabay.

Source: InApps.net

Rate this post

Phu Nguyen

As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.