• Home
  • >
  • DevOps News
  • >
  • Top 12 Best Practices for Better Incident Management Postmortems – InApps 2025

Top 12 Best Practices for Better Incident Management Postmortems – InApps is an article under the topic Devops Many of you are most interested in today !! Today, let’s InApps.net learn Top 12 Best Practices for Better Incident Management Postmortems – InApps in today’s post !

Key Summary

This article from InApps Technology, published in 2022 and authored by Phu Nguyen, outlines 12 best practices for conducting effective incident management postmortems in IT, focusing on improving the process, fostering accountability, and avoiding common pitfalls like finger-pointing. Authored with insights from Steve Tidwell, a tech industry veteran with over 20 years of experience, the article defines postmortems as post-incident analyses to identify what went wrong, why, and how to prevent recurrence. It contrasts postmortems with Agile retrospectives, emphasizes blameless postmortems inspired by Google’s SRE Book, and provides actionable steps to streamline the process. InApps highlights its expertise in DevOps and incident management solutions.

  • Context:
    • Author: Phu Nguyen, with contributions from Steve Tidwell, who has extensive experience in end-user support and global data platforms.
    • Theme: Effective postmortems are critical for continuous improvement in incident management, requiring a blameless culture and structured processes to address root causes.
    • Source: InApps article, aimed at IT professionals and DevOps teams.
  • Key Points:
    • What is a Postmortem?:
      • Definition: An analysis after an incident to determine intended outcome, actual outcome, root cause, and preventive measures.
      • Purpose: Create a feedback loop for continuous improvement across industries, especially in tech for incident management.
      • Format:
        • What was the intended outcome?
        • What actually happened?
        • Why did it happen?
        • How to avoid it in the future?
    • Postmortems vs. Retrospectives:
      • Postmortems: Conducted immediately after incidents (e.g., outages, failed deployments), focus on specific failures.
      • Retrospectives: Part of Agile cycles, held regularly (e.g., end of sprints), assess general performance (what went well, what didn’t, future improvements).
    • Pitfalls to Avoid:
      • Finger-Pointing: Leads to blame game (e.g., developers blaming QA, QA blaming Ops), derailing root cause analysis.
      • Example Scenario:
        • Intended: Deploy new code to production.
        • Actual: Website crashed.
        • Blame Game: Developers blame QA, QA blames Ops, Ops blames developers.
        • Result: No actionable solutions, frustration.
      • Human Nature: People are hard-wired to blame, requiring a blame-aware culture to counteract.
    • Blameless Postmortem (Inspired by Google’s SRE Book):
      • Approach: Objective analysis without blaming individuals/teams.
      • Example:
        • Intended: Deploy new code.
        • Actual: Website crashed.
        • Why: Staging/production environment mismatch caused undetected bug.
        • Solution: Add error checks, ensure identical environments, use canary deployments.
      • Key: Include actionable items with owners and follow-up to ensure completion.
    • 12 Best Practices for Postmortems:
      • Management Buy-In: Secure support from all levels to enforce process adherence.
      • Assign Process Owner: Responsible for scheduling and follow-up.
      • Keep Process Simple: Avoid complexity to ensure compliance.
      • Dedicated Ticketing Project: Track incident workflow in a ticketing system.
      • Simple Ticket Workflow: E.g., In Progress, Resolved, Root Cause Analysis, Follow-Up, Closed.
      • Minimal Ticket Fields: Include title, summary, personnel, incident owner, date, timeline, what/why happened, attachments, sub-tickets, due dates.
      • Enforce Ticket Creation: Mandate tickets for major incidents.
      • Assign Ticket Owner Post-Incident: Ensures root cause analysis and action item completion.
      • Appoint Process Owner for Workflow: Oversees ticket progression and meeting scheduling.
      • Initiate Postmortems for Major Issues: E.g., outages, failed deployments, security breaches, data loss, missed deadlines, recurring issues.
      • Avoid Postmortems for Minor Issues: E.g., minor problems, proactive maintenance, scheduled work (unless it causes incidents).
      • Eliminate Finger-Pointing: Foster blame-awareness and cooperation.
    • Additional Insights:
      • Challenges: Human tendency to blame requires cultural shift to blamelessness.
      • Resources: Articles like “Blameless Postmortems and a Just Culture” and “How to Run a Blameless Postmortem” provide further guidance.
      • Continuous Improvement: Regularly reassess postmortem process to adapt to organizational changes.
    • References:
      • Google’s SRE Book: Advocates blameless postmortems for learning from failures.
      • External Articles: Highlight blame-aware strategies and team collaboration.
  • InApps Insight:
    • InApps Technology, ranked 1st in Vietnam and 5th in Southeast Asia for app and software development, specializes in DevOps, incident management, and cloud solutions, using React Native, ReactJS, Node.js, Vue.js, Microsoft’s Power Platform, Azure, Power Fx (low-code), Azure Durable Functions, and GraphQL APIs (e.g., Apollo).
    • Offers outsourcing services for startups and enterprises, delivering cost-effective solutions at 30% of local vendor costs, supported by Vietnam’s 430,000 software developers and 1.03 million ICT professionals.
    • Relevance: Expertise in DevOps aligns with implementing postmortem best practices for scalable, reliable systems.
    • Affiliation: InApps is a subsidiary of Insight Partners, an investor in Torq.
  • Call to Action:
    • Contact InApps Technology at www.inapps.net or sales@inapps.net to implement DevOps solutions and incident management frameworks for efficient postmortems and system reliability.
Read More:   Insourcing vs Outsourcing: Finding the Right Balance for Your IT Projects

Read more about Top 12 Best Practices for Better Incident Management Postmortems – InApps at Wikipedia

You can find content about Top 12 Best Practices for Better Incident Management Postmortems – InApps from the Wikipedia website

Poorly implemented postmortems for IT incidents can be painful for everyone involved; they cost money, and worse yet, they can fail to address the root cause of the problem. In this post, we will discuss some of the pitfalls of postmortems and introduce several best practices that can help smooth the postmortem process — including choosing the right personnel, creating a culture of accountability, and conducting “blameless” postmortems. In short, we will explain what you need to do to improve the postmortem process for everyone involved.

What Is a Postmortem?

Steve Tidwell

Steve has been working in the tech industry for over two decades, and has done everything from end-user support to scaling a global data ingestion and analysis platform to handle data analysis for some of the largest streaming events on the web. He has worked for a number of companies helping to improve their operations and automate their infrastructure.

According to Merriam-Webster, a postmortem is “an analysis or discussion of an event after it is over.” In the tech world, postmortems meetings are a key component to an overall process of incident management and are conducted after an undesirable outcome in order to determine what went wrong, why it went wrong, and how it can be avoided in the future.

Postmortems are not limited to the tech world. Many industries and organizations utilize this process to create a feedback loop that allows for continuous improvement. Regardless of the industry, though, a postmortem will almost always follow the same basic format:

  1. What was the intended outcome?
  2. What actually happened?
  3. Why did it happen?
  4. How can it be avoided in the future?

Retrospectives vs. Postmortems

Postmortems are similar to Agile retrospectives in that they have a similar intent, but there are a few key differences. Postmortems are normally held as soon as possible after an event or incident occurs. Retrospectives are normally held on a regular basis as part of a wider Agile strategy that includes sprint planning, a daily standup, and a retrospective (which is typically held at the end of the sprint).

Although there are different ways to implement a retrospective, they usually look something like this:

  1. What went well during the project, sprint, or prior period?
  2. What didn’t go so well?
  3. What would we like to see in the future?
Read More:   High Performing Teams Use Continuous Integration – InApps 2022

What to Avoid in a Postmortem Process

So can postmortems go wrong? Very easily, as it turns out. In an organization without proper accountability or a well-planned postmortem process, the most common problem is usually finger-pointing — or what is sometimes called “The Blame Game.”

Many people can probably relate to this scenario. A poorly moderated postmortem discussion would go something like this:

  1. Question: “What was the intended outcome?”
    Answer: “To successfully deploy new code and features to production.”
  2. Question: “What actually happened?”
    Answer: “The website went down during a regularly scheduled deployment.”
  3. Question: “Why did that happen?”
    Developers might answer: “QA signed off. They didn’t have a proper test strategy and let a bug slip into production.”
    QA might answer: “Ops didn’t configure the production environment correctly. If it weren’t for that, we would have caught this before it went out.”
    Ops might answer: “If the code had been written correctly, the application wouldn’t have crashed in the first place.”
  4. Question: “How can it be avoided in the future?”
    Developers might answer: “QA needs to do a better job in the future!”
    QA might answer: “Ops needs to do a better job in the future!”
    Ops might answer: “Developers need to do a better job in the future!”
    Management: “Sigh…”

The Blameless Postmortem

Google’s SRE Book has an excellent postmortem strategy in the chapter entitled, “Postmortem Culture: Learning from Failure.” It discusses why postmortems need to be conducted objectively (hint: people are hard-wired to point fingers) and why collaboration is a better approach (because most people want to learn from their mistakes and make things work better for everyone else too).

A practical implementation of a blameless postmortem would look something like this:

  1. Question: “What was the intended outcome?”
    Answer: “To successfully deploy new code and features to production.”
  2. Question: “What actually happened?”
    Answer: “The website went down during a regularly scheduled deployment.”
  3. Question: “Why did that happen?”
    Answer: “The staging and production environments were different. A bug that didn’t manifest in the staging environment manifested in production. That caused the application to crash.”
  4. Question: “How can it be avoided in the future?”
    Answer: “We should include additional checks in the code to improve our ability to catch error conditions and prevent the application from crashing. We should make sure that the staging and production environments are identical. If that’s not possible, we should implement additional testing using a canary deployment (or other means) to catch bugs before they are fully deployed to production.”

The last step should also include a list of actionable items, with an owner assigned to each one. A routine follow-up should also be conducted to ensure that those action items were actually completed in a timely manner.

Notice that at no point in our blameless postmortem scenario did anyone attempt to blame another group. Instead, they conducted an objective analysis of the incident. This process would also include a proper root cause analysis, along with a list of possible remedial actions. You can also get ahead of the blame game by proactively avoiding some common communication mistakes among teams.

Potential Postmortem Pitfalls

The problem with trying to instill an accountable yet blameless culture in organizations is that, as we mentioned earlier, humans tend to be hard-wired to point the finger — whether it’s at themselves or someone else.

For an example of how you can avoid “the blame game,” check out “Blameless postmortems don’t work. Here’s what does.” In short, you want to make sure that your process is solid, you hold people to the process, you always keep in mind that you are dealing with human beings, you are “blame aware,” and you work with your teams to help them understand healthier ways to interact and improve.

Read More:   Update Apache Geode Spawns ‘All Sorts of In-Memory Things’

Postmortem Best Practices

The following are a few best practices and tips to help you on your journey to a better postmortem process:

  1. Obtain buy-in from management, from the bottom all the way to the top. Without some kind of authority behind your process, it will most likely go nowhere.
  2. Assign a process owner. This individual will be responsible for all followup, including scheduling meetings.
  3. Keep the overall process simple. Complicated processes make gaining acceptance more difficult. A lack of acceptance begets non-compliance.
  4. Create a project in your ticketing system dedicated solely to tracking incident workflow.
  5. Keep the ticket workflow simple.
    •  For example, a simple workflow might be something like:
      1. Incident in progress
      2. Incident resolved
      3. Root cause analysis
      4. Incident followup
      5. Incident closed
  6. Keep the amount of information required for a ticket to a minimum. If you have less fields in the ticket, it will be easier for people to identify the information that will facilitate the process. It will also increase the likelihood that the ticket will be filled out properly.
    • A minimalist ticket might look like the following:
      1. Title
      2. Executive Summary
      3. List of personnel who participated in resolving the incident
      4. Ticket (incident owner)
      5. Incident date
      6. Start and end time of the incident. (We recommend using UTC if you have an organization that spans more than one timezone. This will also help keep the timeline more accurate when reviewing server or chat logs, since correlation is easier when it doesn’t require conversion.)
      7. Incident timeline
      8. What happened?
      9. Why did it happen (ie: RCA)?
      10. Attachments, links, graphs, logs, or other information
      11. Sub-tickets with suggested followup actions
      12. Due date for followup
  7. Enforce ticket creation whenever a major incident occurs. This can be done by the individual, or team responding to the incident, or by an Incident Coordinator.
  8. Once the incident is over, assign the ticket to an owner. The owner will be responsible for following up on the root cause analysis and ensuring that action items that were created during postmortem discussions are completed.
  9. Appoint a process owner to ensure that tickets in the incident project move through the workflow. In addition, the process owner should be responsible for scheduling meetings as needed.
  10. You should initiate a postmortem when you have:
    1. Major outages that impact end users
    2. Failed deployments
    3. Security breaches
    4. Data loss
    5. Missed deadlines
    6. Repeated or unresolved incidents
  11. You should avoid a postmortem when you have:
    1. Minor problems
    2. Proactive maintenance to prevent larger problems
    3. Scheduled work (unless the work itself causes an incident)
  12. Finally, stamp out finger-pointing wherever possible and try to create a culture of “blame-awareness” and cooperation.

This article will point you in the right direction when it comes to postmortems, but there are many variables that organizations will need to assess in order to determine what will work best for them. Keep in mind that the postmortem process itself should be reassessed over time in order to account for changes in requirements and to make sure that it is still optimal for your organization.

There are many excellent articles that describe how different companies have implemented their version of the “blameless postmortem.” In particular, see Blameless PostMortems and a Just Culture, as well as How to run a blameless postmortem, and Tuning Blameless Postmortems.

InApps is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Torq.

Feature image via Pixabay.

Source: InApps.net

Rate this post
As a Senior Tech Enthusiast, I bring a decade of experience to the realm of tech writing, blending deep industry knowledge with a passion for storytelling. With expertise in software development to emerging tech trends like AI and IoT—my articles not only inform but also inspire. My journey in tech writing has been marked by a commitment to accuracy, clarity, and engaging storytelling, making me a trusted voice in the tech community.

Let’s create the next big thing together!

Coming together is a beginning. Keeping together is progress. Working together is success.

Let’s talk

Get a custom Proposal

Please fill in your information and your need to get a suitable solution.

    You need to enter your email to download

      Success. Downloading...