Improving Reliability Through Blameless Postmortems

Technology

By Joe Benik | September 23, 2025

In the technology industry, a postmortem meeting is called in response to an incident, outage, or any disruption of service that impacts a company’s offerings. These retrospective meetings usually occur in the days following the disruption’s resolution to discuss the event. The purpose of this article is to highlight how blameless, well-automated postmortems transform incidents into opportunities for enhanced organizational learning and long-term reliability.

What is a Postmortem?

During postmortems, participants reflect on the incident in question to review why it was caused, how it was handled, and any action items that can be taken to reduce the chance of a similar incident from happening in the future. Postmortems are an essential tool in site reliability engineering (SRE)—a discipline Google originated in the early 2000s—that companies across the industry perform regularly.

Postmortems/SRE at FactSet

We started doing our own postmortems many years ago, and it has been exciting to see how these meetings have evolved over time. In the early 2010s, postmortems were scheduled primarily for the largest outages. Although timely, the meetings occasionally involved too many people to stay organized and on topic. As a result, some retrospectives resulted in follow-up meetings and delayed action items.

In 2017, we established a policy that every incident categorized as critical or above required a postmortem. We also created a small, dedicated SRE team responsible for continuously improving reliability across all of our firm’s offerings. Among the SRE team’s responsibilities were facilitating retrospectives, identifying areas of improvement, and taking action.

Changing an entire company’s approach was no easy task, but with the support of leadership and colleagues, the SRE team made gradual, steady improvements. To this day, we continue postmortems for every critical-level event; we review incidents, identify causes, and develop action items to reduce the likelihood of an incident occurring again.

Blameless Postmortems

In addition to requiring postmortems on critical incidents, our SRE team improved the quality of all postmortems by implementing the blameless postmortem model. A blameless postmortem is a concept that Google introduced to improve their SRE processes. Blamelessness is rooted in the assumption that all teams and employees chose to act with the best intentions based on the information available at the time.

Running a postmortem in this way allows all participants to objectively review the incident without applying blame to an individual or team. As a result, these retrospectives are focused on improving a system to preemptively identify future mistakes rather than criticizing an individual/team for causing an outage.

For example, instead of asking postmortem questions like “Which team was responsible?” or “Why did they make that change?”, our SRE team asks empowering, solution-focused questions like “How can our process improve so that we catch this next time?” or “What can we change to prevent something like this happening in the future?” The philosophy of directly addressing the process instead of the individual(s) is the foundation of a blameless postmortem.

Importantly, choosing not to blame is not the same as pretending that mistakes don’t happen. Conducting a blameless postmortem is a way to acknowledge that mistakes will happen and to facilitate the conversation on how to prevent them from becoming incidents in the future.

Over time, the SRE team guided FactSetters toward the blame-free principles and helped them become more familiar with the retrospective system. Eventually, blameless postmortems became the standard. Blameless postmortems are officially recorded in our policy, new hires are educated on what a blameless postmortem means, and the meetings themselves always start with a reminder of how to operate within a blame-free culture.

Benefits of a Blameless Postmortem

Embracing a blameless approach to postmortems fosters a healthy, collaborative team culture and helps strengthen the company culture. When employees know mistakes are the burden of the team’s process rather than the individual, they are more likely to report issues honestly and share ideas for improvements. This mindset also empowers engineers to try new things and solve problems with confidence.

Adopting a blameless incident response is more than just a cultural decision. By shifting the focus away from individuals, FactSetters are encouraged to make changes that persist beyond one-time incidents. Keeping an individual accountable may prevent that individual from making the same mistake for a short time, but making a process update (such as an improvement to a system review) will improve stability for an entire team and their clients.

A blameless postmortem acknowledges that mistakes will always happen. Even the most senior employees make mistakes, and it is understood that new employees need time before completely understanding a new system. Instead of trying to solve the impossible task of preventing everyone from making mistakes, the blameless postmortem centers on developing an environment that identifies mistakes before they become incidents.

The above example shows the impact of a blameless postmortem on a small developer team with two experienced senior developers and one junior developer. Even though all developers can make mistakes, the senior developers often catch their own mistakes before they reach the codebase. The junior developer, however, makes a mistake that ends up turning into an incident.

During a blame-based postmortem, the junior developer’s lack of code experience is identified as the reason for the outage. To resolve the issue, the team decides to teach the junior developer about all common pitfalls so they don’t make another mistake.

When a new hire joins the team several months later, they make the same mistake that causes another incident. This time, the team hosts a blameless postmortem. Instead of teaching the new hire about every possible mistake, the development team creates a suite of automated tests that will block any new code change that contain known mistakes.

As the project grows, a collaborator from a different team contributes to a new feature. Instead of requiring a robust knowledge of the code base and previous mistakes, the collaborating engineer can propose changes with confidence, assured that automated tests are checking and will block common mistakes.

The power of a blameless postmortem comes from the action items taken as a result of the meeting. By making improvements to the development process, teams can implement lasting solutions.

Improving Postmortems Through Automation

When incidents occur, response time is everything. Every moment is valuable, and manually handling logistics is the last place we want to spend those moments. That’s why our SRE team has created automated systems that help in times of rapid response.

As we identify incidents, automated systems handle logistics: relevant response teams are notified, and virtual triage rooms are opened and linked in central locations. Additionally, automated tools track and record relevant information through the incident lifecycle to prepopulate the retrospective process. As the incident progresses, this information initializes the postmortem process so that it is ready to assist reviewers once the issue has been resolved.

Prior to the beginning of the retrospective, our specialized postmortem platform uses participant information from the incident to help schedule and notify relevant reviewers. Automation makes sure the meeting is announced in a centralized location and provides the means of accessing the meeting.

During the postmortem, an incident timeline is automatically created from key points recorded during the disruption. A simple workflow guides both SRE engineers and other reviewers through the process from discussion to action items. When action items are created, the platform interfaces with various work management tools, such as JIRA or Github Issues to ensure the work is recorded.

Conclusion

In summary, postmortems—especially those shaped by blameless principles—have become a cornerstone of our reliability strategy. By encouraging open dialogue and focusing on systemic solutions rather than individual blame, we ensure that every incident becomes an opportunity to learn and further strengthen our processes. By applying automation to our reliability and response workflows, we continue to enhance our postmortem process to make the most out of every opportunity and sustain reliability into the future.

This article was a collaborative effort between FactSet's Site Reliability Engineering group and Development Specialist Evan Murphy.

This blog post is for informational purposes only. The information contained in this blog post is not legal, tax, or investment advice. FactSet does not endorse or recommend any investments and assumes no liability for any consequence relating directly or indirectly to any action or inaction taken based on the information contained in this article.

Post Comment

Joe Benik

Vice President and Principal Software Architect

Mr. Joe Benik is Vice President and Principal Software Architect at FactSet. In this role, he is responsible for driving a culture of reliability across the company. He is a founding member of the company’s Site Reliability Engineering (SRE) team. Prior to his involvement with the company’s reliability efforts, he worked in analytics, focusing on financial risk models. Mr. Benik earned a Bachelor of Science in Computer Science from the University of Maryland, College Park.

Improving Reliability Through Blameless Postmortems

Technology

What is a Postmortem?

Postmortems/SRE at FactSet

Blameless Postmortems

Benefits of a Blameless Postmortem

Improving Postmortems Through Automation

Conclusion

Joe Benik

Vice President and Principal Software Architect

Related Articles

December 16, 2025

December 2, 2025

September 30, 2025

September 25, 2025

Comments