The video on-demand of this session is available to logged in QCon attendees only. Please login to your QCon account to watch the session.

Session + Live Q&A

The Scientific Method for Testing System Resilience

Do you remember the Scientific Method from elementary school science class? It's time to dust off that knowledge and use it to your advantage to test your IT systems! In this session, you'll be re-introduced to the Scientific Method, and learn how Vanguard's software engineers and IT architects draw inspiration from it in their resilience testing efforts. We’ll do a deep dive into the "Failure Modes and Effects Analysis" technique, in which engineers examine complex architecture diagrams, asking themselves questions about the failure modes of various technical components and developing hypotheses based on their expectations of how the system would behave. Then, we’ll discuss how the engineers use these conjectures as inputs into experimentation, selecting and executing chaos experiments accordingly to validate (or disprove!) their hypotheses. We’ll even take a look behind the curtain at how some of these fault injection tests are implemented at Vanguard.

Main Takeaways

1 Hear about how Vanguard deals with issues in their software systems.

2 Learn how to use Failure Modes and Effects Analysis.

Christina, what is the focus of your work these days?

Right now, my primary focus is the staffing, onboarding and subsequent education of site reliability engineers for Vanguard. So I handle everything from what it means to be a site reliability engineer in the day-to-day, what tools and technologies they'll need to be familiar with and how to best get them up to speed. But also on where are we going to find these SREs, and how many do we need to find and where should we be putting them within our organization?

What is the motivation for your talk?

Share the story of a practice we've adopted at Vanguard across the organization, in particular in areas where we have site reliability engineer staff to do the work. Share the story of the failure mode and effects analysis practice where it started, which had many, many challenges, lots of bumps in the road to the various iterations to make the practice better and then share the value that we've derived from making this practice a step for the majority of applications going into production for Vanguard. And I'll share all of the frameworks to make sure that all of the attendees of this presentation can repeat the successes that we've seen at Vanguard in their own systems at their companies.

And how would you describe the persona and the level for the target audience?

I think that the right audience for this talk is anyone who is in the position of maybe a technical lead for a software system. Oftentimes, the people involved in the conversations that make up a failure modes and effects analysis are technical leads, architects, or senior engineers who can look at an architecture diagram and ask the right questions, interpret what they're seeing and make suggestions for how to improve the architecture to make it more resilient.

What do you want this persona to walk away with from your presentation?

I hope that anyone attending my presentation will feel confident taking what they've learned about the failure modes and effects analysis technique and even chaos experimentation, and be able to bring that back to their organizations, to the software systems that they are building and apply it to their own work so that they can reap the same benefits I have.


Christina Yakomin

Senior Site Reliability Engineering Specialist @Vanguard_Group

Christina is a Senior Site Reliability Engineering Specialist in Vanguard's Chief Technology Office. She has worked at the company's Malvern, PA headquarters since graduating from Villanova University with an undergraduate degree in Computer Science. Throughout her career, she...

Read more


Wednesday May 18 / 12:30PM EDT (50 minutes)


Resilient Architectures


ArchitectureEnterprise ArchitectureResilienceResiliency

Add to Calendar

Add to calendar


From the same track

Session + Live Q&A Architecture

Resiliency Superpowers with eBPF

Wednesday May 18 / 09:00AM EDT

eBPF is a powerful technology that allows us to run custom programs in the kernel. It’s enabling a whole new generation of tools for networking, security and observability. Let’s explore how it can help us build resilient architectures. This talk - with demos - considers...

Liz Rice

Chief Open Source Officer @Isovalent

Session + Live Q&A Fault Tolerance

How to Test Your Fault Isolation Boundaries in the Cloud

Wednesday May 18 / 11:20AM EDT

Will my system keep working when a server fails? When a data center goes offline? When a service dependency is unavailable?Availability calculations for redundant components require that those components are independent and autonomous of each other. But modern day systems are complex, exhibiting...

Jason Barto

Principal Solutions Architect @AWS

Session + Live Q&A Architecture

Resilient Real-Time Data Streaming Across the Edge and Hybrid Cloud

Wednesday May 18 / 10:10AM EDT

Hybrid cloud architectures are the new black for most companies. A cloud-first strategy is evident for many new enterprise architectures, but some use cases require resiliency across edge sites and multiple cloud regions. Data streaming with the Apache Kafka ecosystem is a perfect technology for...

Kai Waehner

Field CTO @Confluentinc

View full Schedule