The video on-demand of this session is available to logged in QCon attendees only. Please login to your QCon account to watch the session.

Session + Live Q&A

More More More! Why the Most Resilient Companies Want More Incidents

Major tech companies like Facebook, Google, and Netflix want more incidents, not fewer. NASA wants them so urgently that they import incidents from other companies. The reason? Postmortems. This talk will focus on how companies of any scale can improve their ingestion of understandability by lowering their barriers to incident reporting and ruthlessly simplifying their processes for documenting knowledge and distributing postmortems.

Main Takeaways

1 Hear how some tech organizations are doing incident management.

2 Learn how to lower the barrier so more incidents are dealt with and documented, leading to long term resilience.


What is the work that you're doing today?

I'm a cofounder at Kintaba, which means most of my time is spent establishing direction and marketing and PR and everything else for our company. But the thing that really keeps me going is the work around evangelizing the adoption of modern incident management across entire organizations. We push really hard with our product, especially to convince entire companies to adopt resilience engineering practices as opposed to just SRE teams. A lot of our time is spent thinking about how to lower that barrier, how do we make it easier for product engineering teams and non-technical teams to adopt incident response processes.

Can you explain what incident response is?

There are two categories here. There's incident response, which is the direct actions taken against a major outage or incident generally within a product or engineering team. Something that you might see in public as AWS going down or Google going down: major customer-impacting events. But any major incident that puts the company at risk can be generally categorized as a moment when a real-time response is required. As opposed to a task or project management where you're saying, look, these things have deadlines, we're gonna get them done in a week or a month. Incident response is all about the real-time reaction to the unexpected, generally black swan situations that can put an entire company at risk at any moment. At a macro level to that, there's incident management, which is the overarching process that encompasses how your company approaches incidents. So just to be clear, there are actually two areas there. And the product like ours is actually both incident management at the top level that you would implement, and it includes incident response as a feature. 

What exactly does it mean? Is it managing processes? Is it communication? Is it a way of connecting the right teams or is it tools?

It's a consistent process that you implement for how you're going to deal with the unexpected. As a set of actions, it's everything from how you centrally file the incident such that your entire organization knows that it's happening, to how you contact the appropriate responders and bring that response team together such that you can start to work on coming to a resolution to the incident, to how you follow up and learn from the incident. Kintaba includes a  collaborative space where responders come together to interact and communicate as they work towards mitigation, and it handles the recordkeeping of that space as well. Across the company, Kintaba is the ultimate status tracking system: keeping everyone updated on the status as the incident moves towards mitigation and resolution. This includes the emails that are going out to executives and other people in the company to be aware of how far the team is in terms of progressing towards mitigation. And finally, it handles the capture of knowledge after an incident.  We call these postmortems: the write-up of what was learned once the incident is closed which is distributed to the rest of the organization so that the company learns and it becomes less likely and hopefully impossible for the same incident to happen again. So it's that entire lifecycle of an incident that we call incident management. 

I guess you need to be a company of a certain size.

A big part of our goal here is to try to make it easier for companies to implement all of this because when I say it, it sounds like there are all of these different steps. But in reality, if you have a product like Kintaba taking care of that flow for you, it's pretty lightweight. What you're doing is you're telling the product to start the process for you. And in the background, it's really just working to make sure you complete each step. We see companies as small as five to 10 people able to practice this pretty successfully. It really starts to take off when you get around like 20 or 30 as an organization and you start to get to that point where you are an organization of data, and the distribution of roles might not necessarily be something that everyone within the company knows. And then certainly as you grow and get to one hundred people, a thousand people, it becomes critical. And it's not even an optional thing to have within the organization anymore. So it becomes, I think, more critical as companies get larger. But one of the big goals and movements in the industry right now is to simplify these processes so that they're useful for companies of all sizes.

The bigger the company is just impossible to know the one person who knows how the network works and stuff like that.

Yes, and I think a lot of these side tools really feed in here. Things like On-Call rotations. Knowing who the person is within the engineering team. But with incident management and incident response, you care about the non-technical roles, too. Who's the PR person who's on call? Who's the legal person who's on call? These folks are important, especially in company-facing incidents. Even in a company of 20 or 30 people, it can be hard to know who is the person in the legal team who gets to be part of this because there is PII or personally identifiable information involved and an engineer might not know that. Making all of that information easy to access and available becomes pretty valuable pretty early on.

What are your goals for the talk?

The title of my talk is "More and more and more." And we're talking about why it's important to increase the total number of incidents being tracked in your organization. And we really want to push against what we see at a lot of companies, especially medium-sized companies, where all of the efforts internally is thought to be around getting these charts tracking our total number of incidents to go down. 

So this incident response has been important for companies like airlines. How long has this been going on? 

Outside of Silicon Valley, this was born in the aviation industry coming out of World War Two and beyond. In the 50s and the 60s, we started looking at the way that we deal with catastrophe in high-risk industries like airlines, where an accident means life is lost. And these organizations have spent decades working against the natural human reaction, which is to blame other humans and say, well, this was a failure of that person, and instead recognize that the incidents themselves represent systemic challenges within the company or even the industry that need to be addressed. As soon as you realize that it changes the way you approach what an incident means in terms of the reaction you're going to take as a company rather than saying, an incident has happened, let's fire someone. An incident has happened, we need to go and change the way that things work to make sure no other human makes that same mistake again. And it sounds pretty obvious when we say it today, but there's a solid 40 years there. So it’s outside of the valley where this was refined and it really wasn't until the 80s and 90s that we really started to get good at it, at what we would call blame-free post mortems and incident management. And the tech world started to adopt this with the Googles and Facebook's back and the early 2000s. Google, I think, absorbed a lot of their original incident management practices from the Mountain View Fire Department, I believe. But basically pulling these concepts of what are you going to do about major outages and incidents and how are we going to react to them in a predictable and manageable way so that we're not panicking all the time. Google published a book, I think in 2016, that was the first real public statement that Google made about this. I think it was the SRE handbook that had a whole two chapters on incident management and how they practice. And then other organizations like Facebook have also been practicing this and Netflix. Recently, over the last maybe five years, it's started to filter out from the big companies out into other tech organizations where it's becoming a common practice as opposed to just something you do once you're a unicorn.


Speaker

John Egan

CEO and Co-Founder @Kintaba

John Egan is CEO and cofounder at Kintaba, the modern incident response and management product for teams.  Prior to Kintaba, John helped to lead enterprise products at Facebook.

Read more
Find John Egan at:

Date

Tuesday May 18 / 10:00AM EDT (40 minutes)

Track

Observability and Understandability in Production

Topics

Incident ManagementResilienceDevops

Add to Calendar

Add to calendar

Share

From the same track

Session + Live Q&A Observability

Resources & Transactions: A Fundamental Duality in Observability

Tuesday May 18 / 12:00PM EDT

Fundamentally, there are only two types of “things worth observing” when it comes to production systems:Resources, andTransactionsThe tricky (and interesting) part is that they’re entirely codependent. “Transactions” are the things that traverse your system and...

Ben Sigelman

CEO and co-founder @LightStepHQ, Co-creator @OpenTracing API standard

Session + Live Q&A Observability

Observing and Understanding Failures: SRE Apprentices

Tuesday May 18 / 11:00AM EDT

In this session, Tammy will share how Padawans and Jedis can inspire and teach us how to help people of a wide variety of backgrounds, ages, and experience levels to observe and understand failures in production. Tammy will share how she and a colleague created an SRE Apprentice program to hire...

Tammy Bryant Butow

Principal Site Reliability Engineer @Gremlin

PANEL DISCUSSION + Live Q&A Observability

Panel: Observability and Understandability

Tuesday May 18 / 01:00PM EDT

This panel will feature experienced practitioners who have worked in the engineering teams of Google, Facebook, Dropbox. & MongoDB. They are all now working at startups focused on helping engineers improve their ability to reduce downtime and customer-impacting failures. Hear from this panel...

Jason Yee

Director of Advocacy @Gremlin

John Egan

CEO and Co-Founder @Kintaba

Ben Sigelman

CEO and co-founder @LightStepHQ, Co-creator @OpenTracing API standard

View full Schedule