Track Overview
Observability and Understandability in Production
This track brings together innovative leaders from the world of incident management and large-scale distributed systems together to discuss and share their current thoughts on the state of Observability and Understandability. How have we changed how we observe and understand our systems? What do we expect to see in the next 5-10 years as our communities increase their dependency on connected networks and systems? You’ll hear from the technical experts who’ve been in the trenches, achieved incredible results, and learned a ton along the way. The speakers on this panel didn’t just do the work, they created entire movements to bring thousands of people along with them to achieve and celebrate success.
From this track
More More More! Why the Most Resilient Companies Want More Incidents
Tuesday May 18 / 10:00AM EDT
Major tech companies like Facebook, Google, and Netflix want more incidents, not fewer. NASA wants them so urgently that they import incidents from other companies. The reason? Postmortems. This talk will focus on how companies of any scale can improve their ingestion of understandability by...
John Egan
CEO and Co-Founder @Kintaba
Observing and Understanding Failures: SRE Apprentices
Tuesday May 18 / 11:00AM EDT
In this session, Tammy will share how Padawans and Jedis can inspire and teach us how to help people of a wide variety of backgrounds, ages, and experience levels to observe and understand failures in production. Tammy will share how she and a colleague created an SRE Apprentice program to hire...
Tammy Bryant Butow
Principal Site Reliability Engineer @Gremlin
Resources & Transactions: A Fundamental Duality in Observability
Tuesday May 18 / 12:00PM EDT
Fundamentally, there are only two types of “things worth observing” when it comes to production systems:Resources, andTransactionsThe tricky (and interesting) part is that they’re entirely codependent. “Transactions” are the things that traverse your system and...
Ben Sigelman
CEO and co-founder @LightStepHQ, Co-creator @OpenTracing API standard
Panel: Observability and Understandability
Tuesday May 18 / 01:00PM EDT
This panel will feature experienced practitioners who have worked in the engineering teams of Google, Facebook, Dropbox. & MongoDB. They are all now working at startups focused on helping engineers improve their ability to reduce downtime and customer-impacting failures. Hear from this panel...
Jason Yee
Director of Advocacy @Gremlin
John Egan
CEO and Co-Founder @Kintaba
Ben Sigelman
CEO and co-founder @LightStepHQ, Co-creator @OpenTracing API standard
Speakers from this track
John Egan
CEO and Co-Founder @Kintaba
John Egan is CEO and cofounder at Kintaba, the modern incident response and management product for teams. Prior to Kintaba, John helped to lead enterprise products at Facebook.
Read moreFind John Egan at:
Tammy Bryant Butow
Principal Site Reliability Engineer @Gremlin
Tammy Butow is the principal SRE at Gremlin, where she works on Chaos Engineering, the facilitation of controlled experiments to identify systemic weaknesses. Gremlin helps engineers build resilient systems using their control plane and API. Tammy previously led SRE teams at Dropbox...
Read moreFind Tammy Bryant Butow at:
Ben Sigelman
CEO and co-founder @LightStepHQ, Co-creator @OpenTracing API standard
Ben Sigelman is a Cofounder & CEO at Lightstep, a company that makes complex microservice applications more transparent and reliable. He is an expert in distributed tracing and also co-founded the OpenTelemetry project.
Read moreFind Ben Sigelman at:
Jason Yee
Director of Advocacy @Gremlin
Jason Yee is Director of Advocacy at Gremlin where he helps people build more resilient systems by learning from how they fail. He also leads the internal Chaos Engineering practices to make Gremlin more reliable. Previously, he worked at Datadog, O’Reilly Media, and MongoDB. His...
Read moreFind Jason Yee at:
Track Host
Tammy Bryant Butow
Principal Site Reliability Engineer @Gremlin
Tammy Butow is the principal SRE at Gremlin, where she works on Chaos Engineering, the facilitation of controlled experiments to identify systemic weaknesses. Gremlin helps engineers build resilient systems using their control plane and API. Tammy previously led SRE teams at Dropbox...
Read more