The video on-demand of this session is available to logged in QCon attendees only. Please login to your QCon account to watch the session.

Session + Live Q&A

Robust Foundation for Data Pipelines at Scale - Lessons From Netflix

At Netflix, Data/ML pipelines are widely used and have become central for the business. A very wide scenario presents diverse use cases that go beyond recommendations, predictions and data transformations. As big data and ML gains presence and becomes more impactful, the scalability and stability of the ecosystem have increasingly become more important for our data scientists and the company.  

Over the past years, we have developed a robust foundation composed of multiple cloud services and libraries, that provides users a consistent way to define, execute and monitor units of work. In the big data and ML space, our foundation is responsible for reliable executing a large number of Data/ML workflows containing tens of thousands parallel jobs, in addition to event-driven triggers and conditional branches.  

In this talk, we will share our experiences of building and operating the orchestration platform for Netflix’s big data ecosystem. We will talk about challenges we faced to manage hundreds of thousands of pipelines, and lessons we learned to automate them over the past years, such as fair resource allocation, scaling problems, and security concerns. We will also share best practices for the workflow lifecycle management and design philosophy for workflow automation, including patterns we developed and approaches we took.

Main Takeaways

1 Hear about Netflix’s orchestration platform, how they built and operate it.

2 Learn about the challenges they encountered and lessons learned managing hundreds of thousands of pipelines over the years.

3 Find out what are some of the best practices to be used in the workflow lifecycle management.


What is the work that you are doing today?

Jun: I work in the Big Data Orchestration team at Netflix. Our team owns multiple Orchestration services. My work focuses on designing and building the Netflix workflow orchestrator, a robust and scalable platform to provide workflow as a service. It has been widely used by thousands of Netflix internal users. With the scale of Netflix, one of my main tasks is to develop the scheduler to not only support a wide variety of use cases but also be able to scale up and scale out to automate hundreds of thousands of data pipelines at Netflix. 

Harrington: I work in the Data Platform Orchestration team at Netflix. My work focused on building the next generation of scheduler tools. This means building all the orchestration layers for our users to be able to execute jobs and schedule DAGs. In addition to this, I also work on building the event-driven platform for the data platform. We want our platform to be smarter, and we have been adopting and leveraging events, to be able to react to anything that happens in the platform. This has allowed us to orchestrate executions based on events instead of simply relying on time-based mechanisms. For example, instead of saying, “I want to run this workflow at midnight because a specific metric I need is likely going to be ready around that time”, we allow our users to say “I want this workflow to run when the metrics I need are available”. By doing this, we make our platform more efficient because instead of running when you think you have to, you run when you actually have to. 

What are the goals for the talk?

Jun: In this talk, we want to share our experiences while building this robust platform. We would like to share the best practices and lessons we have learned while operating the platform for the past several years. For each of the components, we will talk about its design and the technical decisions we made to better serve our users. After the talk, the audiences will see the tradeoffs for building and managing a large data pipeline platform and can apply some of those principles and best practices to their work when they see a fit.

Harrington: We've been working on this for quite a while, several years or so. Over this time, we have learned that by separating everything into components or layers, we have been able to build a solid platform. This has worked out for us quite well. Whenever we have had to make changes or evolve our tools, we have been able to do it causing zero to minimal impact on our users and the applications that are built on top of us. 

I expect that during this talk, I can clearly communicate how the separation of concerns that we have adopted into our implementation has helped us build our platform. I also want people to understand what we have learned, as well as the benefits of each of the main components. Finally, I would like people to walk away understanding how the platform has been able to evolve and keep up with the scale.


Speaker

Jun He

Sr. Software Engineer in the Big Data Orchestration Team @Netflix

Jun He is a Sr. Software Engineer in the Big Data Orchestration team at Netflix, where he is responsible for building the big data workflow scheduler to manage and automate ML and data pipelines at Netflix. Prior to Netflix, He spent a few years building distributed services and search...

Read more

Speaker

Harrington Joseph

Sr. Software Engineer @Netflix Data Platform Orchestration Team

Harrington Joseph is a Sr. Software Engineer at Netflix Data Platform Orchestration team. His work is focused on data orchestration and high-throughput event-driven architectures. Currently, Harrington is actively working on building the next generation of scheduling tools for Netflix Data...

Read more
Find Harrington Joseph at:

Date

Thursday May 27 / 10:10AM EDT (40 minutes)

Track

Modern Data Pipelines and Data Mesh

Topics

ArchitectureData PipelineMachine LearningDatabase

Add to Calendar

Add to calendar

Share

From the same track

Session + Live Q&A Architecture

Data Mesh: An Architectural Deep Dive

Thursday May 27 / 11:10AM EDT

Data Mesh is a paradigm shift in how we imagine and build big data management solutions, and most importantly is a shift in how we form our teams around these solutions and govern them. Data Mesh departs from half a century old assumptions on how we need to manage analytical data to be useful. It...

Zhamak Dehghani

Director of Emerging Technologies @thoughtworks & Creator of the Data Mesh concept

Session + Live Q&A Architecture

Data Mesh in the Real World: Lessons Learnt From the Financial Markets

Thursday May 27 / 09:10AM EDT

CMC Markets, a FTSE 250 financial services company, has been running a successful trading platform for more than 30 years and is now undertaking a broad and ambitious transformation to take advantage of new technologies and ways of working. One vital aspect of a successful transformation to a...

Tareq Abedrabbo

Core Data Principal Engineer @CMCMarkets

PANEL DISCUSSION + Live Q&A Architecture

Data Pipelines & Data Mesh: Where We Are and How the Future Looks Like

Thursday May 27 / 12:10PM EDT

In the panel, experts with different backgrounds will discuss the current challenges for building Modern Data Pipelines and applying Data Mesh in the real world. We will also discuss how the future looks like in terms of new techniques, architectures, and tools to make effective data-based projects.

Zhamak Dehghani

Director of Emerging Technologies @thoughtworks & Creator of the Data Mesh concept

Tareq Abedrabbo

Core Data Principal Engineer @CMCMarkets

Jacek Laskowski

IT freelancer, Java Champion & Author of "The Internals Of"

View full Schedule