Session + Live Q&A
Data Versioning at Scale: Chaos and Chaos Management
Version control is fundamental when managing code, but what about data? Our data changes over time, first since it accumulates, we have new data points for new points in time. But this is not the only reason. We also have additional data added to past time, since we were able to get additional data sources, or changed past data in light of new information that was late to arrive.
Since our data is mutable, version control of the data will allow us to ensure we can reproduce a set of results, provide us with the lineage between the input and output data sets of a process or a model, allow us to experiment, provide the relevant information for auditing, and assist us in production management. In this talk we will go over several technologies that version large data sets. We will understand the use cases they support and look under the hood at the technology developed to best support those use cases.
Speaker
Dr. Einat Orr
Co-creator of @lakeFS, Co-founder & CEO of Treeverse
Einat Orr has 20+ years of experience building R&D organizations and leading the technology vision at multiple companies, the latest being Similarweb, that IPO in NYSE last May. Currently she serves as Co-founder and CEO of Treeverse, the company behind lakeFS, an open source platform...
Read moreFind Dr. Einat Orr at:
From the same track
Taming the Data Mess, How Not to Be Overwhelmed by the Data Landscape
Wednesday May 18 / 09:00AM EDT
The data engineering field has evolved at a tremendous pace in the last decade, new systems that enable the processing of huge amounts of data generated enormous opportunities, as well as challenges for software practitioners. All these new tools and methodologies created a new set of...
Ismaël Mejía
Senior Cloud Advocate @Microsoft
Modern Data Pipelines in AdTech—Life in the Trenches
Wednesday May 18 / 11:20AM EDT
There are various tasks that the modern data pipelines approach helps us solve in different domains, including advertising. Modern data pipelines allow us to process data in a more efficient manner with a diverse set of data transformation tools for both batch and streaming data processing....
Roksolana Diachuk
Big Data Engineer @Captify
Orchestrating Hybrid Workflows with Apache Airflow
Wednesday May 18 / 12:30PM EDT
According to analysts, 87 percent of enterprises have already adopted hybrid cloud strategies. Customers have many reasons why they need to support hybrid environments, from maximizing the value from heritage systems to meeting local compliance and data processing regulations. As they build...
Ricardo Sueiras
Principal Advocate in Open Source @AWS