CIDR 2022 Keynote 1: The Databricks Lakehouse Platform by Matei Zaharia

This is a time of rapid change in both the requirements and technical opportunities for data platforms. In terms of requirements, organizations want to run ever more sophisticated analytics methods (e.g., data science and machine learning) on ever larger volumes of data. On the other hand, cloud storage has created the opportunity to query all of an organization’s data together for the first time, and open source has led to a broad software ecosystem that can exchange data through standard formats. I’ll describe how we leveraged these emerging requirements and opportunities in Databricks, a “lakehouse” platform designed to provide state-of-the-art performance, governance, and scalability for data in open formats that support a wide range of analytics tools. Databricks is used by over 5000 enterprises to process exabytes of data per day on over ten million VMs, with use cases from interactive SQL to real-time machine learning. I’ll describe some of the key things we learned about modern data users when designing the platform, as well as what it takes to operate this scale of cloud platform reliably.
Be the first to comment