Big Data Europe Conference 2021 - Management of a Cloud Data Lake using Apache Spark - Josef Habdank

Management of a Cloud Data Lake in Practice: How to Manage 1000s of ETLs Using Apache Spark
Nowadays the problem of speed of processing is seemingly solved. Unless you process tens of petabytes an off-the-shelf toolset will suffice for most of the problems. Currently, the main challenges in data lake systems are in the field of data governance:
* how do you make sure data is discoverable, reusable, up to date and of high quality?
* how to avoid huge technical debt when developing a massive number of complex data flows?
* how to guarantee that the project can scale despite having access to very scarce human resources and technical talent?

The goal of this talk is to showcase how to design a data lake management system scalable in all the broadest meaning of the word: that is not only scales with the growth of the data, but as well that it scales with the growth of the complexity of the whole enterprise. The talk will outline the business reasoning, key design principles as well as technical solution. Expect some (but not too much) nerdy details related to Apache Spark implementation.
Be the first to comment