Everyone (or at least most) companies today talk about digital transformation and treat data as a main asset for this. The question is where to store this data. In a traditional database? In a DWH? Ever heard about the datalake?
What is the datalake?
I think we should take a step back to answer this question. First of all, a Datalake is not a single piece of software. It consists of a large variety of Platforms, where Hadoop is a central one, but not the only one – it includes other tools such as Spark, Kafka, … and many more. Also, it includes relational Databases – such as PostgreSQL for instance. If we look at how truly digital companies such as Facebook, Google or Amazon solve these problems, then the technology stack is also clear; in fact, they heavily contribute to and use Hadoop & similar technologies. So the answer is clear: you don’t need overly expensive DWHs any more.
However, many C-Level executives might now say: “but we’ve invested millions in our DWH over the last years (or even decades)”. Here the question is getting more complex. How should we treat our DWH? Should it be replaced or should the DWH become the single source of truth and should the Datalake be ignored? In my opinion, both options aren’t valid:
Can the datalake replace a data warehouse?
First, replacing a DWH and moving all data to a Datalake will be a massive project that will bind too many resources in a company. Finding people with adequate skills isn’t easy, so this can’t be the solution to it. In addition to that, there are hundreds of business KPIs built, a lot of units within large enterprises built their decisions on these. Moving them to a Datalake will most likely break (important) business processes. Also, previous investments will be vaporised. So a big-bang replacement is clearly a no-go.
Second, keeping everything in the DWH is not feasible. Modern tools such as Python, Tensorflow and many more aren’t well supported by proprietary software (or at least, get the support with delay). From a skills-perspective, most young professionals coming from university get skills in technologies such as Spark, Hadoop and alike and therefore the skills shortage can be solved easier by moving towards a Datalake.
I am speaking at a large number of international conferences; whenever I ask the audience if they want to work with proprietary DWH databases, no hands go up. If I ask them if they want to work with Datalake technologies, everyone raises the hand. The fact is, that employees choose the company they want to work for, not vice versa. We have a skills shortage in this area, everyone ignoring or not accepting that is simply wrong. Also, a DWH is way more expensive then a Datalake. So also this option is not a valid one.
What to do now?
So what is my recommendation or strategy? For large, established enterprises, it is a combination of both steps, but with a clear path towards replacing the DWH in the long run. I am not a supporter of complex, long-running projects that are hard to control and track. Replacing the DWH should be a vision, not a project. This can be achieved by agile project management, combined with a long-term strategy: new projects are solely done by Datalake technologies.
All future investments and platform implementations must use the Datalake as the single source of truth. Once existing KPIs and processes are renewed, it must be ensured that these technologies are implemented on the Datalake and that the data gets shifted to the Datalake from the DWH. To make this succeed, it is necessary to have a strong Metadata management and data governance in place, otherwise the Datalake will be a very messy place – and thus become a data swamp.
This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. Wikipedia describes the concept of the datalake very well.