The datalake has been a significant design concept over the last years when we talked about big data and data processing. In recent month, a new concept – the data mesh – got significant attention. But what is the data mesh and how does it impact the datalake? Will it put a sudden death to the datalake?
The data divide
The data mesh was first introduced by Zhamak Dehghani in this blog post. It is a concept based on different challenges when handling data. Some of the arguments Zhamak is using are:
- The focus on ETL processes
- Building a monolith (aka Datalake or Data warehouse)
- Not focusing on the business
According to her, this leads to the “data divide”. Based on my experience, I can fully subscribe to the data divide. Building a datalake isn’t state of the art anymore, since it focuses too much on building a large system for month to years, while business priorities are moving targets that shift during this timeframe. Furthermore, it locks sparse resources (data engineers) into infrastructure work, while they should create value.
The datalake was often perceived as a “solution” to this problem. But it was only a technical answer to a non-technical problem. One monolith (data warehouse) was replaced with the other one (datalake). IT folks argued over what was the better solution, but after years of arguing, implementation and failed projects, companies figured out that not much has changed. But why?
The answer to this is simple
The focus in the traditional (what is called as monolithic approach) is the focus on building ETL processes. The challenge behind that is that BI units, which are often remote to the business, don’t have a clue about the business. The teams of data engineers often work in the dark, fully decoupled from the business. The original goal of centralised data units was to harmonize data and remove silos. However, what was created was quite different: unusable data. Nobody had an idea about what was in the data, why it was produced and for what purpose. If there is no idea about the business process itself, there hardly is an idea why the data comes in a specific format and alike.
I like comparisons to the car industry, which currently is in full disruption: traditional car makers focused on improving gas powered engines. Then comes Elon Musk with Tesla and builds a far better car with great acceleration and ways lower consumption. This is real change. The same is valid for data: replacing a technology that didn’t work with another technology won’t change the problem: the process is the problem.
The Data mesh – focus on what matters
Here comes the data mesh into play. It is based loosely on some aspects that we already know:
- Microservices architecture
- Services meshs
- Cloud
One of the concepts of the data mesh that I really like is its focus on the business and its simplicity. Basically, it asks for an architectural quantum, meaning the simples architecture necessary to run the case. There are several tools available to use and it shifts the focus away from building a monolith were a use case might run at a specific point in time towards doing the use case and use the tools that are available for it to run. And, hey, in the public cloud we have tons of tools for all use cases one might imagine, so no need to build this platform. Again: focus on the business.
Another aspect that I really like about the data mesh is the shift of responsibility towards the business. With that, I mean the data ownership. Data is provided from the place where it is created. Marketing creates their marketing data and makes sure it is properly cleaned, finance their data and so on. Remember: only business knows best why data is created and for what purpose.
But what is the future role of IT?
So, does the data mesh require all data engineers, data scientists and alike to now move to business units? I would say, it depends. Basically, the data mesh requires engineering to work in multi-disciplinary teams with the business. This changes the role of IT to a more strategic one but – requiring IT to deploy the right people to the projects.
Also, IT needs to ensure governance and standards are properly set. The data mesh concept will fail if there is no smart governance behind it. There is a high risk of creating more data silos and thus do no good to the data strategy. If you would like to read more about data strategy, check out this tutorial on data governance.
Also, I want to stress one thing: the data mesh doesn’t replace the data warehouse nor the data lake. Tools used and built in this can be reused.
There is still much more to the data mesh. This is just my summary and thoughts on this very interesting concept. Make sure to read Zhamak’s post on it as well for the full details!
Leave a Reply
Want to join the discussion?Feel free to contribute!