Posts

In some of my previous posts, I shared my thoughts on the data mesh architecture. The Data Mesh was originally introduced by Zhamak Dehghani in 2019 and is now enjoying huge popularity in the community. As one of the main thoughts of the data mesh architecture is the distributed nature of data, it also leads to a domain driven design of the data itself. A data circle enables this design.

What is a data circle?

A data circle is a data model, that is tailored to the use-case domain. It should follow the approach of the architectural quant from the micro service architecture. The domain model should only contain all relevant information for the purpose it is built for and not contain any additional data. Also, each circle could or should run within its own environment (e.g. database). The technology should be selected for the best use of the data. A circle might easily be confused with a data mart, that is built within the data warehouse. However, several data circles might not “live” within one (physical) data warehouse but use different technologies and are highly distributed.

Each company will have several data circles in place, each tailored to the specific needs of use-cases. When modelling data with data circles, unnecessary information will be skipped as it will – at some point – be connectable with other data circles in the company. Have we previously built our data models in a very comprehensive way (e.g. via the data warehouse), we now built the data models in a distributed way.

Samples in the telco and financial industry

If we take for example a telco company, data circles might be:

  • The customer data circle: containing the most important customer data
  • The network data circle: containing information about the network
  • The CDR data circle: containing information about calls conducted

If we look at the insurance industry, data circles might be:

  • The customer data circle: containing the most important customer data
  • The claims data circle: containing the data about past claims
  • The health data: containing the data about health related infos

If we focus back to the telco company, the data about the customer might be stored in a relational model within a RDBMS. However, network data might be stored in a graph for better spatial analysis. CDR data might be stored in a files-based setup. For each domain, the best technology is selected and the best model is designed. Similar holds true for other industries.

Several data circles make up the design

Different business units will built their own data circles to fit to their demands. This, however, makes it necessary to create a central repository that sticks it all together: a hub connecting all the circles. The hub stores information about connectivity of different circles. Imagine the network data model again – you might want to connect the the network data with customer data. There must be a way to connect this data, by still keeping its distributed aspects. The hub serves as a central data asset management tool and one-stop-shop for employees within the company to find the data they need.

A data hub connecting different data circles
Circles connected via a hub

The Data Hub also allows users to connect and analyse the data they want to access. This allows the users to use tools such as Jupyter to analyse the data. The hub also takes care about the connectivity to the data and thus provides an API for all users. A data hub is all about data governance.

What’s next?

I recommend you reading about all the other articles I’ve written about the data mesh architecture. It is fairly easy to get stated with this architectural style and the data circles contribute to this.

In my last post, I presented the concept of the data mesh. One thing that is often discussed in regards to the data mesh is how to make an agile architecture with data and microservices. To understand where this needs to go to, we must first discuss the architectural quantum.

What is the architectural quantum?

The architectural quantum is the smallest possible item that needs to be deployed in order to run an application. It is described in the book “Building Evolutionary Architectures“. The traditional approach with data lakes was to have a monolith, so the smallest entity was the datalake itself. The goal of the architectural quantum is to reduce complexity. With microservices, this is achieved by decoupling services and making small entities in a shared-nothing approach.

The goal is simplification and this can only be achieved if there is little to no shared service involved. The original expectation with SOA was to share commonly used infrastructure in order to reduce the effort in development. However, it lead to higher, rather than lower, complexity: when a change to a shared item was necessary, all teams depending on the item had to be involved. With the shared-nothing architecture, items are rather copied than shared and then independently used from each other.

Focus on the business problem

Each solution is designed around the business domain and not around the technical platform. Consequently, this also means that the technology fit for the business problem should be chosen. As of now, most IT departments have a narrow view on the technology, so they try to fit the business problem to the technology until the business problem becomes a technical problem. However, it should be vice versa.

With the data mesh and the architectural quantum, we focus fully on the domain. Since the goal is to reduce complexity (small quantum size!) we won’t re-use technology but select the appropriate one. The data mesh thus only works well if there is a large set of tools available, which can typically be found by large cloud providers such as AWS, Microsoft Azure or Google Cloud. Remember: you want to find a solution to the business problem, not create a technical problem.

Why we need data microservices

After microservices, it is about time for data microservices. There are several things that we need to change when working with data:

  • Focus on the Business. We don’t solve technical problems, we need to start solving business problems
  • Reduce complexity. Don’t think in tech porn. Simplify the architecture, don’t over-complicate.
  • Don’t build it. It already exists in the cloud and is less complicated to use than build and run it on your own
  • No Monoliths. We used to build them for decades for data, replacing a DWH with a Datalake didn’t work out.

It is just about time to start doing so.

If you want to learn more about the data mesh, make sure to read the original description of it by Zhamak Dehghani in this blog post.

mesh bags on white textile

The datalake has been a significant design concept over the last years when we talked about big data and data processing. In recent month, a new concept – the data mesh – got significant attention. But what is the data mesh and how does it impact the datalake? Will it put a sudden death to the datalake?

The data divide

The data mesh was first introduced by Zhamak Dehghani in this blog post. It is a concept based on different challenges when handling data. Some of the arguments Zhamak is using are:

  • The focus on ETL processes
  • Building a monolith (aka Datalake or Data warehouse)
  • Not focusing on the business

According to her, this leads to the “data divide”. Based on my experience, I can fully subscribe to the data divide. Building a datalake isn’t state of the art anymore, since it focuses too much on building a large system for month to years, while business priorities are moving targets that shift during this timeframe. Furthermore, it locks sparse resources (data engineers) into infrastructure work, while they should create value.

The datalake was often perceived as a “solution” to this problem. But it was only a technical answer to a non-technical problem. One monolith (data warehouse) was replaced with the other one (datalake). IT folks argued over what was the better solution, but after years of arguing, implementation and failed projects, companies figured out that not much has changed. But why?

The answer to this is simple

The focus in the traditional (what is called as monolithic approach) is the focus on building ETL processes. The challenge behind that is that BI units, which are often remote to the business, don’t have a clue about the business. The teams of data engineers often work in the dark, fully decoupled from the business. The original goal of centralised data units was to harmonize data and remove silos. However, what was created was quite different: unusable data. Nobody had an idea about what was in the data, why it was produced and for what purpose. If there is no idea about the business process itself, there hardly is an idea why the data comes in a specific format and alike.

I like comparisons to the car industry, which currently is in full disruption: traditional car makers focused on improving gas powered engines. Then comes Elon Musk with Tesla and builds a far better car with great acceleration and ways lower consumption. This is real change. The same is valid for data: replacing a technology that didn’t work with another technology won’t change the problem: the process is the problem.

The Data mesh – focus on what matters

Here comes the data mesh into play. It is based loosely on some aspects that we already know:

  • Microservices architecture
  • Services meshs
  • Cloud

One of the concepts of the data mesh that I really like is its focus on the business and its simplicity. Basically, it asks for an architectural quantum, meaning the simples architecture necessary to run the case. There are several tools available to use and it shifts the focus away from building a monolith were a use case might run at a specific point in time towards doing the use case and use the tools that are available for it to run. And, hey, in the public cloud we have tons of tools for all use cases one might imagine, so no need to build this platform. Again: focus on the business.

Another aspect that I really like about the data mesh is the shift of responsibility towards the business. With that, I mean the data ownership. Data is provided from the place where it is created. Marketing creates their marketing data and makes sure it is properly cleaned, finance their data and so on. Remember: only business knows best why data is created and for what purpose.

But what is the future role of IT?

So, does the data mesh require all data engineers, data scientists and alike to now move to business units? I would say, it depends. Basically, the data mesh requires engineering to work in multi-disciplinary teams with the business. This changes the role of IT to a more strategic one but – requiring IT to deploy the right people to the projects.

Also, IT needs to ensure governance and standards are properly set. The data mesh concept will fail if there is no smart governance behind it. There is a high risk of creating more data silos and thus do no good to the data strategy. If you would like to read more about data strategy, check out this tutorial on data governance.

Also, I want to stress one thing: the data mesh doesn’t replace the data warehouse nor the data lake. Tools used and built in this can be reused.

There is still much more to the data mesh. This is just my summary and thoughts on this very interesting concept. Make sure to read Zhamak’s post on it as well for the full details!