Posts

In my last post, I presented the concept of the data mesh. One thing that is often discussed in regards to the data mesh is how to make an agile architecture with data and microservices. To understand where this needs to go to, we must first discuss the architectural quantum.

What is the architectural quantum?

The architectural quantum is the smallest possible item that needs to be deployed in order to run an application. It is described in the book “Building Evolutionary Architectures“. The traditional approach with data lakes was to have a monolith, so the smallest entity was the datalake itself. The goal of the architectural quantum is to reduce complexity. With microservices, this is achieved by decoupling services and making small entities in a shared-nothing approach.

The goal is simplification and this can only be achieved if there is little to no shared service involved. The original expectation with SOA was to share commonly used infrastructure in order to reduce the effort in development. However, it lead to higher, rather than lower, complexity: when a change to a shared item was necessary, all teams depending on the item had to be involved. With the shared-nothing architecture, items are rather copied than shared and then independently used from each other.

Focus on the business problem

Each solution is designed around the business domain and not around the technical platform. Consequently, this also means that the technology fit for the business problem should be chosen. As of now, most IT departments have a narrow view on the technology, so they try to fit the business problem to the technology until the business problem becomes a technical problem. However, it should be vice versa.

With the data mesh and the architectural quantum, we focus fully on the domain. Since the goal is to reduce complexity (small quantum size!) we won’t re-use technology but select the appropriate one. The data mesh thus only works well if there is a large set of tools available, which can typically be found by large cloud providers such as AWS, Microsoft Azure or Google Cloud. Remember: you want to find a solution to the business problem, not create a technical problem.

Why we need data microservices

After microservices, it is about time for data microservices. There are several things that we need to change when working with data:

  • Focus on the Business. We don’t solve technical problems, we need to start solving business problems
  • Reduce complexity. Don’t think in tech porn. Simplify the architecture, don’t over-complicate.
  • Don’t build it. It already exists in the cloud and is less complicated to use than build and run it on your own
  • No Monoliths. We used to build them for decades for data, replacing a DWH with a Datalake didn’t work out.

It is just about time to start doing so.

If you want to learn more about the data mesh, make sure to read the original description of it by Zhamak Dehghani in this blog post.

mesh bags on white textile

The datalake has been a significant design concept over the last years when we talked about big data and data processing. In recent month, a new concept – the data mesh – got significant attention. But what is the data mesh and how does it impact the datalake? Will it put a sudden death to the datalake?

The data divide

The data mesh was first introduced by Zhamak Dehghani in this blog post. It is a concept based on different challenges when handling data. Some of the arguments Zhamak is using are:

  • The focus on ETL processes
  • Building a monolith (aka Datalake or Data warehouse)
  • Not focusing on the business

According to her, this leads to the “data divide”. Based on my experience, I can fully subscribe to the data divide. Building a datalake isn’t state of the art anymore, since it focuses too much on building a large system for month to years, while business priorities are moving targets that shift during this timeframe. Furthermore, it locks sparse resources (data engineers) into infrastructure work, while they should create value.

The datalake was often perceived as a “solution” to this problem. But it was only a technical answer to a non-technical problem. One monolith (data warehouse) was replaced with the other one (datalake). IT folks argued over what was the better solution, but after years of arguing, implementation and failed projects, companies figured out that not much has changed. But why?

The answer to this is simple

The focus in the traditional (what is called as monolithic approach) is the focus on building ETL processes. The challenge behind that is that BI units, which are often remote to the business, don’t have a clue about the business. The teams of data engineers often work in the dark, fully decoupled from the business. The original goal of centralised data units was to harmonize data and remove silos. However, what was created was quite different: unusable data. Nobody had an idea about what was in the data, why it was produced and for what purpose. If there is no idea about the business process itself, there hardly is an idea why the data comes in a specific format and alike.

I like comparisons to the car industry, which currently is in full disruption: traditional car makers focused on improving gas powered engines. Then comes Elon Musk with Tesla and builds a far better car with great acceleration and ways lower consumption. This is real change. The same is valid for data: replacing a technology that didn’t work with another technology won’t change the problem: the process is the problem.

The Data mesh – focus on what matters

Here comes the data mesh into play. It is based loosely on some aspects that we already know:

  • Microservices architecture
  • Services meshs
  • Cloud

One of the concepts of the data mesh that I really like is its focus on the business and its simplicity. Basically, it asks for an architectural quantum, meaning the simples architecture necessary to run the case. There are several tools available to use and it shifts the focus away from building a monolith were a use case might run at a specific point in time towards doing the use case and use the tools that are available for it to run. And, hey, in the public cloud we have tons of tools for all use cases one might imagine, so no need to build this platform. Again: focus on the business.

Another aspect that I really like about the data mesh is the shift of responsibility towards the business. With that, I mean the data ownership. Data is provided from the place where it is created. Marketing creates their marketing data and makes sure it is properly cleaned, finance their data and so on. Remember: only business knows best why data is created and for what purpose.

But what is the future role of IT?

So, does the data mesh require all data engineers, data scientists and alike to now move to business units? I would say, it depends. Basically, the data mesh requires engineering to work in multi-disciplinary teams with the business. This changes the role of IT to a more strategic one but – requiring IT to deploy the right people to the projects.

Also, IT needs to ensure governance and standards are properly set. The data mesh concept will fail if there is no smart governance behind it. There is a high risk of creating more data silos and thus do no good to the data strategy. If you would like to read more about data strategy, check out this tutorial on data governance.

Also, I want to stress one thing: the data mesh doesn’t replace the data warehouse nor the data lake. Tools used and built in this can be reused.

There is still much more to the data mesh. This is just my summary and thoughts on this very interesting concept. Make sure to read Zhamak’s post on it as well for the full details!

Over the last months, I wrote several articles about data governance. One aspect of data governance is also the principle of FAIR data. FAIR in the context of data stands for: findable, accessible, interoperable and reusable. There are several scientific papers dealing with this topic. Let me explain what it is about

What is FAIR data?

FAIR builds on the four principles stating at the beginning: findable, accessible, interoperable and reusable. This tackles most of the requirements around data governance and thus should increase the use of data. It doesn’t really deals with the aspect of data quality, but it does deal with the challenge on how to work with data. In my experience, most issues around data governance are very basic and most companies don’t manage to solve them at the elementary level.

If a company gets started with the principle of FAIR, some elementary groundwork can be done and future quality improvements can be built on top of it. Plus, it is a good and easy starting point for data governance. Let me explain each of the principles in a bit more depth now.

Findable data

Most data projects starts with the question on how to find if there is data for a specific use-case. This is often difficult to answer, since data engineers or data scientists often don’t know what kind of data is available in a large enterprise. They know the problem that they want to solve but don’t know where the data is. They have to move from person to person and dig deep in the organisation, until they find someone that knows about the data that could potentially serve for their business need. This process can take weeks and data scientists might get frustrated along the way.

A data catalog containing information about the data assets in an enterprise might solve these issues.

Accessible data

Once the first aspect is solved, it is necessary to access data. This also brings a lot of complexity, since data is often sensitive and data owners simply don’t want to share the data access. Escalations often happen along that way. To solve these problems, it is necessary to have clear data owners for all data assets defined. Also, it is highly important to have a clear process for data access available.

Interoperable data

Data often needs to be combined in use-cases with other data sets. This means, that it must be known what each data asset is about. It is necessary to have metadata available about the data and have this shared with data consumers. Nothing is worse for data scientists to constantly ask data owners about the content of the data set. The better a description about a data set is available, the faster people can work with data.

A frequent case is that data is being bought from other companies or shared among companies. This is the concept of decentralised data hubs. In this context, it is highly important to have a clearly defined metadata available.

Reuseable data

Data should eventually be reusable for other business cases as well. Therefore, it is necessary on how data was created. A description about the source system and producing entities needs to be available. Also, it is necessary how include information about potential transformations on data.

In order to make data reusable, the terms of reusability must be provided. This can be a license or other community standards on the data. Data can be either purchased or made available for free. Different software solutions enable this.

What’s next on FAIR data?

I believe it is easy to get started with implementing the tools and processes needed for a FAIR data strategy. It will immediately increase the access times to data and provide a clear way forward. Also, it will increase data quality indirectly and enable future data quality initiatives.

My article was inspired by the discussions I had with Prof. Polleres. Thanks for the insights!

The three data sources

For Data itself, there are a lot of different sources that are needed. Based on the company and industry, they differ a lot. However, to create a complex view on your company, it isn’t necessary only to have your own data. There are several other data sources you should consider.

The three data sources

The three data sources

Data you already have

The first data source – data you have – seems to be the easiest. However, it isn’t as easy as you might believe. Bringing your data in order is actually a very difficult task and can’t be achieved that easy. I’ve written several blog posts here about the challenges around data and you can review them. Basically, all of them focus on your internal data sources. I won’t re-state them in detail here, but it is mainly about data governance and access.

Data that you can acquire

The second data source – data you can acquire – is another important aspect. By acquire I basically mean everything that you don’t have to pay to an external party as data provider. You might use surveys (and pay for it as well) or acquire the data from open data platforms. Also, you might collect data from social media or with other kind of crawlers. This data source is very important for you, as you can get great overview and insights into your specific questions.

In the past, I’ve seen a lot of companies utilising the second one and we did a lot on that aspect. For this kind of data, you don’t necessarily have to pay for it – some data sources are free. And if you pay for something, you don’t pay for the data itself but rather for the (semi)-manual way of collecting it. Also here, it differs heavily from industry to industry and what the company is all about. I’ve seen companies collecting data from news sites to get insights into their competition and mentions or simply by scanning social media. A lot is possible with this aspect of data source.

Data you can buy

The last one – data you can buy – is easy to get but very expensive in cash-out terms. There are a lot of data providers selling different kind of data. Often, it is demographic data or data about customers. Different platforms collect data from a large number of online sites and thus track individuals over different sites and their behavior. Such platforms then sell this kind of data to marketing departments with more insights. Also here, you can buy this kind of data from that platforms and thus enrich your own first-party and second-party data. Imagine, you are operating a retail business selling all kind of furniture.

You would probably not know much about your web shop visitors, since they are anonymous until they buy something. With data bought from such kind of data providers, it would now be possible for you to figure out if an anonymous visitor is an outdoor enthusiast. You might adjust your offers to match his or her interest best. Or, you might learn that the person visiting your shop recently bought a countryside house with a garden. You might now adjust your offers to present garden furniture or Barbecue accessories. With this kind of third party data, you can achieve a lot and better understand your customers and your company.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. If you look for open data, I would recommend you browsing some open data catalogs like the open data catalog from the U.S. government.

The three data sources

To get the most out of your data strategy in an enterprise, it is necessary to cluster the different user types that might arise in an enterprise. All of them are users of data but with different needs and demands on it. In my opinion, they range from different expertise levels. Basically, I see three different user types for data access within a company

Data access on 3 different levels

Three degrees of Data Access

Basically, the different user types differentiate from their level of how they use data and from the number of users. Let’s first start with the lower part of the pyramid – Business Users

Business Users

The first layer are the business users. This are basically users that need data for their daily decisions, but are rather consumers of the data. These people look at different reports to make decisions on their business topics. They could either be Marketing, Sales or Technology – depending on the company itself. Basically, these users would use pre-defined reports, but in the long run would rather go for customized reports. One great thing for that is self-service BI. Basically, theses users are experienced in interpreting data for their business goals and asking questions on their data. This could be about re-viewing the performance of a campaign, weekly or monthly sales reports, … They create huge load on the underlying systems without understanding the implementation and complexity underneath it – and they don’t have to. From time to time, they start digging deeper into their data and thus become power users – our next level

Power Users

Power Users often emerge from Business Users. This is typically a person that is close with the business and understands the needs and processes around it. However, they also have a great technical understanding (or gained this understanding during the process of becoming power users). They have some level of SQL know-how or know the basics of other scripting tools. They often work with the business users (even in the same department) on solving business questions. Also, they work close with Data Engineers on accessing data sources and integrating new data sources. Also, they go for self-service analytics tools to have a basic level of data science done. However, they aren’t data scientists but might get into this direction if they invest significant time into it. This now brings us to the next level – the data scientists

Data access for Data Scientists

This is the top level of our pyramid. People working as data scientists aren’t in the majority – business users and power users are much more. However, they work on more challenging topics then the previous two. Also, they work close with power users and business users. They might still be in the same department, but not necessarily. Also, they work with advanced tools such as R and Python and fine-tune the models the power users built with self-service analytics tools or translate the business questions raised from the business users into algorithms.

Often, those 3 develop in different directions – however, it is necessary that all of them work together – as a team – in order to make projects with data a success. With Data access, it is necessary to also incorporate role based access controls.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company.

cloud computing header

Honestly, a data scientist is doing a great job. Literally, they are saving all industries from a strong decline. And those heroes, they are doing all of that alone. Alone? Not fully.

The Data Scientist need the Data Engineer

There are some poor guys that support their success: those, that are called Data Engineers. A huge majority of tasks has been carried out by these guys (and girls) that hardly anyone is talking about. All the fame seems to be going to the data scientists but the data engineers aren‘t receiving any credits.

I remember one of the many meetings with C-Level executives I had. When I explained the structure of a team dealing with data, everyone in the board room agreed on „we need data scientists“. Then, one of the executives raised the question: „but what are these data engineers about? Do we really need them or could we maybe have more data scientists instead of them“.

I kept on explaining and they accepted it. But I had the feeling that they still wanted to go with more Data Scientists than Engineers eventually. This basically comes out of the trend and hype around the data scientists we see. Everyone knows that they are important. But data driven projects only succeed when a team with mixed skills and know-how is coming together.

A Data Science team needs at least the same number of Data Engineers

In all data driven projects I saw so far, it would have never worked without data engineers. They are relevant for many different things – but mainly – and in an ideal world – working in close cooperation with data scientists. If the maturity in a company for data is high, the data engineer would prepare the data for the data scientist and then work with the data scientist again on putting the algorithm back into production. I saw a lot of projects where the later one wasn‘t working – basically, the first steps were successful (data preparation) but the later step (automation) was never done.

But, there are more roles involved in that: one role, which is rather a specialization of the data engineer is the data system engineer. This is not often a dedicated role, but carried out by data engineers. Here, we basically talk about infrastructure preparation and set-up for the data scientists or engineers. Another role is the one of the data architect that ensures a company-wide approach on data and of course data owners and data stewards.

I stated it several times, but it is worth stating it over and over again: data science isn‘t a one (wo)man show, it is ALWAYS a team effort.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. Another interesting article about the data science team setup can be found here.

Data lifecycle management is a complex and important thing to consider. Despite the absolute storage of data is getting cheaper over time, it is still important to build a data platform that stores data in an efficient way. By efficient, I mean both cost and performance wise. It is necessary to build a data architecture that allows fast access to data but on the other hand also stores data in a cost effective way. Both topics are somewhat conflicting, because a cost effective storage is often slow and thus won’t create much throughput. Highly performant storages in contrast are often expensive to build. However, one question should rather be if it is really necessary to store all data in high performing entities. Therefore, it is necessary to measure the value of your data and how much you can store in a specific storage.

How to manage the data lifecycle

The role of the Data Architect is in charge of storing data efficient – both in performance and cost. Also, the architect needs to take care of data lifecycle management. Some years from now, the answer was to put all relevant data into the data warehouse. Since this was too expensive for most data, data was put into HDFS (Hadoop) in recent years. But with the cloud, we now have more diverse options. We can store data in message buffers (such as Kafka), on HDFS systems (disk based) and on Cloud-based Object stores. Especially the later one provides even more options. Comming from general purpose cloud storages, over the last years those storages have evolved to premium object stores (with high performance), common-purpose storage and cheap archive stores. This gives more flexibility in terms of how to store data even more cost effective. Data would typically demote from in-memory (e.g. via instances on Kafka) or premium storages to general purpose storages or even to Archive Stores. The data architect now has the possibility to store data in the most effective way (and thus making a Kappa Architecture useless – cloud prefers Lambda!).

But this now add additional pressure to the data architect’s job. How would the data architect now figure out what is the value of the data to store it? I recently came across a very interesting article, introducing something called “the half life of data”. Basically, this article describes how fast data loses value and thus makes it easier to judge where to store the data. For those that want to read it. The article can be found here.

What is the half life of data?

The half life of data basically categorises data into 3 different value types:

  • Strategic Data: companies use this data for strategic decision making. Data still has high value after some days, so it should be easy and fast to access.
  • Operational Data: data has still some value after some hours but then looses value. Data should be kept available for some hours to maximum days, then it should be demoted to cheaper storages
  • Tactical Data: data has value only for some minutes to maximum of hours. Value is lost fast, so it should either be stored in a very cheap storage or even deleted.

There is also an interesting infograph that illustrates this:

The half life of data: https://nucleusresearch.com/research/single/guidebook-measuring-the-half-life-of-data/

What do you think? What is your take on it? How do you measure the value of your data? How do you handle your data lifecycle in your company?

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company

One of the most important things is to partition data in an environment. Especially with large-scale systems, this is very important, as not everything can be stored on a limited number of systems.

How to partition data?

Partitioning is another factor for Big Data Applications. It is one of the factors of the CAP-Theorem (see 1.6.1) and is also important for scaling applications. Partitioning basically describes the ability to distribute a database over different servers. In Big Data Applications, it is often not possible to store everything on one (Josuttis, 2011)

Data Partitioning

Data Partitioning

The factors for partitioning illustrated in the Figure: Partitioning are described by (Rys, 2011). Functional partitioning is basically describing the service oriented architecture (SOA) approach (Josuttis, 2011). With SOA, different functions are provided by their own services. If we talk about a Web shop such as Amazon, there are a lot of different services involved. Some Services handle the Order Workflow; other Services handle the search and so on.
If there is high load on a specific service such as the shopping cart, new instances can be added on demand. This reduces the risk of an outage that would lead to loosing money. Building a service-oriented architecture simply doesn’t solve all problems for partitioning. Therefore, data also has to be partitioned. By data partitioning, all data is distributed over different servers. They can also be distributed geographically.
A partition key basically identifies partitioned Data. Since there is a lot of data available and single nodes may fail, it is necessary to partition data in the network. This means that data should be replicated and stored redundant in order to deal with node failures.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

Scalable data processing is necessary for all platforms handling data. In today’s tutorial we will have a look at this.
Scalability is another factor of Big Data Applications described by (Rys, 2011). Whenever we talk about Big Data, it mainly involves high-scaling systems. Each Big Data Application should be built in a way that eases scaling. (Rys, 2011) describes several needs for scaling: user load scalability, data load scalability, computational scalability and scale agility.

What is scalable data?

The figure illustrates the different needs for scalability in Big Data environments as described by (Rys, 2011). Many applications such as Facebook (Fowler, 2012) have a lot of users. Applications should support the large user base and should stay prone to errors in case the application sees unexpected high user numbers. Various techniques can be applied to support different needs such as fast data access. A factor that often – but not only – comes with a high number of users is the data load.
(Rys, 2011) describes that some or many users can produce this data. However, things such as sensors and other devices that do not directly relate to users can also produce large datasets. Computational scalability is the ability to scale to large datasets. Data is often analyzed and this needs compute power on the analysis side. Distributed algorithms such as Map/Reduce require a lot of nodes in order to perform queries and analyze in a performing manner.
Scale agility describes the possibility to change the environment of a system. This basically means that new instances such as compute can be added or removed on-demand. This requires a high level of automation and virtualization and is very similar to what can be done in cloud computing environments. Several Platforms such as Amazon EC2, Windows Azure, OpenStack, Eucalyptus and others enable this level of self-service that is a great support to scaling agility for Big Data environments.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

Agility is an important factor to Big Data Applications. Agile data needs to fulfill 3 different agility factors which are: model agility, operational agility and programming ability. (Rys, 2011)

Data agility

Data agility

Agile data: model agility

Model agility means how easy it is to change the Data Model. Traditionally, in SQL Systems it is rather hard to change a schema. Other Systems such as non-relational Databases allow easy change to the Database. If we look at Key/Value Storages such as DynamoDB (Amazon Web Services, 2013), the change to a Model is very easy. Databases in fast changing systems such as Social Media Applications, Online Shops and other require model agility. Updates to such systems occur frequently, often weekly to daily (Paul, 2012).

Operational agility

In distributed environments, it is often necessary to change operational aspects of a System. New Servers get added often, also with different aspects such as Operating System and Hardware. Database systems should stay tolerant to operational changes, as this is a crucial factor to growth.

Programming agility

Database Systems should support the software developers. This is when programming agility comes into play. Programming agility describes the approach that the Database and all associated SDK’s should easy the live of a developer that is working with the Database itself. Furthermore, it should also support fast development.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.