In my last post, I presented the concept of the data mesh. One thing that is often discussed in regards to the data mesh is how to make an agile architecture with data and microservices. To understand where this needs to go to, we must first discuss the architectural quantum.

What is the architectural quantum?

The architectural quantum is the smallest possible item that needs to be deployed in order to run an application. It is described in the book “Building Evolutionary Architectures“. The traditional approach with data lakes was to have a monolith, so the smallest entity was the datalake itself. The goal of the architectural quantum is to reduce complexity. With microservices, this is achieved by decoupling services and making small entities in a shared-nothing approach.

The goal is simplification and this can only be achieved if there is little to no shared service involved. The original expectation with SOA was to share commonly used infrastructure in order to reduce the effort in development. However, it lead to higher, rather than lower, complexity: when a change to a shared item was necessary, all teams depending on the item had to be involved. With the shared-nothing architecture, items are rather copied than shared and then independently used from each other.

Focus on the business problem

Each solution is designed around the business domain and not around the technical platform. Consequently, this also means that the technology fit for the business problem should be chosen. As of now, most IT departments have a narrow view on the technology, so they try to fit the business problem to the technology until the business problem becomes a technical problem. However, it should be vice versa.

With the data mesh and the architectural quantum, we focus fully on the domain. Since the goal is to reduce complexity (small quantum size!) we won’t re-use technology but select the appropriate one. The data mesh thus only works well if there is a large set of tools available, which can typically be found by large cloud providers such as AWS, Microsoft Azure or Google Cloud. Remember: you want to find a solution to the business problem, not create a technical problem.

Why we need data microservices

After microservices, it is about time for data microservices. There are several things that we need to change when working with data:

  • Focus on the Business. We don’t solve technical problems, we need to start solving business problems
  • Reduce complexity. Don’t think in tech porn. Simplify the architecture, don’t over-complicate.
  • Don’t build it. It already exists in the cloud and is less complicated to use than build and run it on your own
  • No Monoliths. We used to build them for decades for data, replacing a DWH with a Datalake didn’t work out.

It is just about time to start doing so.

If you want to learn more about the data mesh, make sure to read the original description of it by Zhamak Dehghani in this blog post.

mesh bags on white textile

The datalake has been a significant design concept over the last years when we talked about big data and data processing. In recent month, a new concept – the data mesh – got significant attention. But what is the data mesh and how does it impact the datalake? Will it put a sudden death to the datalake?

The data divide

The data mesh was first introduced by Zhamak Dehghani in this blog post. It is a concept based on different challenges when handling data. Some of the arguments Zhamak is using are:

  • The focus on ETL processes
  • Building a monolith (aka Datalake or Data warehouse)
  • Not focusing on the business

According to her, this leads to the “data divide”. Based on my experience, I can fully subscribe to the data divide. Building a datalake isn’t state of the art anymore, since it focuses too much on building a large system for month to years, while business priorities are moving targets that shift during this timeframe. Furthermore, it locks sparse resources (data engineers) into infrastructure work, while they should create value.

The datalake was often perceived as a “solution” to this problem. But it was only a technical answer to a non-technical problem. One monolith (data warehouse) was replaced with the other one (datalake). IT folks argued over what was the better solution, but after years of arguing, implementation and failed projects, companies figured out that not much has changed. But why?

The answer to this is simple

The focus in the traditional (what is called as monolithic approach) is the focus on building ETL processes. The challenge behind that is that BI units, which are often remote to the business, don’t have a clue about the business. The teams of data engineers often work in the dark, fully decoupled from the business. The original goal of centralised data units was to harmonize data and remove silos. However, what was created was quite different: unusable data. Nobody had an idea about what was in the data, why it was produced and for what purpose. If there is no idea about the business process itself, there hardly is an idea why the data comes in a specific format and alike.

I like comparisons to the car industry, which currently is in full disruption: traditional car makers focused on improving gas powered engines. Then comes Elon Musk with Tesla and builds a far better car with great acceleration and ways lower consumption. This is real change. The same is valid for data: replacing a technology that didn’t work with another technology won’t change the problem: the process is the problem.

The Data mesh – focus on what matters

Here comes the data mesh into play. It is based loosely on some aspects that we already know:

  • Microservices architecture
  • Services meshs
  • Cloud

One of the concepts of the data mesh that I really like is its focus on the business and its simplicity. Basically, it asks for an architectural quantum, meaning the simples architecture necessary to run the case. There are several tools available to use and it shifts the focus away from building a monolith were a use case might run at a specific point in time towards doing the use case and use the tools that are available for it to run. And, hey, in the public cloud we have tons of tools for all use cases one might imagine, so no need to build this platform. Again: focus on the business.

Another aspect that I really like about the data mesh is the shift of responsibility towards the business. With that, I mean the data ownership. Data is provided from the place where it is created. Marketing creates their marketing data and makes sure it is properly cleaned, finance their data and so on. Remember: only business knows best why data is created and for what purpose.

But what is the future role of IT?

So, does the data mesh require all data engineers, data scientists and alike to now move to business units? I would say, it depends. Basically, the data mesh requires engineering to work in multi-disciplinary teams with the business. This changes the role of IT to a more strategic one but – requiring IT to deploy the right people to the projects.

Also, IT needs to ensure governance and standards are properly set. The data mesh concept will fail if there is no smart governance behind it. There is a high risk of creating more data silos and thus do no good to the data strategy. If you would like to read more about data strategy, check out this tutorial on data governance.

Also, I want to stress one thing: the data mesh doesn’t replace the data warehouse nor the data lake. Tools used and built in this can be reused.

There is still much more to the data mesh. This is just my summary and thoughts on this very interesting concept. Make sure to read Zhamak’s post on it as well for the full details!

Over the last months, I wrote several articles about data governance. One aspect of data governance is also the principle of FAIR data. FAIR in the context of data stands for: findable, accessible, interoperable and reusable. There are several scientific papers dealing with this topic. Let me explain what it is about

What is FAIR data?

FAIR builds on the four principles stating at the beginning: findable, accessible, interoperable and reusable. This tackles most of the requirements around data governance and thus should increase the use of data. It doesn’t really deals with the aspect of data quality, but it does deal with the challenge on how to work with data. In my experience, most issues around data governance are very basic and most companies don’t manage to solve them at the elementary level.

If a company gets started with the principle of FAIR, some elementary groundwork can be done and future quality improvements can be built on top of it. Plus, it is a good and easy starting point for data governance. Let me explain each of the principles in a bit more depth now.

Findable data

Most data projects starts with the question on how to find if there is data for a specific use-case. This is often difficult to answer, since data engineers or data scientists often don’t know what kind of data is available in a large enterprise. They know the problem that they want to solve but don’t know where the data is. They have to move from person to person and dig deep in the organisation, until they find someone that knows about the data that could potentially serve for their business need. This process can take weeks and data scientists might get frustrated along the way.

A data catalog containing information about the data assets in an enterprise might solve these issues.

Accessible data

Once the first aspect is solved, it is necessary to access data. This also brings a lot of complexity, since data is often sensitive and data owners simply don’t want to share the data access. Escalations often happen along that way. To solve these problems, it is necessary to have clear data owners for all data assets defined. Also, it is highly important to have a clear process for data access available.

Interoperable data

Data often needs to be combined in use-cases with other data sets. This means, that it must be known what each data asset is about. It is necessary to have metadata available about the data and have this shared with data consumers. Nothing is worse for data scientists to constantly ask data owners about the content of the data set. The better a description about a data set is available, the faster people can work with data.

A frequent case is that data is being bought from other companies or shared among companies. This is the concept of decentralised data hubs. In this context, it is highly important to have a clearly defined metadata available.

Reuseable data

Data should eventually be reusable for other business cases as well. Therefore, it is necessary on how data was created. A description about the source system and producing entities needs to be available. Also, it is necessary how include information about potential transformations on data.

In order to make data reusable, the terms of reusability must be provided. This can be a license or other community standards on the data. Data can be either purchased or made available for free. Different software solutions enable this.

What’s next on FAIR data?

I believe it is easy to get started with implementing the tools and processes needed for a FAIR data strategy. It will immediately increase the access times to data and provide a clear way forward. Also, it will increase data quality indirectly and enable future data quality initiatives.

My article was inspired by the discussions I had with Prof. Polleres. Thanks for the insights!

I am talking a lot to different people in my domain – either on conferences or as I know them personally. One thing most of them have in common is one thing: frustration. But why are people working with data frustrated? Why do we see so many frustrated data scientists? Is it the complexity of the job on dealing with data or is it something else? My experience is clearly one thing: something else.

Why are people working with Data frustrated?

One pattern is very clear: most people I talk to that are frustrated with their job working in classical industries. Whenever I talk to people in the IT industry or in Startups, they seem to be very happy. This is largely in contrast to people working in “classical” industries or in consulting companies. There are several reasons to that:

  • First, it is often about a lack of support within traditional companies. Processes are complex and employees work in that company for quite some time. Bringing in new people (or the cool data scientists) often creates frictions with the established employees of the company. Doing things different to how they used to be done isn’t well perceived by the established type of employees and they have the power and will to block any kind of innovation. The internal network they have can’t compete with any kind of data science magic.
  • Second, data is difficult to grasp and organised in silos. Established companies often have an IT function as a cost center, so things were done or fixed on the fly. It was never really intended to dismantle those silos, as budgets were never reserved or made available in doing so. Even now, most companies don’t look into any kind of data governance to reduce their silos. Data quality isn’t a key aspect they strive for. The new kind of people – data scientists – are often “hunting” for data rather than working with the data.
  • Third, the technology stack is heterogenous and legacy brings in a lot of frustration as well. This is very similar to the second point. Here, the issue is rather about not knowing how to get the data out of a system without a clear API rather than finding data at all.
  • Fourth, everybody forgets about data engineers. Data Scientists sit alone and though they do have some skills in Python, they aren’t the ones operating a technology stack. Often, there is a mismatch between data scientists and data engineers in corporations.
  • Fifth, legacy always kicks in. Mandatory regulatory reporting and finance reporting is often taking away resources from the organisation. You can’t just say: “Hey, I am not doing this report for the regulatory since I want to find some patterns in the behaviour of my customers”. Traditional industries are more heavy regulated than Startups or IT companies. This leads to data scientists being reused for standard reporting (not even self-service!). Then the answer often is: “This is not what I signed up for!”
  • Sixth, Digitalisation and Data units are often created in order to show it to the shareholder report. There is no real need from the board for impact. Impact is driven from the business and the business knows how to do so. There won’t be significant growth at all but some growth with “doing it as usual”. (However, startups and companies changing the status quo will get this significant growth!)
  • Seventh, Data scientists need to be in the business, whereas data engineers need to be in the IT department close to the IT systems. Period. However, Tribes need to be centrally steered.

How to overcome this frustration?

Basically, there is no fast cure available to this problem to reduce the frustrated data scientists. The field is still young, so confusion and wrong decisions outside of the IT industry is normal. Projects will fail, skilled people will leave and find new jobs. Over time, companies will get more and more mature in their journey and thus everything around data will become part of the established parts of a company. Just like controlling, marketing or any other function. It is yet to find its place and organisation type.

Data Governance

Everybody is talking about Data Science and Big Data, but one heavily ignored topic is Data Governance and Data Quality. Executives all over the world want to invest into doing data science, but they often ignore Data Governance. Some month ago I wrote about this and shared my frustration about it. Now I’ve decided to go for a more pragmatic approach and describe what Data Governance is all about. This should bring some clarity into the topic and reduce emotions.

Why is Data Governance important?

It is important to keep a certain level of quality in the data. Making decisions on Bad Data Quality leads to bad overall decisions. Data Governance efforts are increasing exponentially when not done in the very beginning of your Data Strategy.

Also, there are a lot of challenges around Data Governance:

  • Keeping a high level of security is often slowing down business implementations
  • Initial investments are necessary – that don’t show value for month to years
  • Benefits are only visible “on top” of governance – e.g. with faster business results or better insights and thus it is not easy to “quantify” the impact
  • Data Governance is often considered as “unsexy” to do. Everybody talks about data science, but nobody about data governance. In fact, Data Scientists can do almost nothing without data governance
  • Data Governance tools are rare – and those that are available are very expensive. Open Source doesn’t focus too much on it, as there is less “buzz” around it than AI. However, this also creates opportunities for us

Companies can basically follow three different strategies. Each strategy differs in the level of maturity:

  • Reactive Governance: Efforts are rather designed to respond to current pains. This happens when the organization has suffered a regulatory breach or a data disaster
  • Pre-emptive Governance: The organization is facing a major change or threat. This strategy is designed to ward off significant issues that could affect success of the company. Often it is driven by impending regulatory & compliance needs
  • Proactive Governance: All efforts are designed to improve capabilities to resolve risk and data issues. This strategy builds on reactive governance to create an ever-increasing body of validated rules, standards, and tested processes. It is also part of a wider Information Management strategy

The 4 pillars

4 data governance pillars
The 4 pillars of Data Governance

As you can see in the image, there are basically 4 main pillars. During the next weeks, I will describe each of them in detail. But let’s have a first look at them now:

  • Data Security & Data Privacy: The overall goal in here is to keep the data secure against external access. It is built on encryption, access management and accessibility. Often, a Roles-based access is defined in this process. A typical definition in here is privacy and security by design
  • Data Quality Management: In this pillar, different measures for Data Quality are defined and tracked. Typically, for each dataset, specific quality measures are looked after. This gives data consumers an overview of the data quality.
  • Data Access & Search: This pillar is all about making data accessible and searchable within the company assets. A typical sample here is a Data Catalog, that shows all available company data to end users.
  • Master Data Management: master data is the common data of the company – e.g. the customer data, the data of suppliers and alike. Data in here should be of high quality and consistent. One physical customer should occur exactly as one person and not as multiple persons

For each of the above mentioned pillars, I will write individual articles over the next weeks.

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In one of my last posts, I wrote about the fact that Cloud is more PaaS/FaaS then IaaS already. In fact, IaaS doesn’t bring much value at all over traditional architectures. There still are some advantages, but they remain limited. If you want to go for a future-proove archtiecture, Analytics needs to be serverless analytics. In this article, I will explain why.

What is serverless analytics?

Just as similar with serverless technologies, serverless analytics also follows the same concept. Basically, the idea behind that is to significantly reduce the work on infrastructure and servers. Modern environments allow us to “only” bring the code and the cloud provider takes care about everything else. This is basically the dream of every developer. Do you know the statement “it works on my machine”? With serverless, this is ways easier. You only need to focus on the app itself, without any requirements on operating system and stack. Also, execution is task- or consumption-based. This means that eventually you only pay for what is used. If your service isn’t utilised, you don’t pay for it. You can also achieve this with IaaS, but with serverless it is part of the concept and not something you need to enable on.

With Analytics, we now also march towards the serverless approach. But why only now? Serverless is around for already some time? Well, if we look at the data analytics community, it always used to be a bit slower than the overall industry. When most tech stacks already migrated to the Cloud, analytics projects were still carried out with large Hadoop installations in the local data center. Also back then, the Cloud was already superior. However, a lot of people still insisted on it. Now, data analytics workloads are moving more and more into the Cloud.

What are the components of Serverless Analytics?

  • Data Integration Tools: Most cloud providers provide easy to use tools to integrate data from different sources. A GUI makes the use of this easier.
  • Data Governance: Data Catalogs and quality management tools are also often parts of any solution. This enables a ways better integration.
  • Different Storage options: Basically, for serverless analytics, storage must always be decoupled from the analytics layer. Normally, there are different databases available. But most of the data is stored on object stores. Real-time data is consumed via a real-time engine.
  • Data Science Labs: Data Scientists need to experiment with data. Major cloud providers have data science labs available, which enable this sort of work.
  • API for integration: With the use of APIs, it is possible to bring back the results into production- or decision-making systems.

How is it different to Kubernetes or Docker?

At the moment, there is also a big discussion if Kubernetes or Docker will solve this job with Analytics. However, this again requires the usage of servers and thus increases the maintenance at some point. All cloud providers have different Kubernetes and Docker solutions available, which allows an easy migration later on. However, I would suggest to go immediately for serverless solutions and avoid the use of containers if avoidable.

What are the financial benefits?

It is challenging to measure the benefits. If the only comparison is price, then it is probably not the best way to do so. Serverless Analytics will greatly reduce the cost of maintaining your stack – this will go close to zero! The only thing you need to focus on from now on is your application(s) – and they should eventually produce value. Also, it is easier to measure IT on the business impact. You get a bill for the applications, not for maintaining a stack. If you run an analysis, you will get a quote for it and the business impact may or may not justify the investment.

If you want to learn more about Serverless Analytics, I can recommend you this tutorial. (Disclaimer: I am not affiliated with Udemy!)

Recurrent Neural Network

In the last two posts we introduced the core concepts of Deep Learning, Feedforward Neural Network and Convolutional Neural Network. In this post, we will have a look at two other popular deep learning techniques: Recurrent Neural Network and Long Short-Term Memory.

Recurrent Neural Network

The main difference to the previously introduced Networks is that the Recurrent Neural Network provides a feedback loop to the previous neuron. This architecture makes it possible to remember important information about the input the network received and takes the learning into consideration along with the next input. RNNs work very well with sequential data such as sound, time series (sensor) data or written natural languages.

The advantage of a RNN over a feedforward network is that the RNN can remember the output and use the output to predict the next element in a series, while a feedforward network is not able to fed the output back to the network. Real-time gesture tracking in videos is another important use-case for RNNs.

A Recurrent Neural Network
A Recurrent Neural Network

Long Short-Term Memory

A usual RNN has a short-term memory, which is already great at some aspect. However, there are requirenments for more advanced memory functionality. Long Short-Term Memory is solving this problem. The two Austrian researchers Josef Hochreiter and Jürgen Schmidhuber introduced LSTM. LSTMs enable RNNs to remember inputs over a long period of time. Therefore, LSTMs are used in combination with RNNs for sequential data which have long time lags in between.

LSTM learns over time on which information is relevant and what information isn’t relevant. This is done by assigning weights to information. This information is then assigned to three different gates within the LSTM: the input gate, the output gate and the “forget” gate.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

Working with data is a complex thing and not done in some days. It is rather a matter of several sequential steps that lead to a final output. In this post, I present the data science process for project execution in data science.

What is the Data Science Process?

Data Science is often mainly consisting of data wrangling and feature engineering, before one can go to the exciting stuff. Since data science is often very exploratory, processes didn’t much evolve around it (yet). In the data science process, I group the process in three main steps that have several sub-steps in it. Let’s first start with the three main steps:

  • Data Acquisition
  • Feature Engineering and Selection
  • Model Training and Extraction

Each of the different process main steps contains some sub-steps. I will describe them a bit in detail now.

Step 1: Data Acquistion

Data Engineering is the main ingredient in this step. After a business question was formulated, it is necessary to now look for the data. In an ideal setup, you would already have a data catalog in your enterprise. If not, you might need to ask several people until you have found the right place to dig deeper into it.

First of all, you need to acquire the data. They might be internal sources but you might also combine them with external sources. In this context, you might want to read about the different data sources you need. Once you are done with having a first look at the data, it is necessary to integrate the data.

Data integration is often perceived as a challenging task. You need to setup a new environment to store the data or you need to extend an existing schema. A common practise is to build a data science lab. A data science lab should be an easy platform for data engineers and data scientists to work with data. A best practise for that is to use a prepared environment in the cloud for it.

After integrating the data, there comes the heavy part of cleaning the data. In most cases, Data is very nasty and thus needs a lot of cleaning with it. This is also mainly carried out by data engineers alongside with data analysts in a company. Once you are done with the data acquisition part of it, you can move on with the feature engineering and selection step.

Typically, this first process step can be very painful and long-lasting. It depends on different factors of an enterprise, such as the data quality itself, the availability of a data catalog and corresponding metadata descriptions. If your maturity in all these items is very high, it can take some days to a week, but in average it is rather 2 to 4 weeks of work.

Step 2: Feature Engineering and Selection

In the next step we start with a very important step in the Data Science process: Feature Engineering. Features are very important for Machine Learning and have a huge impact on the quality of the predictions. With Feature Engineering, you have to understand the domain you are in and what to use with it. One need to understand what data to use and for what reason.

After the feature engineering itself, it is necessary to select the relevant features with the feature selection. A common mistake is the overfitting of a model, or also called “feature explosion”. It happens often that too many features are created and thus the predictions aren’t accurate anymore. Therefore, it is very important to select only those features that are relevant to the use-case and thus bring some significance.

Another important step is the development of the cross-validation structure. This is necessary to check how the model will perform in practice. The cross-validation will measure the performance of your model and give you insights on how to use it. Next after that is the Hyperparameter tuning. Hyperparameters are fine-tuned to improve the prediction of your model.

This process is now carried out mainly by Data Scientists, but still supported by data engineers. The next and final step in the data science process is Model Training and Extraction.

Step 3: Model Training and Extraction

The last step in the process is the model training and extraction. In this step, the algorithm(s) for the model prediction are selected and compared to each other. In order to ease up work here, it is necessary to put all your process into a pipeline. (Note: I will explain the concept of the pipeline in a later post). After the training is done, you can go into the predictions itself and bring the model into production.

The following illustration outlines the now presented process:

This image describes the Data Science process in it's three steps: Data acquisition, Feature Engineering and Selection, Model Training and Extraction
The Data Science Process

The Data Science process itself can easily be carried out in a Scrum or Kanban approach, depending on your favourite management style. For instance, you could have each of the 3 process steps as sprints. The first sprint “Data Acquisition” might last longer than the other sprints or you could even break the first one into several sprints. For Agile Data Science, I can recommend you reading this post.

About 1,5 years ago I was writing that Cloud is not the future. Instead, I claimed that it is the present. In fact, most companies are already embracing the Cloud. Today, I want to revisit this statement and take it to the next level: Cloud IaaS is not the Future

What is wrong about Cloud IaaS?

Cloud IaaS was great in the early days of the Cloud. It gave us freedom to move our workloads in a lift-and-shift scenario to the Cloud. Also, it greatly improved how we can handle workloads in a more dynamic way. Adding servers and shutting them down on demand was really easy. In an on-premise scenario, this was far from easy. All big Cloud providers today provide a comprehensive toolset and third-party applications for IaaS solutions. But why is it not as great as it used to be?

Honestly, I was never a big fan of IaaS in the Cloud. To state it blunt, it didn’t improve much (other than scale and flexibility) to an on-premise world. With Cloud IaaS, we still have to maintain all our servers like in the old days with on-premise. Security patches, updates, fixes and alike stays with those that build the services. Since the early days, I was a big fan of Cloud PaaS (Platform as a Service). 

What is the status of Cloud PaaS?

Over the last 1.5 years, a lot of mature Cloud PaaS services emerged. Cloud PaaS has been around for almost 10 years, but the current level of maturity is impressive. Some two years ago, they have mainly been general purpose Services, but now they moved into very specific domains. There are now a lot of services available for things such as IoT or Data Analytics.

The current trend in Cloud PaaS is definitely the move towards „Serverless Analytics“. Analytics has always been a slow-mover when it came to the Cloud. Other functional areas had already Cloud-native implementations, when Analytical workloads were still developed for the on-premise world. Hadoop was one of these projects, but other projects took over and Hadoop is in the decline. Now, more analytical applications will be developed with a PaaS-stack.

What should you do now?

Cloud PaaS isn’t a revolution or anything spectacular new. If you have no experiences with Cloud PaaS, I would urge you to look at these platforms asap. They will become essential for your business and do provide a lot of benefits. Again – it isn’t the future, it is the present!

If you want to learn more about Serverless Analytics, I can recommend you this tutorial. (Disclaimer: I am not affiliated with Udemy!)

Large enterprises have a lot of legacy systems in their footprint. This created a lot of challenges (but also opportunities!) for system integrators. Now, since companies strive to become data driven, it becomes an even bigger challenge. But luckily there is a new thing out there that can help: data abstraction.

Data Abstraction: why do you need it?

If a company wants to become data driven, it is necessary to unlock all the data that is available within a company. However, this is easier said than done. Most companies have a heterogenous IT landscape and thus struggle to integrate essential data sources into their analytical systems. In the past, there have been several approaches to it. Data was loaded (with some delay) to an analytical data warehouse. This data warehouse didn’t power any operational systems, so it was decoupled.

However, several things proved to be wrong with Data warehouses handling analytical workload: (a) data warehouses tend to be super-expensive in both cost and operations. It makes sense for KPIs and highly structured data, but not for other datasets. And (b) due to the high cost, data warehouses were loaded with some hours to even days of delay. In a real-time world, this isn’t good at all.

But – didn’t the datalake solve this already?

Some years ago, Data lakes surfaced. They were more efficient in terms of speed and cost than traditional data warehouses. However, data warehouses kept the master data, which data lakes often need. So a connection between the two needed to be established. In early days, data was simply replicated in order to do so. Next to datalakes, many other systems (NoSQL mainly) surfaced. Business units aquired different other systems, that made more integration efforts necessary. So, there was no end to data siles at all – it even got worse (and will continue to do so)

So, why not give in to the pressure of heterogenous systems and data stores and try to solve it differently? This is where data abstraction comes into play …

What is it about?

As already introduced, Data Abstraction should reduce your sleepless nights when it comes to accessing and unlocking your data assets. It is like a virtual layer that you add in between your data storages and your data consumers to enable one common access. The following illustration shows this:

Data Abstraction shows how to abstract you data
Data Abstraction

Basically, you build a layer on top of your data sources. Of course, it doesn’t solve the challenges around data integration, but it ensures that consumers can expect to have one common layer that they can plug into. Also, it enables you to exchange the technical layer of a data source without consumers taking note of it. You might consider to re-develop a data source from the ground up, in order to make it more performant. Both the old and the new stack will conform to the data abstraction and thus consumers won’t realize that there are significant changes under the hood.

This sounds really nice. So what’s the (technical) solution to it?

Basically, I don’t recommend any technology at this stage. There are several technologies that enable Data Abstraction. They can be clustered into 3 different areas:

  1. Lightweight SQL Engines: There are several products and tools (both Open Source and non-Open Source) available, which enable SQL access to different data sources. They not only plug into relational databases, but also into non-relational databases. Most tools provide easy integration and abstraction.
  2. API Integration: It is possible to integrate your data sources via an API layer that eventually abstracts the below data sources. The pain of integration is higher than with SQL Engines, but it gives you more flexbility on top and a higher degree of abstraction. In contrast to SQL engines, your consumers won’t plug too deep into database specifics. If you want to go for a really advanced tech stack, I recommend you reading about Graphs.
  3. Full-blown solution: There are several proprietary tools available, that provide numerous connectors to data sources. What is really great about these solutions is that they also include chaching mechanisms for frequent data access. You get much higher performance with limited implementation cost. However, you will lock into a specific soltuion.

Which solution you eventually consider to go for, is fully up to you. It depends on the company and its know-how and characterists. In most cases, it is also a combination of different solutions.

So what is next?

There are many tools and services out there which enable data abstraction. Data Abstraction is more of a concept than a concrete technology – not even an architectural pattern. In some cases, you might acquire a technology. Or you would abstract your data via an API or Graph. There are many technologies, tools and services out there to solve your issues.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. If you look for open data, I would recommend you browsing some open data catalogs like the open data catalog from the U.S. government.