Large enterprises have a lot of legacy systems in their footprint. This created a lot of challenges (but also opportunities!) for system integrators. Now, since companies strive to become data driven, it becomes an even bigger challenge. But luckily there is a new thing out there that can help: data abstraction.
Data Abstraction: why do you need it?
If a company wants to become data driven, it is necessary to unlock all the data that is available within a company. However, this is easier said than done. Most companies have a heterogenous IT landscape and thus struggle to integrate essential data sources into their analytical systems. In the past, there have been several approaches to it. Data was loaded (with some delay) to an analytical data warehouse. This data warehouse didn’t power any operational systems, so it was decoupled.
However, several things proved to be wrong with Data warehouses handling analytical workload: (a) data warehouses tend to be super-expensive in both cost and operations. It makes sense for KPIs and highly structured data, but not for other datasets. And (b) due to the high cost, data warehouses were loaded with some hours to even days of delay. In a real-time world, this isn’t good at all.
But – didn’t the datalake solve this already?
Some years ago, Data lakes surfaced. They were more efficient in terms of speed and cost than traditional data warehouses. However, data warehouses kept the master data, which data lakes often need. So a connection between the two needed to be established. In early days, data was simply replicated in order to do so. Next to datalakes, many other systems (NoSQL mainly) surfaced. Business units aquired different other systems, that made more integration efforts necessary. So, there was no end to data siles at all – it even got worse (and will continue to do so)
So, why not give in to the pressure of heterogenous systems and data stores and try to solve it differently? This is where data abstraction comes into play …
What is it about?
As already introduced, Data Abstraction should reduce your sleepless nights when it comes to accessing and unlocking your data assets. It is like a virtual layer that you add in between your data storages and your data consumers to enable one common access. The following illustration shows this:
Basically, you build a layer on top of your data sources. Of course, it doesn’t solve the challenges around data integration, but it ensures that consumers can expect to have one common layer that they can plug into. Also, it enables you to exchange the technical layer of a data source without consumers taking note of it. You might consider to re-develop a data source from the ground up, in order to make it more performant. Both the old and the new stack will conform to the data abstraction and thus consumers won’t realize that there are significant changes under the hood.
This sounds really nice. So what’s the (technical) solution to it?
Basically, I don’t recommend any technology at this stage. There are several technologies that enable Data Abstraction. They can be clustered into 3 different areas:
- Lightweight SQL Engines: There are several products and tools (both Open Source and non-Open Source) available, which enable SQL access to different data sources. They not only plug into relational databases, but also into non-relational databases. Most tools provide easy integration and abstraction.
- API Integration: It is possible to integrate your data sources via an API layer that eventually abstracts the below data sources. The pain of integration is higher than with SQL Engines, but it gives you more flexbility on top and a higher degree of abstraction. In contrast to SQL engines, your consumers won’t plug too deep into database specifics. If you want to go for a really advanced tech stack, I recommend you reading about Graphs.
- Full-blown solution: There are several proprietary tools available, that provide numerous connectors to data sources. What is really great about these solutions is that they also include chaching mechanisms for frequent data access. You get much higher performance with limited implementation cost. However, you will lock into a specific soltuion.
Which solution you eventually consider to go for, is fully up to you. It depends on the company and its know-how and characterists. In most cases, it is also a combination of different solutions.
So what is next?
There are many tools and services out there which enable data abstraction. Data Abstraction is more of a concept than a concrete technology – not even an architectural pattern. In some cases, you might acquire a technology. Or you would abstract your data via an API or Graph. There are many technologies, tools and services out there to solve your issues.
This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. If you look for open data, I would recommend you browsing some open data catalogs like the open data catalog from the U.S. government.