two native american playing wind instrument

The ultimate career goal of a Data Scientist or any other person in the Data industry would be to become a CDO – Chief Data Officer. But is it really that interesting to become one? Is the role of the CDO still relevant nowadays? Let me dig a bit into the role and how it has evolved and outline the facts on why I believe that this role will – sooner or later – become irrelevant

The Chief Data Officer at the board level

Whenever we talk about a C-role, it would imply that this person is a board member. In my past and my own career, I have however not seen any CDO being a real board member. In fact, most of the time the “CDO” was either reporting to the board or several hierarchical levels underneath it. So in all cases, the “C” should be removed from the job title per se. In my recent jobs, I had the full job description of a CDO, reported to the board and dealt with the data topic on a group-wide basis of the company. However, my job title was always “Head of Data”, rather than adding the C to my title (since, again, I wasn’t part of the board). My personal opinion is that anyway the data topic shouldn’t be on a board level – it should be in ALL business departments.

A central function for a decentralised job

One of the key aspects, why the Chief Data Officer might become irrelevant is the basic nature of Data: Data is always decentral, produced in business departments and used in these departments. A central function for data will never be fast enough to catch up with the demands around data. This leads to the question on why a central department might be relevant? Or will it still be relevant at all?

One key consideration is, if the job brings any benefits to the company, if installed. Most CDOs I knew used to focus on analytical use-cases. But this is definitely something that needs to be done in the business departments. With the data mesh, not only using data but also preparing data (e.g. data engineering as a task) would rather be embedded in business functions than in centralised functions. So several functions, that a Chief Data Officer would carry out get decentralised. But what will stay for the Chief Data Officer?

The Chief Data Officer as central Data Governance and Architecture steering

However, there is still plenty of work left for a “Chief Data Officer”. The main tasks of this function or person will center around the following items:

  • Data Governance: steer all decentral projects into common standards and raise awareness for data quality
  • Data Architecture: ensure a common data architecture and set standards for decentral functions, alongside the data governance standards
  • Drive a data driven culture: Gather the decentral community within an organisation and ensure that the organisation learns and constantly improves in this topic. Become a catalyst for innovation in the data topic

There are some aspects that such a function should not do:

  • Data Engineering: Data Engineering should either be done in IT (if it is source centric) or in business departments (but rather on a limited scale, focused on DataOps!) if it is pipeline centric (and supporting data scientists)
  • Data Science: This should entirely be done in the business

As we can see, there are a lot of things that a “CDO” should still do. However, the function would rather focus on securing the base, not creating value in the business cases per se. But this might not be entirely true: if there is nobody that takes care of governance and a proper architecture, there is no chance to create value in the business. So it is a very important role to have in organisations. Will this role be called a “Chief Data Officer”? Probably not, but people like titles, so it will stay 😉

Errol and I had the idea of doing a “Data Hike” back in 2020. However, Covid Restrictions didn’t allow us to meet. So, both of us ended up hiking on our own. Errol the beautiful lakes of Sweden, me the beautiful alpine lakes in Austria. However, things get easier now and so we finally made it last weekend!

Our hike took us to the beautiful and scenic area of Semmering in Lower Austria. This place is not just known for its beautiful scenery, but also for an area where the Austrian emperors often went to. The hike took us through forests, alms and even a small summit (Großer Pfaff). We touched different topics around data and one’s rather remote to it.

The Data Mesh Hike

an aereal picture of the Stuhleck
The Data Hike in Austria

One of the key topics we talked about was the “Data Mesh”. This trend started in 2019 by ((LINK)) is getting popular fast. Both Errol and I had our ideas of it and how to approach it. We were both in-line with the idea that it isn’t just a trend but something that will ease up all our lives and how we deal with data. We talked about some companies we both know that are already thinking into that direction. Also, we highlighted the need for more know-how on this topic, since it is a new and emerging topic.

The Datalake

Talking about the Data Mesh while we had a break

We didn’t really pass by a lake during our hike – only at a swamp. This was a “signature” for what the datalakes eventually had become: data swamps. We discussed how this appeared and why it ended like this; we argued that most likely it was because companies focused on implementing a complex technology and fully ignored the fact that a good governance is necessary. All tech geeks discussed the fancy technology around the Datalake – e.g. Spark, Hive, Kafka, … but nobody wanted to “pitch” for governance. Our final thoughts were: the datalake is death. However, we both agreed that it is only about the legacy technology, not the concept of it.

Heads up in the Cloud

Once we reached the top of “Großer Paff”, we had a marvelous view over the area. This led us to another topic: The Cloud. In the Cloud, you reach another level of maturity by using Microservices and alike. It isn’t necessary anymore to build a big Datalake. It is already existing and available. Using services such as S3 is easy and straight forward. There is no need for a complex HDFS-based solution.

It’s the business, stupid!

We made it to our final destination – Großer Pfaff

After a 4 and a half hours hike, we reached our car and rewarded ourselves with nice Topfenknödel. I made my pitch for “Kasnocken” but it isn’t served in this area of Austria. When we asked ourselves, why we did this hike, the answer was simple: Data needs to serve a business need. There is no need to go for complex technology, keep it simple (with Microservices) and deliver value!

If you are also interested in joining a hike with us, hit me up! One thing is clear to us: this was just the beginning of a series of Data Hikes (or maybe even Data Ski’s?)

Shows the code editor in Python

In my previous post, I gave an introduction to Python Libraries for Data Engineering and Data Science. In this post, we will have a first look at NumPy, one of the most important libraries to work with in Python.

NumPy is the simplest library for working with data. It is often re-used by other libraries such as Pandas, so it is necessary to first understand NumPy. The focus of this library is on easy transformations of Vectors, Matrizes and Arrays. It provides a lot of functionality on that. But let’s get our hands dirty with the library and have a look at it!

Before you get started, please make sure to have the Sandbox setup and ready

Getting started with NumPy

First of all, we need to import the library. This works with the following import statement in Python:

import numpy as np

This should now give us access to NumPy libraries. Let us first create an 3-dimensional array with 5 values in it. In NumPy, this works with the “arange” method. We provide “15” as the number of items and then let it re-shape to 3×5:

vals = np.arange(15).reshape(3,5)
vals

This should now give us an output array with 2 dimensions, where each dimension contains 5 values. The values range from 0 to 14:

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

NumPy contains a lot of different variables and functions. To have PI, you simply import “pi” from numpy:

from numpy import pi
pi

We can now use PI for further work and calculations in Python.

Simple Calculations with NumPy

Let’s create a new array with 5 values:

vl = np.arange(5)
vl

An easy way to calculate is to calculate something to the power. This works with “**”

nv = vl**2
nv

Now, this should give us the following output:

array([ 0,  1,  4,  9, 16])

The same applies to “3”: if we want to calculate everything in an array to the power of 3:

nn = vl**3
nn

And the output should be similar:

array([ 0,  1,  8, 27, 64])

Working with Random Numbers in NumPy

NumPy contains the function “random” to create random numbers. This method takes the dimensions of the array to fit the numbers into. We use a 3×3 array:

nr = np.random.random((3,3))
nr *= 100
nr

Please note that random returns numbers between 0 and 1, so in order to create higher numbers we need to “stretch” them. We thus multiply by 100. The output should be something like this:

array([[90.30147522,  6.88948191,  6.41853222],
       [82.76187536, 73.37687372,  9.48770728],
       [59.02523947, 84.56571797,  5.05225463]])

Your numbers should be different, since we are working with random numbers in here. We can do this as well with a 3-dimensional array:

n3d = np.random.random((3,3,3))
n3d *= 100
n3d

Also here, your numbers would be different, but the overall “structure” should look like the following:

array([[[89.02863455, 83.83509441, 93.94264059],
        [55.79196044, 79.32574406, 33.06871588],
        [26.11848117, 64.05158411, 94.80789032]],

       [[19.19231999, 63.52128357,  8.10253043],
        [21.35001753, 25.11397256, 74.92458022],
        [35.62544853, 98.17595966, 23.10038137]],

       [[81.56526913,  9.99720992, 79.52580966],
        [38.69294158, 25.9849473 , 85.97255179],
        [38.42338734, 67.53616027, 98.64039687]]])

Other means to work with Numbers in Python

NumPy provides several other options to work with data. There are several aggregation functions available that we can use. Let’s now look for the maximum value in the previously created array:

n3d.max()

In my example this would return 98.6. You would get a different number, since we made it random. Also, it is possible to return the maximum number of a specific axis within an array. We therefore add the keyword “axis” to the “max” function:

n3d.max(axis=1)

This would now return the maximum number for each of the axis within the array. In my example, the results look like this:

array([[93.94264059, 79.32574406, 94.80789032],
       [63.52128357, 74.92458022, 98.17595966],
       [81.56526913, 85.97255179, 98.64039687]])

Another option is to create the sum. We can do this by the entire array, or by providing the axis keyword:

n3d.sum(axis=1)

In the next sample, we make the data look more pretty. This can be done by rounding the numbers to 2 digits:

n3d.round(2)

Iterating arrays in Python

Often, it is necessary to iterate over items. In NumPy, this can be achieved by using the built-in iterator. We get it by the function “nditer”. This function needs the array to iterate over and then we can include it in a for-each loop:

or val in np.nditer(n3d):
    print(val)

The above sample would iterate over all values in the array and then prints the values. If we want to modify the items within the array, we need to set the flag “op_flags” to “readwrite”. This enables us to do modifications to the array while iterating it. In the next sample, we iterate over each item and then create the modulo of 3 from it:

n3d = n3d.round(0)

with np.nditer(n3d, op_flags=['readwrite']) as iter:
    for i in iter:
        i[...] = i%3
        
n3d

These are the basics of NumPy. In our next tutorial, we will have a look at Pandas: a very powerful dataframe library.

If you liked this post, you might consider the tutorial about Python itself. This gives you a great insight into the Python language for Spark itself. If you want to know more about Python, you should consider visiting the official page.

In some of my previous posts, I shared my thoughts on the data mesh architecture. The Data Mesh was originally introduced by Zhamak Dehghani in 2019 and is now enjoying huge popularity in the community. As one of the main thoughts of the data mesh architecture is the distributed nature of data, it also leads to a domain driven design of the data itself. A data circle enables this design.

What is a data circle?

A data circle is a data model, that is tailored to the use-case domain. It should follow the approach of the architectural quant from the micro service architecture. The domain model should only contain all relevant information for the purpose it is built for and not contain any additional data. Also, each circle could or should run within its own environment (e.g. database). The technology should be selected for the best use of the data. A circle might easily be confused with a data mart, that is built within the data warehouse. However, several data circles might not “live” within one (physical) data warehouse but use different technologies and are highly distributed.

Each company will have several data circles in place, each tailored to the specific needs of use-cases. When modelling data with data circles, unnecessary information will be skipped as it will – at some point – be connectable with other data circles in the company. Have we previously built our data models in a very comprehensive way (e.g. via the data warehouse), we now built the data models in a distributed way.

Samples in the telco and financial industry

If we take for example a telco company, data circles might be:

  • The customer data circle: containing the most important customer data
  • The network data circle: containing information about the network
  • The CDR data circle: containing information about calls conducted

If we look at the insurance industry, data circles might be:

  • The customer data circle: containing the most important customer data
  • The claims data circle: containing the data about past claims
  • The health data: containing the data about health related infos

If we focus back to the telco company, the data about the customer might be stored in a relational model within a RDBMS. However, network data might be stored in a graph for better spatial analysis. CDR data might be stored in a files-based setup. For each domain, the best technology is selected and the best model is designed. Similar holds true for other industries.

Several data circles make up the design

Different business units will built their own data circles to fit to their demands. This, however, makes it necessary to create a central repository that sticks it all together: a hub connecting all the circles. The hub stores information about connectivity of different circles. Imagine the network data model again – you might want to connect the the network data with customer data. There must be a way to connect this data, by still keeping its distributed aspects. The hub serves as a central data asset management tool and one-stop-shop for employees within the company to find the data they need.

A data hub connecting different data circles
Circles connected via a hub

The Data Hub also allows users to connect and analyse the data they want to access. This allows the users to use tools such as Jupyter to analyse the data. The hub also takes care about the connectivity to the data and thus provides an API for all users. A data hub is all about data governance.

What’s next?

I recommend you reading about all the other articles I’ve written about the data mesh architecture. It is fairly easy to get stated with this architectural style and the data circles contribute to this.

In my last post, I presented the concept of the data mesh. One thing that is often discussed in regards to the data mesh is how to make an agile architecture with data and microservices. To understand where this needs to go to, we must first discuss the architectural quantum.

What is the architectural quantum?

The architectural quantum is the smallest possible item that needs to be deployed in order to run an application. It is described in the book “Building Evolutionary Architectures“. The traditional approach with data lakes was to have a monolith, so the smallest entity was the datalake itself. The goal of the architectural quantum is to reduce complexity. With microservices, this is achieved by decoupling services and making small entities in a shared-nothing approach.

The goal is simplification and this can only be achieved if there is little to no shared service involved. The original expectation with SOA was to share commonly used infrastructure in order to reduce the effort in development. However, it lead to higher, rather than lower, complexity: when a change to a shared item was necessary, all teams depending on the item had to be involved. With the shared-nothing architecture, items are rather copied than shared and then independently used from each other.

Focus on the business problem

Each solution is designed around the business domain and not around the technical platform. Consequently, this also means that the technology fit for the business problem should be chosen. As of now, most IT departments have a narrow view on the technology, so they try to fit the business problem to the technology until the business problem becomes a technical problem. However, it should be vice versa.

With the data mesh and the architectural quantum, we focus fully on the domain. Since the goal is to reduce complexity (small quantum size!) we won’t re-use technology but select the appropriate one. The data mesh thus only works well if there is a large set of tools available, which can typically be found by large cloud providers such as AWS, Microsoft Azure or Google Cloud. Remember: you want to find a solution to the business problem, not create a technical problem.

Why we need data microservices

After microservices, it is about time for data microservices. There are several things that we need to change when working with data:

  • Focus on the Business. We don’t solve technical problems, we need to start solving business problems
  • Reduce complexity. Don’t think in tech porn. Simplify the architecture, don’t over-complicate.
  • Don’t build it. It already exists in the cloud and is less complicated to use than build and run it on your own
  • No Monoliths. We used to build them for decades for data, replacing a DWH with a Datalake didn’t work out.

It is just about time to start doing so.

If you want to learn more about the data mesh, make sure to read the original description of it by Zhamak Dehghani in this blog post.

mesh bags on white textile

The datalake has been a significant design concept over the last years when we talked about big data and data processing. In recent month, a new concept – the data mesh – got significant attention. But what is the data mesh and how does it impact the datalake? Will it put a sudden death to the datalake?

The data divide

The data mesh was first introduced by Zhamak Dehghani in this blog post. It is a concept based on different challenges when handling data. Some of the arguments Zhamak is using are:

  • The focus on ETL processes
  • Building a monolith (aka Datalake or Data warehouse)
  • Not focusing on the business

According to her, this leads to the “data divide”. Based on my experience, I can fully subscribe to the data divide. Building a datalake isn’t state of the art anymore, since it focuses too much on building a large system for month to years, while business priorities are moving targets that shift during this timeframe. Furthermore, it locks sparse resources (data engineers) into infrastructure work, while they should create value.

The datalake was often perceived as a “solution” to this problem. But it was only a technical answer to a non-technical problem. One monolith (data warehouse) was replaced with the other one (datalake). IT folks argued over what was the better solution, but after years of arguing, implementation and failed projects, companies figured out that not much has changed. But why?

The answer to this is simple

The focus in the traditional (what is called as monolithic approach) is the focus on building ETL processes. The challenge behind that is that BI units, which are often remote to the business, don’t have a clue about the business. The teams of data engineers often work in the dark, fully decoupled from the business. The original goal of centralised data units was to harmonize data and remove silos. However, what was created was quite different: unusable data. Nobody had an idea about what was in the data, why it was produced and for what purpose. If there is no idea about the business process itself, there hardly is an idea why the data comes in a specific format and alike.

I like comparisons to the car industry, which currently is in full disruption: traditional car makers focused on improving gas powered engines. Then comes Elon Musk with Tesla and builds a far better car with great acceleration and ways lower consumption. This is real change. The same is valid for data: replacing a technology that didn’t work with another technology won’t change the problem: the process is the problem.

The Data mesh – focus on what matters

Here comes the data mesh into play. It is based loosely on some aspects that we already know:

  • Microservices architecture
  • Services meshs
  • Cloud

One of the concepts of the data mesh that I really like is its focus on the business and its simplicity. Basically, it asks for an architectural quantum, meaning the simples architecture necessary to run the case. There are several tools available to use and it shifts the focus away from building a monolith were a use case might run at a specific point in time towards doing the use case and use the tools that are available for it to run. And, hey, in the public cloud we have tons of tools for all use cases one might imagine, so no need to build this platform. Again: focus on the business.

Another aspect that I really like about the data mesh is the shift of responsibility towards the business. With that, I mean the data ownership. Data is provided from the place where it is created. Marketing creates their marketing data and makes sure it is properly cleaned, finance their data and so on. Remember: only business knows best why data is created and for what purpose.

But what is the future role of IT?

So, does the data mesh require all data engineers, data scientists and alike to now move to business units? I would say, it depends. Basically, the data mesh requires engineering to work in multi-disciplinary teams with the business. This changes the role of IT to a more strategic one but – requiring IT to deploy the right people to the projects.

Also, IT needs to ensure governance and standards are properly set. The data mesh concept will fail if there is no smart governance behind it. There is a high risk of creating more data silos and thus do no good to the data strategy. If you would like to read more about data strategy, check out this tutorial on data governance.

Also, I want to stress one thing: the data mesh doesn’t replace the data warehouse nor the data lake. Tools used and built in this can be reused.

There is still much more to the data mesh. This is just my summary and thoughts on this very interesting concept. Make sure to read Zhamak’s post on it as well for the full details!

Over the last months, I wrote several articles about data governance. One aspect of data governance is also the principle of FAIR data. FAIR in the context of data stands for: findable, accessible, interoperable and reusable. There are several scientific papers dealing with this topic. Let me explain what it is about

What is FAIR data?

FAIR builds on the four principles stating at the beginning: findable, accessible, interoperable and reusable. This tackles most of the requirements around data governance and thus should increase the use of data. It doesn’t really deals with the aspect of data quality, but it does deal with the challenge on how to work with data. In my experience, most issues around data governance are very basic and most companies don’t manage to solve them at the elementary level.

If a company gets started with the principle of FAIR, some elementary groundwork can be done and future quality improvements can be built on top of it. Plus, it is a good and easy starting point for data governance. Let me explain each of the principles in a bit more depth now.

Findable data

Most data projects starts with the question on how to find if there is data for a specific use-case. This is often difficult to answer, since data engineers or data scientists often don’t know what kind of data is available in a large enterprise. They know the problem that they want to solve but don’t know where the data is. They have to move from person to person and dig deep in the organisation, until they find someone that knows about the data that could potentially serve for their business need. This process can take weeks and data scientists might get frustrated along the way.

A data catalog containing information about the data assets in an enterprise might solve these issues.

Accessible data

Once the first aspect is solved, it is necessary to access data. This also brings a lot of complexity, since data is often sensitive and data owners simply don’t want to share the data access. Escalations often happen along that way. To solve these problems, it is necessary to have clear data owners for all data assets defined. Also, it is highly important to have a clear process for data access available.

Interoperable data

Data often needs to be combined in use-cases with other data sets. This means, that it must be known what each data asset is about. It is necessary to have metadata available about the data and have this shared with data consumers. Nothing is worse for data scientists to constantly ask data owners about the content of the data set. The better a description about a data set is available, the faster people can work with data.

A frequent case is that data is being bought from other companies or shared among companies. This is the concept of decentralised data hubs. In this context, it is highly important to have a clearly defined metadata available.

Reuseable data

Data should eventually be reusable for other business cases as well. Therefore, it is necessary on how data was created. A description about the source system and producing entities needs to be available. Also, it is necessary how include information about potential transformations on data.

In order to make data reusable, the terms of reusability must be provided. This can be a license or other community standards on the data. Data can be either purchased or made available for free. Different software solutions enable this.

What’s next on FAIR data?

I believe it is easy to get started with implementing the tools and processes needed for a FAIR data strategy. It will immediately increase the access times to data and provide a clear way forward. Also, it will increase data quality indirectly and enable future data quality initiatives.

My article was inspired by the discussions I had with Prof. Polleres. Thanks for the insights!

I am talking a lot to different people in my domain – either on conferences or as I know them personally. One thing most of them have in common is one thing: frustration. But why are people working with data frustrated? Why do we see so many frustrated data scientists? Is it the complexity of the job on dealing with data or is it something else? My experience is clearly one thing: something else.

Why are people working with Data frustrated?

One pattern is very clear: most people I talk to that are frustrated with their job working in classical industries. Whenever I talk to people in the IT industry or in Startups, they seem to be very happy. This is largely in contrast to people working in “classical” industries or in consulting companies. There are several reasons to that:

  • First, it is often about a lack of support within traditional companies. Processes are complex and employees work in that company for quite some time. Bringing in new people (or the cool data scientists) often creates frictions with the established employees of the company. Doing things different to how they used to be done isn’t well perceived by the established type of employees and they have the power and will to block any kind of innovation. The internal network they have can’t compete with any kind of data science magic.
  • Second, data is difficult to grasp and organised in silos. Established companies often have an IT function as a cost center, so things were done or fixed on the fly. It was never really intended to dismantle those silos, as budgets were never reserved or made available in doing so. Even now, most companies don’t look into any kind of data governance to reduce their silos. Data quality isn’t a key aspect they strive for. The new kind of people – data scientists – are often “hunting” for data rather than working with the data.
  • Third, the technology stack is heterogenous and legacy brings in a lot of frustration as well. This is very similar to the second point. Here, the issue is rather about not knowing how to get the data out of a system without a clear API rather than finding data at all.
  • Fourth, everybody forgets about data engineers. Data Scientists sit alone and though they do have some skills in Python, they aren’t the ones operating a technology stack. Often, there is a mismatch between data scientists and data engineers in corporations.
  • Fifth, legacy always kicks in. Mandatory regulatory reporting and finance reporting is often taking away resources from the organisation. You can’t just say: “Hey, I am not doing this report for the regulatory since I want to find some patterns in the behaviour of my customers”. Traditional industries are more heavy regulated than Startups or IT companies. This leads to data scientists being reused for standard reporting (not even self-service!). Then the answer often is: “This is not what I signed up for!”
  • Sixth, Digitalisation and Data units are often created in order to show it to the shareholder report. There is no real need from the board for impact. Impact is driven from the business and the business knows how to do so. There won’t be significant growth at all but some growth with “doing it as usual”. (However, startups and companies changing the status quo will get this significant growth!)
  • Seventh, Data scientists need to be in the business, whereas data engineers need to be in the IT department close to the IT systems. Period. However, Tribes need to be centrally steered.

How to overcome this frustration?

Basically, there is no fast cure available to this problem to reduce the frustrated data scientists. The field is still young, so confusion and wrong decisions outside of the IT industry is normal. Projects will fail, skilled people will leave and find new jobs. Over time, companies will get more and more mature in their journey and thus everything around data will become part of the established parts of a company. Just like controlling, marketing or any other function. It is yet to find its place and organisation type.

Data Governance

Everybody is talking about Data Science and Big Data, but one heavily ignored topic is Data Governance and Data Quality. Executives all over the world want to invest into doing data science, but they often ignore Data Governance. Some month ago I wrote about this and shared my frustration about it. Now I’ve decided to go for a more pragmatic approach and describe what Data Governance is all about. This should bring some clarity into the topic and reduce emotions.

Why is Data Governance important?

It is important to keep a certain level of quality in the data. Making decisions on Bad Data Quality leads to bad overall decisions. Data Governance efforts are increasing exponentially when not done in the very beginning of your Data Strategy.

Also, there are a lot of challenges around Data Governance:

  • Keeping a high level of security is often slowing down business implementations
  • Initial investments are necessary – that don’t show value for month to years
  • Benefits are only visible “on top” of governance – e.g. with faster business results or better insights and thus it is not easy to “quantify” the impact
  • Data Governance is often considered as “unsexy” to do. Everybody talks about data science, but nobody about data governance. In fact, Data Scientists can do almost nothing without data governance
  • Data Governance tools are rare – and those that are available are very expensive. Open Source doesn’t focus too much on it, as there is less “buzz” around it than AI. However, this also creates opportunities for us

Companies can basically follow three different strategies. Each strategy differs in the level of maturity:

  • Reactive Governance: Efforts are rather designed to respond to current pains. This happens when the organization has suffered a regulatory breach or a data disaster
  • Pre-emptive Governance: The organization is facing a major change or threat. This strategy is designed to ward off significant issues that could affect success of the company. Often it is driven by impending regulatory & compliance needs
  • Proactive Governance: All efforts are designed to improve capabilities to resolve risk and data issues. This strategy builds on reactive governance to create an ever-increasing body of validated rules, standards, and tested processes. It is also part of a wider Information Management strategy

The 4 pillars

4 data governance pillars
The 4 pillars of Data Governance

As you can see in the image, there are basically 4 main pillars. During the next weeks, I will describe each of them in detail. But let’s have a first look at them now:

  • Data Security & Data Privacy: The overall goal in here is to keep the data secure against external access. It is built on encryption, access management and accessibility. Often, a Roles-based access is defined in this process. A typical definition in here is privacy and security by design
  • Data Quality Management: In this pillar, different measures for Data Quality are defined and tracked. Typically, for each dataset, specific quality measures are looked after. This gives data consumers an overview of the data quality.
  • Data Access & Search: This pillar is all about making data accessible and searchable within the company assets. A typical sample here is a Data Catalog, that shows all available company data to end users.
  • Master Data Management: master data is the common data of the company – e.g. the customer data, the data of suppliers and alike. Data in here should be of high quality and consistent. One physical customer should occur exactly as one person and not as multiple persons

For each of the above mentioned pillars, I will write individual articles over the next weeks.

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In one of my last posts, I wrote about the fact that Cloud is more PaaS/FaaS then IaaS already. In fact, IaaS doesn’t bring much value at all over traditional architectures. There still are some advantages, but they remain limited. If you want to go for a future-proove archtiecture, Analytics needs to be serverless analytics. In this article, I will explain why.

What is serverless analytics?

Just as similar with serverless technologies, serverless analytics also follows the same concept. Basically, the idea behind that is to significantly reduce the work on infrastructure and servers. Modern environments allow us to “only” bring the code and the cloud provider takes care about everything else. This is basically the dream of every developer. Do you know the statement “it works on my machine”? With serverless, this is ways easier. You only need to focus on the app itself, without any requirements on operating system and stack. Also, execution is task- or consumption-based. This means that eventually you only pay for what is used. If your service isn’t utilised, you don’t pay for it. You can also achieve this with IaaS, but with serverless it is part of the concept and not something you need to enable on.

With Analytics, we now also march towards the serverless approach. But why only now? Serverless is around for already some time? Well, if we look at the data analytics community, it always used to be a bit slower than the overall industry. When most tech stacks already migrated to the Cloud, analytics projects were still carried out with large Hadoop installations in the local data center. Also back then, the Cloud was already superior. However, a lot of people still insisted on it. Now, data analytics workloads are moving more and more into the Cloud.

What are the components of Serverless Analytics?

  • Data Integration Tools: Most cloud providers provide easy to use tools to integrate data from different sources. A GUI makes the use of this easier.
  • Data Governance: Data Catalogs and quality management tools are also often parts of any solution. This enables a ways better integration.
  • Different Storage options: Basically, for serverless analytics, storage must always be decoupled from the analytics layer. Normally, there are different databases available. But most of the data is stored on object stores. Real-time data is consumed via a real-time engine.
  • Data Science Labs: Data Scientists need to experiment with data. Major cloud providers have data science labs available, which enable this sort of work.
  • API for integration: With the use of APIs, it is possible to bring back the results into production- or decision-making systems.

How is it different to Kubernetes or Docker?

At the moment, there is also a big discussion if Kubernetes or Docker will solve this job with Analytics. However, this again requires the usage of servers and thus increases the maintenance at some point. All cloud providers have different Kubernetes and Docker solutions available, which allows an easy migration later on. However, I would suggest to go immediately for serverless solutions and avoid the use of containers if avoidable.

What are the financial benefits?

It is challenging to measure the benefits. If the only comparison is price, then it is probably not the best way to do so. Serverless Analytics will greatly reduce the cost of maintaining your stack – this will go close to zero! The only thing you need to focus on from now on is your application(s) – and they should eventually produce value. Also, it is easier to measure IT on the business impact. You get a bill for the applications, not for maintaining a stack. If you run an analysis, you will get a quote for it and the business impact may or may not justify the investment.

If you want to learn more about Serverless Analytics, I can recommend you this tutorial. (Disclaimer: I am not affiliated with Udemy!)