For a FAIR data

Over the last months, I wrote several articles about data governance. One aspect of data governance is also the principle of FAIR data. FAIR in the context of data stands for: findable, accessible, interoperable and reusable. There are several scientific papers dealing with this topic. Let me explain what it is about

What is FAIR data?

FAIR builds on the four principles stating at the beginning: findable, accessible, interoperable and reusable. This tackles most of the requirements around data governance and thus should increase the use of data. It doesn’t really deals with the aspect of data quality, but it does deal with the challenge on how to work with data. In my experience, most issues around data governance are very basic and most companies don’t manage to solve them at the elementary level.

If a company gets started with the principle of FAIR, some elementary groundwork can be done and future quality improvements can be built on top of it. Plus, it is a good and easy starting point for data governance. Let me explain each of the principles in a bit more depth now.

Findable data

Most data projects starts with the question on how to find if there is data for a specific use-case. This is often difficult to answer, since data engineers or data scientists often don’t know what kind of data is available in a large enterprise. They know the problem that they want to solve but don’t know where the data is. They have to move from person to person and dig deep in the organisation, until they find someone that knows about the data that could potentially serve for their business need. This process can take weeks and data scientists might get frustrated along the way.

A data catalog containing information about the data assets in an enterprise might solve these issues.

Accessible data

Once the first aspect is solved, it is necessary to access data. This also brings a lot of complexity, since data is often sensitive and data owners simply don’t want to share the data access. Escalations often happen along that way. To solve these problems, it is necessary to have clear data owners for all data assets defined. Also, it is highly important to have a clear process for data access available.

Interoperable data

Data often needs to be combined in use-cases with other data sets. This means, that it must be known what each data asset is about. It is necessary to have metadata available about the data and have this shared with data consumers. Nothing is worse for data scientists to constantly ask data owners about the content of the data set. The better a description about a data set is available, the faster people can work with data.

A frequent case is that data is being bought from other companies or shared among companies. This is the concept of decentralised data hubs. In this context, it is highly important to have a clearly defined metadata available.

Reuseable data

Data should eventually be reusable for other business cases as well. Therefore, it is necessary on how data was created. A description about the source system and producing entities needs to be available. Also, it is necessary how include information about potential transformations on data.

In order to make data reusable, the terms of reusability must be provided. This can be a license or other community standards on the data. Data can be either purchased or made available for free. Different software solutions enable this.

What’s next on FAIR data?

I believe it is easy to get started with implementing the tools and processes needed for a FAIR data strategy. It will immediately increase the access times to data and provide a clear way forward. Also, it will increase data quality indirectly and enable future data quality initiatives.

My article was inspired by the discussions I had with Prof. Polleres. Thanks for the insights!

The frustrated Data Scientist

I am talking a lot to different people in my domain – either on conferences or as I know them personally. One thing most of them have in common is one thing: frustration. But why are people working with data frustrated? Why do we see so many frustrated data scientists? Is it the complexity of the job on dealing with data or is it something else? My experience is clearly one thing: something else.

Why are people working with Data frustrated?

One pattern is very clear: most people I talk to that are frustrated with their job working in classical industries. Whenever I talk to people in the IT industry or in Startups, they seem to be very happy. This is largely in contrast to people working in “classical” industries or in consulting companies. There are several reasons to that:

  • First, it is often about a lack of support within traditional companies. Processes are complex and employees work in that company for quite some time. Bringing in new people (or the cool data scientists) often creates frictions with the established employees of the company. Doing things different to how they used to be done isn’t well perceived by the established type of employees and they have the power and will to block any kind of innovation. The internal network they have can’t compete with any kind of data science magic.
  • Second, data is difficult to grasp and organised in silos. Established companies often have an IT function as a cost center, so things were done or fixed on the fly. It was never really intended to dismantle those silos, as budgets were never reserved or made available in doing so. Even now, most companies don’t look into any kind of data governance to reduce their silos. Data quality isn’t a key aspect they strive for. The new kind of people – data scientists – are often “hunting” for data rather than working with the data.
  • Third, the technology stack is heterogenous and legacy brings in a lot of frustration as well. This is very similar to the second point. Here, the issue is rather about not knowing how to get the data out of a system without a clear API rather than finding data at all.
  • Fourth, everybody forgets about data engineers. Data Scientists sit alone and though they do have some skills in Python, they aren’t the ones operating a technology stack. Often, there is a mismatch between data scientists and data engineers in corporations.
  • Fifth, legacy always kicks in. Mandatory regulatory reporting and finance reporting is often taking away resources from the organisation. You can’t just say: “Hey, I am not doing this report for the regulatory since I want to find some patterns in the behaviour of my customers”. Traditional industries are more heavy regulated than Startups or IT companies. This leads to data scientists being reused for standard reporting (not even self-service!). Then the answer often is: “This is not what I signed up for!”
  • Sixth, Digitalisation and Data units are often created in order to show it to the shareholder report. There is no real need from the board for impact. Impact is driven from the business and the business knows how to do so. There won’t be significant growth at all but some growth with “doing it as usual”. (However, startups and companies changing the status quo will get this significant growth!)
  • Seventh, Data scientists need to be in the business, whereas data engineers need to be in the IT department close to the IT systems. Period. However, Tribes need to be centrally steered.

How to overcome this frustration?

Basically, there is no fast cure available to this problem to reduce the frustrated data scientists. The field is still young, so confusion and wrong decisions outside of the IT industry is normal. Projects will fail, skilled people will leave and find new jobs. Over time, companies will get more and more mature in their journey and thus everything around data will become part of the established parts of a company. Just like controlling, marketing or any other function. It is yet to find its place and organisation type.

Data Governance

What is Data Governance?

Everybody is talking about Data Science and Big Data, but one heavily ignored topic is Data Governance and Data Quality. Executives all over the world want to invest into doing data science, but they often ignore Data Governance. Some month ago I wrote about this and shared my frustration about it. Now I’ve decided to go for a more pragmatic approach and describe what Data Governance is all about. This should bring some clarity into the topic and reduce emotions.

Why is Data Governance important?

It is important to keep a certain level of quality in the data. Making decisions on Bad Data Quality leads to bad overall decisions. Data Governance efforts are increasing exponentially when not done in the very beginning of your Data Strategy.

Also, there are a lot of challenges around Data Governance:

  • Keeping a high level of security is often slowing down business implementations
  • Initial investments are necessary – that don’t show value for month to years
  • Benefits are only visible “on top” of governance – e.g. with faster business results or better insights and thus it is not easy to “quantify” the impact
  • Data Governance is often considered as “unsexy” to do. Everybody talks about data science, but nobody about data governance. In fact, Data Scientists can do almost nothing without data governance
  • Data Governance tools are rare – and those that are available are very expensive. Open Source doesn’t focus too much on it, as there is less “buzz” around it than AI. However, this also creates opportunities for us

Companies can basically follow three different strategies. Each strategy differs in the level of maturity:

  • Reactive Governance: Efforts are rather designed to respond to current pains. This happens when the organization has suffered a regulatory breach or a data disaster
  • Pre-emptive Governance: The organization is facing a major change or threat. This strategy is designed to ward off significant issues that could affect success of the company. Often it is driven by impending regulatory & compliance needs
  • Proactive Governance: All efforts are designed to improve capabilities to resolve risk and data issues. This strategy builds on reactive governance to create an ever-increasing body of validated rules, standards, and tested processes. It is also part of a wider Information Management strategy

The 4 pillars

4 data governance pillars
The 4 pillars of Data Governance

As you can see in the image, there are basically 4 main pillars. During the next weeks, I will describe each of them in detail. But let’s have a first look at them now:

  • Data Security & Data Privacy: The overall goal in here is to keep the data secure against external access. It is built on encryption, access management and accessibility. Often, a Roles-based access is defined in this process. A typical definition in here is privacy and security by design
  • Data Quality Management: In this pillar, different measures for Data Quality are defined and tracked. Typically, for each dataset, specific quality measures are looked after. This gives data consumers an overview of the data quality.
  • Data Access & Search: This pillar is all about making data accessible and searchable within the company assets. A typical sample here is a Data Catalog, that shows all available company data to end users.
  • Master Data Management: master data is the common data of the company – e.g. the customer data, the data of suppliers and alike. Data in here should be of high quality and consistent. One physical customer should occur exactly as one person and not as multiple persons

For each of the above mentioned pillars, I will write individual articles over the next weeks.

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

Serverless Analytics

In one of my last posts, I wrote about the fact that Cloud is more PaaS/FaaS then IaaS already. In fact, IaaS doesn’t bring much value at all over traditional architectures. There still are some advantages, but they remain limited. If you want to go for a future-proove archtiecture, Analytics needs to be serverless analytics. In this article, I will explain why.

What is serverless analytics?

Just as similar with serverless technologies, serverless analytics also follows the same concept. Basically, the idea behind that is to significantly reduce the work on infrastructure and servers. Modern environments allow us to “only” bring the code and the cloud provider takes care about everything else. This is basically the dream of every developer. Do you know the statement “it works on my machine”? With serverless, this is ways easier. You only need to focus on the app itself, without any requirements on operating system and stack. Also, execution is task- or consumption-based. This means that eventually you only pay for what is used. If your service isn’t utilised, you don’t pay for it. You can also achieve this with IaaS, but with serverless it is part of the concept and not something you need to enable on.

With Analytics, we now also march towards the serverless approach. But why only now? Serverless is around for already some time? Well, if we look at the data analytics community, it always used to be a bit slower than the overall industry. When most tech stacks already migrated to the Cloud, analytics projects were still carried out with large Hadoop installations in the local data center. Also back then, the Cloud was already superior. However, a lot of people still insisted on it. Now, data analytics workloads are moving more and more into the Cloud.

What are the components of Serverless Analytics?

  • Data Integration Tools: Most cloud providers provide easy to use tools to integrate data from different sources. A GUI makes the use of this easier.
  • Data Governance: Data Catalogs and quality management tools are also often parts of any solution. This enables a ways better integration.
  • Different Storage options: Basically, for serverless analytics, storage must always be decoupled from the analytics layer. Normally, there are different databases available. But most of the data is stored on object stores. Real-time data is consumed via a real-time engine.
  • Data Science Labs: Data Scientists need to experiment with data. Major cloud providers have data science labs available, which enable this sort of work.
  • API for integration: With the use of APIs, it is possible to bring back the results into production- or decision-making systems.

How is it different to Kubernetes or Docker?

At the moment, there is also a big discussion if Kubernetes or Docker will solve this job with Analytics. However, this again requires the usage of servers and thus increases the maintenance at some point. All cloud providers have different Kubernetes and Docker solutions available, which allows an easy migration later on. However, I would suggest to go immediately for serverless solutions and avoid the use of containers if avoidable.

What are the financial benefits?

It is challenging to measure the benefits. If the only comparison is price, then it is probably not the best way to do so. Serverless Analytics will greatly reduce the cost of maintaining your stack – this will go close to zero! The only thing you need to focus on from now on is your application(s) – and they should eventually produce value. Also, it is easier to measure IT on the business impact. You get a bill for the applications, not for maintaining a stack. If you run an analysis, you will get a quote for it and the business impact may or may not justify the investment.

If you want to learn more about Serverless Analytics, I can recommend you this tutorial. (Disclaimer: I am not affiliated with Udemy!)

Recurrent Neural Network

Recurrent Neural Network and Long Short-Term Memory

In the last two posts we introduced the core concepts of Deep Learning, Feedforward Neural Network and Convolutional Neural Network. In this post, we will have a look at two other popular deep learning techniques: Recurrent Neural Network and Long Short-Term Memory.

Recurrent Neural Network

The main difference to the previously introduced Networks is that the Recurrent Neural Network provides a feedback loop to the previous neuron. This architecture makes it possible to remember important information about the input the network received and takes the learning into consideration along with the next input. RNNs work very well with sequential data such as sound, time series (sensor) data or written natural languages.

The advantage of a RNN over a feedforward network is that the RNN can remember the output and use the output to predict the next element in a series, while a feedforward network is not able to fed the output back to the network. Real-time gesture tracking in videos is another important use-case for RNNs.

A Recurrent Neural Network
A Recurrent Neural Network

Long Short-Term Memory

A usual RNN has a short-term memory, which is already great at some aspect. However, there are requirenments for more advanced memory functionality. Long Short-Term Memory is solving this problem. The two Austrian researchers Josef Hochreiter and Jürgen Schmidhuber introduced LSTM. LSTMs enable RNNs to remember inputs over a long period of time. Therefore, LSTMs are used in combination with RNNs for sequential data which have long time lags in between.

LSTM learns over time on which information is relevant and what information isn’t relevant. This is done by assigning weights to information. This information is then assigned to three different gates within the LSTM: the input gate, the output gate and the “forget” gate.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

The Data Science Process

Working with data is a complex thing and not done in some days. It is rather a matter of several sequential steps that lead to a final output. In this post, I present the data science process for project execution in data science.

What is the Data Science Process?

Data Science is often mainly consisting of data wrangling and feature engineering, before one can go to the exciting stuff. Since data science is often very exploratory, processes didn’t much evolve around it (yet). In the data science process, I group the process in three main steps that have several sub-steps in it. Let’s first start with the three main steps:

  • Data Acquisition
  • Feature Engineering and Selection
  • Model Training and Extraction

Each of the different process main steps contains some sub-steps. I will describe them a bit in detail now.

Step 1: Data Acquistion

Data Engineering is the main ingredient in this step. After a business question was formulated, it is necessary to now look for the data. In an ideal setup, you would already have a data catalog in your enterprise. If not, you might need to ask several people until you have found the right place to dig deeper into it.

First of all, you need to acquire the data. They might be internal sources but you might also combine them with external sources. In this context, you might want to read about the different data sources you need. Once you are done with having a first look at the data, it is necessary to integrate the data.

Data integration is often perceived as a challenging task. You need to setup a new environment to store the data or you need to extend an existing schema. A common practise is to build a data science lab. A data science lab should be an easy platform for data engineers and data scientists to work with data. A best practise for that is to use a prepared environment in the cloud for it.

After integrating the data, there comes the heavy part of cleaning the data. In most cases, Data is very nasty and thus needs a lot of cleaning with it. This is also mainly carried out by data engineers alongside with data analysts in a company. Once you are done with the data acquisition part of it, you can move on with the feature engineering and selection step.

Typically, this first process step can be very painful and long-lasting. It depends on different factors of an enterprise, such as the data quality itself, the availability of a data catalog and corresponding metadata descriptions. If your maturity in all these items is very high, it can take some days to a week, but in average it is rather 2 to 4 weeks of work.

Step 2: Feature Engineering and Selection

In the next step we start with a very important step in the Data Science process: Feature Engineering. Features are very important for Machine Learning and have a huge impact on the quality of the predictions. With Feature Engineering, you have to understand the domain you are in and what to use with it. One need to understand what data to use and for what reason.

After the feature engineering itself, it is necessary to select the relevant features with the feature selection. A common mistake is the overfitting of a model, or also called “feature explosion”. It happens often that too many features are created and thus the predictions aren’t accurate anymore. Therefore, it is very important to select only those features that are relevant to the use-case and thus bring some significance.

Another important step is the development of the cross-validation structure. This is necessary to check how the model will perform in practice. The cross-validation will measure the performance of your model and give you insights on how to use it. Next after that is the Hyperparameter tuning. Hyperparameters are fine-tuned to improve the prediction of your model.

This process is now carried out mainly by Data Scientists, but still supported by data engineers. The next and final step in the data science process is Model Training and Extraction.

Step 3: Model Training and Extraction

The last step in the process is the model training and extraction. In this step, the algorithm(s) for the model prediction are selected and compared to each other. In order to ease up work here, it is necessary to put all your process into a pipeline. (Note: I will explain the concept of the pipeline in a later post). After the training is done, you can go into the predictions itself and bring the model into production.

The following illustration outlines the now presented process:

This image describes the Data Science process in it's three steps: Data acquisition, Feature Engineering and Selection, Model Training and Extraction
The Data Science Process

The Data Science process itself can easily be carried out in a Scrum or Kanban approach, depending on your favourite management style. For instance, you could have each of the 3 process steps as sprints. The first sprint “Data Acquisition” might last longer than the other sprints or you could even break the first one into several sprints. For Agile Data Science, I can recommend you reading this post.

Cloud IaaS is not the future

About 1,5 years ago I was writing that Cloud is not the future. Instead, I claimed that it is the present. In fact, most companies are already embracing the Cloud. Today, I want to revisit this statement and take it to the next level: Cloud IaaS is not the Future

What is wrong about Cloud IaaS?

Cloud IaaS was great in the early days of the Cloud. It gave us freedom to move our workloads in a lift-and-shift scenario to the Cloud. Also, it greatly improved how we can handle workloads in a more dynamic way. Adding servers and shutting them down on demand was really easy. In an on-premise scenario, this was far from easy. All big Cloud providers today provide a comprehensive toolset and third-party applications for IaaS solutions. But why is it not as great as it used to be?

Honestly, I was never a big fan of IaaS in the Cloud. To state it blunt, it didn’t improve much (other than scale and flexibility) to an on-premise world. With Cloud IaaS, we still have to maintain all our servers like in the old days with on-premise. Security patches, updates, fixes and alike stays with those that build the services. Since the early days, I was a big fan of Cloud PaaS (Platform as a Service). 

What is the status of Cloud PaaS?

Over the last 1.5 years, a lot of mature Cloud PaaS services emerged. Cloud PaaS has been around for almost 10 years, but the current level of maturity is impressive. Some two years ago, they have mainly been general purpose Services, but now they moved into very specific domains. There are now a lot of services available for things such as IoT or Data Analytics.

The current trend in Cloud PaaS is definitely the move towards „Serverless Analytics“. Analytics has always been a slow-mover when it came to the Cloud. Other functional areas had already Cloud-native implementations, when Analytical workloads were still developed for the on-premise world. Hadoop was one of these projects, but other projects took over and Hadoop is in the decline. Now, more analytical applications will be developed with a PaaS-stack.

What should you do now?

Cloud PaaS isn’t a revolution or anything spectacular new. If you have no experiences with Cloud PaaS, I would urge you to look at these platforms asap. They will become essential for your business and do provide a lot of benefits. Again – it isn’t the future, it is the present!

If you want to learn more about Serverless Analytics, I can recommend you this tutorial. (Disclaimer: I am not affiliated with Udemy!)

Data abstraction: the what and the why

Large enterprises have a lot of legacy systems in their footprint. This created a lot of challenges (but also opportunities!) for system integrators. Now, since companies strive to become data driven, it becomes an even bigger challenge. But luckily there is a new thing out there that can help: data abstraction.

Data Abstraction: why do you need it?

If a company wants to become data driven, it is necessary to unlock all the data that is available within a company. However, this is easier said than done. Most companies have a heterogenous IT landscape and thus struggle to integrate essential data sources into their analytical systems. In the past, there have been several approaches to it. Data was loaded (with some delay) to an analytical data warehouse. This data warehouse didn’t power any operational systems, so it was decoupled.

However, several things proved to be wrong with Data warehouses handling analytical workload: (a) data warehouses tend to be super-expensive in both cost and operations. It makes sense for KPIs and highly structured data, but not for other datasets. And (b) due to the high cost, data warehouses were loaded with some hours to even days of delay. In a real-time world, this isn’t good at all.

But – didn’t the datalake solve this already?

Some years ago, Data lakes surfaced. They were more efficient in terms of speed and cost than traditional data warehouses. However, data warehouses kept the master data, which data lakes often need. So a connection between the two needed to be established. In early days, data was simply replicated in order to do so. Next to datalakes, many other systems (NoSQL mainly) surfaced. Business units aquired different other systems, that made more integration efforts necessary. So, there was no end to data siles at all – it even got worse (and will continue to do so)

So, why not give in to the pressure of heterogenous systems and data stores and try to solve it differently? This is where data abstraction comes into play …

What is it about?

As already introduced, Data Abstraction should reduce your sleepless nights when it comes to accessing and unlocking your data assets. It is like a virtual layer that you add in between your data storages and your data consumers to enable one common access. The following illustration shows this:

Data Abstraction shows how to abstract you data
Data Abstraction

Basically, you build a layer on top of your data sources. Of course, it doesn’t solve the challenges around data integration, but it ensures that consumers can expect to have one common layer that they can plug into. Also, it enables you to exchange the technical layer of a data source without consumers taking note of it. You might consider to re-develop a data source from the ground up, in order to make it more performant. Both the old and the new stack will conform to the data abstraction and thus consumers won’t realize that there are significant changes under the hood.

This sounds really nice. So what’s the (technical) solution to it?

Basically, I don’t recommend any technology at this stage. There are several technologies that enable Data Abstraction. They can be clustered into 3 different areas:

  1. Lightweight SQL Engines: There are several products and tools (both Open Source and non-Open Source) available, which enable SQL access to different data sources. They not only plug into relational databases, but also into non-relational databases. Most tools provide easy integration and abstraction.
  2. API Integration: It is possible to integrate your data sources via an API layer that eventually abstracts the below data sources. The pain of integration is higher than with SQL Engines, but it gives you more flexbility on top and a higher degree of abstraction. In contrast to SQL engines, your consumers won’t plug too deep into database specifics. If you want to go for a really advanced tech stack, I recommend you reading about Graphs.
  3. Full-blown solution: There are several proprietary tools available, that provide numerous connectors to data sources. What is really great about these solutions is that they also include chaching mechanisms for frequent data access. You get much higher performance with limited implementation cost. However, you will lock into a specific soltuion.

Which solution you eventually consider to go for, is fully up to you. It depends on the company and its know-how and characterists. In most cases, it is also a combination of different solutions.

So what is next?

There are many tools and services out there which enable data abstraction. Data Abstraction is more of a concept than a concrete technology – not even an architectural pattern. In some cases, you might acquire a technology. Or you would abstract your data via an API or Graph. There are many technologies, tools and services out there to solve your issues.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. If you look for open data, I would recommend you browsing some open data catalogs like the open data catalog from the U.S. government.

Convolutional Neural Network (CNN) and Feedforward Neural Network

In the last couple of posts, we’ve learned about various aspects of Machine Learning. Now, we will focus on other aspects of Machine Learning: Deep Learning. After introducing the key concepts of Deep Learning in the previous post, we will have a look at two concepts: the Convolutional Neural Network (CNN) and the Feedforward Neural Network

The Feedforward Neural Network

Feedforward neural networks are the most general-purpose neural network. The entry point is the input layer and it consists of several hidden layers and an output layer. Each layer has a connection to the previous layer. This is one-way only, so that nodes can’t for a cycle. The information in a feedforward network only moves into one direction – from the input layer, through the hidden layers to the output layer. It is the easiest version of a Neural Network. The below image illustrates the Feedforward Neural Network.

Feedforward Neural Network

Convolutional Neural Networks (CNN)

The Convolutional Neural Network is very effective in Image recognition and similar tasks. For that reason it is also good for Video processing. The difference to the Feedforward neural network is that the CNN contains 3 dimensions: width, height and depth. Not all neurons in one layer are fully connected to neurons in the next layer. There are three different type of layers in a Convolutional Neural Network, which are also different to feedforward neural networks:

Convolution Layer

Convolution puts the input image through several convolutional filters. Each filter activates certain features, such as: edges, colors or objects. Next, the feature map is created out of them. The deeper the network goes the more sophisticated those filters become. The convolutional layer automatically learns which features are most important to extract for a specific task.

Rectified linear units (ReLU)

The goal of this layer is to improve the training speed and impact. Negative values in the layers are removed.


Pooling simplifies the output by performing nonlinear downsampling. The number of parameters that the network needs to learn about gets reduced. In convolutional neural networks, the operation is useful since the outgoing connections usually receive similar information.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

AI Ethics: towards a sustainable AI and Data business

AI and Ethics is a complex and ofthen discussed topic at different conferences, usergroups and forums. It even got picked up by the European commission. I would argue that it should actually go one step further: it should be part of every corporate responsibility strategy – just like social and environmental elements.

AI Ethics: what is it about?

Since I am heading the Data Strategy at a large enterprise, I am not only confronted with technical and use-case challenges, but also with legal and compliance topics around data. This might sound challenging and “boring”, but it isn’t neither one of them. Technical challenges are often more complex than the legal aspects of data. Many companies state that legal is blocking their data inititives, but often they simply didn’t include legal and privacy on their strategy. So what should you consider when talking about AI Ethics? Basically, it consists out of three building blocks.


The first building block of ethics is the robustness of data. This is mainly a technical challenge, but it needs to be done right in all senses. It consists of platforms that are prone to errors and vulnerabilities. It is all about access control, access logging and prevention. Data systems should track who accessed data and prevent unrightfull access. Also, it should implement the “need to know” principle: within a large enterprise, one should only access data that is relevant to his/her job purpose. After finishing the project, access should be revoked.


Ethics in AI is an important topic, and bias happens often. There are numerous samples out when algorithms use bias. We are humans and are influenced by bias. Bias comes from how we grew up, what experiences we made in life and a lot of our environment. Bias is bad though, as it limits our thinking. In psychology, there is a term for how to overcome this: fast and slow thinking. Imagine you have a job interview (you are the interviewer). A candidate walks in and she immediately reminds you because of some aspects about a person you met years ago and had difficulties with. During the job interview, you might not like her, even though she would be different. Your brain went into fast thinking – input-output. This is built in our brains to prevent us from danger, but often drives bias. It helps us driving a car, doing sports and alike. If you see an obstacle in your way driving a car, you need to react fast. There is no time to think over it again. However, when making decisions, you need to remove bias and think slow.

Slow thinking is challenging and you fully need to overcome bias. If you let bias dominate you, you won’t be capable of doing good decisions. Coming back to the interview example, you might reject the candidate because of your bias. After some month, this person found a job at your competitor and is building more advanced models than your company. You lost a great candidate because of your bias. This isn’t good, right?

There are other aspects to ethicas and I could probably write about this an entire series. But you also need to consider other topics, such as harrasement in algorithms. If your algorithms don’t take ethics into consideration, it isn’t just about acting wrong. You will also loose the credibility with your customers and thus start to see financial impact as well!


Last but not least, your data strategy should reflect all building blocks of legal frameworks. With the right to forget, this needs to be implemented in your systems. In enterprise environments, it isn’t easy at all. There is a lot of legacy and different systems consuming data. To tackle this from a technical perspective, it is necessary to harmonize your data models. Depending on your company ownership and structure, you need to implement GDPR and/or SOX. Different industries even come with more regulations, such as the finance industry, giving you more challenges around data. It is very important to talk to your legal department and make them your friends at an early stage in order to succeed!

So what is next for AI Ethics?

I keep it with the previous statement mentioned several times: work closely with Legal and Privacy in order to achieve a responsible strategy towards data and AI. A lot of people I know claim that AI Ethics rather blocks their strategy on data, but I argue it is the other way around: just because you can do stuff with data, it doesn’t justify doing all of what you potentially could do. By the end of the day, you have customers that should trust you. Don’t miss-use this trust and build an ethical strategy on it. Work with those people that know it best – Privacy, Security and Legal. Then – and only then – you will succeed.

I also recommend you reading my post about data access.

Credits: the three pillar points weren’t invented by myself, so I want to credit those people that gave me the ideas around it: our corporate lawyer Daniel, our Privacy Officer Paul and our Legal Counsel Doris.