Posts

Over the last months, I wrote several articles about data governance. One aspect of data governance is also the principle of FAIR data. FAIR in the context of data stands for: findable, accessible, interoperable and reusable. There are several scientific papers dealing with this topic. Let me explain what it is about

What is FAIR data?

FAIR builds on the four principles stating at the beginning: findable, accessible, interoperable and reusable. This tackles most of the requirements around data governance and thus should increase the use of data. It doesn’t really deals with the aspect of data quality, but it does deal with the challenge on how to work with data. In my experience, most issues around data governance are very basic and most companies don’t manage to solve them at the elementary level.

If a company gets started with the principle of FAIR, some elementary groundwork can be done and future quality improvements can be built on top of it. Plus, it is a good and easy starting point for data governance. Let me explain each of the principles in a bit more depth now.

Findable data

Most data projects starts with the question on how to find if there is data for a specific use-case. This is often difficult to answer, since data engineers or data scientists often don’t know what kind of data is available in a large enterprise. They know the problem that they want to solve but don’t know where the data is. They have to move from person to person and dig deep in the organisation, until they find someone that knows about the data that could potentially serve for their business need. This process can take weeks and data scientists might get frustrated along the way.

A data catalog containing information about the data assets in an enterprise might solve these issues.

Accessible data

Once the first aspect is solved, it is necessary to access data. This also brings a lot of complexity, since data is often sensitive and data owners simply don’t want to share the data access. Escalations often happen along that way. To solve these problems, it is necessary to have clear data owners for all data assets defined. Also, it is highly important to have a clear process for data access available.

Interoperable data

Data often needs to be combined in use-cases with other data sets. This means, that it must be known what each data asset is about. It is necessary to have metadata available about the data and have this shared with data consumers. Nothing is worse for data scientists to constantly ask data owners about the content of the data set. The better a description about a data set is available, the faster people can work with data.

A frequent case is that data is being bought from other companies or shared among companies. This is the concept of decentralised data hubs. In this context, it is highly important to have a clearly defined metadata available.

Reuseable data

Data should eventually be reusable for other business cases as well. Therefore, it is necessary on how data was created. A description about the source system and producing entities needs to be available. Also, it is necessary how include information about potential transformations on data.

In order to make data reusable, the terms of reusability must be provided. This can be a license or other community standards on the data. Data can be either purchased or made available for free. Different software solutions enable this.

What’s next on FAIR data?

I believe it is easy to get started with implementing the tools and processes needed for a FAIR data strategy. It will immediately increase the access times to data and provide a clear way forward. Also, it will increase data quality indirectly and enable future data quality initiatives.

My article was inspired by the discussions I had with Prof. Polleres. Thanks for the insights!

golden record

In our last tutorial for Data Governance, we now look at Master Data Management. This is the last of our four pillars. Master Data is the core data in the company, which should be clean, accurate and in a clear data model.

What is the goal of Master Data Management?

It is important to have exactly one dataset of key data assets within the company. This could for instance be the data about a customer or a supplier. It is useful to have one customer exactly once. Many companies have their customer data spread over different systems and thus having issues getting a connection between those systems. If a customer walks into a store, the sales agents often have to use different CRM tools to get a holistic picture of the customer. This often leads to not fully understanding the customer within a company.

In order to reach this, it is necessary to harmonise within a company. Reducing double entries and finding the “golden record” is a key challenge in MDM: all data about one customer should be connected and in one place. Today, this is often called “Customer 360”. But achieving this isn’t easy at all.

How to find the “Golden Record”?

Basically, there are several options to find the golden record within a dataset. Let’s imagine we have the following dataset; each of the entries is exactly the same person, but names are written different:

NameSocial Security NumberPassportMatching Group ID
Mario Meir123-45-6789
Meir Mario123-45-6789P 123456 M
M. MeirP 123456 M
How to find the golden record in a dataset

Basically, in this dataset, we see that there is a match on the social security number and on the passport. So, we can apply hierarchical matching. First, we match those entries that are rather unique. Normally, the social security number is unique, as well as the passport ID. In this case, we could match the dataset to one dataset. This would be now represented in matching groups:

NameSocial Security NumberPassportMatching Group ID
Mario Meir123-45-67891
Meir Mario123-45-6789P 123456 M1
M. MeirP 123456 M
Hierarchical matching

What else can be done to increase the quality of your Master Data?

Basically, in addition to hierarchical matching, there are several other techniques available. The most common one is the “manual matching”, where employees seek for duplicated data and thus match this data. However, a better approach is to match data via machine learning and combine it with the “manual matching”!

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

Next to Data Security & Privacy as well as Data Quality Management, there is a huge importance in Data Access and Search. This topic focuses on finding and accessing data in your data assets. Most large enterprises have a lot of data at their finger tips, but different business units don’t know where and how to find it. In this tutorial, we will have a look at how to solve this issue.

What are the ingredients for successful Data Access and Search?

There are several pre-conditions that need to be fulfilled in order to make data accessible. One of the pre-conditions is to have data security and privacy solved. If you want to make data accessible in large-scale, it is very important to ensure that only those users can access the data they should access. As a result of this, all users should see data assets in the company via a data catalog, but not the data itself. In this catalog, people should have the possibility to browse different data assets available in the company and start asking more questions.

A good data catalog constantly checks the data for updates to the catalog itself and to possible modifications. In addition to these requirements mentioned before, the data catalog checks for different data quality measures as described in the previous tutorial.

What should be inside a data catalog?

Based on the above mentioned things, a data catalog contains a lot of data about data. Next to different data assets available, each data asset should be described and offer several informations about it:

  • Titel. Title of the dataset
  • Description. What this dataset is about.
  • Categories. Tags, to enable search.
  • Business Unit. Unit, maintaining the dataset (z.b. Marketing)
  • Data Owner. Person, in charge of maintaining the dataset.
  • Data Producer. System that produces the data
  • Data Steward. Person taking care of the dataset, if not data owner itself.
  • Timespan. This indicates a date when to when the data was recorded.
  • Data refresh interval. If not in real-time available, indication how often the data gets refreshed
  • Quality metrics. Indications on data quality.
  • Data Access or Sample Data. Information on how to access the data or a sample dataset to explore the data
  • Transformations. When and how was the data transformed?

How does a data catalog looks like?

This items above are samples for the contents of a data catalog entry. A good data catalog makes it easy for users to find and search within the metadata. The following sample shows the data catalog from the US government:

US government open data portal

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

We started our tutorial with a general intro to Data Governance and then went a bit deeper into data security and data privacy. In this post, we will have a look at how to ensure a certain level of data quality in your data sets. Data Quality is a very important aspect. Imagine, you have wrong data about your customers and you build your marketing campaign on it. The campaign might return wrong results. This can damage your brand and turn away previously loyal customers. Therefore, data quality is highly essential.

How to measure data quality?

There are several aspects on how to measure data quality. I’ve summarised them into 5 core metrics. If you browse different literature, you might find more or less metrics. However, these five metrics should give you a core understanding of data quality management.

The 5 dimensions of data quality
The 5 dimensions of data quality

Availability

Availability states that data should be available. If we want to query all existing users interested in luxury cars, we are not interested in a subset but all of them. Availability is also a challenge addressed by the CAP-Theorem. In this case, it doesn’t focus on the general availability of the database but at the availability of each dataset itself. The algorithm querying the data should be as good as possible to retrieve all available data. There should be easy to use tools and languages to retrieve the data. Normally, each database provides a query language such as SQL, or O/R Mappers to developers.

With availability is also meant that the data used for a specific use-case should be available to data analysts in business units. A data relevant for a marketing campaign might be existing but not available for the campaign. For instance, the company might have specific customer data available in the data warehouse, but it isn’t know to business units that the data actually exists.

Correctness & Completness

Correctness means that Data has to be correct. If we again query for all existing users on a web portal interested in luxury cars, the data about that should be correct. By correctness, it is meant that the data should really represent people interested in luxury cars and that faked entries should be removed. A data set is also not correct if the user changed his or her address without the company knowing about it. Therefore, it must be tracked when which dataset was last updated.

Similar to correctness is completness. Data should be complete. Targeting all users interested in luxury cars only makes sense if we can target them somehow, e.g. by e-mail. If the e-mail field is blank or any other field we would like to target our users, data is not complete for our use-case.

Timeliness

Data should be up-to date. A user might change the e-mail address after a while and our database should reflect these changes whenever and wherever possible. If we target our users for luxury cars, it won’t be good at all if only 50% of the user’s e-mail addresses are correct. We might have “big data” but the data is not correct since updates didn’t occur for a while.

Consistency

This shouldn’t be confused with the consistency requirement by the CAP-Theorem. Data might be duplicated, since users might register several times to get various benefits. The user might select “luxury cars” and with another account “budget cars”. Duplicate accounts leads to inconsistency of data and it is a frequent problem in large web portals such as Facebook

Understandability

It should be easy to understand data. If we query our database for people interested in luxury cars, we should have the possibility to easily understand what the data is about. Once the data is returned, we should use our favorite tool to work with the data. The data itself should describe itself and we should know how to handle it. If the data returns a “zip” column, we know that this is the ZIP-code individual users are living in.

What can you do to improve your data quality?

Basically, it all starts with starting. You need to start tracking your data quality at some point and then need to continuously improve it. There are several tools existing that support your endeavour. But keep in mind: bad data creates bad decisions!

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In our previous tutorial intro, we outlined the four pillars that are relevant to data governance. In this post, I will go for a deeper dive into the data security and data privacy aspects of data governance.

What is data security?

Data security is all about securing the data against intrusions from the in- or outside of an organisation. Basically, it deals with hardening any systems that store data and making sure that data is only stored in a safe and secure way.

When dealing with data privacy, it comes in several layers:

  • Infrastructure: ensuring that the physical infrastructure is protected against any unwanted access. This starts with physical access control to servers and any devices associated with the organisation. This layer is only relevant when done on-premise.
  • Operating Systems and virtualisation: here it needs to be ensured that the operating system is in a secure state. If done on-premise, it requires both the host and the guest OS and the virtualisation software. When done in the cloud, it only applies to IaaS
  • Databases and Data Stores: any databases need to be constantly checked for vulnerabilities. If using any other stores such as object stores, they also need to be secured. This applies to on-premise and IaaS cloud solutions, but not to PaaS or SaaS cloud solutions
  • Application Security: When building a software on top of the previous stack, it is necessary to write this software in a secure manner. This applies to both on-prem and cloud. When using PaaS or SaaS solutions, it is the only relevant security challenge for companies implementing it. Therefore, it is highly important to look for a comprehensive security concept on this layer.

What if you ignore it?

Having issues with data security is a frequent failure of companies. There are a lot of examples of data leaks like with LinkedIn, Deutsche Telekom or Twitter. Almost nobody is secure and thus this block needs to be taken into consideration at the highest level when building a data strategy. Experts argue that it might not be a question when an intrusion happens. The only question might be how long the organisation needs to realise it and thus take counter-measures and minimise the damage.

A key recommendation (but not the only one) is to encrypt all data, so that it is more challenging to get full access.

What is data privacy?

Another important block is data privacy. This now deals more with the question on who can read or access the data within a company. Basically, algorithms and people should work with (pseudo) anonymised data whenever possible. Analysts or Data Scientists shouldn’t see any personal information within the data that they are dealing with. If we take a marketing campaign, the analysts working with the data should only see the minimum available data for them necessary to build the campaign. The marketing tool should then combine the results of their selection with the addresses of their target. There are several tools available that obfuscate personal identifiable data (PID) and thus make the work with it easier.

The above described is also called as the “need to know principle”. People should only see the data that they really need to know. When looking at how companies build their access rights to data, it is often built on a very individual basis. People ask for access, state why they need it and the data owner gives them access. However, this is rather manual and not necessarily fit for the new era of privacy.

A business driven role-based access model

A much better approach is to build on a role-based access model. By roles, it doesn’t necessarily mean Active Directory roles. It is more built on the business roles that users are in. For example, a role would be “Marketing Analyst”. This user would get access to specific data that he or she needs for the daily work. Access to all data that are relevant should be given, but nothing more than that. The roles in this approach should be clearly business focused and not technology-focused.

Another key aspect in data privacy is to understand who was accessing what data. It is necessary to store a comprehensive audit log about all data access and thus make data breaches trackable.

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

… this is at least what I hear often. A lot of people working in the data domain state this to be “false but true”. Business units are often pressing data delivery to be dirty and thus force IT units to deliver this kind of data in an ad-hoc manner with a lack of governance and in bad quality. This ends up having business projects being carried out inefficient and with a lack to a 360 degree view on the data. Business units often trigger inefficiency in data and thus projects fail – more or less digging their own hole.

The issue about data governance is simple: you hardly see it in P&L if you did it right. At least, you don’t see it directly. If your data is in bad shape, you might see it from other results such as failing projects and bad results in projects which use data. Often business in the blamed for bad results – even though the data was the weak point. It is therefore very important to apply a comprehensive data governance strategy in the entire company (and not just one division or business unit). Governance consists of several topics that need to be addressed:

What is data governance about?

  • Data Security and Access: data needs to stay secure and storages need to implement a high level of security. Access should be easy but secure. Data Governance should enable self-service analytics and not block it.
  • One common data storage: Data should stored under same standards in the company. A specific number of storages should cover all needs and different storage techniques should be connected. No silos should exist
  • Data Catalog: It should be possible to see what data is available in the company and how to access it. A data catalog should make it possible to browse different data sources and see what is inside (as long as one is allowed to access this data)
  • Systems/Processes using data: There is a clear tracking of data access. If there are changes to data, it should be possible to see what systems and processes might be affected by it.
  • Auditing: An audit log should be available, especially to see who accessed data when
  • Data quality tracking: it should be possible to track the quality of datasets under specific items. These could be: accuracy, timeliness, correctness, …
  • Metadata about your data: Metadata about the data itself should be available. You should know what can be inside your data and your Metadata should describe your data precisely.
  • Master data: you should have a golden record about all your data. This is challenging and difficult, but should be the target

Achieving this is very complex but can be achieved if the company is implementing a good data strategy. There are many benefits for Data Governance.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company.