golden record

In our last tutorial for Data Governance, we now look at Master Data Management. This is the last of our four pillars. Master Data is the core data in the company, which should be clean, accurate and in a clear data model.

What is the goal of Master Data Management?

It is important to have exactly one dataset of key data assets within the company. This could for instance be the data about a customer or a supplier. It is useful to have one customer exactly once. Many companies have their customer data spread over different systems and thus having issues getting a connection between those systems. If a customer walks into a store, the sales agents often have to use different CRM tools to get a holistic picture of the customer. This often leads to not fully understanding the customer within a company.

In order to reach this, it is necessary to harmonise within a company. Reducing double entries and finding the “golden record” is a key challenge in MDM: all data about one customer should be connected and in one place. Today, this is often called “Customer 360”. But achieving this isn’t easy at all.

How to find the “Golden Record”?

Basically, there are several options to find the golden record within a dataset. Let’s imagine we have the following dataset; each of the entries is exactly the same person, but names are written different:

NameSocial Security NumberPassportMatching Group ID
Mario Meir123-45-6789
Meir Mario123-45-6789P 123456 M
M. MeirP 123456 M
How to find the golden record in a dataset

Basically, in this dataset, we see that there is a match on the social security number and on the passport. So, we can apply hierarchical matching. First, we match those entries that are rather unique. Normally, the social security number is unique, as well as the passport ID. In this case, we could match the dataset to one dataset. This would be now represented in matching groups:

NameSocial Security NumberPassportMatching Group ID
Mario Meir123-45-67891
Meir Mario123-45-6789P 123456 M1
M. MeirP 123456 M
Hierarchical matching

What else can be done to increase the quality of your Master Data?

Basically, in addition to hierarchical matching, there are several other techniques available. The most common one is the “manual matching”, where employees seek for duplicated data and thus match this data. However, a better approach is to match data via machine learning and combine it with the “manual matching”!

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

Next to Data Security & Privacy as well as Data Quality Management, there is a huge importance in Data Access and Search. This topic focuses on finding and accessing data in your data assets. Most large enterprises have a lot of data at their finger tips, but different business units don’t know where and how to find it. In this tutorial, we will have a look at how to solve this issue.

What are the ingredients for successful Data Access and Search?

There are several pre-conditions that need to be fulfilled in order to make data accessible. One of the pre-conditions is to have data security and privacy solved. If you want to make data accessible in large-scale, it is very important to ensure that only those users can access the data they should access. As a result of this, all users should see data assets in the company via a data catalog, but not the data itself. In this catalog, people should have the possibility to browse different data assets available in the company and start asking more questions.

A good data catalog constantly checks the data for updates to the catalog itself and to possible modifications. In addition to these requirements mentioned before, the data catalog checks for different data quality measures as described in the previous tutorial.

What should be inside a data catalog?

Based on the above mentioned things, a data catalog contains a lot of data about data. Next to different data assets available, each data asset should be described and offer several informations about it:

  • Titel. Title of the dataset
  • Description. What this dataset is about.
  • Categories. Tags, to enable search.
  • Business Unit. Unit, maintaining the dataset (z.b. Marketing)
  • Data Owner. Person, in charge of maintaining the dataset.
  • Data Producer. System that produces the data
  • Data Steward. Person taking care of the dataset, if not data owner itself.
  • Timespan. This indicates a date when to when the data was recorded.
  • Data refresh interval. If not in real-time available, indication how often the data gets refreshed
  • Quality metrics. Indications on data quality.
  • Data Access or Sample Data. Information on how to access the data or a sample dataset to explore the data
  • Transformations. When and how was the data transformed?

How does a data catalog looks like?

This items above are samples for the contents of a data catalog entry. A good data catalog makes it easy for users to find and search within the metadata. The following sample shows the data catalog from the US government:

US government open data portal

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.