We started our tutorial with a general intro to Data Governance and then went a bit deeper into data security and data privacy. In this post, we will have a look at how to ensure a certain level of data quality in your data sets. Data Quality is a very important aspect. Imagine, you have wrong data about your customers and you build your marketing campaign on it. The campaign might return wrong results. This can damage your brand and turn away previously loyal customers. Therefore, data quality is highly essential.
How to measure data quality?
There are several aspects on how to measure data quality. I’ve summarised them into 5 core metrics. If you browse different literature, you might find more or less metrics. However, these five metrics should give you a core understanding of data quality management.
Availability states that data should be available. If we want to query all existing users interested in luxury cars, we are not interested in a subset but all of them. Availability is also a challenge addressed by the CAP-Theorem. In this case, it doesn’t focus on the general availability of the database but at the availability of each dataset itself. The algorithm querying the data should be as good as possible to retrieve all available data. There should be easy to use tools and languages to retrieve the data. Normally, each database provides a query language such as SQL, or O/R Mappers to developers.
With availability is also meant that the data used for a specific use-case should be available to data analysts in business units. A data relevant for a marketing campaign might be existing but not available for the campaign. For instance, the company might have specific customer data available in the data warehouse, but it isn’t know to business units that the data actually exists.
Correctness & Completness
Correctness means that Data has to be correct. If we again query for all existing users on a web portal interested in luxury cars, the data about that should be correct. By correctness, it is meant that the data should really represent people interested in luxury cars and that faked entries should be removed. A data set is also not correct if the user changed his or her address without the company knowing about it. Therefore, it must be tracked when which dataset was last updated.
Similar to correctness is completness. Data should be complete. Targeting all users interested in luxury cars only makes sense if we can target them somehow, e.g. by e-mail. If the e-mail field is blank or any other field we would like to target our users, data is not complete for our use-case.
Data should be up-to date. A user might change the e-mail address after a while and our database should reflect these changes whenever and wherever possible. If we target our users for luxury cars, it won’t be good at all if only 50% of the user’s e-mail addresses are correct. We might have “big data” but the data is not correct since updates didn’t occur for a while.
This shouldn’t be confused with the consistency requirement by the CAP-Theorem. Data might be duplicated, since users might register several times to get various benefits. The user might select “luxury cars” and with another account “budget cars”. Duplicate accounts leads to inconsistency of data and it is a frequent problem in large web portals such as Facebook
It should be easy to understand data. If we query our database for people interested in luxury cars, we should have the possibility to easily understand what the data is about. Once the data is returned, we should use our favorite tool to work with the data. The data itself should describe itself and we should know how to handle it. If the data returns a “zip” column, we know that this is the ZIP-code individual users are living in.
What can you do to improve your data quality?
Basically, it all starts with starting. You need to start tracking your data quality at some point and then need to continuously improve it. There are several tools existing that support your endeavour. But keep in mind: bad data creates bad decisions!
This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.