In order to analyse data right, it is necessary to have a high level of data quality. In this tutorial, we will look at how to achieve this.
What are the data quality attributes?
To Store Data, some attributes regarding the data must be fulfilled. (Heinrich & Stelzer, 2011) defined some data quality attributes that should be fulfilled.
- Relevance. Data should be relevant to the use-case. If a query should look up all available users interested in “luxury cars” in a web portal, all these users should be returned. It should be possible to take some advantage out of these data, e.g. for advanced marketing targeting.
- Correctness. Data has to be correct. If we again query for all existing users on a web portal interested in luxury cars, the data about that should be correct. By correctness, it is meant that the data should really represent people interested in luxury cars and that faked entries should be removed.
- Completeness. Data should be complete. Targeting all users interested in luxury cars only makes sense if we can target them somehow, e.g. by e-mail. If the e-mail field is blank or any other field we would like to target our users, data is not complete for our use-case.
- Timeliness. Data should be up-to date. A user might change the e-mail address after a while and our database should reflect these changes whenever and wherever possible. If we target our users for luxury cars, it won’t be good at all if only 50% of the user’s e-mail addresses are correct. We might have “big data” but the data is not correct since updates didn’t occur for a while.
- Accuracy. Data should be as accurate as possible. Web site users should have the possibility to specify, “Yes, I am interested in luxury cars” instead of defining their favorite brand (which could be done additionally). If the users have the possibility to select a favorite brand, it might be accurate but not accurate enough. Imagine someone selects “BMW” as favorite brand. BMW could be considered as luxury car but they also have different models. If someone selects BMW just because one likes the sport features, the targeting mechanism might hit the wrong people.
- Consistency. This shouldn’t be confused with the consistency requirement by the CAP-Theorem (see next section). Data might be duplicated, since users might register several times to get various benefits. The user might select “luxury cars” and with another account “budget cars”. Duplicate accounts leads to inconsistency of data and it is a frequent problem in large web portals such as Facebook (Kelly, 2012).
- Availability. Availability states that data should be available. If we want to query all existing users interested in luxury cars, we are not interested in a subset but all of them. Availability is also a challenge addressed by the CAP-Theorem. In this case, it doesn’t focus on the general availability of the database but at the availability of each dataset itself. The algorithm querying the data should be as good as possible to retrieve all available data. There should be easy to use tools and languages to retrieve the data. Normally, each database provides a query language such as SQL, or O/R Mappers to developers.
- Understandability. It should be easy to understand data. If we query our database for people interested in luxury cars, we should have the possibility to easily understand what the data is about. Once the data is returned, we should use our favorite tool to work with the data. The data itself should describe itself and we should know how to handle it. If the data returns a “zip” column, we know that this is the ZIP-code individual users are living in.
I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.