Posts

Over the last months, I wrote several articles about data governance. One aspect of data governance is also the principle of FAIR data. FAIR in the context of data stands for: findable, accessible, interoperable and reusable. There are several scientific papers dealing with this topic. Let me explain what it is about

What is FAIR data?

FAIR builds on the four principles stating at the beginning: findable, accessible, interoperable and reusable. This tackles most of the requirements around data governance and thus should increase the use of data. It doesn’t really deals with the aspect of data quality, but it does deal with the challenge on how to work with data. In my experience, most issues around data governance are very basic and most companies don’t manage to solve them at the elementary level.

If a company gets started with the principle of FAIR, some elementary groundwork can be done and future quality improvements can be built on top of it. Plus, it is a good and easy starting point for data governance. Let me explain each of the principles in a bit more depth now.

Findable data

Most data projects starts with the question on how to find if there is data for a specific use-case. This is often difficult to answer, since data engineers or data scientists often don’t know what kind of data is available in a large enterprise. They know the problem that they want to solve but don’t know where the data is. They have to move from person to person and dig deep in the organisation, until they find someone that knows about the data that could potentially serve for their business need. This process can take weeks and data scientists might get frustrated along the way.

A data catalog containing information about the data assets in an enterprise might solve these issues.

Accessible data

Once the first aspect is solved, it is necessary to access data. This also brings a lot of complexity, since data is often sensitive and data owners simply don’t want to share the data access. Escalations often happen along that way. To solve these problems, it is necessary to have clear data owners for all data assets defined. Also, it is highly important to have a clear process for data access available.

Interoperable data

Data often needs to be combined in use-cases with other data sets. This means, that it must be known what each data asset is about. It is necessary to have metadata available about the data and have this shared with data consumers. Nothing is worse for data scientists to constantly ask data owners about the content of the data set. The better a description about a data set is available, the faster people can work with data.

A frequent case is that data is being bought from other companies or shared among companies. This is the concept of decentralised data hubs. In this context, it is highly important to have a clearly defined metadata available.

Reuseable data

Data should eventually be reusable for other business cases as well. Therefore, it is necessary on how data was created. A description about the source system and producing entities needs to be available. Also, it is necessary how include information about potential transformations on data.

In order to make data reusable, the terms of reusability must be provided. This can be a license or other community standards on the data. Data can be either purchased or made available for free. Different software solutions enable this.

What’s next on FAIR data?

I believe it is easy to get started with implementing the tools and processes needed for a FAIR data strategy. It will immediately increase the access times to data and provide a clear way forward. Also, it will increase data quality indirectly and enable future data quality initiatives.

My article was inspired by the discussions I had with Prof. Polleres. Thanks for the insights!

golden record

In our last tutorial for Data Governance, we now look at Master Data Management. This is the last of our four pillars. Master Data is the core data in the company, which should be clean, accurate and in a clear data model.

What is the goal of Master Data Management?

It is important to have exactly one dataset of key data assets within the company. This could for instance be the data about a customer or a supplier. It is useful to have one customer exactly once. Many companies have their customer data spread over different systems and thus having issues getting a connection between those systems. If a customer walks into a store, the sales agents often have to use different CRM tools to get a holistic picture of the customer. This often leads to not fully understanding the customer within a company.

In order to reach this, it is necessary to harmonise within a company. Reducing double entries and finding the “golden record” is a key challenge in MDM: all data about one customer should be connected and in one place. Today, this is often called “Customer 360”. But achieving this isn’t easy at all.

How to find the “Golden Record”?

Basically, there are several options to find the golden record within a dataset. Let’s imagine we have the following dataset; each of the entries is exactly the same person, but names are written different:

NameSocial Security NumberPassportMatching Group ID
Mario Meir123-45-6789
Meir Mario123-45-6789P 123456 M
M. MeirP 123456 M
How to find the golden record in a dataset

Basically, in this dataset, we see that there is a match on the social security number and on the passport. So, we can apply hierarchical matching. First, we match those entries that are rather unique. Normally, the social security number is unique, as well as the passport ID. In this case, we could match the dataset to one dataset. This would be now represented in matching groups:

NameSocial Security NumberPassportMatching Group ID
Mario Meir123-45-67891
Meir Mario123-45-6789P 123456 M1
M. MeirP 123456 M
Hierarchical matching

What else can be done to increase the quality of your Master Data?

Basically, in addition to hierarchical matching, there are several other techniques available. The most common one is the “manual matching”, where employees seek for duplicated data and thus match this data. However, a better approach is to match data via machine learning and combine it with the “manual matching”!

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

Next to Data Security & Privacy as well as Data Quality Management, there is a huge importance in Data Access and Search. This topic focuses on finding and accessing data in your data assets. Most large enterprises have a lot of data at their finger tips, but different business units don’t know where and how to find it. In this tutorial, we will have a look at how to solve this issue.

What are the ingredients for successful Data Access and Search?

There are several pre-conditions that need to be fulfilled in order to make data accessible. One of the pre-conditions is to have data security and privacy solved. If you want to make data accessible in large-scale, it is very important to ensure that only those users can access the data they should access. As a result of this, all users should see data assets in the company via a data catalog, but not the data itself. In this catalog, people should have the possibility to browse different data assets available in the company and start asking more questions.

A good data catalog constantly checks the data for updates to the catalog itself and to possible modifications. In addition to these requirements mentioned before, the data catalog checks for different data quality measures as described in the previous tutorial.

What should be inside a data catalog?

Based on the above mentioned things, a data catalog contains a lot of data about data. Next to different data assets available, each data asset should be described and offer several informations about it:

  • Titel. Title of the dataset
  • Description. What this dataset is about.
  • Categories. Tags, to enable search.
  • Business Unit. Unit, maintaining the dataset (z.b. Marketing)
  • Data Owner. Person, in charge of maintaining the dataset.
  • Data Producer. System that produces the data
  • Data Steward. Person taking care of the dataset, if not data owner itself.
  • Timespan. This indicates a date when to when the data was recorded.
  • Data refresh interval. If not in real-time available, indication how often the data gets refreshed
  • Quality metrics. Indications on data quality.
  • Data Access or Sample Data. Information on how to access the data or a sample dataset to explore the data
  • Transformations. When and how was the data transformed?

How does a data catalog looks like?

This items above are samples for the contents of a data catalog entry. A good data catalog makes it easy for users to find and search within the metadata. The following sample shows the data catalog from the US government:

US government open data portal

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

We started our tutorial with a general intro to Data Governance and then went a bit deeper into data security and data privacy. In this post, we will have a look at how to ensure a certain level of data quality in your data sets. Data Quality is a very important aspect. Imagine, you have wrong data about your customers and you build your marketing campaign on it. The campaign might return wrong results. This can damage your brand and turn away previously loyal customers. Therefore, data quality is highly essential.

How to measure data quality?

There are several aspects on how to measure data quality. I’ve summarised them into 5 core metrics. If you browse different literature, you might find more or less metrics. However, these five metrics should give you a core understanding of data quality management.

The 5 dimensions of data quality
The 5 dimensions of data quality

Availability

Availability states that data should be available. If we want to query all existing users interested in luxury cars, we are not interested in a subset but all of them. Availability is also a challenge addressed by the CAP-Theorem. In this case, it doesn’t focus on the general availability of the database but at the availability of each dataset itself. The algorithm querying the data should be as good as possible to retrieve all available data. There should be easy to use tools and languages to retrieve the data. Normally, each database provides a query language such as SQL, or O/R Mappers to developers.

With availability is also meant that the data used for a specific use-case should be available to data analysts in business units. A data relevant for a marketing campaign might be existing but not available for the campaign. For instance, the company might have specific customer data available in the data warehouse, but it isn’t know to business units that the data actually exists.

Correctness & Completness

Correctness means that Data has to be correct. If we again query for all existing users on a web portal interested in luxury cars, the data about that should be correct. By correctness, it is meant that the data should really represent people interested in luxury cars and that faked entries should be removed. A data set is also not correct if the user changed his or her address without the company knowing about it. Therefore, it must be tracked when which dataset was last updated.

Similar to correctness is completness. Data should be complete. Targeting all users interested in luxury cars only makes sense if we can target them somehow, e.g. by e-mail. If the e-mail field is blank or any other field we would like to target our users, data is not complete for our use-case.

Timeliness

Data should be up-to date. A user might change the e-mail address after a while and our database should reflect these changes whenever and wherever possible. If we target our users for luxury cars, it won’t be good at all if only 50% of the user’s e-mail addresses are correct. We might have “big data” but the data is not correct since updates didn’t occur for a while.

Consistency

This shouldn’t be confused with the consistency requirement by the CAP-Theorem. Data might be duplicated, since users might register several times to get various benefits. The user might select “luxury cars” and with another account “budget cars”. Duplicate accounts leads to inconsistency of data and it is a frequent problem in large web portals such as Facebook

Understandability

It should be easy to understand data. If we query our database for people interested in luxury cars, we should have the possibility to easily understand what the data is about. Once the data is returned, we should use our favorite tool to work with the data. The data itself should describe itself and we should know how to handle it. If the data returns a “zip” column, we know that this is the ZIP-code individual users are living in.

What can you do to improve your data quality?

Basically, it all starts with starting. You need to start tracking your data quality at some point and then need to continuously improve it. There are several tools existing that support your endeavour. But keep in mind: bad data creates bad decisions!

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

Data Governance

Everybody is talking about Data Science and Big Data, but one heavily ignored topic is Data Governance and Data Quality. Executives all over the world want to invest into doing data science, but they often ignore Data Governance. Some month ago I wrote about this and shared my frustration about it. Now I’ve decided to go for a more pragmatic approach and describe what Data Governance is all about. This should bring some clarity into the topic and reduce emotions.

Why is Data Governance important?

It is important to keep a certain level of quality in the data. Making decisions on Bad Data Quality leads to bad overall decisions. Data Governance efforts are increasing exponentially when not done in the very beginning of your Data Strategy.

Also, there are a lot of challenges around Data Governance:

  • Keeping a high level of security is often slowing down business implementations
  • Initial investments are necessary – that don’t show value for month to years
  • Benefits are only visible “on top” of governance – e.g. with faster business results or better insights and thus it is not easy to “quantify” the impact
  • Data Governance is often considered as “unsexy” to do. Everybody talks about data science, but nobody about data governance. In fact, Data Scientists can do almost nothing without data governance
  • Data Governance tools are rare – and those that are available are very expensive. Open Source doesn’t focus too much on it, as there is less “buzz” around it than AI. However, this also creates opportunities for us

Companies can basically follow three different strategies. Each strategy differs in the level of maturity:

  • Reactive Governance: Efforts are rather designed to respond to current pains. This happens when the organization has suffered a regulatory breach or a data disaster
  • Pre-emptive Governance: The organization is facing a major change or threat. This strategy is designed to ward off significant issues that could affect success of the company. Often it is driven by impending regulatory & compliance needs
  • Proactive Governance: All efforts are designed to improve capabilities to resolve risk and data issues. This strategy builds on reactive governance to create an ever-increasing body of validated rules, standards, and tested processes. It is also part of a wider Information Management strategy

The 4 pillars

4 data governance pillars
The 4 pillars of Data Governance

As you can see in the image, there are basically 4 main pillars. During the next weeks, I will describe each of them in detail. But let’s have a first look at them now:

  • Data Security & Data Privacy: The overall goal in here is to keep the data secure against external access. It is built on encryption, access management and accessibility. Often, a Roles-based access is defined in this process. A typical definition in here is privacy and security by design
  • Data Quality Management: In this pillar, different measures for Data Quality are defined and tracked. Typically, for each dataset, specific quality measures are looked after. This gives data consumers an overview of the data quality.
  • Data Access & Search: This pillar is all about making data accessible and searchable within the company assets. A typical sample here is a Data Catalog, that shows all available company data to end users.
  • Master Data Management: master data is the common data of the company – e.g. the customer data, the data of suppliers and alike. Data in here should be of high quality and consistent. One physical customer should occur exactly as one person and not as multiple persons

For each of the above mentioned pillars, I will write individual articles over the next weeks.

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

Agility is almost everywhere and it also starts to get more into other hyped domains – such as Data Science. One thing which I like in this respect is the combination with DevOps – as this eases up the process and creates end-to-end responsibility. However, I strongly believe that it doesn’t make much sense to exclude the business. In case of Analytics, I would argue that it is BizDevOps.

There is a huge demand for Data DevOps nowadays. Basically, Data Science needs a lot of business integration and works throughout different domains and functions. I outlined several times and in different posts here, that Data Science isn’t a job that is done by Data Scientists. It is more of a team work, and thus needs different people. With the concept of BizDevOps, this can be easily explained; let’s have a look at the following picture and I will afterwards outline the interdependencies on it.

The process for Data Science: BizDevOps is the answer

BizDevOps for Data Science

Basically, there must be exactly one person that takes the end-to-end responsibility – ranging from business alignments to translation into an algorithm and finally in making it productive by operating it. This is basically the typical workflow for BizDevOps. This one person taking the end-to-end responsibility is typically a project or program manager working in the data domain. The three steps were outlined in the above figure, let’s now have a look at each of them.

Data DevOps: Biz

The program manager for Data (or – you could also call this person the “Analytics Translator”) works closely with the business – either marketing, fraud, risk, shop floor, … – on getting their business requirements and needs. This person has a great understanding of what is feasible with their internal data as well in order to be capable of “translating a business problem to an algorithm”. In here, it is mainly about the Use-Case and not so much about tools and technologies. This happens in the next step. Until here, Data Scientists aren’t necessarily involved yet.

Data DevOps: Dev

In this phase, it is all about implementing the algorithm and working with the Data. The program manager mentioned above already aligned with the business and did a detailed description. Also, Data Scientists and Data Engineers are integrated now. Data Engineers start to prepare and fetch the data. Also, they work with Data Scientists in finding and retrieving the answer for the business question. There are several iterations and feedback loops back to the business, once more and more answers arrive. Anyway, this process should only take a few weeks – ideally 3-6 weeks. Once the results are satisfying, it goes over to the next phase – bringing it into operation.

Data DevOps: Ops

This phase is now about operating the algorithms that were developed. Basically, the data engineer is in charge of integrating this into the live systems. Basically, the business unit wants to see it as (continuously) calculated KPI or any other action that could result in some sort of impact. Also, continuous improvement of the models is happening there, since business might come up with new ideas on it. In this phase, the data scientist isn’t involved anymore. It is the data engineer or a dedicated devops engineer alongside the program manager.

Eventually, once the project is done (I dislike “done” because in my opinion a project is never done), this entire process moves into a CI process.

This post is part of the “Big Data for Business” tutorial. Our focus was on Data DevOps and BizDevOps. In this tutorial, I explain various aspects of handling data right within a company. I also recommend you to read about the concept of DevOps.

I use booking.com a lot for my bookings, but one thing which constantly bugs me are the e-mails after a booking – stating “The prices for [CITY YOU JUST BOOKED] just dropped again!”. Really, booking.com?!? It happend to me several times already that I booked a Hotel and after some hours I received a message that the prices in this city just dropped. This is a perfect sample of how to do data science wrong

How to do data science wrong

So, I am wondering if this happens on purpose or rather on incident. If we would expect it to happen on purpose, I would like to question the purpose of it. I just booked a Hotel and was sure that I got a good deal. But – sorry – you spent more 😉 No, I don‘t think so. I believe it rather the opposite: incident.

I do expect that booking.com is having some issues with either data silos or with the speed of the data. Either there is no connection between the ordering system and the campaigning system and thus the data doesn‘t flow between those two systems. After some time, I receive getting messages, so I think that the booking.com systems aren‘t built to handle this topic in real-time.

You order something on booking.com – the system is probably optimised on bringing this order process through, send and receive information from their (hotel) partners, … – but they don‘t update the data on the CRM or Marketing systems, that create adds. Now, my suggestion is that once you book a hotel, booking.com tracks that you looked at a specific city. This is then added to their user database and the marketing automation tool is updated.

However, the order process seems to be totally de-coupled from this process and doesn‘t receive the data about this fast enough – and most likely, their marketing automation system is set to „aggressive“ marketing once you have looked up a city – and sends recommendations often. This then leads to some discrepancy (or consistency) in their systems.

For me, this is also a great example of eventual consistency in database design. At some point, booking.com‘s systems will all be up-to date, so they stop re-targeting you. However, their eventual in the consistency is very, very late 🙂

Let me know what experiences you had.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. All credits in here go to the fabulous booking.com!

… this is at least what I hear often. A lot of people working in the data domain state this to be “false but true”. Business units are often pressing data delivery to be dirty and thus force IT units to deliver this kind of data in an ad-hoc manner with a lack of governance and in bad quality. This ends up having business projects being carried out inefficient and with a lack to a 360 degree view on the data. Business units often trigger inefficiency in data and thus projects fail – more or less digging their own hole.

The issue about data governance is simple: you hardly see it in P&L if you did it right. At least, you don’t see it directly. If your data is in bad shape, you might see it from other results such as failing projects and bad results in projects which use data. Often business in the blamed for bad results – even though the data was the weak point. It is therefore very important to apply a comprehensive data governance strategy in the entire company (and not just one division or business unit). Governance consists of several topics that need to be addressed:

What is data governance about?

  • Data Security and Access: data needs to stay secure and storages need to implement a high level of security. Access should be easy but secure. Data Governance should enable self-service analytics and not block it.
  • One common data storage: Data should stored under same standards in the company. A specific number of storages should cover all needs and different storage techniques should be connected. No silos should exist
  • Data Catalog: It should be possible to see what data is available in the company and how to access it. A data catalog should make it possible to browse different data sources and see what is inside (as long as one is allowed to access this data)
  • Systems/Processes using data: There is a clear tracking of data access. If there are changes to data, it should be possible to see what systems and processes might be affected by it.
  • Auditing: An audit log should be available, especially to see who accessed data when
  • Data quality tracking: it should be possible to track the quality of datasets under specific items. These could be: accuracy, timeliness, correctness, …
  • Metadata about your data: Metadata about the data itself should be available. You should know what can be inside your data and your Metadata should describe your data precisely.
  • Master data: you should have a golden record about all your data. This is challenging and difficult, but should be the target

Achieving this is very complex but can be achieved if the company is implementing a good data strategy. There are many benefits for Data Governance.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company.

One topic every company is currently discussing on high level is the topic of marketing automation and marketing data. It is a key factor to digitalisation of the marketing approach of a company. With Marketing Automation, we have the chance that marketing gets much more precise and to the point. No more unnecessary marketing spent, every cent spent wise – and no advertisement overloading. So far, this is the promise from vendors if we would all live in a perfect world. But what does it take to live in this perfect marketing world? DATA.

What is so hot on Marketing data?

One disclaimer upfront: I am not a marketing expert. I try to enable marketing to achieve these goals by the utilisation of our data – next to other tasks. Data is the weak point in Marketing Automation. If you have bad data, you will end up having bad Marketing Automation. Data is the engine or the oil for Marketing Automation. But why is it so crucial to get the data right for it?

As of now, Data was never seen as a strategic asset within companies. It was rather treated like something that you have to store somewhere. So it ended up being stored in silos within different departments. Making it access hard and connections difficult. Also, governance was and is still neglected. When data scientists start to work with data, they often fight governance issues – what is inside the data, why is data structured in a specific way and what should the data tell us? This process often takes weeks to overcome and is expensive.

Some industries (e.g. banks) are more mature, but are also struggling with this. In the last years, a lot of companies built data warehouses to consolidate their view on the data. Data warehouses are heavily outdated and overly expensive nowadays and still most till now most dwh’s are poorly structured. In the last years, companies started to shift data to datalakes (initially Hadoop) to get a 360° view. Economically, this makes perfect sense, but also there a holistic customer model is a challenge. It takes quite some time and resources to build this.

The newest hype in marketing are now Customer Data Platforms (CDPs). The value of CDPs aren’t proved yet. But most of them are an abstraction layer to make data handling for marketeers easier. However, integrating the data to the CDPs is challenging itself and there is a high risk of another data silo.

In order to enable Marketing Automation with data, the following steps are necessary

  • Get your data house in order. Build your data assets on open standards to change technology and vendor if necessary. Don’t lock in your data to one vendor
  • Do the first steps in small chunks, closely aligned with Marketing – in an agile way. Customer journeys are often dedicated to specific data sources and thus a full-blown model isn’t necessary. However, make sure that the model stays extensible and the big picture is always available. A recommendation is to use a NoSQL store such as Document stores for the model.
  • Keep the data processing on the datalake, the abstraction layer (I call it Customer 360) interacts with the datalake and uses tools out of it
  • Do Governance in the first steps. It is too difficult to do it at a later stage. Establish a data catalog for easy retrieval, search and data quality metrics/scoring.
  • Establish a central identity management and household management. A 360 degree view of the customer helps a lot.

With Marketing Automation, we basically differentiate 2 different types of data (so, a Lambda Architecture is my recommendation for it):

  • Batch data. This kind of data doesn’t change frequently – such as Customer Details. This data also contains data about models that run on larger datasets and thus require time-series data. Analytical models run on that data are promoted as KPIs or fields to the C360 model
  • Event data. Data that needs to feed into Marketing Automation platforms fast. If this has happened, unnecessary ads should be removed (otherwise, you would loose money)

What’s next?

This is just a high-level view on that, but handling data right for marketing is getting more and more important. And, you need to get your own data in order – you can’t outsource this task.

Let me know what challenges you had with this so far, as always – looking forward to discuss this with you 🙂

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. If you want to learn more about Marketing Automation, I recommend you reading this article.

Agility is an important factor to Big Data Applications. Agile data needs to fulfill 3 different agility factors which are: model agility, operational agility and programming ability. (Rys, 2011)

Data agility

Data agility

Agile data: model agility

Model agility means how easy it is to change the Data Model. Traditionally, in SQL Systems it is rather hard to change a schema. Other Systems such as non-relational Databases allow easy change to the Database. If we look at Key/Value Storages such as DynamoDB (Amazon Web Services, 2013), the change to a Model is very easy. Databases in fast changing systems such as Social Media Applications, Online Shops and other require model agility. Updates to such systems occur frequently, often weekly to daily (Paul, 2012).

Operational agility

In distributed environments, it is often necessary to change operational aspects of a System. New Servers get added often, also with different aspects such as Operating System and Hardware. Database systems should stay tolerant to operational changes, as this is a crucial factor to growth.

Programming agility

Database Systems should support the software developers. This is when programming agility comes into play. Programming agility describes the approach that the Database and all associated SDK’s should easy the live of a developer that is working with the Database itself. Furthermore, it should also support fast development.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.