Posts

In our previous tutorial intro, we outlined the four pillars that are relevant to data governance. In this post, I will go for a deeper dive into the data security and data privacy aspects of data governance.

What is data security?

Data security is all about securing the data against intrusions from the in- or outside of an organisation. Basically, it deals with hardening any systems that store data and making sure that data is only stored in a safe and secure way.

When dealing with data privacy, it comes in several layers:

  • Infrastructure: ensuring that the physical infrastructure is protected against any unwanted access. This starts with physical access control to servers and any devices associated with the organisation. This layer is only relevant when done on-premise.
  • Operating Systems and virtualisation: here it needs to be ensured that the operating system is in a secure state. If done on-premise, it requires both the host and the guest OS and the virtualisation software. When done in the cloud, it only applies to IaaS
  • Databases and Data Stores: any databases need to be constantly checked for vulnerabilities. If using any other stores such as object stores, they also need to be secured. This applies to on-premise and IaaS cloud solutions, but not to PaaS or SaaS cloud solutions
  • Application Security: When building a software on top of the previous stack, it is necessary to write this software in a secure manner. This applies to both on-prem and cloud. When using PaaS or SaaS solutions, it is the only relevant security challenge for companies implementing it. Therefore, it is highly important to look for a comprehensive security concept on this layer.

What if you ignore it?

Having issues with data security is a frequent failure of companies. There are a lot of examples of data leaks like with LinkedIn, Deutsche Telekom or Twitter. Almost nobody is secure and thus this block needs to be taken into consideration at the highest level when building a data strategy. Experts argue that it might not be a question when an intrusion happens. The only question might be how long the organisation needs to realise it and thus take counter-measures and minimise the damage.

A key recommendation (but not the only one) is to encrypt all data, so that it is more challenging to get full access.

What is data privacy?

Another important block is data privacy. This now deals more with the question on who can read or access the data within a company. Basically, algorithms and people should work with (pseudo) anonymised data whenever possible. Analysts or Data Scientists shouldn’t see any personal information within the data that they are dealing with. If we take a marketing campaign, the analysts working with the data should only see the minimum available data for them necessary to build the campaign. The marketing tool should then combine the results of their selection with the addresses of their target. There are several tools available that obfuscate personal identifiable data (PID) and thus make the work with it easier.

The above described is also called as the “need to know principle”. People should only see the data that they really need to know. When looking at how companies build their access rights to data, it is often built on a very individual basis. People ask for access, state why they need it and the data owner gives them access. However, this is rather manual and not necessarily fit for the new era of privacy.

A business driven role-based access model

A much better approach is to build on a role-based access model. By roles, it doesn’t necessarily mean Active Directory roles. It is more built on the business roles that users are in. For example, a role would be “Marketing Analyst”. This user would get access to specific data that he or she needs for the daily work. Access to all data that are relevant should be given, but nothing more than that. The roles in this approach should be clearly business focused and not technology-focused.

Another key aspect in data privacy is to understand who was accessing what data. It is necessary to store a comprehensive audit log about all data access and thus make data breaches trackable.

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

Data Governance

Everybody is talking about Data Science and Big Data, but one heavily ignored topic is Data Governance and Data Quality. Executives all over the world want to invest into doing data science, but they often ignore Data Governance. Some month ago I wrote about this and shared my frustration about it. Now I’ve decided to go for a more pragmatic approach and describe what Data Governance is all about. This should bring some clarity into the topic and reduce emotions.

Why is Data Governance important?

It is important to keep a certain level of quality in the data. Making decisions on Bad Data Quality leads to bad overall decisions. Data Governance efforts are increasing exponentially when not done in the very beginning of your Data Strategy.

Also, there are a lot of challenges around Data Governance:

  • Keeping a high level of security is often slowing down business implementations
  • Initial investments are necessary – that don’t show value for month to years
  • Benefits are only visible “on top” of governance – e.g. with faster business results or better insights and thus it is not easy to “quantify” the impact
  • Data Governance is often considered as “unsexy” to do. Everybody talks about data science, but nobody about data governance. In fact, Data Scientists can do almost nothing without data governance
  • Data Governance tools are rare – and those that are available are very expensive. Open Source doesn’t focus too much on it, as there is less “buzz” around it than AI. However, this also creates opportunities for us

Companies can basically follow three different strategies. Each strategy differs in the level of maturity:

  • Reactive Governance: Efforts are rather designed to respond to current pains. This happens when the organization has suffered a regulatory breach or a data disaster
  • Pre-emptive Governance: The organization is facing a major change or threat. This strategy is designed to ward off significant issues that could affect success of the company. Often it is driven by impending regulatory & compliance needs
  • Proactive Governance: All efforts are designed to improve capabilities to resolve risk and data issues. This strategy builds on reactive governance to create an ever-increasing body of validated rules, standards, and tested processes. It is also part of a wider Information Management strategy

The 4 pillars

4 data governance pillars
The 4 pillars of Data Governance

As you can see in the image, there are basically 4 main pillars. During the next weeks, I will describe each of them in detail. But let’s have a first look at them now:

  • Data Security & Data Privacy: The overall goal in here is to keep the data secure against external access. It is built on encryption, access management and accessibility. Often, a Roles-based access is defined in this process. A typical definition in here is privacy and security by design
  • Data Quality Management: In this pillar, different measures for Data Quality are defined and tracked. Typically, for each dataset, specific quality measures are looked after. This gives data consumers an overview of the data quality.
  • Data Access & Search: This pillar is all about making data accessible and searchable within the company assets. A typical sample here is a Data Catalog, that shows all available company data to end users.
  • Master Data Management: master data is the common data of the company – e.g. the customer data, the data of suppliers and alike. Data in here should be of high quality and consistent. One physical customer should occur exactly as one person and not as multiple persons

For each of the above mentioned pillars, I will write individual articles over the next weeks.

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

… this is at least what I hear often. A lot of people working in the data domain state this to be “false but true”. Business units are often pressing data delivery to be dirty and thus force IT units to deliver this kind of data in an ad-hoc manner with a lack of governance and in bad quality. This ends up having business projects being carried out inefficient and with a lack to a 360 degree view on the data. Business units often trigger inefficiency in data and thus projects fail – more or less digging their own hole.

The issue about data governance is simple: you hardly see it in P&L if you did it right. At least, you don’t see it directly. If your data is in bad shape, you might see it from other results such as failing projects and bad results in projects which use data. Often business in the blamed for bad results – even though the data was the weak point. It is therefore very important to apply a comprehensive data governance strategy in the entire company (and not just one division or business unit). Governance consists of several topics that need to be addressed:

What is data governance about?

  • Data Security and Access: data needs to stay secure and storages need to implement a high level of security. Access should be easy but secure. Data Governance should enable self-service analytics and not block it.
  • One common data storage: Data should stored under same standards in the company. A specific number of storages should cover all needs and different storage techniques should be connected. No silos should exist
  • Data Catalog: It should be possible to see what data is available in the company and how to access it. A data catalog should make it possible to browse different data sources and see what is inside (as long as one is allowed to access this data)
  • Systems/Processes using data: There is a clear tracking of data access. If there are changes to data, it should be possible to see what systems and processes might be affected by it.
  • Auditing: An audit log should be available, especially to see who accessed data when
  • Data quality tracking: it should be possible to track the quality of datasets under specific items. These could be: accuracy, timeliness, correctness, …
  • Metadata about your data: Metadata about the data itself should be available. You should know what can be inside your data and your Metadata should describe your data precisely.
  • Master data: you should have a golden record about all your data. This is challenging and difficult, but should be the target

Achieving this is very complex but can be achieved if the company is implementing a good data strategy. There are many benefits for Data Governance.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company.