For Data itself, there are a lot of different sources that are needed. Based on the company and industry, they differ a lot. However, to create a complex view on your company, it isn’t necessary only to have your own data. There are several other data sources you should consider.

The three data sources

The three data sources

Data you already have

The first data source – data you have – seems to be the easiest. However, it isn’t as easy as you might believe. Bringing your data in order is actually a very difficult task and can’t be achieved that easy. I’ve written several blog posts here about the challenges around data and you can review them. Basically, all of them focus on your internal data sources. I won’t re-state them in detail here, but it is mainly about data governance and access.

Data that you can acquire

The second data source – data you can acquire – is another important aspect. By acquire I basically mean everything that you don’t have to pay to an external party as data provider. You might use surveys (and pay for it as well) or acquire the data from open data platforms. Also, you might collect data from social media or with other kind of crawlers. This data source is very important for you, as you can get great overview and insights into your specific questions.

In the past, I’ve seen a lot of companies utilising the second one and we did a lot on that aspect. For this kind of data, you don’t necessarily have to pay for it – some data sources are free. And if you pay for something, you don’t pay for the data itself but rather for the (semi)-manual way of collecting it. Also here, it differs heavily from industry to industry and what the company is all about. I’ve seen companies collecting data from news sites to get insights into their competition and mentions or simply by scanning social media. A lot is possible with this aspect of data source.

Data you can buy

The last one – data you can buy – is easy to get but very expensive in cash-out terms. There are a lot of data providers selling different kind of data. Often, it is demographic data or data about customers. Different platforms collect data from a large number of online sites and thus track individuals over different sites and their behavior. Such platforms then sell this kind of data to marketing departments with more insights. Also here, you can buy this kind of data from that platforms and thus enrich your own first-party and second-party data. Imagine, you are operating a retail business selling all kind of furniture.

You would probably not know much about your web shop visitors, since they are anonymous until they buy something. With data bought from such kind of data providers, it would now be possible for you to figure out if an anonymous visitor is an outdoor enthusiast. You might adjust your offers to match his or her interest best. Or, you might learn that the person visiting your shop recently bought a countryside house with a garden. You might now adjust your offers to present garden furniture or Barbecue accessories. With this kind of third party data, you can achieve a lot and better understand your customers and your company.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. If you look for open data, I would recommend you browsing some open data catalogs like the open data catalog from the U.S. government.

To get the most out of your data strategy in an enterprise, it is necessary to cluster the different user types that might arise in an enterprise. All of them are users of data but with different needs and demands on it. In my opinion, they range from different expertise levels. Basically, I see three different user types for data access within a company

Data access on 3 different levels

Three degrees of Data Access

Basically, the different user types differentiate from their level of how they use data and from the number of users. Let’s first start with the lower part of the pyramid – Business Users

Business Users

The first layer are the business users. This are basically users that need data for their daily decisions, but are rather consumers of the data. These people look at different reports to make decisions on their business topics. They could either be Marketing, Sales or Technology – depending on the company itself. Basically, these users would use pre-defined reports, but in the long run would rather go for customized reports. One great thing for that is self-service BI. Basically, theses users are experienced in interpreting data for their business goals and asking questions on their data. This could be about re-viewing the performance of a campaign, weekly or monthly sales reports, … They create huge load on the underlying systems without understanding the implementation and complexity underneath it – and they don’t have to. From time to time, they start digging deeper into their data and thus become power users – our next level

Power Users

Power Users often emerge from Business Users. This is typically a person that is close with the business and understands the needs and processes around it. However, they also have a great technical understanding (or gained this understanding during the process of becoming power users). They have some level of SQL know-how or know the basics of other scripting tools. They often work with the business users (even in the same department) on solving business questions. Also, they work close with Data Engineers on accessing data sources and integrating new data sources. Also, they go for self-service analytics tools to have a basic level of data science done. However, they aren’t data scientists but might get into this direction if they invest significant time into it. This now brings us to the next level – the data scientists

Data access for Data Scientists

This is the top level of our pyramid. People working as data scientists aren’t in the majority – business users and power users are much more. However, they work on more challenging topics then the previous two. Also, they work close with power users and business users. They might still be in the same department, but not necessarily. Also, they work with advanced tools such as R and Python and fine-tune the models the power users built with self-service analytics tools or translate the business questions raised from the business users into algorithms.

Often, those 3 develop in different directions – however, it is necessary that all of them work together – as a team – in order to make projects with data a success. With Data access, it is necessary to also incorporate role based access controls.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company.

Honestly, a data scientist is doing a great job. Literally, they are saving all industries from a strong decline. And those heroes, they are doing all of that alone. Alone? Not fully.

The Data Scientist need the Data Engineer

There are some poor guys that support their success: those, that are called Data Engineers. A huge majority of tasks has been carried out by these guys (and girls) that hardly anyone is talking about. All the fame seems to be going to the data scientists but the data engineers aren‘t receiving any credits.

I remember one of the many meetings with C-Level executives I had. When I explained the structure of a team dealing with data, everyone in the board room agreed on „we need data scientists“. Then, one of the executives raised the question: „but what are these data engineers about? Do we really need them or could we maybe have more data scientists instead of them“.

I kept on explaining and they accepted it. But I had the feeling that they still wanted to go with more Data Scientists than Engineers eventually. This basically comes out of the trend and hype around the data scientists we see. Everyone knows that they are important. But data driven projects only succeed when a team with mixed skills and know-how is coming together.

A Data Science team needs at least the same number of Data Engineers

In all data driven projects I saw so far, it would have never worked without data engineers. They are relevant for many different things – but mainly – and in an ideal world – working in close cooperation with data scientists. If the maturity in a company for data is high, the data engineer would prepare the data for the data scientist and then work with the data scientist again on putting the algorithm back into production. I saw a lot of projects where the later one wasn‘t working – basically, the first steps were successful (data preparation) but the later step (automation) was never done.

But, there are more roles involved in that: one role, which is rather a specialization of the data engineer is the data system engineer. This is not often a dedicated role, but carried out by data engineers. Here, we basically talk about infrastructure preparation and set-up for the data scientists or engineers. Another role is the one of the data architect that ensures a company-wide approach on data and of course data owners and data stewards.

I stated it several times, but it is worth stating it over and over again: data science isn‘t a one (wo)man show, it is ALWAYS a team effort.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. Another interesting article about the data science team setup can be found here.

Data lifecycle management is a complex and important thing to consider. Despite the absolute storage of data is getting cheaper over time, it is still important to build a data platform that stores data in an efficient way. By efficient, I mean both cost and performance wise. It is necessary to build a data architecture that allows fast access to data but on the other hand also stores data in a cost effective way. Both topics are somewhat conflicting, because a cost effective storage is often slow and thus won’t create much throughput. Highly performant storages in contrast are often expensive to build. However, one question should rather be if it is really necessary to store all data in high performing entities. Therefore, it is necessary to measure the value of your data and how much you can store in a specific storage.

How to manage the data lifecycle

The role of the Data Architect is in charge of storing data efficient – both in performance and cost. Also, the architect needs to take care of data lifecycle management. Some years from now, the answer was to put all relevant data into the data warehouse. Since this was too expensive for most data, data was put into HDFS (Hadoop) in recent years. But with the cloud, we now have more diverse options. We can store data in message buffers (such as Kafka), on HDFS systems (disk based) and on Cloud-based Object stores. Especially the later one provides even more options. Comming from general purpose cloud storages, over the last years those storages have evolved to premium object stores (with high performance), common-purpose storage and cheap archive stores. This gives more flexibility in terms of how to store data even more cost effective. Data would typically demote from in-memory (e.g. via instances on Kafka) or premium storages to general purpose storages or even to Archive Stores. The data architect now has the possibility to store data in the most effective way (and thus making a Kappa Architecture useless – cloud prefers Lambda!).

But this now add additional pressure to the data architect’s job. How would the data architect now figure out what is the value of the data to store it? I recently came across a very interesting article, introducing something called “the half life of data”. Basically, this article describes how fast data loses value and thus makes it easier to judge where to store the data. For those that want to read it. The article can be found here.

What is the half life of data?

The half life of data basically categorises data into 3 different value types:

  • Strategic Data: companies use this data for strategic decision making. Data still has high value after some days, so it should be easy and fast to access.
  • Operational Data: data has still some value after some hours but then looses value. Data should be kept available for some hours to maximum days, then it should be demoted to cheaper storages
  • Tactical Data: data has value only for some minutes to maximum of hours. Value is lost fast, so it should either be stored in a very cheap storage or even deleted.

There is also an interesting infograph that illustrates this:

The half life of data:

What do you think? What is your take on it? How do you measure the value of your data? How do you handle your data lifecycle in your company?

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company

One of the most important things is to partition data in an environment. Especially with large-scale systems, this is very important, as not everything can be stored on a limited number of systems.

How to partition data?

Partitioning is another factor for Big Data Applications. It is one of the factors of the CAP-Theorem (see 1.6.1) and is also important for scaling applications. Partitioning basically describes the ability to distribute a database over different servers. In Big Data Applications, it is often not possible to store everything on one (Josuttis, 2011)

Data Partitioning

Data Partitioning

The factors for partitioning illustrated in the Figure: Partitioning are described by (Rys, 2011). Functional partitioning is basically describing the service oriented architecture (SOA) approach (Josuttis, 2011). With SOA, different functions are provided by their own services. If we talk about a Web shop such as Amazon, there are a lot of different services involved. Some Services handle the Order Workflow; other Services handle the search and so on.

If there is high load on a specific service such as the shopping cart, new instances can be added on demand. This reduces the risk of an outage that would lead to loosing money. Building a service-oriented architecture simply doesn’t solve all problems for partitioning. Therefore, data also has to be partitioned. By data partitioning, all data is distributed over different servers. They can also be distributed geographically.

A partition key basically identifies partitioned Data. Since there is a lot of data available and single nodes may fail, it is necessary to partition data in the network. This means that data should be replicated and stored redundant in order to deal with node failures.

I hope you enjoyed the first part of this tutorial about transformable and filterable data. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

Scalable data processing is necessary for all platforms handling data. In today’s tutorial we will have a look at this.

Scalability is another factor of Big Data Applications described by (Rys, 2011). Whenever we talk about Big Data, it mainly involves high-scaling systems. Each Big Data Application should be built in a way that eases scaling. (Rys, 2011) describes several needs for scaling: user load scalability, data load scalability, computational scalability and scale agility.

What is scalable data?

The figure illustrates the different needs for scalability in Big Data environments as described by (Rys, 2011). Many applications such as Facebook (Fowler, 2012) have a lot of users. Applications should support the large user base and should stay prone to errors in case the application sees unexpected high user numbers. Various techniques can be applied to support different needs such as fast data access. A factor that often – but not only – comes with a high number of users is the data load.

(Rys, 2011) describes that some or many users can produce this data. However, things such as sensors and other devices that do not directly relate to users can also produce large datasets. Computational scalability is the ability to scale to large datasets. Data is often analyzed and this needs compute power on the analysis side. Distributed algorithms such as Map/Reduce require a lot of nodes in order to perform queries and analyze in a performing manner.

Scale agility describes the possibility to change the environment of a system. This basically means that new instances such as compute can be added or removed on-demand. This requires a high level of automation and virtualization and is very similar to what can be done in cloud computing environments. Several Platforms such as Amazon EC2, Windows Azure, OpenStack, Eucalyptus and others enable this level of self-service that is a great support to scaling agility for Big Data environments.

I hope you enjoyed the first part of this tutorial about scalable data. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

Agility is an important factor to Big Data Applications. Agile data needs to fulfill 3 different agility factors which are: model agility, operational agility and programming ability. (Rys, 2011)

Data agility

Data agility

Agile data: model agility

Model agility means how easy it is to change the Data Model. Traditionally, in SQL Systems it is rather hard to change a schema. Other Systems such as non-relational Databases allow easy change to the Database. If we look at Key/Value Storages such as DynamoDB (Amazon Web Services, 2013), the change to a Model is very easy. Databases in fast changing systems such as Social Media Applications, Online Shops and other require model agility. Updates to such systems occur frequently, often weekly to daily (Paul, 2012).

Operational agility

In distributed environments, it is often necessary to change operational aspects of a System. New Servers get added often, also with different aspects such as Operating System and Hardware. Database systems should stay tolerant to operational changes, as this is a crucial factor to growth.

Programming agility

Database Systems should support the software developers. This is when programming agility comes into play. Programming agility describes the approach that the Database and all associated SDK’s should easy the live of a developer that is working with the Database itself. Furthermore, it should also support fast development.

I hope you enjoyed the first part of this tutorial about transformable and filterable data. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

There are two main characteristics that data needs to fullfill: there needs to be transformable data and filterable data. In this tutorial, I will describe both.

Transformable Data

 If data is transformed, it can be changed to a different format or layout. This could as well mean the format change from binary to e.g. Json or XML as well as a totally new representation. If someone wants to look at a specific dataset (which, for instance, could be filtered) not all data might be interesting.

Let’s assume that a manager wants to filter for all Customers younger than 18 in a specific district. The manager is probably not interested in the names of the customer but rather in the sum of customers. Instead returning a huge list of Names with addresses and alike, a number is returned.

Or the online marketing department wants to target all customers with specific criteria such as age, the address might not be relevant, but Names and E-Mail are. Transformability is also a necessary characteristic if data has to be exported to another database, e.g. for analytics.

Filterable Data

This is a key characteristic to Datasets. Analytics software use Filtering frequently and it is absolutely necessary since most analytics simply don’t run on all data but rather on selected Data. Filtered Data is often represented with the “Select … Where”-Clauses in Databases.

Most of what filtering of data is good for was already discussed with “Transformability”, however we would still go into detail with that. If we analyze data, it is often necessary to work on specific datasets.

Imagine a Google Search Query, where you search for “Big Data”. All Data within Google’s index gets filtered for exactly these Words and a consolidated List is returned. If the online marketing department mentioned in “Transformability” wants a list of customers in a specific area, this List is also filtered based on the Zip Code or other geographical data. Hence it is an important characteristic for Data to support Filtering.

I hope you enjoyed the first part of this tutorial about transformable and filterable data. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.