Posts

Big Data is a disruptive technology. It is changing major industries from the inside. In the next posts, we will learn how Big Data changes different industries.
Today’s focus: Big Data in Manufacturing.
Manufacturing is a traditional industry relevant to almost any country in the world. It started to emerge in the industrial revolution, when machines took over and production became more and more automated. Big Data has the possibility to substantially change the manufacturing industry again – with various opportunities.
Manufactures can utilize Big Data for various reasons. First, it is all about quality. When we look at production chains, may it be producing a car or just some metal works, quality is key. Who wants to buy a car that is broken? Exactly, nobody. Improving quality is a key aspect in Big Data for manufacturers. As of Big Data, this can come with several aspects. First of all, it is necessary to collect data about the production line(s) and all devices that are connected or connect-able. When errors occur or a product isn’t as desired, the production data can be analyzed and reviewed. Data scientists basically do a great job on that. Real-Time analytics allow the company to improve the material quality and product quality again. This can be done by analyzing images of products or materials and removing them from the production line in case they don’t fulfill certain standards.
A key challenge today in manufacturing is the high degree of product customization. When buying a new car, the words by Henry Ford (you can have any type of the T-model as long as it is black) are not true any more. When customers order whatever type of product, customers expect that their own personality is reflected by the product. If a company fails to deliver that, they might risk loosing customers. But what is the affiliation with Big Data now? Well, this customization is a strong shift towards Industry 4.0, which is heavily promoted by the German industry. In order to make products customize able, it is necessary to have an automated product line and to know what customers might want – by analyzing recent sales and trends from social networks and alike.
Changing the output of a production line is often difficult and ineffective. Big Data analytics allow manufacturers to better understand future demands and they can reduce production pikes. This enables the manufacturer to better plan and act in the market – and get more efficient.

In order to analyse data right, it is necessary to have a high level of data quality. In this tutorial, we will look at how to achieve this.

What are the data quality attributes?

To Store Data, some attributes regarding the data must be fulfilled. (Heinrich & Stelzer, 2011) defined some data quality attributes that should be fulfilled.

Data quality attributes

Data quality attributes

  • Relevance. Data should be relevant to the use-case. If a query should look up all available users interested in “luxury cars” in a web portal, all these users should be returned. It should be possible to take some advantage out of these data, e.g. for advanced marketing targeting.
  • Correctness. Data has to be correct. If we again query for all existing users on a web portal interested in luxury cars, the data about that should be correct. By correctness, it is meant that the data should really represent people interested in luxury cars and that faked entries should be removed.
  • Completeness. Data should be complete. Targeting all users interested in luxury cars only makes sense if we can target them somehow, e.g. by e-mail. If the e-mail field is blank or any other field we would like to target our users, data is not complete for our use-case.
  • Timeliness. Data should be up-to date. A user might change the e-mail address after a while and our database should reflect these changes whenever and wherever possible. If we target our users for luxury cars, it won’t be good at all if only 50% of the user’s e-mail addresses are correct. We might have “big data” but the data is not correct since updates didn’t occur for a while.
  • Accuracy. Data should be as accurate as possible. Web site users should have the possibility to specify, “Yes, I am interested in luxury cars” instead of defining their favorite brand (which could be done additionally). If the users have the possibility to select a favorite brand, it might be accurate but not accurate enough. Imagine someone selects “BMW” as favorite brand. BMW could be considered as luxury car but they also have different models. If someone selects BMW just because one likes the sport features, the targeting mechanism might hit the wrong people.
  • Consistency. This shouldn’t be confused with the consistency requirement by the CAP-Theorem (see next section). Data might be duplicated, since users might register several times to get various benefits. The user might select “luxury cars” and with another account “budget cars”. Duplicate accounts leads to inconsistency of data and it is a frequent problem in large web portals such as Facebook (Kelly, 2012).
  • Availability. Availability states that data should be available. If we want to query all existing users interested in luxury cars, we are not interested in a subset but all of them. Availability is also a challenge addressed by the CAP-Theorem. In this case, it doesn’t focus on the general availability of the database but at the availability of each dataset itself. The algorithm querying the data should be as good as possible to retrieve all available data. There should be easy to use tools and languages to retrieve the data. Normally, each database provides a query language such as SQL, or O/R Mappers to developers.
  • Understandability. It should be easy to understand data. If we query our database for people interested in luxury cars, we should have the possibility to easily understand what the data is about. Once the data is returned, we should use our favorite tool to work with the data. The data itself should describe itself and we should know how to handle it. If the data returns a “zip” column, we know that this is the ZIP-code individual users are living in.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

Big Data Analysis is something that needs some iteration principles. If we look at a famous novel where a lot of data was analyzed, “A hitchhikers guide to the galaxy” became famous. In the novel, someone asked a supercomputer a question: “Answer to the Ultimate Question of Life, the Universe, and Everything”. As this was a quite difficult problem to solve, iteration for that is necessary.
(Bizer, Boncz, Brodie , & Erling, 2011) describes some iteration steps on creating Big Data Analysis Applications. Five easy steps are mentioned in this paper: Define, Search, Transform, Entity Resolution and Answer the Query.
Define deals with the problem that needs to be solved. This is when the marketing manager asks: “we need to find a location in county “xy”, where customers age is over 18 and below 30 and we have no store yet”. In our initial description of “A hitchhikers guide to the galaxy”, this would be the question about the answer to everything.
Next, we identify candidate elements in the Big Data space. (Bizer, Boncz, Brodie , & Erling, 2011) names this “search”. In the marketing sample, this would mean that we have to scan all data of all users that are between 18 and 30. This data must be combined with store locations. In the “hitchhikers guide to the galaxy”, this would mean that we have to scan all data – as we try to find the answer to all.
Transform means that the data identified has to be “extracted, transformed and loaded” into appropriate formats. This is part of the preparation phase, since the data is now almost ready for calculation. Data is extracted from different sources and transformed into a unique format that can be used for the analysis. In the marketing example, we will need to use sources from the government and combine it with our own data on customers. Furthermore, we need map data. All this data is now stored in our database for analysis. It is more complicated with the “hitchhikers” problem: since we need to analyze ALL data available in the universe, we simply can’t copy this to a new system. The analysis has to be done on the systems it is stored at.
After the data elements are prepared, we need to resolve elements. In this phase, we ensure that data entities are unique, relevant and comprehensive. In the marketing example, this would mean that all elements are resolved that have an age of 18 to 30. In the hitchhiker’s problem, we can’t resolve entities. Once again, we need to find the answer to all and can’t afford to exclude data.
In the last step, the data is finally analyzed. (Bizer, Boncz, Brodie , & Erling, 2011) describes this as “answer the query”. Basically this means that the data analysis is done. Big Data analysis usually needs a lot of nodes that compute the results out of the datasets available. In our marketing sample, we would look at the resolved data sets and compare it with our store locations. The result would be a list of counties where no store is available yet and the condition is fulfilled. In our hitchhikers sample, we would analyze all data and look for the ultimate answer.
The following figure shows the 5 Steps for Big Data Analysis displays the necessary steps for Big Data Analysis as described above.

Data iteration

Data iteration

Data is often stored in one system, but the analytical systems are often somewhere else. In this tutorial, we will look at the challenges of moving data for analysis.

Moving data for analysis

Another issue with Big Data is indicated by (Alexander, Hoisie , & Szalay , 2011). The problem is that Data can’t be moved easily for analysis. With Big Data, we often have some Terabytes or more. Moving this via a network connection is not that easy or even impossible.
If real-time data is analyzed, it is literally impossible to move that amount of data to another cluster, since the data will be incorrect or not available at this time. Real-Time data analysis is also necessary in fraud protection. If this data now has to be moved to another cluster, it might already be too late.
In traditional databases, this wasn’t that hard since the data was often some Gigabyte in a single database. With Big Data, data is in various formats, at high volume and at high velocity. To comply with all these factors and moving data to another cluster, this might not be possible.

What are the challenges?

(Alexander, Hoisie , & Szalay , 2011) describes some factors that influence the challenges of moving data to another cluster: high-flux data, structured and unstructured data, real-time decisions and data organization.
High-flux data describes data that arrives in real time. If the data must be analyzed, this also has to be done in real-time. The data might be gone or modified at a later point. In Big Data applications, data will arrive structured as well as unstructured.
Decisions on Data must often be done in real time. If there is a data stream of financial transactions, an algorithm must decide in real time if the data needs more detailed analysis. If not all data is stored, an algorithm must decide if the data is stored or not. Data organization is another challenge when it comes to moving data.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

With the consistency of data, another challenge arrives: data concurrency. What it means is described in this tutorial.

What are the challenges with data concurrency?

Data needs to be partitioned if it can’t be stored on a single system. With Big Data applications, we don’t talk about small storages but rather about distributed systems. Data might be partitioned over hundred or thousand of nodes and the database must scale out to that demand.

Data partitioning is a key concept for databases and it serves as well in Big Data applications. However, if data is distributed over some servers, it might take a while until all nodes are informed about the changes.

To avoid concurrency issues, the data must be locked. This might result in a poor database performance if the database should be kept consistent at all time. One solution is to forget about data consistency in favor of data partitioning. This approach is described in detail in section 1.6.2 when we will focus on the CAP-Theorem.

How does this play out?

Let’s imagine a Web shop. There are 2 users in our sample; both of them (let’s call them User A and User B) want to buy a Product P. There is exactly one item on stock. User A sees this and proceeds with the checkout, as well as User B. They complete the order at about the same time.

The Database in our sample is designed in a way that partitioning is preferred over consistency and both Users get the acknowledgement that their Order was processed. Now we would have -1 items in stock since no database trigger or any other command told us that we ran out of items. We either have to tell one User to “forget” the order or have to find a way to deliver the item to both users.

In any case, one user might get angry. Some web shops solved this issue in a non-technical way: they tell the user “sorry, we are unable to deliver in time” and give them the option to cancel the order or take a voucher. However, there is no simple technical solution to that.

How to solve data concurrency issues?

In most cases, it will cost money to the company. If the web shop would use a system built for consistency, it might run into database outages. Users might not buy products at their web site since the web site is simply “not available”. The web shop can either loose money by users that were unable to buy products because of delays in the database or by consistency issues.

In the case of web shop outage, users might not return and buy products since they are annoyed about the “bad performance of the website” and “inability to process the order”, whereas people would return and buy other products if they get a voucher because of issues that came with data partitioning and concurrency.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

Another challenge for Big Data is about different storage systems. This creates a lot of variety in data and thus increases complexity. In this tutorial, we will discuss this.

What are the problems of different storage systems?

A main factor to Big Data is the variety of data. Data may not only change over time (e.g. a web shop not only wants to sell books but also cars) but will also have different formats. Databases must provide this possibility.
Companies might not only store all their data in one single database but rather in different databases and different APIs consume different formats such as JSON, XML or any other type. Facebook, for instance, uses MySQL, Cassandra and HBase to store their data. They have three different storage systems (Harris, 2011) (Muthukkaruppan, 2010), each of them serving a different need.
(Helland, 2011) described the challenges for datastores with 4 key principles:

  • unlocked data
  • inconsistent schema
  • extract, transform and load
  • too much to be accurate

What are these aspects about?

By unlocked data, it is meant that data is usually locked but with Big Data, this might result in problems, as they don’t rely on locked data. On the other hand, unlocked data leads to semantically changes in a database. With inconsistent schema, (Helland, 2011) describes the challenge of data from different sources and formats. Schema needs to be somewhat flexible to deal with extensibility. As stated earlier, businesses change over time and so does the data schema.
Extract, transform and load is something very specific to Big Data Systems, since data comes from many different sources and it needs to be put into place in a specific system. Too much to be accurate outlines the “velocity” problem with Big Data applications. If data is calculated, the result might not be exact since the data the calculation was built upon might have already changed. (Helland, 2011) states that you might not be accurate at all and you can only guess results.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

Big Data needs Big Storage and storage is at the end a physical device. Until now, most storage devices are hard disks that require mechanical movements. In this tutorial we will discuss the storage performance challenges.

What are the storage performance challenges?

A common hard drive available today (December 2012) has 15,000 (Seagate, 2013) revolutions per minute (rpm) and a desktop hard drive has some 7,200-rpm. In any case, this means that there is significant latency involved until the reading head is in place. The mechanical approach to storage has been around for decades and scientists as well as engineers complain about storage performance.

In-memory was always faster than hard disk storage and the network speed is higher than what can be done with hard disks. (Anthes, 2012) states that disk based storage is about 10-100 times slower than a network and about 1,000 times slower than main memory. This means that there is a significant “bottleneck” when it comes to delivering data from a disk-based storage to an application.

As big data is about storing and analyzing data, this is a major challenge to Big Data Applications. It doesn’t help us much if we have enough compute power to analyze data but our disks simply can’t deliver the data in a fast way.

Data is distributed

When we look at supercomputers nowadays, they are often measured in cores and Teraflops (Top 500 Supercomputers Site, 2012). This is basically good if you want to do whatever kind of calculation such as the human genome. However, this doesn’t tell us anything about disk performance if we want to store or analyze data. (Zverina, 2011) cites Allan Snavely when he proposes to include the disk performance in such metrics as well:

“I’d like to propose that we routinely compare machines using the metric of data motion capacity, or their ability to move data quickly” – Allan Snavely

Allan Snavely also stated that with increasing data size – hard disks are getting higher in capacity but access time stays the same – it is harder to find data.

This can be illustrated easily: you have an external hard disk with the capacity of 1 TB. The hard disk operates with 7,200 rpm and a cache of 16MB. There are 1,000 Videos stored on this hard drive, each with a size of 1 GB. This would fill the entire hard disk. If you now change to a larger system as your videos grow, you would change to a 2 TB system.

If this System is full, you won’t be able to transfer the videos to another system in the same time as you did with the 1 TB hard drive. It is very likely that your 2 TB System now needs about twice as much time to transfer the data. Whereas compute performance grows, the performance to access data stays about the same. Given the growth of data and storage capacity, it even gets slower. Allan Snavely (Zverina, 2011) describes this with the following statement:

“The number of cycles for computers to access data is getting longer – in fact disks are getting slower all the time as their capacity goes up but access times stay the same. It now takes twice as long to examine a disk every year, or put another way, this doubling of capacity halves the accessibility to any random data on a given media.”

How to overcome these challenges?

In the same article, Snavely suggests to include the following metrics in a computer’s performance: DRAM, flash memory, and disk capacity.

But what can enterprises do to achieve higher through output of their systems? There is already some research about that and most resources point towards Solid State Disks as Storage (SSD). Solid State Disks are getting commodity hardware in high end Personal Computers, but they are not that common for servers and distributed systems yet.

SSDs normally have better performance but lower disk space and the price per GB is more expensive. If we talk about large-scale databases that have the need for performance, SSDs might be a better choice. The San Diego Supercomputing Center (SDSC) built a supercomputer with SSDs. This computer is called “Gordon” and can handle Data up to 100 times faster as with normal drives (Zverina, 2011).

Another prototype, called “Moneta” (Anthes, 2012) used a phase change memory to boost I/O performance. The performance was about 9.5 times faster as a normal RAID-System and about 2.8 times faster as a flash-based raid system.

There is significant research around this topic as the performance of storage is a problem to large-scale data centric systems as we now have with Big Data Applications.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

Data representation is an often-mentioned characteristic for Big Data. It goes well with “Variety” in the above stated definition. Each Data is represented in a specific form and it doesn’t matter what form it is. Well-known forms of Data are XML, Json, CSV or binary. Depending on the Representation of Data, different possibilities regarding relations can be integrated.

XML and Json for instance allows us to set child-objects or relations for data, whereas it is rather hard with CSV or binary. A possibility for relations can be a dataset of the type “Person”. Each person consists of some attributes that identify the person (e.g. the last name, age, sex) and an address that is an independent entity. To retrieve this data as CSV or binary, you either have to do two queries or create a new entity for a query where the data is merged. XML and Json allows us to nest entities in other entities.

What is Data representation?

data-entity

data-entity

The in Figure described entity would look like the following, if presented in XML:

<person><common>

<firstname>Mario</firstname>

<lastname>Meir-Huber</lastname>

<age>29</age>

</common>

<address>

<zipcode>1150</zipcode>

<city>Vienna</city>

</address>

</person>

Listing 1: XML representation of the entity “person”

Similar to that, the Json representation of our Model “Person” would look slightly similar:

[Person :[Common :

[“firstname” : “Mario”, “lastname” : “Meir-Huber”, “Age” : 29]

]

[Address :

[“zipcode” : “1150”, “city” : “Vienna”]

]

]

Listing 2: Json interpretation

The traditional way of data representation: SQL

If we now look at how we could represent this data from a database as binary data, we need to join two different datasets. This is basically supported by SQL. A possible representation could look like the following:

p.Firstname p.Lastname p.Age a.Zipcode a.City
Mario Meir-Huber 29 1150 Vienna

Listing 3: SQL-based binary representation

The representation of Data isn’t limited to what was described in this chapter so far. There are several other formats available and others might arise in the future. However, data must have a clear and documented representation in a form that can be processed by Tools that built upon that data.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

Big Data involves a lot of different technologies. Each of these technologies require different knowledge. I’ve described the knowledge necessary in an earlier post. Today, we will have a look at Big data technology layer. 

In this post, I want to outline all necessary technologies in the Big Data stack. The following image shows them:

Technologies in the Big Data Stack

Technologies in the Big Data Stack

The layers are:

  • Management: In this layer, the problem on how to store data on hardware or in the cloud and what resources need to be scheduled is addressed. It is basically knowledge involved in datacenter design and/or cloud computing for Big Data.
  • Platforms: This layer is all about Big Data technologies such as Hadoop and how to use them.
  • Analytics: This layer is about the mathematical and statistical techniques necessary for Big Data. It is about asking the questions you need to answer.
  • Utilisation: The last and most abstract layer is about the visualization of Big Data. This is mainly used by visual artists and presentation software.

Each of the layers needs different knowledge and also different hardware and software. As described earlier, it is simply not possible to have one software that “fits it all”. And you need to create a team that has the knowledge in all of these areas.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

Big Data is definitely a very complex “thing”. Why do I call it “a thing” here? Because it is simply not a technology itself! Hadoop is a technology, Lucene is a technology but Big Data is more of a concept, since it is nothing you can touch. Ever tried installing Big Data on your machine? Or said “I need this Big Data Software”? When you talk about a software or technology, you talk about a very concrete Product or Open Source Tool.
The concept of Big Data is rather complicated when it comes to implementing it. There are several major dimensions you have to be aware of.

Big Data Dimensions

Big Data Dimensions


The dimensions are:

  • Legal dimension: What is necessary in terms of data protection legislation? What do you need to know about legal impacts, what kind of data are you allowed to store or collect/process?
  • Social dimension: What social impacts will you generate with your application? How will your users react to that?
  • Business dimension: What is the business model you want to generate with your Big Data platform? How can your Big Data platform support your business? What kind of pricing do you want to calculate?
  • Technology dimension: How can you achieve your targets? What technology would you use to get there? What scale able software can you use?
  • Application dimension: What industry solutions are available for your needs? How can you enable decision support based on data for your company?

If you want to target all of these questions, you need to have a team that is capable of fulfilling this request. In the next posts I will talk about the Big Data technology stack and what it needs to be a data scientist.
Header Image copyright:  Michael Coghlan. Distributed under the Creative Commons license 2.0 by Creative Commons Australia Pool.