Big Data in Manufacturing


Big Data is a disruptive technology. It is changing major industries from the inside. In the next posts, we will learn how Big Data changes different industries. Today’s focus: Big Data in Manufacturing. Manufacturing is a traditional industry relevant to almost any country in the world. It started to emerge in the industrial revolution, when machines took over and production became more and more automated. Big Data has the possibility to substantially change the manufacturing industry again – with various opportunities. Manufactures can utilize Big Data for various reasons. First, it is all about quality. When we look at production chains, may it be producing a car or just some metal works, quality is key. Who wants to buy a car that is broken? Exactly, nobody. Improving quality is a key aspect in Big Data for manufacturers. As of Big Data, this can come with several aspects. First of all, it is necessary to collect data about the production line(s)

read more Big Data in Manufacturing

Big Data – how to achieve data quality


To Store Data, some attributes regarding the data must be fulfilled. (Heinrich & Stelzer, 2011) defined some data quality attributes that should be fulfilled.   Relevance. Data should be relevant to the use-case. If a query should look up all available users interested in “luxury cars” in a web portal, all these users should be returned. It should be possible to take some advantage out of these data, e.g. for advanced marketing targeting. Correctness. Data has to be correct. If we again query for all existing users on a web portal interested in luxury cars, the data about that should be correct. By correctness, it is meant that the data should really represent people interested in luxury cars and that faked entries should be removed. Completeness. Data should be complete. Targeting all users interested in luxury cars only makes sense if we can target them somehow, e.g. by e-mail. If the e-mail field is blank or any other field we

read more Big Data – how to achieve data quality

How to create Data analytics projects


Big Data Analysis is something that needs some iteration principles. If we look at a famous novel where a lot of data was analyzed, “A hitchhikers guide to the galaxy” became famous. In the novel, someone asked a supercomputer a question: “Answer to the Ultimate Question of Life, the Universe, and Everything”. As this was a quite difficult problem to solve, iteration for that is necessary. (Bizer, Boncz, Brodie , & Erling, 2011) describes some iteration steps on creating Big Data Analysis Applications. Five easy steps are mentioned in this paper: Define, Search, Transform, Entity Resolution and Answer the Query. Define deals with the problem that needs to be solved. This is when the marketing manager asks: “we need to find a location in county “xy”, where customers age is over 18 and below 30 and we have no store yet”. In our initial description of “A hitchhikers guide to the galaxy”, this would be the question about the answer

read more How to create Data analytics projects

Big Data challenges: moving data for analysis


Another issue with Big Data is indicated by (Alexander, Hoisie , & Szalay , 2011). The problem is that Data can’t be moved easily for analysis. With Big Data, we often have some Terabytes or more. Moving this via a network connection is not that easy or even impossible. If real-time data is analyzed, it is literally impossible to move that amount of data to another cluster, since the data will be incorrect or not available at this time. Real-Time data analysis is also necessary in fraud protection. If this data now has to be moved to another cluster, it might already be too late. In traditional databases, this wasn’t that hard since the data was often some Gigabyte in a single database. With Big Data, data is in various formats, at high volume and at high velocity. To comply with all these factors and moving data to another cluster, this might not be possible. (Alexander, Hoisie , & Szalay

read more Big Data challenges: moving data for analysis

Big Data challenges: data partitioning and concurrency


Data needs to be partitioned if it can’t be stored on a single system. With Big Data applications, we don’t talk about small storages but rather about distributed systems. Data might be partitioned over hundred or thousand of nodes and the database must scale out to that demand. Data partitioning is a key concept for databases and it serves as well in Big Data applications. However, if data is distributed over some servers, it might take a while until all nodes are informed about the changes. To avoid concurrency issues, the data must be locked. This might result in a poor database performance if the database should be kept consistent at all time. One solution is to forget about data consistency in favor of data partitioning. This approach is described in detail in section 1.6.2 when we will focus on the CAP-Theorem. Let’s imagine a Web shop. There are 2 users in our sample; both of them (let’s call them

read more Big Data challenges: data partitioning and concurrency

Big Data challenges: different storage systems


A main factor to Big Data is the variety of data. Data may not only change over time (e.g. a web shop not only wants to sell books but also cars) but will also have different formats. Databases must provide this possibility. Companies might not only store all their data in one single database but rather in different databases and different APIs consume different formats such as JSON, XML or any other type. Facebook, for instance, uses MySQL, Cassandra and HBase to store their data. They have three different storage systems (Harris, 2011) (Muthukkaruppan, 2010), each of them serving a different need. (Helland, 2011) described the challenges for datastores with 4 key principles: unlocked data inconsistent schema extract, transform and load too much to be accurate By unlocked data, it is meant that data is usually locked but with Big Data, this might result in problems, as they don’t rely on locked data. On the other hand, unlocked data leads

read more Big Data challenges: different storage systems

Big Data challenges: storage performance


Big Data needs Big Storage and storage is at the end a physical device. Until now, most storage devices are hard disks that require mechanical movements. A common hard drive available today (December 2012) has 15,000 (Seagate, 2013) revolutions per minute (rpm) and a desktop hard drive has some 7,200-rpm. In any case, this means that there is significant latency involved until the reading head is in place. The mechanical approach to storage has been around for decades and scientists as well as engineers complain about storage performance. In-memory was always faster than hard disk storage and the network speed is higher than what can be done with hard disks. (Anthes, 2012) states that disk based storage is about 10-100 times slower than a network and about 1,000 times slower than main memory. This means that there is a significant “bottleneck” when it comes to delivering data from a disk-based storage to an application. As big data is about storing

read more Big Data challenges: storage performance

Big Data 101: Data Representation as part of Variety


Representation is an often-mentioned characteristic for Big Data. It goes well with “Variety” in the above stated definition. Each Data is represented in a specific form and it doesn’t matter what form it is. Well-known forms of Data are XML, Json, CSV or binary. Depending on the Representation of Data, different possibilities regarding relations can be integrated. XML and Json for instance allows us to set child-objects or relations for data, whereas it is rather hard with CSV or binary. A possibility for relations can be a dataset of the type “Person”. Each person consists of some attributes that identify the person (e.g. the last name, age, sex) and an address that is an independent entity. To retrieve this data as CSV or binary, you either have to do two queries or create a new entity for a query where the data is merged. XML and Json allows us to nest entities in other entities. The in Figure described entity

read more Big Data 101: Data Representation as part of Variety

Big Data: Technology Layers for Big Data


Big Data involves a lot of different technologies. Each of these technologies require different knowledge. I’ve described the knowledge necessary in an earlier post. In this post, I want to outline all necessary technologies in the Big Data stack. The following image shows them: The layers are: Management: In this layer, the problem on how to store data on hardware or in the cloud and what resources need to be scheduled is addressed. It is basically knowledge involved in datacenter design and/or cloud computing for Big Data. Platforms: This layer is all about Big Data technologies such as Hadoop and how to use them. Analytics: This layer is about the mathematical and statistical techniques necessary for Big Data. It is about asking the questions you need to answer. Utilisation: The last and most abstract layer is about the visualization of Big Data. This is mainly used by visual artists and presentation software. Each of the layers needs different knowledge and

read more Big Data: Technology Layers for Big Data

Big Data: Why it is not so simple as you might think!


Big Data is definitely a very complex “thing”. Why do I call it “a thing” here? Because it is simply not a technology itself! Hadoop is a technology, Lucene is a technology but Big Data is more of a concept, since it is nothing you can touch. Ever tried installing Big Data on your machine? Or said “I need this Big Data Software”? When you talk about a software or technology, you talk about a very concrete Product or Open Source Tool. The concept of Big Data is rather complicated when it comes to implementing it. There are several major dimensions you have to be aware of. The dimensions are: Legal dimension: What is necessary in terms of data protection legislation? What do you need to know about legal impacts, what kind of data are you allowed to store or collect/process? Social dimension: What social impacts will you generate with your application? How will your users react to that? Business

read more Big Data: Why it is not so simple as you might think!