This section features tutorials in the Big Data field

There are several things people discuss when it comes to Hadoop and there are some wrong discussions. First, there is a small number of people believing that Hadoop is a hype that will end at some point in time. They often come from a strong DWH background and won’t accept (or simply ignore) the new normal. But there are also some people that basically coin two major sayings: the first group of people states that Hadoop is cheap because it is open source and the second group of people states that Hadoop is expensive because it is very complicated. (Info: by Hadoop, I also include Spark and alike)

Neither the one nor the other is true.

First, you can download it for free and install it on your system. This makes it basically free in terms of licenses, but not in terms of running it. When you get a vanilla Hadoop, you will have to think about hotfixes, updates, services, integration and many more tasks that will get very complicated. This ends up in spending many dollars on Hadoop experts to solve your problems. Remember: you didn’t solve any business problem/question so far, as you are busy running the system! You spend dollars and dollars on expensive operational topics instead of spending them on creating value for your business.

Now, we have the opposite. Hadoop is expensive. Is it? In the past years I saw a lot of Hadoop projects the went more or less bad. Costs were always higher than expected and the project timeframe was never kept. Hadoop experts have a high income as well, which makes consulting hours even more expensive. Plus: you probably won’t find them on the market, as they can select what projects to make. So you have two major problems: high implementation cost and low ressource availability.

The pain of cluster sizing

Another factor that is relevant to the cost discussion is the cluster utilization. In many projects I could see one trend: when the discussion about cluster sizing is on, there are two main decisions: (a) sizing the cluster to the highest expected utilization or (b) making the cluster smaller than the highest expected utilization. If you select (a), you have another problem: the cluster might be under-utilized. What I could see and what my clients often have, is the following: 20% of the time, they have full utilization on the cluster, but 80% of the time the cluster utilization is below 20%. This basically means that your cluster is very expensive when it comes to business case calculation. If you select (b), you will loose business agility and your projects/analytics might require long compute times.

At the beginning of this article, I promised to explain that Hadoop is still cost-effective. So far, I only stated that it might be expensive, but this would mean that it isn’t cost effective. Hadoop is still cost effective but I will give you a solution in my next blog post on that, so stay tuned 😉

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company

I am happy to announce that I’ve created a new e-book for Amazon Kindle. As a promotional offer, the e-book will only cost 0.99 cent the next 6 days and the price will then go up again to it’s original price tag! Make sure to obtain it now 🙂
For more details about the e-book, read this page.
You can obtain the e-book here.

I am happy to announce the development we did over the last month within Teradata. We developed a light-weight process model for Big Data Analytic projects, which is called “RACE”. The model is agile and resembles the know-how of more than 25 consultants that worked in over 50 Big Data Analytic projects in the recent month. Teradata also developed CRISP-DM, the industry leading process for data mining. Now we invented a new process for agile projects that addresses the new challenges of Big Data Analytics.
Where does the ROI comes from?
This was one of the key questions we addressed when developing RACE. The economics of Big Data Discovery Analytics are different to traditional Integrated Data Warehousing economics. ROI comes from discovering insights in highly iterative projects run over very short time periods (4 to 8 weeks usually) Each meaningful insight or successful use case that can be actioned generates ROI. The total ROI is a sum of all the successful use cases. Competitive Advantage is therefore driven by the capability to produce both a high volume of insights as well as creative insights that generate a high ROI.
What is the purpose of RACE?
RACE is built to deliver a high volume of use cases, focusing on speed and efficiency of production. It fuses data science, business knowledge & creativity to produce high ROI insights
How does the process look like?

RACE - an agile process for Big Data Analytic Projects

RACE – an agile process for Big Data Analytic Projects


The process itself is divided into several short phases:

  • Roadmap.That’s an optional first step (but heavily recommended) to built a roadmap on where the customer wants to go in terms of Big Data.
  • Align. Use-cases are detailed and data is confirmed.
  • Create. Data is loaded, prepared and analyzed. Models are developed
  • Evaluate. Recommendations for the business are given

In the next couple of weeks we will publish much more on RACE, so stay tuned!

When working with the main Hadoop services, it is not necessary to work with the console at all time (event though this is the most powerful way of doing so). Most Hadoop distributions also come with a User Interface. The user interface is called “Apache Hue” and is a web-based interface running on top of a distribution. Apache Hue integrates major Hadoop projects in the UI such as Hive, Pig and HCatalog. The nice thing about Apache Hue is that it makes the management of your Hadoop installation pretty easy with a great web-based UI.
The following screenshot shows Apache Hue on the Cloudera distribution.
apache-hue
Apache Hue

Apache Commons is one of the easiest things to explain in the Hadoop context – even though it might get complicated when working with it. Apache Commons is a collection of libraries and tools that are often necessary when working with Hadoop. These libraries and tools are then used by various projects in the Hadoop ecosystem. Samples include:

  • A CLI minicluster, that enables a single-node Hadoop installation for testing purposes
  • Native libraries for Hadoop
  • Authentification and superusers
  • A Hadoop secure mode

You might not use all of these tools and libraries that are in Hadoop Commons as some of them are only used when you work on Hadoop projects.

Apache Avro is a service in Hadoop that enables data serialization. The main tasks of Avro are:

  • Provide complex data structures
  • Provide a compact and fast binary data format
  • Provide a container to persist data
  • Provide RPC’s to the data
  • Enable the integration with dynamic languages

Avro is built with a JSON Schema, that allows several different types:

Elementary types

  • Null, Boolean, Int, Long, Float, Double, Byte and String

Complex types

  • Record, Enum, Array, Map, Union and Fixed

The sample below demonstrates an Avro schema

{“namespace”: “person.avro”,

“type”: “record”,

“name”: “Person”,

“fields”: [

{“name”: “name”, “type”: “string”},

{“name”: “age”,  “type”: [“int”, “null”]},

{“name”: “street”, “type”: [“string”, “null”]}

]

}

Table 4: an avro schema

Apache Sqoop is in charge of moving large datasets between different storage systems such as relational databases to Hadoop. Sqoop supports a large number of connectors such as JDBC to work with different data sources. Sqoop makes it easy to import existing data into Hadoop.

Sqoop supports the following databases:

  • HSQLDB starting version 1.8
  • MySQL starting version 5.0
  • Oracle starting version 10.2
  • PostgreSQL
  • Microsoft SQL

Sqoop provides several possibilities to import and export data from and to Hadoop. The service also provides several mechanisms to validate data.

Most IT departments produce a large amount of log data. This occurs especially when server systems are monitored, but it is also necessary for device monitoring. Apache Flume comes into play when this log data needs to be analyzed.

Flume is all about data collection and aggregation. The architecture is built with a flexible architecture that is based on streaming data flows. The service allows you to extend the data model. Key elements of Flume are:

  • Event. An event is data that is transported from one place to another place.
  • Flow. A flow consists of several events that are transported between several places.
  • Client. A client is the start of a transport. There are several clients available. A frequently used client for example is the Log4j appender.
  • Agent. An Agent is an independent process that provides components to flume.
  • Source. This is an interface implementation that is capable of transporting events. A sample of that is an Avro source.
  • Channels. If a source receives an event, this event is passed on to several channels. A channel is a storage that can handle the event, e.g. JDBC.
  • Sink. A sink takes an event from the channel and transports it to the next process.

The following figure illustrates the typical workflow for Apache Flume with its components.

Apache Flume
Apache Flume

Apache Mahout is the service on Hadoop that is in charge of what is often called “data science”. Mahout is all about learning algorithms, pattern recognition and alike. An interesting fact about Mahout is that under the hood MapReduce was replaced by Spark.

Mahout is in charge of the following tasks:

  • Machine Learning. Learning from existing data and.
  • Recommendation Mining. This is what we often see at websites. Remember the “You bought X, you might be interested in Y”? This is exactly what Mahout can do for you.
  • Cluster data. Mahout can cluster documents and data that has some similarities.
  • Classification. Learn from existing classifications.

A Mahout program is written in Java. The next listing shows how the recommendation builder works.

DataModel model = new FileDataModel(new File(“/home/var/mydata.xml”));

 

RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator();

 

RecommenderBuilder builder = new MyRecommenderBuilder();

 

Double res = eval.evaluate(builder, null, model, 0.9, 1.0);

 

System.out.println(result);

Both Apache Graph and Tez are focused on Graph processing. Apache Giraph is a very popular tool for graph processing. A famous use-case for Giraph is the social graph at Facebook. Facebook uses Giraph to analyze how one might know a person in order to find out what other persons could be friends. Graph processing works on the travelling Sales-Person problem, trying to answer the question on what is the shortest way to get to the customers.

Apache Tez is focused on improving the performance when working with graphs. This makes the development ways easier and reduces the number of MapReduce jobs that are executed underneath it significantly. Apache Tez highly increases the performance against typical MapReduce queries and optimizes the resource management.

The following figure demonstrates graph processing with and without Tez.

MapReduce without Tez
MapReduce without Tez

MapReduce without Tez

With Tez
With Tez