This section features tutorials in the Big Data field

S4 is another near-real-time project for Hadoop. S4 is built with a decentralized architecture in mind, focusing on a scaleable and event-oriented architecture. S4 is a long-running process that analyzes streaming data.

S4 is built with Java and with flexibility in mind. This is done via dependency injection, which makes the platform very easy to extend and change. S4 heavily relies on Loose-coupling and dynamic association via the Publish/Subscribe pattern. This makes it easy for S4 to integrate sub-systems into larger systems and updating services on sub-systems can be done independently.

S4 is built to be highly fault-tolerant. Mechanisms built into S4 allow fail-over and recovery.

Apache Storm is in charge for analyzing streaming data in Hadoop. Storm is extremely powerful when analyzing streaming data and is capable of working near real-time. Storm was initially developed by Twitter to power their streaming API. At present, Storm is capable of processing 1 million tuples per node and second. The nice thing about Storm is that it scales linearly.

The Storm architecture is similar to other Hadoop projects. However, Storm comes with different challenges. First, there is Nimbus. Nimbus is the controller for Storm, which is similar to the JobTracker in Hadoop. Apache Storm also utilizes ZooKeeper. The Supervisor is on each instance and takes care of the tuples once they come in. The following figure shows this.

Storm Topology
Storm Topology

Major concepts in Apache Storm are 4 elements: streams, spouts, bolts and topologies.

Storm Tuples
Storm Tuples

Streams are an unbound sequence of Tuples, a Spout is a source of streams, Bolts process input streams and create new output streams and a topology is a network of Bolts and Spouts.

Apache Pig is an abstract language that puts data in the middle. Apache Pig is a “Data-flow” language. In contrast to SQL (and Hive), Pig goes an iterative way and lets data flow from one statement to another. This gives more powerful options when it comes to data. The language used for Apache Pig is called “PigLatin”. A key benefit of Apache Pig is that it abstracts complex tasks in MapReduce such as Joins to very easy functions in Apache Pig. Apache Pig is ways easier for Developers to write complex queries in Hadoop. Pig itself consists of two major components: PigLatin and a runtime environment.

When running Apache Pig, there are two possibilities: the first one is the stand alone mode which is intended to rather small datasets within a virtual machine. On processing Big Data, it is necessary to run Pig in the MapReduce Mode on top of HDFS. Pig applications are usually script files (with the extension .pig) that consist of a series of operations and transformations, that create output data from input data. Pig itself transforms these operations and transformations to MapReduce functions. The set of operations and transformations available by the language can easily be extended via custom code. When compared to the performance of “pure” MapReduce, Pig is a bit slower, but still very close to the native MapReduce performance. Especially for that not experienced in MapReduce, Pig is a great tool (and ways easier to learn than MapReduce)

When writing a Pig application, this application can easily be executed as a script in the Hadoop environment. Especially when using the previously demonstrated Hadoop VM’s, it is easy to get started. Another possibility is to work with Grunt, which allows us to execute Pig commands in the console. The third possibility to run Pig is to embed them in a Java application.

The question is, what differentiates Pig from SQL/Hive. First, Pig is a data-flow language. It is oriented on the data and how it is transformed from one statement to another. It works on a step-by-step iteration and transforms data. Another difference is that SQL needs a schema, but Pig doesn’t. The only dependency is that data needs to be able to work with it in parallel.

The table below will show a sample program. We will look at the possibilities within the next blog posts.

A = LOAD ‘student‘ USING PigStorage() AS (name:chararray, age:int, gpa:float);







One of the easiest to use tools in Hadoop is Hive. Hive is very similar to SQL and is easy to learn for those that have a strong SQL background. Apache Hive is a data-warehousing tool for Hadoop, focusing on large datasets and how to create a structure on them.

Hive queries are written in HiveQL. HiveQL is very similar to SQL, but not the same. As already mentioned, HiveQL translates to MapReduce and therefore comes with minor performance trade-offs. HiveQL can be extended by custom code and MapReduce queries. This is useful, when additional performance is required.

The following listings will show some Hive queries. The first listing will show how to query two rows from a dataset.

hive> SELECT column1, column2 FROM dataset2 5

4 9

5 7

5 9

Listing 2: simple Hive query

The next sample shows how to include a where-clause.

hive> SELECT DISTINCT column1 FROM dataset WHERE column2 = 91

Listing 3: where in Hive

HCatalog is an abstract table manager for Hadoop. The target of HCatalog is to make it easier for users to work with data. Users see everything like it would be a relational database. To access HCatalog, it is possible to use a Rest API.

MapReduce is the elementary data access for Hadoop. MapReduce provides the fastest way in terms of performance, but maybe not in terms of time-to-market. Writing MapReduce queries might be trickier than Hive or Pig. Other projects, such as Hive or Pig translate the code you entered into native MapReduce queries and therefore often come with a tradeoff.

A typical MapReduce function follows the following process:

  • The Input Data is distributed on different Map-processes. The Map processes work on the provided Map-function.
  • The Map-processes are executed in parallel.
  • Each Map-process issues intermediate results. These results are stored, which is often called the shuffle-phase.
  • Once all intermediate results are available, the Map-function has finished and the reduce function starts.
  • The Reduce-function works on the intermediate results. The Reduce-function is also provided by the user (just like the Map-function).

A classical way to demonstrate MapReduce is via the Word-count example. The following listing will show this.

map(String name, String content):

for each word w in content:

EmitIntermediate(w, 1);

reduce(String word, Iterator intermediateList):

int result = 0;

for each v in intermediateList:


Emit(word, result);

Hadoop is very flexible and it is possible to integrate almost any kind of database into the system. Many database vendors extended their products to work with Hadoop. One database that is often used with Hadoop is Apache Cassandra. Cassandra isn’t part of the Hadoop project itself but is often seen in connection with Hadoop projects.

Cassandra comes with several benefits. First, it is a NoSQL-Database as well, working with a key/value store. Beeing developed by Facebook initially, it is now maintained by Datastax. Cassandra comes with a great performance and linear scalability.

Apache Accumulo is another NoSQL Database in the Hadoop stack. Accumulo is based on Google’s Big Table design and is a sorted and distributed key/value storage.

Key/Value storages are basically not operating on rows, but it is possible to query them – which comes with a performance trade-off often. Accumulo allows us to query large rows which typically wouldn’t fit into the memory.

Accumulo is also built for high availability, scalability and fault tolerance. As of the ACID-topology, Accumulo supports “Isolation”. This basically means that recently inserted data isn’t displayed in case that the insert was after the query was sent.

Accumulo is built with a PlugIn-based architecture and provides a comprehensive API. With Accumulo, it is possible to execute MapReduce jobs, bulk- and batch operations.

The following Figure outlines how a Key/Value is displayed in Accumulo. The Key consists of the Row id, a column specifier and a timestamp. The column contains informations about the column family, the qualifier and the visibility.

Apache Accumulo
Apache Accumulo

The next sample will display how Accumulo code is written. The sample displays how to write a text to the database.

Text uid = new Text(“columid”);

Text family = new Text(“columnFamily”);

Text qualifier = new Text(“columnQualifier”);

ColumnVisibility visibility = new ColumnVisibility(“public”);

long timestamp = System.currentTimeMillis();

Value value = new Value(“Here is my text”.getBytes());

Mutation mutation = new Mutation(uid);

mutation.put(family, qualifier, visibility, timestamp, value);

HBase is one of the most popular databases in the Hadoop and NoSQL ecosystem. HBase is a highly-scaleable database that works with fulfilling the partition tolerance and availability of the CAP-Theorem. In case you aren’t familiar with the CAP-Theorem: the theorem states that requirements for a database are consistency, availability and partition tolerance. However, you can only have two of them and the third one comes with a trade-off.

HBase uses a Key/Value storage. The schema of a table in HBase is not present (schema-less), which gives you much more flexibility than with a traditional relational database. HBase takes care of the failover and sharding of data for you.

HBase uses HDFS as storage and ZooKeeper for the coordination. There are several region servers that are controlled by a master server. This is displayed in the next image.

Apache HBase
Apache HBase

Apache YARN can easily be called “the answer to everything”. YARN takes care of most of the things in Hadoop and you will use YARN always without noticing it. YARN is the central point of contact for all operations in the Hadoop ecosystem. YARN executes all MapReduce jobs among other things. What YARN takes care of:

  • Resource Management
  • Job Management
  • Job Tracking
  • Job Scheduling

YARN is built of 3 major components. The first one is the resource manager. The resource manager takes care of distributing the resources for individual applications. Next, there is the node manager. This component is running on the node that a specific job is running on. The third component is the Application Master. The Application Master is in charge of retrieving tasks from the resource manager and to ensure the work with the node manager. The Application Master typically works with one or more tasks.

Yarn components
Yarn components

The following image displays a common workflow in YARN.

YARN architecture
YARN architecture

YARN is used by all other projects such as Hive and Pig. It is possible to access YARN via Java Applications or a REST-Interface.

The Hadoop Distributed File System (HDFS) is one of the key services for Hadoop. HDFS is a distributed file system that abstracts each individual hard disk file system form a specific node. With HDFS, you get a virtual file system that spans over several nodes and allows you to store large amounts of data. HDFS can also operate in a non-distributed way as a standalone system but the purpose of it is to serve as a distributed file system.

One of the nice things about HDFS is that it runs on almost any hardware – which gives us the possibility to integrate existing systems into Hadoop. HDFS is also fault tolerant, reliable, scalable and easy to extend – just like any other Hadoop project!

HDFS works with the assumption that failures do happen – and is built to work fault-tolerant. HDFS is built to reboot in case of failures. Recovery is also easy with HDFS.

As streaming is a major trend in Big Data analytics, HDFS is built to serve that. HDFS allows to access streaming data via batch-processes.

HDFS is built for large amounts of data – you would usually store some terabytes of data in HDFS. The model of HDFS is built for a “write once, read many” approach, which means that it is fast and easy to read data, but writing data might not be as performant. This means that you wouldn’t use Hadoop to build an application on top of it that serves other purposes than providing analytics. That’s not the target for HDFS.

With HDFS, you basically don’t move data around. Once the data is in HDFS, it will likely stay there since it is “big”. Moving this data to another place might not be effective.

HDFS architecture

The above figure shows the HDFS architecture. HDFS has NamedNodes, which take care of the Metadata handling, distribution of files and alike. The client talks to HDFS itself to write and read files, without knowing on which (physical) node the file resides.

There are several possibilities to access HDFS:

  • REST: HDFS exposes a Rest-API which is called WebHDFS. This REST-API is also used from Java.
  • Libhdfs: This is what you use when accessing HDFS from C or C++.