This section features tutorials in the Big Data field

Apache Oozie is the workflow scheduler for Hadoop Jobs. Oozie basically takes care of the step-wise workflow iteration in Hadoop. Oozie is like all other Hadoop projects built for high scalability, fault tolerance and extensible.

An Oozie Workflow is started by data availability or after a specific time. Oozie is the root for all MapReduce jobs as they get scheduled via Oozie. This also means that all other projects such as Pig and Hive (which we will discuss later on) also take advantage of Oozie.

Oozie workflows are described in an XML-Dialect, which is called hPDL. Oozie knows two different types of nodes:

  • Control-Flow-Nodes that take do exactly what the name says: controlling the flow.
  • Action-Nodes take care of the actual execution of a job.

The following illustration shows the iteration process in an Oozie Workflow. The first step for Oozie is to start a task (MapReduce Job) on a remote system. Once the task has completed, the remote system sends the result back to the remote system via a callback function.

Apache Oozie
Apache Oozie

One of the key infrastructure services for Hadoop is Apache ZooKeeper. ZooKeeper is in charge of coordinating nodes in the Hadoop cluster. Key challenges for ZooKeeper in that domain are to provide high availability for Hadoop and to take care of the distributed coordination.

Under these challenges, Hadoop takes care of managing the cluster configuration for Hadoop. A key challenge in the Hadoop Cluster is naming, which has to be applied to all nodes within a cluster. Apache ZooKeeper takes care of that by providing unique names to individual nodes based on naming conventions.

The hierarchy in Zookeeper
The hierarchy in Zookeeper

As shown in Figure 7, naming is hierarchical. This means that naming also occurs via a path. The Root instance starts with a “/”, all successors have their unique name, and their successors also apply this naming schema. This enables the cluster to have nodes with child-nodes, which in return has positive effects on maintainability.

ZooKeeper takes care of synchronization within the distributed environment and provides some group services to the Hadoop Cluster. As of synchronization, there is one server in the ZooKeeper Service that acts as the “Leader” of all servers running under the ZooKeeper Service. The following illustration shows this.

Synchronisation in the ZooKeeper Service
Synchronisation in the ZooKeeper Service

To ensure a high uptime and availability, individual servers in the ZooKeeper service are mirrored. Each of the servers in the service knows any other server. In case that one server has a failure and isn’t available any more, clients connect to other servers. The ZooKeeper service itself is built for failover and is also highly scalable.

Apache Ambari was developed by the Hadoop distributor Hortonworks and also comes with their distribution. The aim of Ambari is to make the management of Hadoop clusters easier. Ambari is useful, if you run large server farms based on Hadoop. Ambari automates much of the manual work you would need to do with Hadoop when managing your cluster from the console.

Ambari comes with three key aspects around cluster management: first, it is about provisioning instances. This is helpful when you want to add new instances to your Hadoop cluster. Ambari takes care of automating all aspects of adding new instances. Next, there is monitoring. Ambari monitors your server farm and gives you an overview on what is going on. The last aspect is the management of your server farm itself.

Provisioning has always been a very tricky part of Hadoop. When someone wanted to add new nodes to a cluster, this was basically not an easy thing to do and included a lot of manual work. Most organizations abstracted this problem by creating scripts and using automation software, but this simply couldn’t fill the scope that is often necessary in Hadoop clusters. Ambari provides an easy-to-use assistant that enables users to install new services or activate/deactivate them. Ambari takes care of the entire cluster provisioning and configuration with an easy UI.

Ambari also includes comprehensive monitoring capabilities for the cluster. This allows user to view the status of the cluster in a dashboard and to get to know immediately what the cluster is up to (or not). Ambari uses Apache Ganglia to collect the metrics. Ambari also integrates the possibility to send System messages via Apache Nagios. This includes alerts and other things that are necessary for the administrator of the cluster.

Other key aspects of Ambari are:

  • Extensibility. Ambari is built on a plug-in architecture, which basically allows you to extend Ambari with your own functionality used within your company or organization. This is useful if you want to integrate Hadoop into your business processes.
  • Fault Tolerance. Ambari takes care of errors and reacts to them. For example, if an instance has an error, Ambari restarts this instance. This takes away much of the headache you got in previous, pre-Ambari, versions of Hadoop.
  • Secure. Ambari uses a role-based authentication. This gives you more control over sensitive information in your cluster(s) and enables you to apply different roles.
  • Feedback. Ambari provides Feedback to the user(s) about long-running processes. This is especially useful for stream processing and near-real-time processes that basically have no end of their lifespan.

Apache Ambari can be accessed easily via two different ways: first, Ambari provides a mature UI that enables you to access the cluster management via a Browser. Furthermore, Ambari can also be accessed via ReSTful Web Services, which gives you additional possibilities in working with the service.

The following illustration outlines the Ambari Server and the Agents Communication.

Apache Ambari
Apache Ambari

As of the architecture, Ambari leverages several projects. As key elements, Ambari uses message queues for communication. The configuration within Apache Ambari is done by Puppet. The next figure shows the overall architecture of Ambari.

Apache Ambari Architecture
Apache Ambari Architecture

Last week I wrote a blog post introducing the Hadoop project and gave an overview of the Map/Reduce algorithm. This week, I will outline the Hadoop stack and major technologies in the Hadoop step. Please note: there are many projects in the Hadoop stack and this is not complete. The following figure will outline major Hadoop projects.

The Hadoop technology stack
The Hadoop technology stack

I have clustered the Hadoop stack into several areas. The lowest area is the cluster management. This level is everything about managing and running Hadoop. Projects on this layer include Ambari for provisioning, monitoring and management, Zookeeper for the coordination and reliability and Oozie for Workflow-scheduling. This layer is focused on infrastructure and if you work on this layer, you normally don’t analyse data (yet).

Moving one level up, we find ourselves in the “Infrastructure” layer. This layer is not about physical or virtual machines or disk storage. I called it “Infrastructure” since it contains projects that are used by other Hadoop components. This includes Apache Commons, a shared library, and the HDSF (Hadoop Distributed File System). HDFS is used by all other projects and it is a virtual file system that can span over many different servers and abstracts individual (machine-based) file systems to one common file system.

The next layer could also be called the 42 layer. Apache YARN is the core of almost everything you do in Hadoop. YARN takes care of the Map/Reduce jobs and many other things including resource management, job management, job tracking and job scheduling.

The next layer is all about data. As we can see here, this layer contains a lot of projects for the 3 core things when it comes to data: data storage, data access and data science. As of data storage, a key project is HBase, a distributed, key/value database. It is built for large amounts of data. We will dig deeper into HBase in a couple of weeks from now. Data access includes several important projects such as Hive (a SQL-like query language), Pig (a data flow language), streaming and in-memory processing for real-time applications such as Spark and Storm, and Graph processing with Giraph. Mahout is the only project in the data science layer. Mahout is useful for machine learning, clustering and recommendation mining.

On the next layer, we have several tools for data governance and integration. When it is necessary to import data into Hadoop, we can find projects on this layer.

The last layer consists of Apache Hue. This is the Hadoop UI that makes our lives easier 😉

Next week, I will give more insights on the individual layers discussed here. Stay tuned 😉

Hadoop is one of the most popular Big Data technologies, or maybe the key Big Data technology. Due to large demand for Hadoop, I’ve decided to write a short Hadoop tutorial series here. In the next weeks, I will write several articles on the Hadoop platform and key technologies.

When we talk about Hadoop, we don’t talk about one specific software or a service. The Hadoop project features several projects, each of them serving different topics in the Big Data ecosystem. When handling Data, Hadoop is very different to traditional RDBMS systems. Key differences are:

  • Hadoop is about large amounts of data. Traditional database systems were only about some gigabyte or terabyte of data, Hadoop can handle much more. Petabytes are not a problem for Hadoop
  • RDBMS work with an interactive access to data, whereas Hadoop is batch-oriented.
  • With traditional database systems, the approach was “read many, write many”. That means, that data gets written often but also modified often. With Hadoop, this is different: the approach now is “write once, read many”. This means that data is written once and then never gets changed. The only purpose is to read the data for analytics.
  • RDBMS systems have schemas. When you design an application, you first need to create the schema of the database. With Hadoop, this is different: the schema is very flexible, it is actually schema-less
  • Last but not least, Hadoop scales linear. If you add 10% more compute capacity, you will get about the same amount of performance. RDBMS are different; at a certain point, scaling them gets really difficult.

Central to Hadoop is the Map/Reduce algorithm. This algorithm was usually introduced by Google to power their search engine. However, the algorithm turned out to be very efficient for distributed systems, so it is nowadays used in many technologies. When you run queries in Hadoop with languages such as Hive or Pig (I will explain them later), these queries are translated to Map/Reduce algorithms by Hadoop. The following figure shows the Map/Reduce algorithm:

Map Reduce function
Map Reduce function

The Map/Reduce function has some steps:

  1. All input data is distributed to the Map functions
  2. The Map functions are running in parallel. The distribution and failover is handled entirely by Hadoop.
  3. The Map functions emit data to a temporary storage
  4. The Reduce function now calculates the temporary stored data

A typical sample is the word-count. With word-count, input data as text is put to a Map function. The Map function adds all words of the same kind to a list in the temporary store. The reduce-function now counts the words and builds a sum.

Next week I will blog about the different Hadoop projects. As already mentioned earlier, Hadoop consists of several other projects.