Big Data Big Data Big Data Technologies Hadoop Tutorials

Hadoop Tutorial – an overview of Hadoop projects

Last week I wrote a blog post introducing the Hadoop project and gave an overview of the Map/Reduce algorithm. This week, I will outline the Hadoop stack and major technologies in the Hadoop step. Please note: there are many projects in the Hadoop stack and this is not complete. The following figure will outline major Hadoop projects.

The Hadoop technology stack
The Hadoop technology stack

I have clustered the Hadoop stack into several areas. The lowest area is the cluster management. This level is everything about managing and running Hadoop. Projects on this layer include Ambari for provisioning, monitoring and management, Zookeeper for the coordination and reliability and Oozie for Workflow-scheduling. This layer is focused on infrastructure and if you work on this layer, you normally don’t analyse data (yet).

Moving one level up, we find ourselves in the “Infrastructure” layer. This layer is not about physical or virtual machines or disk storage. I called it “Infrastructure” since it contains projects that are used by other Hadoop components. This includes Apache Commons, a shared library, and the HDSF (Hadoop Distributed File System). HDFS is used by all other projects and it is a virtual file system that can span over many different servers and abstracts individual (machine-based) file systems to one common file system.

The next layer could also be called the 42 layer. Apache YARN is the core of almost everything you do in Hadoop. YARN takes care of the Map/Reduce jobs and many other things including resource management, job management, job tracking and job scheduling.

The next layer is all about data. As we can see here, this layer contains a lot of projects for the 3 core things when it comes to data: data storage, data access and data science. As of data storage, a key project is HBase, a distributed, key/value database. It is built for large amounts of data. We will dig deeper into HBase in a couple of weeks from now. Data access includes several important projects such as Hive (a SQL-like query language), Pig (a data flow language), streaming and in-memory processing for real-time applications such as Spark and Storm, and Graph processing with Giraph. Mahout is the only project in the data science layer. Mahout is useful for machine learning, clustering and recommendation mining.

On the next layer, we have several tools for data governance and integration. When it is necessary to import data into Hadoop, we can find projects on this layer.

The last layer consists of Apache Hue. This is the Hadoop UI that makes our lives easier 😉

Next week, I will give more insights on the individual layers discussed here. Stay tuned 😉

I lead a team of Senior Experts in Data & Data Science as Head of Data & Analytics and AI at A1 Telekom Austria Group. I also teach this topic at various universities and frequently speak at various Conferences. In 2010 I wrote a book about Cloud Computing, which is often used at German & Austrian Universities. In my home country (Austria) I am part of several organisations on Big Data & Data Science.

1 comment on “Hadoop Tutorial – an overview of Hadoop projects

  1. Pingback: Hadoop Tutorial – Getting started with Apache Hadoop

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: