Hadoop Tutorial – Working with the Apache Hue GUI


When working with the main Hadoop services, it is not necessary to work with the console at all time (event though this is the most powerful way of doing so). Most Hadoop distributions also come with a User Interface. The user interface is called “Apache Hue” and is a web-based interface running on top of a distribution. Apache Hue integrates major Hadoop projects in the UI such as Hive, Pig and HCatalog. The nice thing about Apache Hue is that it makes the management of your Hadoop installation pretty easy with a great web-based UI. The following screenshot shows Apache Hue on the Cloudera distribution. Apache Hue

Hadoop Tutorial – Serialising Data with Apache Avro


Apache Avro is a service in Hadoop that enables data serialization. The main tasks of Avro are: Provide complex data structures Provide a compact and fast binary data format Provide a container to persist data Provide RPC’s to the data Enable the integration with dynamic languages Avro is built with a JSON Schema, that allows several different types: Elementary types Null, Boolean, Int, Long, Float, Double, Byte and String Complex types Record, Enum, Array, Map, Union and Fixed The sample below demonstrates an Avro schema {“namespace”: “person.avro”, “type”: “record”, “name”: “Person”, “fields”: [ {“name”: “name”, “type”: “string”}, {“name”: “age”,  “type”: [“int”, “null”]}, {“name”: “street”, “type”: [“string”, “null”]} ] } Table 4: an avro schema

Hadoop Tutorial – Import large amount of data with Apache Sqoop


Apache Sqoop is in charge of moving large datasets between different storage systems such as relational databases to Hadoop. Sqoop supports a large number of connectors such as JDBC to work with different data sources. Sqoop makes it easy to import existing data into Hadoop. Sqoop supports the following databases: HSQLDB starting version 1.8 MySQL starting version 5.0 Oracle starting version 10.2 PostgreSQL Microsoft SQL Sqoop provides several possibilities to import and export data from and to Hadoop. The service also provides several mechanisms to validate data.

Hadoop Tutorial – Analysing Log Data with Apache Flume


Most IT departments produce a large amount of log data. This occurs especially when server systems are monitored, but it is also necessary for device monitoring. Apache Flume comes into play when this log data needs to be analyzed. Flume is all about data collection and aggregation. The architecture is built with a flexible architecture that is based on streaming data flows. The service allows you to extend the data model. Key elements of Flume are: Event. An event is data that is transported from one place to another place. Flow. A flow consists of several events that are transported between several places. Client. A client is the start of a transport. There are several clients available. A frequently used client for example is the Log4j appender. Agent. An Agent is an independent process that provides components to flume. Source. This is an interface implementation that is capable of transporting events. A sample of that is an Avro source. Channels.

read more Hadoop Tutorial – Analysing Log Data with Apache Flume

Hadoop Tutorial – Data Science with Apache Mahout


Apache Mahout is the service on Hadoop that is in charge of what is often called “data science”. Mahout is all about learning algorithms, pattern recognition and alike. An interesting fact about Mahout is that under the hood MapReduce was replaced by Spark. Mahout is in charge of the following tasks: Machine Learning. Learning from existing data and. Recommendation Mining. This is what we often see at websites. Remember the “You bought X, you might be interested in Y”? This is exactly what Mahout can do for you. Cluster data. Mahout can cluster documents and data that has some similarities. Classification. Learn from existing classifications. A Mahout program is written in Java. The next listing shows how the recommendation builder works. DataModel model = new FileDataModel(new File(“/home/var/mydata.xml”));   RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator();   RecommenderBuilder builder = new MyRecommenderBuilder();   Double res = eval.evaluate(builder, null, model, 0.9, 1.0);   System.out.println(result); A Mahout program

Hadoop Tutorial – Graph Data in Hadoop with Giraph and Tez


Both Apache Graph and Tez are focused on Graph processing. Apache Giraph is a very popular tool for graph processing. A famous use-case for Giraph is the social graph at Facebook. Facebook uses Giraph to analyze how one might know a person in order to find out what other persons could be friends. Graph processing works on the travelling Sales-Person problem, trying to answer the question on what is the shortest way to get to the customers. Apache Tez is focused on improving the performance when working with graphs. This makes the development ways easier and reduces the number of MapReduce jobs that are executed underneath it significantly. Apache Tez highly increases the performance against typical MapReduce queries and optimizes the resource management. The following figure demonstrates graph processing with and without Tez. MapReduce without Tez With Tez

Hadoop Tutorial – Real-Time Data with Apache S4


S4 is another near-real-time project for Hadoop. S4 is built with a decentralized architecture in mind, focusing on a scaleable and event-oriented architecture. S4 is a long-running process that analyzes streaming data. S4 is built with Java and with flexibility in mind. This is done via dependency injection, which makes the platform very easy to extend and change. S4 heavily relies on Loose-coupling and dynamic association via the Publish/Subscribe pattern. This makes it easy for S4 to integrate sub-systems into larger systems and updating services on sub-systems can be done independently. S4 is built to be highly fault-tolerant. Mechanisms built into S4 allow fail-over and recovery.

Hadoop Tutorial – Accessing streaming data with Apache Storm


Apache Storm is in charge for analyzing streaming data in Hadoop. Storm is extremely powerful when analyzing streaming data and is capable of working near real-time. Storm was initially developed by Twitter to power their streaming API. At present, Storm is capable of processing 1 million tuples per node and second. The nice thing about Storm is that it scales linearly. The Storm architecture is similar to other Hadoop projects. However, Storm comes with different challenges. First, there is Nimbus. Nimbus is the controller for Storm, which is similar to the JobTracker in Hadoop. Apache Storm also utilizes ZooKeeper. The Supervisor is on each instance and takes care of the tuples once they come in. The following figure shows this. Major concepts in Apache Storm are 4 elements: streams, spouts, bolts and topologies. Streams are an unbound sequence of Tuples, a Spout is a source of streams, Bolts process input streams and create new output streams and a topology is

read more Hadoop Tutorial – Accessing streaming data with Apache Storm

Hadoop Tutorial – Getting started with Apache Pig


Apache Pig is an abstract language that puts data in the middle. Apache Pig is a “Data-flow” language. In contrast to SQL (and Hive), Pig goes an iterative way and lets data flow from one statement to another. This gives more powerful options when it comes to data. The language used for Apache Pig is called “PigLatin”. A key benefit of Apache Pig is that it abstracts complex tasks in MapReduce such as Joins to very easy functions in Apache Pig. Apache Pig is ways easier for Developers to write complex queries in Hadoop. Pig itself consists of two major components: PigLatin and a runtime environment. When running Apache Pig, there are two possibilities: the first one is the stand alone mode which is intended to rather small datasets within a virtual machine. On processing Big Data, it is necessary to run Pig in the MapReduce Mode on top of HDFS. Pig applications are usually script files (with the extension

read more Hadoop Tutorial – Getting started with Apache Pig

Hadoop Tutorial – Apache Hive and Apache HCatalog


One of the easiest to use tools in Hadoop is Hive. Hive is very similar to SQL and is easy to learn for those that have a strong SQL background. Apache Hive is a data-warehousing tool for Hadoop, focusing on large datasets and how to create a structure on them. Hive queries are written in HiveQL. HiveQL is very similar to SQL, but not the same. As already mentioned, HiveQL translates to MapReduce and therefore comes with minor performance trade-offs. HiveQL can be extended by custom code and MapReduce queries. This is useful, when additional performance is required. The following listings will show some Hive queries. The first listing will show how to query two rows from a dataset. hive> SELECT column1, column2 FROM dataset2 5 4 9 5 7 5 9 Listing 2: simple Hive query The next sample shows how to include a where-clause. hive> SELECT DISTINCT column1 FROM dataset WHERE column2 = 91 Listing 3: where in

read more Hadoop Tutorial – Apache Hive and Apache HCatalog