Apache Mahout is the service on Hadoop that is in charge of what is often called “data science”. Mahout is all about learning algorithms, pattern recognition and alike. An interesting fact about Mahout is that under the hood MapReduce was replaced by Spark.

Mahout is in charge of the following tasks:

  • Machine Learning. Learning from existing data and.
  • Recommendation Mining. This is what we often see at websites. Remember the “You bought X, you might be interested in Y”? This is exactly what Mahout can do for you.
  • Cluster data. Mahout can cluster documents and data that has some similarities.
  • Classification. Learn from existing classifications.

A Mahout program is written in Java. The next listing shows how the recommendation builder works.

DataModel model = new FileDataModel(new File(“/home/var/mydata.xml”));

 

RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator();

 

RecommenderBuilder builder = new MyRecommenderBuilder();

 

Double res = eval.evaluate(builder, null, model, 0.9, 1.0);

 

System.out.println(result);

Both Apache Graph and Tez are focused on Graph processing. Apache Giraph is a very popular tool for graph processing. A famous use-case for Giraph is the social graph at Facebook. Facebook uses Giraph to analyze how one might know a person in order to find out what other persons could be friends. Graph processing works on the travelling Sales-Person problem, trying to answer the question on what is the shortest way to get to the customers.

Apache Tez is focused on improving the performance when working with graphs. This makes the development ways easier and reduces the number of MapReduce jobs that are executed underneath it significantly. Apache Tez highly increases the performance against typical MapReduce queries and optimizes the resource management.

The following figure demonstrates graph processing with and without Tez.

MapReduce without Tez
MapReduce without Tez

MapReduce without Tez

With Tez
With Tez

S4 is another near-real-time project for Hadoop. S4 is built with a decentralized architecture in mind, focusing on a scaleable and event-oriented architecture. S4 is a long-running process that analyzes streaming data.
S4 is built with Java and with flexibility in mind. This is done via dependency injection, which makes the platform very easy to extend and change. S4 heavily relies on Loose-coupling and dynamic association via the Publish/Subscribe pattern. This makes it easy for S4 to integrate sub-systems into larger systems and updating services on sub-systems can be done independently.
S4 is built to be highly fault-tolerant. Mechanisms built into S4 allow fail-over and recovery.

Apache Storm is in charge for analyzing streaming data in Hadoop. Storm is extremely powerful when analyzing streaming data and is capable of working near real-time. Storm was initially developed by Twitter to power their streaming API. At present, Storm is capable of processing 1 million tuples per node and second. The nice thing about Storm is that it scales linearly.

The Storm architecture is similar to other Hadoop projects. However, Storm comes with different challenges. First, there is Nimbus. Nimbus is the controller for Storm, which is similar to the JobTracker in Hadoop. Apache Storm also utilizes ZooKeeper. The Supervisor is on each instance and takes care of the tuples once they come in. The following figure shows this.

Storm Topology
Storm Topology

Major concepts in Apache Storm are 4 elements: streams, spouts, bolts and topologies.

Storm Tuples
Storm Tuples

Streams are an unbound sequence of Tuples, a Spout is a source of streams, Bolts process input streams and create new output streams and a topology is a network of Bolts and Spouts.

In the last weeks, I outlined several Big Data benefits by industries. The next posts, I want to outline use-cases where Big Data are relevant in any company, as I will focus on the business functions.
This post’s focus: Logistics.
Big Data is a key driver for logistics. By logistics, companies that provide logistics solutions and companies that take advantage of logistics are meant. On the one hand, Big Data can significantly improve the supply chain of a company. For years – or even decades – companies rely on the “just in time” delivery. However, “just in time” wasn’t always “just in time”. In many cases, the time an item spent on stock was simply reduced but it still needed to be stored somewhere – either in a temporary warehouse on-site or in the delivery trucks themselves. The first approach is capital intensive, since these warehouses need to be built (and extended in case of growth). The second approach is to keep the delivery vehicles waiting – which creates expenses on the operational side – each minute a driver has to wait, costs money. With analytics, the just in time delivery can be further improved and optimized to lower costs and increase productivity.
Another key driver for Big Data and logistics is the route optimization. Routes can be improved by algorithms and make them faster. This lowers costs and on the other hand significantly saves the environment. But this is not the end of possibilities: routes can also be optimized in real-time. This includes traffic prediction and jam avoidance. Real-time algorithms will not only calculate the fastest route but also the environmental friendliest route and cheapest route. This again lowers costs and time for the company.
Header Image by  Nick Saltmarsh / CC BY