Learn about common Big Data Technologies, such as Apache Hadoop, Spark and all it’s projects associated with it

//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js //

Apache Software Foundation announced that Apache Storm is now a top level Hadoop project. But what is Apache Storm about? Well, basically Apache Storm is a project to analyse data streams that are near real time. Storm works with messages and analyses what is going on. Storm originates from Twitter, which is using it for their streaming API. Storm is about processing time-critical data and Storm guarantees that your data gets processed. It is basically fault tolerant and scalable. Apache Storm is useful for fraud protection in gambling, banking and financial services, but not only there. Storm can be used wherever real-time or time-critical applications are necessary. At the moment, Storm allows to process 1 million tupels per second and node. This is massive, given the fact that Storm is all about scaling out. Imagine adding 100 nodes! Apache Storm works with Tupels that come from spouts. A spout is a messaging system such as Apache Kafka. Storm supports much more Messaging systems and it can easily be extended by it’s abstraction layer. Storm consists of some major concepts illustrated in the following image: Apache Storm Nimbus is the Master Node, similar to Hadoop‘s Job Tracker. ZooKeeper is used for Cluster coordination and the Supervisor runs the worker process. Each worker process consists of some subsets: an executor that is a thread spanned by the worker and a task itself.

Enjoy this article?

Make sure to subscribe to Cloudvane to receive regular updates here

Major concepts in Apache Storm are 4 elements: streams, spouts, bolts and topologies.

Tuples in Apache Storm

Tuples in Apache Storm

Streams are an unbound sequence of Tuples, a Spout is a source of streams, Bolts process input streams and create new output streams and a topology is a network of Bolts and Spouts.   The header image is provided as Creative Commons license by MattysFlicks.

Big Data involves a lot of different technologies. Each of these technologies require different knowledge. I’ve described the knowledge necessary in an earlier post. Today, we will have a look at Big data technology layer. 

In this post, I want to outline all necessary technologies in the Big Data stack. The following image shows them:

Technologies in the Big Data Stack

Technologies in the Big Data Stack

The layers are:

  • Management: In this layer, the problem on how to store data on hardware or in the cloud and what resources need to be scheduled is addressed. It is basically knowledge involved in datacenter design and/or cloud computing for Big Data.
  • Platforms: This layer is all about Big Data technologies such as Hadoop and how to use them.
  • Analytics: This layer is about the mathematical and statistical techniques necessary for Big Data. It is about asking the questions you need to answer.
  • Utilisation: The last and most abstract layer is about the visualization of Big Data. This is mainly used by visual artists and presentation software.

Each of the layers needs different knowledge and also different hardware and software. As described earlier, it is simply not possible to have one software that “fits it all”. And you need to create a team that has the knowledge in all of these areas.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.