Hadoop Tutorial – Apache Accumulo

Apache Accumulo is another NoSQL Database in the Hadoop stack. Accumulo is based on Google’s Big Table design and is a sorted and distributed key/value storage.

Key/Value storages are basically not operating on rows, but it is possible to query them – which comes with a performance trade-off often. Accumulo allows us to query large rows which typically wouldn’t fit into the memory.

Accumulo is also built for high availability, scalability and fault tolerance. As of the ACID-topology, Accumulo supports “Isolation”. This basically means that recently inserted data isn’t displayed in case that the insert was after the query was sent.

Accumulo is built with a PlugIn-based architecture and provides a comprehensive API. With Accumulo, it is possible to execute MapReduce jobs, bulk- and batch operations.

The following Figure outlines how a Key/Value is displayed in Accumulo. The Key consists of the Row id, a column specifier and a timestamp. The column contains informations about the column family, the qualifier and the visibility.

Apache Accumulo
Apache Accumulo

The next sample will display how Accumulo code is written. The sample displays how to write a text to the database.

Text uid = new Text(“columid”);

Text family = new Text(“columnFamily”);

Text qualifier = new Text(“columnQualifier”);

ColumnVisibility visibility = new ColumnVisibility(“public”);

long timestamp = System.currentTimeMillis();

Value value = new Value(“Here is my text”.getBytes());

Mutation mutation = new Mutation(uid);

mutation.put(family, qualifier, visibility, timestamp, value);

Big Data is everywhere! In all major industries

The last weeks I outlined several industries that can benefit from Big Data. However, this was just a short overview on what is possible. Let me use this post to sum up the industries that benefit from Big Data. You can get an overview by this tag.
In the first post I started with manufacturing. This traditional industry sees major benefits from Big Data, especially with Industry 4.0. You can read the full post here. Big Data is already used heavily by another industry – the finance sector. Major banks, insurances and financial service providers use Big Data. I outlined the possibilities in this post.
Big Data is also a Big Deal for the public sector. Not just that the Obama administration announced to make more data available – it also gives major benefits to smart cities and alike. You can read the full post here. Often included in public sector is healthcare. Healthcare sees great benefits from using Big Data as well. I’ve summed up the benefits here.
The oil and gas industry can also benefit from Big Data by applying them to sensors while drilling. A sector where you might not expect benefits from IT or Big Data is agriculture. But Big Data can give major benefits to this industry as well – as described here.
Next week I will start to look at the functions within a company – to see where Big Data is within a company – independent from the industry.

Get brand new Hadoop E-Book for 0.99 cent instead of 5$

I’ve created a new E-Book providing an Overview on the Hadoop technology. The usual price is 4.99 USD but is available until the end of the week for only 0.99 cent, which is a massive discount for early buyers. The E-Book gives an overview of Hadoop projects and is intended to those that need to get started fast with Hadoop. It focuses on explaining the technology stack rather than explaining details about each technology itself.
From the cover:
Kick Start: Hadoop is an e-book on the Hadoop Technology. The focus of the kick start series is to provide a very fast entry into a new technology. This e-book is useful if you need to build up knowledge on Hadoop within hours and don’t want to spend weeks learning the content. The e-book is useful for consultants, managers, trainers, students and sales staff, that need an overview of all Hadoop technologies but don’t need to understand the technical details. This book is all about get you started fast without the need to spend days or even weeks on trying to understand the technology.
From the Index:
1 Introduction
1.1 Overview on Big Data
1.2 What is Hadoop and why is it important for Big Data?
1.3 The Hadoop Stack
2 Cluster Management with Hadoop
2.1 Apache Ambari
2.2 ZooKeeper
2.3 Oozie
3 Infrastructure and Support
3.1 The Hadoop File System (HDFS)
3.2 Hadoop Commons
3.3 Apache Yarn
4 Storing Data with Hadoop
4.1 HBase
4.2 Accumulo
4.3 Other Databases
5 Accessing Data with Hadoop
5.1 MapReduce for Native Data Access
5.2 SQL Tools in Hadoop with Apache Hive and Apache HCatalog
5.3 Scripting Data with Apache Pig
5.4 Accessing Streaming Data with Apache Storm
5.5 Accessing Real-Time Data with Apache S4
5.6 Graph Data in Hadoop with Apache Giraph and Tez
6 Data Science in Hadoop with Apache Mahout
7 Data Governance and Data Integration In Hadoop
7.1 Apache Falcon
7.2 Apache Flume
7.3 Apache Sqoop
7.4 Apache Avro
8 User Interface in Hadoop with Apache Hue
You can obtain the E-Book on Amazon for Kindle here:

Hadoop Tutorial – Apache HBase

HBase is one of the most popular databases in the Hadoop and NoSQL ecosystem. HBase is a highly-scaleable database that works with fulfilling the partition tolerance and availability of the CAP-Theorem. In case you aren’t familiar with the CAP-Theorem: the theorem states that requirements for a database are consistency, availability and partition tolerance. However, you can only have two of them and the third one comes with a trade-off.

HBase uses a Key/Value storage. The schema of a table in HBase is not present (schema-less), which gives you much more flexibility than with a traditional relational database. HBase takes care of the failover and sharding of data for you.

HBase uses HDFS as storage and ZooKeeper for the coordination. There are several region servers that are controlled by a master server. This is displayed in the next image.

Apache HBase
Apache HBase

Big Data in Agriculture

Big Data is a disruptive technology. It is changing major industries from the inside. In the next posts, we will learn how Big Data changes different industries.
Today’s focus: Big Data for Agriculture.
Well wait – farming and IT? Really? Short answer: YES!
I believe that we are at the brink of something revolutionary in agriculture. This topic has largely been ignored in industrialization and the ongoing digitalization. Agriculture is (at least in Europe) done by many farmers that cultivate rather small land. Big Data is not about to change this in favor of few farmers on large land – the changes are more about performance, quality and quantity.
I recently had a very interesting discussion with someone from a ministry in Europe working on IT and agriculture. They expect a lot from Big Data. First, they want to improve the way how terrain is used by integrating geo-data from satellites. Analysing the terrain and former usages of the terrain gives additional benefits on what to grow on a specific place. The ministry also wants to integrate weather data in combination with what grains grow on a specific place. This would give additional informations on where water is missing. The long-term idea behind that is to make integrate drones that take care of watering grains and plants that had too little water so far. This is also useful for “premier” goods such as wine. Better quality means higher prices and profits for farmers.
At present, companies such as John Deree are working on integrating Data into their products and services. We can expect some very interesting things to happen here 😉

CloudVane once again rated among the Top Blogs!

I am happy that my Blog, CloudVane, once again received a great honor: the UK based consultancy company named CloudVane as one of the Top blogs on Cloud Computing and Big Data! Thanks for that!! It is a great honor to be named next to sites like InfoWorld and ReadWrite!
CloudVane continues success in the online world 😉 Stay tuned for more great articles! If you haven’t subscribed to the Blog yet (and now you should definitely do that ;)) Feel free to do so here.
And finally, here is the link to the nomination.

Hadoop Tutorial – Apache YARN

Apache YARN can easily be called “the answer to everything”. YARN takes care of most of the things in Hadoop and you will use YARN always without noticing it. YARN is the central point of contact for all operations in the Hadoop ecosystem. YARN executes all MapReduce jobs among other things. What YARN takes care of:

  • Resource Management
  • Job Management
  • Job Tracking
  • Job Scheduling

YARN is built of 3 major components. The first one is the resource manager. The resource manager takes care of distributing the resources for individual applications. Next, there is the node manager. This component is running on the node that a specific job is running on. The third component is the Application Master. The Application Master is in charge of retrieving tasks from the resource manager and to ensure the work with the node manager. The Application Master typically works with one or more tasks.

Yarn components
Yarn components

The following image displays a common workflow in YARN.

YARN architecture
YARN architecture

YARN is used by all other projects such as Hive and Pig. It is possible to access YARN via Java Applications or a REST-Interface.

How is Big Data used by online casinos?

If you spend much time in a casino, you’ll quickly notice the familiar relationships that develop between croupiers and their regular players. Casinos are an unusual and unique world, many patrons visit to unwind or to meet a friendly community; a certain level of player/dealer intimacy certainly helps in that aim.
Online establishments can sometimes struggle to offer the same level of community experience. Many of the latest, most advanced casinos include chat functions, live games and “hosts” to help recreate the “live playing experience”. However, Big Data can also play an important part in improving the consumer’s gaming experience.
Big Data refers to the extremely large data sets available to modern companies and researchers; often derived from cookies, loyalty cards and other tracking tools. In few industries do consumers reveal as much about their preferences as they do while gambling and casinos record it all through cameras, chip scanners and loyalty cards.
An online casino will know, for example, exactly what games a customer plays, when and with what pattern. Some players will make regular, predictable deposits and play slot games with an unwavering stake, while another might play Poker tournaments and turn to roulette if winning. The biggest sectors of the online gambling market are sports, bingo and table games. Given that the games are generally provided by a small pool of developers, and sports odds set by tote, the main factor that distinguishes competing gaming brands is bonus offering. Customer data can be invaluable in crafting personalized bonus messages, arriving at relevant times.
Part of the challenge is gathering useful data from as many sources as possible. Fortunately a new generation of social games is emerging which will likely connect directly into social media. If they take off, gambling businesses will be able to plumb Facebook and Twitter accounts for telling indicators. Betting sites often offer hospitality packages and prize draws to reward loyal players and incentivise regular play; they would have much more emotional draw if they could be individually targeted to a customer’s favourite team or band.
Casinos already offer bonuses tailored to a player’s favourite game type. However, Big Data at sites including Uptown Aces, promise the ability to tailor bonuses on a completely individual level. For example, some casinos are now timing bonus emails to coincide with player’s return from work, and adding extra offers for their favourite teams.
Sports betting has long been the richest online gambling market, with the vast majority of gambling advertising focused on, and screened around major sporting events. Online sportsbooks have adopted a strategy of competing indirectly on “insurance” or “money back” deals – which play to natural superstition and avoid damaging price competition.
The problem is, we each have our individual bugbears and bogey men when sports betting. One player may be constantly undone by last minute penalties, while another may constantly see his team reduced to 10 men. The ultimate objective of Big Data will be achieved when sportsbooks can seamlessly anticipate our worries and choices, providing individual markets based on the fate of previous bets.
Big Data has become a buzzword over the last decade, but for good reason. The human race now generates vastly more data than it can currently find a use for, finding that use will undoubtedly create many more efficiencies in our lives. As companies learn how to simplify and sort their vast Excel spreadsheets, they should take some pointers from the gaming industry – where information has been successfully leveraged for long term profit.

Big Data in Oil and Gas

Big Data is a disruptive technology. It is changing major industries from the inside. In the next posts, we will learn how Big Data changes different industries.
Today’s focus: Big Data for Oil and Gas.
There are several benefits for the Oil & Gas industry with Big Data. A key benefit comes with real-time monitoring of sensors in the production chain. This starts with monitoring during a drill and continues when hauling oil. It is possible to react immediately to changed pressure and other factors. Big Data can also largely improve results during the refining phase. Last but not least, it is possible to adjust global operations in the oil&gas industry by using data.

Hadoop Tutorial – The Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is one of the key services for Hadoop. HDFS is a distributed file system that abstracts each individual hard disk file system form a specific node. With HDFS, you get a virtual file system that spans over several nodes and allows you to store large amounts of data. HDFS can also operate in a non-distributed way as a standalone system but the purpose of it is to serve as a distributed file system.

One of the nice things about HDFS is that it runs on almost any hardware – which gives us the possibility to integrate existing systems into Hadoop. HDFS is also fault tolerant, reliable, scalable and easy to extend – just like any other Hadoop project!

HDFS works with the assumption that failures do happen – and is built to work fault-tolerant. HDFS is built to reboot in case of failures. Recovery is also easy with HDFS.

As streaming is a major trend in Big Data analytics, HDFS is built to serve that. HDFS allows to access streaming data via batch-processes.

HDFS is built for large amounts of data – you would usually store some terabytes of data in HDFS. The model of HDFS is built for a “write once, read many” approach, which means that it is fast and easy to read data, but writing data might not be as performant. This means that you wouldn’t use Hadoop to build an application on top of it that serves other purposes than providing analytics. That’s not the target for HDFS.

With HDFS, you basically don’t move data around. Once the data is in HDFS, it will likely stay there since it is “big”. Moving this data to another place might not be effective.

HDFS architecture

The above figure shows the HDFS architecture. HDFS has NamedNodes, which take care of the Metadata handling, distribution of files and alike. The client talks to HDFS itself to write and read files, without knowing on which (physical) node the file resides.

There are several possibilities to access HDFS:

  • REST: HDFS exposes a Rest-API which is called WebHDFS. This REST-API is also used from Java.
  • Libhdfs: This is what you use when accessing HDFS from C or C++.