Hadoop where are you heading to?


One of my 5 predictions for 2019 is about Hadoop. Basically, I do expect that a lot of projects won’t take Hadoop as a full-blown solution anymore. Why is that? Basically, one of the most exciting news in 2018 was the merger between Hortonworks and Cloudera. The two main competitors now joining forces? How can this happen? Basically, I do believe that a lot of that didn’t come out of a strength of the two and that they somehow started to “love” each other but rather out of economical calculations. Now, it isn’t a competition between Hortonworks or Cloudera anymore (even before the merger), it is rather Hadoop vs. new solutions. These solutions are highly diversified – Apache Spark is one of the top competitors to it. But there are also other platforms such as Apache Kafka and some NoSQL databases such as MongoDB, plus TensorFlow emerging. One would now argue that all of that is included in a Cloudera

read more Hadoop where are you heading to?

How to: Start and Stop Cloudera on Azure with the Azure CLI


The Azure CLI is my favorite tool to manage Hadoop Clusters on Azure. Why? Because I can use the tools I am used to from Linux now from my Windows PC. In Windows 10, I am using the Ubuntu Bash for that, which gives me all the major tools for managing remote Hadoop Clusters. One thing I am doing frequently, is starting and stopping Hadoop Clusters based on Cloudera. If you are coming from Powershell, this might be rather painfull for you, since you can only start each vm in the cluster sequentially, meaning that a cluster consisting of 10 or more nodes is rather slow to start and might take hours! In the Azure CLI I can easily do this by specifiying “–nowait” and all runs in parallel. The only disadvantage is that I won’t get any notifications on when the cluster is ready. But I am doing this with a simple hack: ssh’ing into the cluster (since I

read more How to: Start and Stop Cloudera on Azure with the Azure CLI

Why building Hadoop on your own doesn’t make sense


There are several things people discuss when it comes to Hadoop and there are some wrong discussions. First, there is a small number of people believing that Hadoop is a hype that will end at some point in time. They often come from a strong DWH background and won’t accept (or simply ignore) the new normal. But there are also some people that basically coin two major sayings: the first group of people states that Hadoop is cheap because it is open source and the second group of people states that Hadoop is expensive because it is very complicated. (Info: by Hadoop, I also include Spark and alike) Neither the one nor the other is true. First, you can download it for free and install it on your system. This makes it basically free in terms of licenses, but not in terms of running it. When you get a vanilla Hadoop, you will have to think about hotfixes, updates, services,

read more Why building Hadoop on your own doesn’t make sense

My Big Data predictions for 2016


As 2016 is around the corner, the question is what this year will bring for Big Data. Here are my top assumptions for the year to come: The growth for relational databases will slow down, as more companies will evaluate Hadoop as an alternative to classic rdbms The Hadoop stack will get more complicated, as more and more projects are added. It will almost take a team to understand what each of these projects does Spark will lead the market for handling data. It will change the entire ecosystem again. Cloud vendors will add more and more capability to their solutions to deal with the increasing demand for workloads in the cloud We will see a dramatic increase of successful use-cases with Hadoop, as the first projects come to a successful end What do you think about my predictions? Do you agree or disagree?

Big Data and Hadoop E-Books at reduced price


2 Big Data and Hadoop E-Books are available at a special promotion. The reduced price is only valid for 1 week, so make sure to order soon! The offer expires on 21th of December and are available on the Kindle store. The two E-Books are: Big Data (Introduction); 0.99$ instead of 5$: Get it here Hadoop (Introduction); 0.99$ instead of 5$: Get it here Have fun reading it!

How to kill your Big Data initiative


Everyone is doing Big Data these days. If you don’t work on Big Data projects within your company, you are simply not up to date and don’t know how things work. Big Data solves all of your problems, really! Well, in reality this is different. It doesn’t solve all your problems. It actually creates more problems then you think! Most companies I saw recently working on Big Data projects failed. They started a Big Data project and successfully wasted thousands of dollars on Big Data projects. But what exactly went wrong? First of all, Big Data is often only seen as Hadoop. We live with the mis-perception that only Hadoop can solve all Big Data topics. This simply isn’t true. Hadoop can do many things – but real data science is often not done with the core of Hadoop. Ever talked to someone doing the analytics (e.g someone good in math or statistics)?. They are not ok with writing Java

read more How to kill your Big Data initiative

Hadoop Tutorial – Working with the Apache Hue GUI


When working with the main Hadoop services, it is not necessary to work with the console at all time (event though this is the most powerful way of doing so). Most Hadoop distributions also come with a User Interface. The user interface is called “Apache Hue” and is a web-based interface running on top of a distribution. Apache Hue integrates major Hadoop projects in the UI such as Hive, Pig and HCatalog. The nice thing about Apache Hue is that it makes the management of your Hadoop installation pretty easy with a great web-based UI. The following screenshot shows Apache Hue on the Cloudera distribution. Apache Hue

Hadoop Tutorial – Hadoop Commons


Apache Commons is one of the easiest things to explain in the Hadoop context – even though it might get complicated when working with it. Apache Commons is a collection of libraries and tools that are often necessary when working with Hadoop. These libraries and tools are then used by various projects in the Hadoop ecosystem. Samples include: A CLI minicluster, that enables a single-node Hadoop installation for testing purposes Native libraries for Hadoop Authentification and superusers A Hadoop secure mode You might not use all of these tools and libraries that are in Hadoop Commons as some of them are only used when you work on Hadoop projects.

Hadoop Tutorial – Serialising Data with Apache Avro


Apache Avro is a service in Hadoop that enables data serialization. The main tasks of Avro are: Provide complex data structures Provide a compact and fast binary data format Provide a container to persist data Provide RPC’s to the data Enable the integration with dynamic languages Avro is built with a JSON Schema, that allows several different types: Elementary types Null, Boolean, Int, Long, Float, Double, Byte and String Complex types Record, Enum, Array, Map, Union and Fixed The sample below demonstrates an Avro schema {“namespace”: “person.avro”, “type”: “record”, “name”: “Person”, “fields”: [ {“name”: “name”, “type”: “string”}, {“name”: “age”,  “type”: [“int”, “null”]}, {“name”: “street”, “type”: [“string”, “null”]} ] } Table 4: an avro schema

Hadoop Tutorial – Import large amount of data with Apache Sqoop


Apache Sqoop is in charge of moving large datasets between different storage systems such as relational databases to Hadoop. Sqoop supports a large number of connectors such as JDBC to work with different data sources. Sqoop makes it easy to import existing data into Hadoop. Sqoop supports the following databases: HSQLDB starting version 1.8 MySQL starting version 5.0 Oracle starting version 10.2 PostgreSQL Microsoft SQL Sqoop provides several possibilities to import and export data from and to Hadoop. The service also provides several mechanisms to validate data.