Posts

One of my 5 predictions for 2019 is about Hadoop. Basically, I do expect that a lot of projects won’t take Hadoop as a full-blown solution anymore. Why is that? What is the future of Hadoop?

What happend to the future of Hadoop?

Basically, one of the most exciting news in 2018 was the merger between Hortonworks and Cloudera. The two main competitors now joining forces? How can this happen? Basically, I do believe that a lot of that didn’t come out of a strength of the two and that they somehow started to “love” each other but rather out of economical calculations. Now, it isn’t a competition between Hortonworks or Cloudera anymore (even before the merger), it is rather Hadoop vs. new solutions.

These solutions are highly diversified – Apache Spark is one of the top competitors to it. But there are also other platforms such as Apache Kafka and some NoSQL databases such as MongoDB, plus TensorFlow emerging. One would now argue that all of that is included in a Cloudera or Hortonworks distribution, but it isn’t as simple as that. Spark and Kafka founders provider their own distributions of their stack, more lightweight than the complex Hadoop stack. In several use-cases, it is simply not necessary to have a full-blown solution but rather go for a light-weighted one.

The Cloud is the real threat to Hadoop

But the real thread rather comes from something else: the Cloud. Hadoop was always running better on bare-metal and still both pre-merger companies are arguing that in fact Hadoop does better run on bare-metal. Other solutions such as Spark are performing better in the Cloud and built for the Cloud. This is the real threat for Hadoop, since the Cloud is simply something that won’t go away now – with most companies switching to it.

Object stores provide a great and cheap alternative to HDFS and the management of Object Stores is ways easier. I only call it an alternative here since Object Stores still miss several Enterprise Features. However, I expect that the large cloud providers such as AWS and Microsoft will invest significantly in this space and provide great additions to their object stores even this year. Object Stores in the cloud will catch up fast this year – and probably surpass HDFS functionality by 2020. If this happens and the cost benefits remain better than bare-metal Hadoop, there is really no need for it anymore.

On the analytics layer, the cloud is also ways superior. Running dynamic Spark Jobs against data in object stores (or managed NoSQL databases) are impressive. You don’t have to manage Clusters anymore, which takes a lot of pain and headache away from large IT departments. This will increase performance and speed of developments. Another disadvantage I see for the leading Hadoop solutions is their salesforce: they get better compensated for on-prem solutions, so they try to tell companies to keep out of the cloud – which isn’t the best strategy in 2019.

What about enterprise adoption of Hadoop?

However, there is still some hope about Enterprise Integration, which is often handled better from Hadoop distributions. And even though the entire world is moving on the Cloud, there are still many legacy systems running on-premise. Also, after the HWX/Cloudera merger, their mission statement became of being the leading company for big data in the cloud. So if they are going to fully execute this, I am sure that there will be a huge market share ahead of them – and the initially described threads could even be turned down. Let’s see what 2019 and 2020 will bring in this respect and what the future of Hadoop might bring.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company

The Azure CLI is my favorite tool to manage Hadoop Clusters on Azure. Why? Because I can use the tools I am used to from Linux now from my Windows PC. In Windows 10, I am using the Ubuntu Bash for that, which gives me all the major tools for managing remote Hadoop Clusters.
One thing I am doing frequently, is starting and stopping Hadoop Clusters based on Cloudera. If you are coming from Powershell, this might be rather painfull for you, since you can only start each vm in the cluster sequentially, meaning that a cluster consisting of 10 or more nodes is rather slow to start and might take hours! In the Azure CLI I can easily do this by specifiying “–nowait” and all runs in parallel. The only disadvantage is that I won’t get any notifications on when the cluster is ready. But I am doing this with a simple hack: ssh’ing into the cluster (since I have to do this anyway). SSH will succeed once the Masternodes are ready and so I can perform some tasks on the nodes (such as restarting Cloudera Manager since CM is usually a bit “dizzy” after sending it to sleep and waking it up again :))
Let’s start with the easiest step: stopping the cluster. The Azure CLI always starts with “az” as command (meaning Azure of course). The command for stopping one or more vm’s with the Azure CLI is “vm stop”. The only two things I need to provide now are the id’s I want to stop and “–nowait” since I want to quit the script right after.
So, the script would look like the following:

az vm stop --ids YOUR_IDS --no-wait

However, this has still one major disadvantage: you would need to provide all ID’s Hardcoded. This doesn’t matter at all if your cluster never changes, but in my case I add and delete vm’s to or from the cluster, so this script doesn’t play well for my case. However, the CLI is very flexible (and so is bash) and I can query all my vm’s in a resource group. This will give me the IDs which are currently in the cluster (let’s assume I delete dropped vm’s and add new vm’s to the RG). The Query for retrieving all VMs in a Resource Group is easy:

az vm list --resource-group YOUR_RESOURCE_GROUP --query "[].id" -o tsv

This will give me all IDs in the RG. The real fun starts when doing this in one statement:

az vm stop --ids $(az vm list --resource-group clouderarg --query "[].id" -o tsv) --no-wait

Which is really nice and easy 🙂
It is similar with starting VMs in a Resource Group:

az vm start --ids $(az vm list --resource-group mmhclouderarg --query "[].id" -o tsv) --no-wait