We have learned about the basics of Hive in the previous tutorial. In the following tutorials, we will use the Hortonworks Sandbox to use Hive. Hortonworks is one of the Hadoop distributions (next to Cloudera and MapR) and a pre-configured environment. There is no need for additional setup or installations. Hortonworks is delivered via different VMs or also as a Docker container. We use this, as it is the easiest way (and you don’t need to install any VM tools). To get started, download the latest Docker environment for your computer/mac: https://www.docker.com/get-started. Then, we can get started to setup the Hortonworks Sandbox with Docker.

Follow the installation routine throughout, it is easy and straight forward. Once done, download the Hortonworks image fromhttps://hortonworks.com/downloads/#sandbox

As an install type, select “Docker” and make sure that you have the latest version. As of writing this article, the current version of HDP (Hortonworks Data Platform) is 3.0. Once you have finished the download, execute the Docker file (on Linux and Mac: docker-deploy-hdp30.sh). After that, the script pulls several repositories. This takes some time, as it is several GB in size – so be patient!

The script also installs the HDP proxy tool, which might cause some errors. If you have whitespaces in your directories, you need to edit the HDP proxy sh file (e.g. with vim) and set all paths under “”. Then, everything should be fine.

The next step is to change the admin password in your image. To do this, you need to SSH into the machine with the following command:

docker exec -it sandbox-hdp /bin/bash

Execute the following command:


Now re-type the passwords and the services will re-start. After that, you are ready to use HDP 3.0. To access your hdp, use your local ip ( with port 8080. Now, you should see the Ambari Login screen. Enter “admin” and your password (the one you reset in the step before). You are now re-directed to your administration interface and should see the following screen:

The Hortonworks Ambari Environment shows services that aren't started yet in the Hortonworks Sandbox with Docker
HDP 3.0 with Ambari

You might see that most of your services are somewhat red. In order to get them to work, you need to restart them. This takes some time again, so you need to be patient here. Once your services turned green, we are ready to go. As you can see, setting up the Hortonworks Sandbox with Docker is really easy and straight forward.

Have fun exploring HDP – we will use it in the next tutorial, where we will look at how Hive abstracts Tables and Databases.

Strategy by Nick Youngson CC BY-SA 3.0 Alpha Stock Images

Digitalisation is a key driver amongst companies since the last 2 years. However, many companies forget that the oil for the digitalisation engine is data. Most companies have no data strategy in place or at least it is very blurry. A lot of digitalisation strategies fail, which is often due to the lack of proper treatment and management of their data. In this blog post, I will write about the most common errors I saw so far in my experience. Disclaimer: I won’t offer answers as of now, but it is relevant to give you an insight into what you should probably avoid doing. The following steps help you to destroy your data strategy.

Step 1: Hire Data Scientists. Really: you need them

Being a Data Scientist is a damn sexy job. It is even considered to be the most sexy job of the 21st century. So why should you not have one? Or two or three? Don’t worry – just hire them. They do the magic and solve almost all of your problems around data. Just don’t think about it, just do it. If you have no Data Scientist for your digitalisation strategy, it isn’t complete. Think about what they can or should do later.

In my experience, this happend a lot in the last years. Only few industries (e.g. banking) have experience with them, as it is natural for them. Over the last years I saw Data Scientists joining companies without a clear strategy. These Data Scientists then had to deal with severe issues:

  • Lack of data availability. Often, they have issues getting to the data. Long processes, siloed systems and commodity systems prevent them from doing so.
  • Poor data quality. Once they get to the data and want to start doing things with it, it becomes even more complex: no governance, no description of the data, poor overall quality.

So, what most companies are often missing out on is the counterpart each data scientist needs: a Data Engineer. Without them, they are often nothing.

But with this, I described actually a status which is almost advanced; often, companies hire data scientists (at high salaries!) and then let them just do BI tasks like reporting. I saw this often and people got frustrated. Frustration led to them leaving the jobs just after some months. The company had no learnings after that and no business benefits. So it clearly failed.

Step 2: Deliver & Work in silence. Let nobody know what you are doing

Digitalisation is dangerous and disruptive. It will lead to major changes in companies. This is a fact, not fiction. And you don’t need science to figure that out. So why should you talk about it? Just do it, let other units continue doing their job and don’t disrupt them.

Digitalisation is a complex topic and humans by nature tend to interpret. Also, they will start to interpret things from this topic to fit to their comfort zone. This will lead to different strategies and approaches, creating even more failed projects and a lot of uncertainty.

The approach here should be to be consistent about communication within the company and to take away fear from different units. Digitalisation is by nature disruptive, but do it with the people, not against them!

Step 3: Build even more silos to destroy the data strategy

Step 2 will most likely lead to different silos. A digital company should be capable of doing and solving their digital products, services and solutions on their own. There is always a high threat that different business units will create data silos. This leads to the fact that there will never be a holistic view on all of your data. The integration is though later on and will burn a lot of money. For businesses, it is often a quick win to implement the one or another solution, but backwards integration of these solutions – especially when it comes to data – is very tricky.

A lot of companies have no 360 degree view of their data. This is due to the mere fact that business units often confront IT departments with “we need this tool now, please integrate”. This leads to issues, since IT departments are anyway often understaffed. So, a swamp in the IT landscape is created, leading to an even bigger swamp of data. Integration then never really happens as it is too expensive. Will you become digital with this? Clearly no.

Step 4: Build a sophisticated structure when the company isn’t sophisticated with this topic yet.

Data Scientists tend to sit in business units. For a data driven enterprise, this is exactly how it should be. However, only a small percentage of companies are data driven. I would argue that traditional companies aren’t data driven, only the Facebooks, Googles and Amazons of our world are.

However, traditional companies now tend towards copying this system and Business units hire data scientists – which are then disconnected to other units and only loosely connected via internal communities. A distributed layout of your company in terms of data only makes sense once the company reached a high level of maturity. In my opinion, it needs to be steered from a central unit first. Once the maturity is going to improve, it can be step-wise decentralised and then put back fully into business units.

One thing: put digitalisation very close to the CEO of the company. It needs to have some fire power as there will always be obstacles.

In my experience, I’ve seen quite a lot of failures when it comes to where to place data units. In my opinion, it only makes sense in a technical unit or – if available – in the digitalisation unit. However, it should never be in business functions. You will definitely succeed and destroy the data strategy with this.

Step 5: Don’t invest into people to destroy your data strategy

Last but not least, never invest into people. Especially Data Scientists – they should be really happy to have a job with you, so why would you also invest into them and give them education?

This is also one challenge I see a lot in companies. They simply don’t treat their employees well, and those that are under high demand (like Data Scientists) tend to leave fast then. This is one of the key failures in Data driven strategies. Keeping the people is a key to a successful strategy and a lot of companies don’t manage this well. To not invest into people is probably one of the most effective ways to destroy a data strategy.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. Now it is about time to twist it around and destroy your competitors with data.


This is the kick-off to the Apache Hive Tutorial. Over the next weeks, I will post different tutorials on how to use Hive. Hive is a key component of Hadoop and we will today start with a general description of it.

What is Apache Hive?

Basically, what is Hive all about? Hive is a distributed query engine and language (called HiveQL) for Hadoop. Its main purpose is to enable a large number of people working with data stored in Hadoop. Therefore, Facebook introduced Hive for their analysts. Below you can see the typical Dataflow in an Hive project.

Hive Data Flow

The above image shows how the workflow goes: first, a Hive client sends a request to the Hive Server. After that, the driver takes over and submits to the JobClient. Jobs are then executed on a Hadoop or Spark Cluster. In our samples over the next tutorials, we will however use the Web UI from Hortonworks. But we will have a look at that later. First, let’s have a look at another component: HCatalog.

HCatalog is a service that makes it easy to use Hive. With this, files on HDFS are abstracted to look like databases and tables. HCatalog is therefore a metadata repository about the files on HDFS. Other tools on Hadoop or Spark take advantage of this and use HCatalog.

With traditional Datawarehouse or RDBMS sytems, one worked in Databases and SQL was the language how to access data from these systems. Hive provides the HiveQL (which we will look at more detailed in the coming blog posts). HiveQL basically works on Hadoop files, such as plain text files, OCR or Parquet.

One key aspect of Hive is that it is mainly read-oriented. This means that you don’t update data, as everything you do in Hadoop is built for analytics. Hive still provides the possibility to update data, but this is rather done as an append update (meaning, that the original data isn’t altered as in contrast to RDBMS systems).

Apache Hive Security

One key element of Hive is security. It all enterprise environments, it is very important to secure your tables against different kind of access. Hive therefore supports different options:

  • Storage-based authorization: Hive doesn’t care about the authorization. Auth is being handled via the Storage Layer (ACLs in Cloud Bucket/Object Store or HDFS ACLs)
  • Standard-based Autorization via HiveServer2 over Databases: Storage-based authorization is all or nothing from a table – not fine-grained enough. Hive can also work with fine-grained auth from databases to only show colums/rows relevant to the user
  • Authorization via Ranger or Sentry: Apache Projects that do advanced authorization in Hadoop and abstract the authorization issues •Allows advanced rules and access to data

To work with Hive, you will typically use HiveQL. In the next tutorial, we will have a look on how to setup an environment where you can work with Hive.

This tutorial is part of the Apache Hive Tutorials. For more information about Hive, you might also visit the official page.

Header image: https://www.flickr.com/photos/karen_roe/32417107542

When Kappa first appeared as an architecture style (introduced by Jay Kreps) I was really fond of this new approach. I carried out several projects that went with Kafka as the main “thing” and not having the trade-offs as Lambda. But the more complex projects got, the more I figured out that it isn’t the answer to everything and that we ended up with Lambda again … somehow.

Kappa vs. Lambda Architecture

First of all, what is the benefit of Kappa and the trade-off with Lambda? It all started with Jay Kreps in his blog post when he questioned the Lambda Architecture. Basically, with different layers in the Lambda Architecture (Speed Layer, Batch Layer and Presentation Layer) you need to use different tools and programming languages. This leads to code complexity and the risk that you end up having inconsistent versions of your processing capabilities. A change to the logic on the one layer requires changes on the other layer as well. Complexity is basically one thing we want to remove from our architecture at all times, so we should also do it with Data Processing.

The Kappa Architecture came with the promise to put everything into one system: Apache Kafka. The speed that data can be processed with it is tremendous and also the simplicity is great. You only need to change code once and not twice or three times as compared to Lambda. This leads to cheaper labour costs as well, as less people are necessary to maintain and produce code. Also, all our data is available at our fingertips, without major delays as with batch processing. This brings great benefits to business units as they don’t need to wait forever for processing.

So what is the problem about Kappa Architecture?

However, my initial statement was about something else – that I mistrust Kappa Architecture. I implemented this architecture style at several IoT projects, where we had to deal with sensor data. There was no question if Kappa is the right thing – as we were in a rather isolated Use-Case. But as soon as you have to look at a Big Data architecture for a large enterprise (and not only into isolated use-cases) you end up with one major issue around Kappa: Cost.

In use-cases where data don’t need to be available within minutes, Kappa seems to be an overkill. Especially in the cloud, Lambda brings major cost benefits with Object Storages in combination with automated processing capabilities such as Azure Databricks. In enterprise environments, cost does matter and an architecture should also be cost efficient. This also holds true when it comes to the half-live of data which I was recently writing about. Basically, data that looses its value fast should be stored on cheap storage systems at the very beginning already.

Cost of Kappa Architecture

An easy way to compare Kappa to Lambda is the comparison per Terabyte stored or processed. Basically, we will use a scenario to store 32 TB. With a Kappa Architecture running 24/7, this would mean that we have an estimated 16.000$ per month to spend (no discounts, no reserved instances – pay as you go pricing; E64 CPUs with 64 cores per node, 432 GB Ram and E80 SSDs attached with 32TB per disk). If we would use Lambda and only process once per day, this would mean that we need 32TB on a Blob Store – that costs 680$ per month. Now we would take the cluster above for processing with Spark and use it 1 hour per day: 544$. Summing up, this would equal to 1.224$ per month – a cost ratio of 1:13.

However, this is a very easy calculation and it can still be optimised on both sides. In the broader enterprise context, Kappa is only a specialisation of Lambda but won’t exist all alone at all time. Kappa vs. Lambda can only be selected by the use-case, and this is what I recommend you to do.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company