Posts

When Kappa first appeared as an architecture style (introduced by Jay Kreps) I was really fond of this new approach. I carried out several projects that went with Kafka as the main “thing” and not having the trade-offs as Lambda. But the more complex projects got, the more I figured out that it isn’t the answer to everything and that we ended up with Lambda again … somehow.

Kappa vs. Lambda Architecture

First of all, what is the benefit of Kappa and the trade-off with Lambda? It all started with Jay Kreps in his blog post when he questioned the Lambda Architecture. Basically, with different layers in the Lambda Architecture (Speed Layer, Batch Layer and Presentation Layer) you need to use different tools and programming languages. This leads to code complexity and the risk that you end up having inconsistent versions of your processing capabilities. A change to the logic on the one layer requires changes on the other layer as well. Complexity is basically one thing we want to remove from our architecture at all times, so we should also do it with Data Processing.

The Kappa Architecture came with the promise to put everything into one system: Apache Kafka. The speed that data can be processed with it is tremendous and also the simplicity is great. You only need to change code once and not twice or three times as compared to Lambda. This leads to cheaper labour costs as well, as less people are necessary to maintain and produce code. Also, all our data is available at our fingertips, without major delays as with batch processing. This brings great benefits to business units as they don’t need to wait forever for processing.

So what is the problem about Kappa Architecture?

However, my initial statement was about something else – that I mistrust Kappa Architecture. I implemented this architecture style at several IoT projects, where we had to deal with sensor data. There was no question if Kappa is the right thing – as we were in a rather isolated Use-Case. But as soon as you have to look at a Big Data architecture for a large enterprise (and not only into isolated use-cases) you end up with one major issue around Kappa: Cost.

In use-cases where data don’t need to be available within minutes, Kappa seems to be an overkill. Especially in the cloud, Lambda brings major cost benefits with Object Storages in combination with automated processing capabilities such as Azure Databricks. In enterprise environments, cost does matter and an architecture should also be cost efficient. This also holds true when it comes to the half-live of data which I was recently writing about. Basically, data that looses its value fast should be stored on cheap storage systems at the very beginning already.

Cost of Kappa Architecture

An easy way to compare Kappa to Lambda is the comparison per Terabyte stored or processed. Basically, we will use a scenario to store 32 TB. With a Kappa Architecture running 24/7, this would mean that we have an estimated 16.000$ per month to spend (no discounts, no reserved instances – pay as you go pricing; E64 CPUs with 64 cores per node, 432 GB Ram and E80 SSDs attached with 32TB per disk). If we would use Lambda and only process once per day, this would mean that we need 32TB on a Blob Store – that costs 680$ per month. Now we would take the cluster above for processing with Spark and use it 1 hour per day: 544$. Summing up, this would equal to 1.224$ per month – a cost ratio of 1:13.

However, this is a very easy calculation and it can still be optimised on both sides. In the broader enterprise context, Kappa is only a specialisation of Lambda but won’t exist all alone at all time. Kappa vs. Lambda can only be selected by the use-case, and this is what I recommend you to do.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company

The Azure CLI is my favorite tool to manage Hadoop Clusters on Azure. Why? Because I can use the tools I am used to from Linux now from my Windows PC. In Windows 10, I am using the Ubuntu Bash for that, which gives me all the major tools for managing remote Hadoop Clusters.
One thing I am doing frequently, is starting and stopping Hadoop Clusters based on Cloudera. If you are coming from Powershell, this might be rather painfull for you, since you can only start each vm in the cluster sequentially, meaning that a cluster consisting of 10 or more nodes is rather slow to start and might take hours! In the Azure CLI I can easily do this by specifiying “–nowait” and all runs in parallel. The only disadvantage is that I won’t get any notifications on when the cluster is ready. But I am doing this with a simple hack: ssh’ing into the cluster (since I have to do this anyway). SSH will succeed once the Masternodes are ready and so I can perform some tasks on the nodes (such as restarting Cloudera Manager since CM is usually a bit “dizzy” after sending it to sleep and waking it up again :))
Let’s start with the easiest step: stopping the cluster. The Azure CLI always starts with “az” as command (meaning Azure of course). The command for stopping one or more vm’s with the Azure CLI is “vm stop”. The only two things I need to provide now are the id’s I want to stop and “–nowait” since I want to quit the script right after.
So, the script would look like the following:

az vm stop --ids YOUR_IDS --no-wait

However, this has still one major disadvantage: you would need to provide all ID’s Hardcoded. This doesn’t matter at all if your cluster never changes, but in my case I add and delete vm’s to or from the cluster, so this script doesn’t play well for my case. However, the CLI is very flexible (and so is bash) and I can query all my vm’s in a resource group. This will give me the IDs which are currently in the cluster (let’s assume I delete dropped vm’s and add new vm’s to the RG). The Query for retrieving all VMs in a Resource Group is easy:

az vm list --resource-group YOUR_RESOURCE_GROUP --query "[].id" -o tsv

This will give me all IDs in the RG. The real fun starts when doing this in one statement:

az vm stop --ids $(az vm list --resource-group clouderarg --query "[].id" -o tsv) --no-wait

Which is really nice and easy 🙂
It is similar with starting VMs in a Resource Group:

az vm start --ids $(az vm list --resource-group mmhclouderarg --query "[].id" -o tsv) --no-wait

2016 is around the corner and the question is, what the next year might bring. I’ve added my top 5 predictions that could become relevant for 2016:

  • The Cloud war will intensify. Amazon and Azure will lead the space, followed (with quite some distance) by IBM. Google and Oracle will stay far behind the leading 2+1 Cloud providers. Both Microsoft and Amazon will see significant growth, with Microsoft’s growth being higher, meaning that Microsoft will continue to catch up with Amazon
  • More PaaS Solutions will arrive. All major vendors will provide PaaS solutions on their platform for different use-cases (e.g. Internet of Things). These Solutions will become more industry-specific (e.g. a Solution specific for manufacturing workflows, …)
  • Vendors currently not using the cloud will see declines in their income, as more and more companies move to the cloud
  • Cloud Data Centers will become more often outsourced from the leading providers to local companies, in order to overcome local legislation
  • Big Data in the Cloud will grow significantly in 2016 as more companies will put workload to the Cloud for these kind of applications

What do you think? What are your predictions?

Amazon announced details about their Q2 earnings yesterday. Their cloud business grew with incredible 81%. This is massive, given the fact that Amazon is already the number #1 company in that area. This quarter, they earned 1.8 billion USD from cloud computing.
Summing up this number, their revenue would definitively reach some 7 billion this year. However, if this growth continues to increase so fast, I guess they could even get double-digit by the end of this year. Will Amazon reach 10 billion in 2015? If so, this would be incredible! Microsoft stated that their growth was somewhere well above the 100% mark, so I am interested in where Microsoft will stand by the end of the year.
But what does this tell us? Both Microsoft and Amazon are growing fast in this business and we can expect that we will see many more interesting services in the coming month and years in the Cloud. My opinion is that the market is already consolidated between Microsoft and Amazon. Other companies such as Google and Oracle are rather niche players in the Cloud market.