Learn about common Big Data Technologies, such as Apache Hadoop, Spark and all it’s projects associated with it

In our last tutorial section, we looked at filtering, joining and sorting data. Today, we will look at more Spark Data Transformations on RDD. Our focus topics for today are: distinct, groupby and union in Spark. Now, we will have a look at several other operators for that. First, we will use the “Distinct” transformation

Distinct

Distinct enables us to return exactly one of each item. For instance, if we have more than one entry in the same sequence, we can reduce this. A sample would be the following array:

[1, 2, 3, 4, 1]

However, we only want to return each number exactly once. This is done via the distinct keyword. The following example illustrates this:

ds_distinct = sc.parallelize([(1), (2), (3), (4), (1)]).distinct().collect()
ds_distinct

GROUPBY

A very important task when working with data is grouping. In Spark, we have the GroupBy transformation for this. In our case, this is “GroupByKey”. Basically, this groups the dataset into a specific form and the execution is added when calling the “mapValues” function. With this function, you can provide how you want to deal with the values. Some options are pre-defined, such as “len” for the number of occurrences or “list” for the actual values. The following sample illustrates this with our dataset introduced in the previous tutorial:

ds_set = sc.parallelize([("Mark", 1984), ("Lisa", 1985), ("Mark", 2015)])
ds_grp = ds_set.groupByKey().mapValues(list).collect()
ds_grp

If you want to have a count instead, simply use “len” for it:

ds_set.groupByKey().mapValues(len).collect()

The output should look like this now:

The GroupByKey keyword in Apache Spark
The GroupByKey keyword in Apache Spark

Union

A union joins together two datasets into one. In contrast to the “Join” transformation that we already looked at in our last tutorial, it doesn’t take any keys and simply appends the datasets. It is very similar to the previous one, but the result is different. The syntax for it is straight forward: it is written “dsone.union(dstwo)”. Let’s have a look at it:

ds_one = sc.parallelize([("Mark", 1984), ("Lisa", 1985)])
ds_two = sc.parallelize([("Luke", 2015), ("Anastasia", 2017)])
sorted(ds_one.union(ds_two).collect())

Now, the output of this should look like the following:

[('Anastasia', 2017), ('Lisa', 1985), ('Luke', 2015), ('Mark', 1984)]

Today, we learned a lot about Spark data transformations. In our next tutorial – part 3 – we will have a look at even more of them.

If you enjoyed this tutorial, make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.

In the last post, we learned the basics of Spark RDDs. Now, we will have a look at data transformation in Apache Spark. One data transformation was already used in the last tutorial: filter. We will have a look at this now as the first transformation. But before we start, we need to have a look at one major thing: lambda expressions. Each of the functions we use for data transformations in Spark take functions in the form of a lambda expression. In our last tutorial, we filtered data based on a lambda expression. Let’s recall the code to have a deeper look into it:

sp_pos = spark_data.filter(lambda x: x>0.0).collect()

What is crucial for us is the statement that is in the braces “()” after the filter command. In there, we see the following: “lamdba x: x>0.0”. Bascially, with “lambda x” we state that the following evaluation should be applied to all items in the dataset. For each iteration, we use “x” as the variable. So it reads like: “apply the evaluation, if the variable is greater than 0, to all items in our dataset”. If x would be a complex dataset, we could also use it’s fields and methods. But in our case, it is a simple number.

One more thing that is important for transformations: transformations in Spark are always “lazy” bound to it’s execution. So calling the filter function does nothing, until you call an execution on it. The one we used above is “.collect()”. Collect() in our case calls the filter.

Filtering

Filtering in data is one of the most frequent used transformations. The filter criteria is parsed as a lambda expression. You can also chain different filter criteria easily, since we have late binding on it.

We extend the above sample by only showing numbers that are smaller than 3.0. The code for that looks the following:

sp_pos = spark_data.filter(lambda x: x>0.0).filter(lambda y: y<3.0).collect()
sp_pos

Sorting

Another important transformation is sorting data. At some point, you want to arrange the data in either ascending or descending order. This is done with the “sortBy” function. The “sortBy” function takes a lambda expression that takes the field to filter. In our above example, this isn’t relevant since we only have one item per RDD. Let’s have a look at how to use it with our dataset:

sp_sorted = spark_data.sortBy(lambda x: x).collect()
sp_sorted

Now, if we want to filter it in the opposite order, we can set the optional “ascending” keyword to false. Let’s have a look at this:

sp_sorted = spark_data.sortBy(lambda x: x, False).collect()
sp_sorted
The sorted dataset

Joining data

Often, it is necessary to join two different datasets together. This is done by the “Join” function. To use the join, we need to create new datasets first (and we will need them further on as well):

ds_one = sc.parallelize([("Mark", 1984), ("Lisa", 1985)])
ds_two = sc.parallelize([("Mark", 2015), ("Anastasia", 2017)])
sorted(ds_one.join(ds_two).collect())

You can also use the inner or outer join on RDDs.

Today, we looked at some basic data transformation in Spark. Over the next couple of tutorial posts, I will walk you through more of them.

If you enjoyed this tutorial, make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.

In the last blog post, we looked at how to access the Jupyter Notebook and work with Spark. Now, we will have some hands-on with Spark and do some first samples. One of our main ambitions is to write a lambda expression in spark with python

Intro to the Spark RDD

Therefore, we start with RDDs. RDDs are the basic API in Spark. Even though it is much more convenient to use Dataframes, we start with the RDDs in order to get a basic understanding of Spark.

RDDs stand for “Resilient Distributed Dataset”. Basically, this means that the Dataset is used for parallel computation. In RDDs, we can store any kind of data and apply some functions to it (such as the sum or the average). Let’s create a new Python notebook and import the required libraries.

Create a new Notebook in Python 3
The new Notebook with pyspark

Once the new notebook is created, let’s start to work with Spark.

First, we need to import the necessary libraries. Therefore, we use “pyspark” and “random”. Also, we need to create the Spark context that is used throughout our application. Note that you can only execute this line once, because if you create the SparkContext twice – it is case sensitive!

Create the Spark Context in Python

import pyspark
import random
sc = pyspark.SparkContext(appName="Cloudvane_S01")

When done with this, hit the “Run” Button in the Notebook. Next to the current cell, you will now see the [ ] turning into [*]. This means that the process is currently running and something is happening. Once it has finished, it will turn into [1] or any other incremental number. Also, if errors occur, you will see them below your code. If this part of the code succeeded, you will see no other output than the [1].

Create random data in Spark

Next, let’s produce some data that we can work with. Therefore, we create an array and fill it with some data. In our case, we add 100 items. Each item is of a random value calculated with expovariate.

someValues = []
for i in range(0,100):
    someValues.append(random.expovariate(random.random()-0.5))
someValues

When the data is created, we distribute it over the network by calling “sc.parallelize”. This creates an RDD now and enables us to work with Spark.

spark_data = sc.parallelize(someValues)

Different Functions on RDD in Spark

We can apply various functions to the RDD. One sample would be to use the “Sum” function.

sp_sum = spark_data.sum()
sp_sum

Another sample is the “Count” function.

sp_ct = spark_data.count()
sp_ct

We can also do more complex calculations by defining methods that do some calculations. In Python, this is done by “def functionname(params)”. The following sample creates the average of the array that is passed onto the function.

def average(vals):
    return vals.sum() / vals.count()

The function is simply invoked with our data.

average(spark_data)

A lambda function in Spark and Python

Last but not least, we can also filter data. In the following sample, we only include positive values. We do this with a simple Lambda function. I’ve explained Lambda functions in detail in the Python tutorial, in case you want to learn more.

sp_pos = spark_data.filter(lambda x: x>0.0).collect()
sp_pos

Now you are ready to go and write your own lambda expression with spark in Python. There is of course much more to learn about Spark, so make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.

In our last tutorial, we had some brief introduction to Apache Spark. Now, in this tutorial we will have a look into how to setup an environment to work with Apache Spark. To make things easy, we will setup Spark in Docker. If you are not familiar with Docker, you can learn about Docker here. To get started, we first need to install Docker. If you don’t have it yet, find out how to install it from this link: https://docs.docker.com/install/. The installation procedure will take some time to finish, so please be patient.

Docker comes with an easy tool called “Kitematic”, which allows you to easily download and install docker containers. Luckily, the Jupyter Team provided a comprehensive container for Spark, including Python and of course Jupyter itself. Once your docker is installed successfully, download the container for Spark via Kitematic. Select “all-spark-notebook” for our samples. Note that the download will take a while.

Download Apache Spark for Docker

Once your download has finished, it is about time to start your Docker container. When you download the container via Kitematic, it will be started by default. Within the container logs, you can see the URL and port to which Jupyter is mapped. Open the URL and enter the Token. When everything works as expected, you can now create new Notebooks in Jupyter.

Enter the URL and the Token
Jupyter is running

Now you are ready to go and write your own lambda expression with spark in Python. There is of course much more to learn about Spark, so make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.

In this tutorial, I will provide a first introduction to Apache Spark. Apache Spark is the number one Big Data Tool nowadays. It is even considered ad the “killer” to Hadoop, even though Hadoop isn’t that old yet. However, Apache Spark has several advantages over “traditional” Hadoop. One of the key benefits is that Spark is ways better suited for Big Data Analytics in the Cloud than Hadoop is. Hadoop itself was never built for the Cloud, since it was built years before the Cloud took over major workloads. Apache Spark in contrast was built during the Cloud became common sense and thus has several benefits over it – e.g. by using object stores to access data such as Amazon S3.

However, Spark also integrates well into an existing Hadoop environment. Apache Spark runs native on Hadoop as a Yarn application and it re-uses different Hadoop components such as HDFS, HBase, Hive. Spark replaces Map/Reduce for batch processing with it‘s own technology, which is much faster. Hive can also run with Spark on it and thus is ways faster. Additionally, Spark comes with new technologies for interactive queries, streaming and machine learning.

Apache Spark is great in terms of performance. To sort 100 TB of data, Spark does that 3 times faster as Map/Reduce by only using 1/10th of nodes. It is well suited for sorting PB of Data and it won several sorting benchmarks such as the GraySort and CloudSort benchmark.

Spark is written in Scala

It is written in Scala, but is often used from Python. However, if you want to use the newest features of Spark, it is often necessary to work with Scala. Spark uses Micro-batches for „real-time“ processing, meaning that it isn‘t true real-time. The fastest interval for Micro-batches is 0.5 seconds. Spark should run in the same LAN as the data is stored. In terms of the Cloud, this means the same datacenter or availability zone. Spark shouldn‘t run on the same nodes as the data is stored (e.g. with Hbase). With Hadoop, this is the other way around; Hadoop processes the data where it is stored. There are several options to run Spark: Standalone, on Apache Mesos, on Hadoop or via Kubernetes/Docker. For our future tutorials, we will use Docker for it.

Spark has 4 main components. Over the next tutorials, we will have a look at each of them. These 4 components are:

  • Spark SQL: Provides a SQL Language, Dataframes and Datasets. This is the most convenient way to use Spark
  • Spark Streaming: Provides Micro-batch execution for near real-time applications
  • Spark ML: Built-In Machine Learning Library for Spark
  • GraphX: Built-In Library for Graph Processing

To develop Apache Spark applications, it is possible to use either Scala, Java, Python or R. Each Spark application starts with a „driver program“. The driver program executes the „main“ function.

RDD – Resilient Distributed Datasets

Data Elements in Spark are called „RDD – Resilient Distributed Datasets“. This can be files on HDFS, an Object Store such as S3 or any other kind of dataset. RDDs are distributed on different nodes and Spark doesn’t take care of them. RDDs can also be kept in memory for faster execution. Spark works with „Shared Variables“. These are variables shared over different nodes, e.g. for computation. There are two types:

  • Broadcast variables: used to cache values in memory on all nodes (e.g. commonly used values)
  • Accumulators: used to add values (e.g. counters, sums or similar)

I hope you enjoyed the introduction to in the next tutorial about Apache Spark we will have a look at how to setup the environment to work with Apache Spark.

Now you are ready to go and write your own lambda expression with spark in Python. There is of course much more to learn about Spark, so make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.

A current trend in AI is not a much technical one – it is rather a societal one. Basically, technologies around AI in Machine Learning and Deep Learning are getting more and more complex. This is making it even more complex for humans to understand what is happening and why a prediction is happening. The current approach in „throwing data in, getting a prediction out“ is not necessarily working for that. It is somewhat dangerous building knowledge and making decisions based on algorithms that we don‘t understand. To solve this problem, we need to have explainable AI.

What is explainable AI?

Explainable AI is getting even more important with new developments in the AI space such as Auto ML. With Auto ML, the system takes most of the data scientist‘s work. It needs to be ensured that everyone understands what‘s going on with the algorithms and why a prediction is happening exactly the way it is. So far (and without AutoML), Data Scientists were basically in charge of the algorithms. At least there was someone that could explain an algorithm. NOTE: it didn‘t prevent us from bias in it, nor will AutoML do. With AutoML, when the tuning and algorithm selection is done more or less automatically, we need to ensure to have some vital and relevant documentation of the predictions available.

And one last note: this isn‘t a primer against AutoML and tools that do so – I believe that democratisation of AI is an absolute must and a good thing. However, we need to ensure that it stays – explainable!

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. A comprehensive article about explainable AI can also be found on wikipedia.

When Kappa first appeared as an architecture style (introduced by Jay Kreps) I was really fond of this new approach. I carried out several projects that went with Kafka as the main “thing” and not having the trade-offs as Lambda. But the more complex projects got, the more I figured out that it isn’t the answer to everything and that we ended up with Lambda again … somehow.

Kappa vs. Lambda Architecture

First of all, what is the benefit of Kappa and the trade-off with Lambda? It all started with Jay Kreps in his blog post when he questioned the Lambda Architecture. Basically, with different layers in the Lambda Architecture (Speed Layer, Batch Layer and Presentation Layer) you need to use different tools and programming languages. This leads to code complexity and the risk that you end up having inconsistent versions of your processing capabilities. A change to the logic on the one layer requires changes on the other layer as well. Complexity is basically one thing we want to remove from our architecture at all times, so we should also do it with Data Processing.

The Kappa Architecture came with the promise to put everything into one system: Apache Kafka. The speed that data can be processed with it is tremendous and also the simplicity is great. You only need to change code once and not twice or three times as compared to Lambda. This leads to cheaper labour costs as well, as less people are necessary to maintain and produce code. Also, all our data is available at our fingertips, without major delays as with batch processing. This brings great benefits to business units as they don’t need to wait forever for processing.

So what is the problem about Kappa Architecture?

However, my initial statement was about something else – that I mistrust Kappa Architecture. I implemented this architecture style at several IoT projects, where we had to deal with sensor data. There was no question if Kappa is the right thing – as we were in a rather isolated Use-Case. But as soon as you have to look at a Big Data architecture for a large enterprise (and not only into isolated use-cases) you end up with one major issue around Kappa: Cost.

In use-cases where data don’t need to be available within minutes, Kappa seems to be an overkill. Especially in the cloud, Lambda brings major cost benefits with Object Storages in combination with automated processing capabilities such as Azure Databricks. In enterprise environments, cost does matter and an architecture should also be cost efficient. This also holds true when it comes to the half-live of data which I was recently writing about. Basically, data that looses its value fast should be stored on cheap storage systems at the very beginning already.

Cost of Kappa Architecture

An easy way to compare Kappa to Lambda is the comparison per Terabyte stored or processed. Basically, we will use a scenario to store 32 TB. With a Kappa Architecture running 24/7, this would mean that we have an estimated 16.000$ per month to spend (no discounts, no reserved instances – pay as you go pricing; E64 CPUs with 64 cores per node, 432 GB Ram and E80 SSDs attached with 32TB per disk). If we would use Lambda and only process once per day, this would mean that we need 32TB on a Blob Store – that costs 680$ per month. Now we would take the cluster above for processing with Spark and use it 1 hour per day: 544$. Summing up, this would equal to 1.224$ per month – a cost ratio of 1:13.

However, this is a very easy calculation and it can still be optimised on both sides. In the broader enterprise context, Kappa is only a specialisation of Lambda but won’t exist all alone at all time. Kappa vs. Lambda can only be selected by the use-case, and this is what I recommend you to do.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company

One of my 5 predictions for 2019 is about Hadoop. Basically, I do expect that a lot of projects won’t take Hadoop as a full-blown solution anymore. Why is that? What is the future of Hadoop?

What happend to the future of Hadoop?

Basically, one of the most exciting news in 2018 was the merger between Hortonworks and Cloudera. The two main competitors now joining forces? How can this happen? Basically, I do believe that a lot of that didn’t come out of a strength of the two and that they somehow started to “love” each other but rather out of economical calculations. Now, it isn’t a competition between Hortonworks or Cloudera anymore (even before the merger), it is rather Hadoop vs. new solutions.

These solutions are highly diversified – Apache Spark is one of the top competitors to it. But there are also other platforms such as Apache Kafka and some NoSQL databases such as MongoDB, plus TensorFlow emerging. One would now argue that all of that is included in a Cloudera or Hortonworks distribution, but it isn’t as simple as that. Spark and Kafka founders provider their own distributions of their stack, more lightweight than the complex Hadoop stack. In several use-cases, it is simply not necessary to have a full-blown solution but rather go for a light-weighted one.

The Cloud is the real threat to Hadoop

But the real thread rather comes from something else: the Cloud. Hadoop was always running better on bare-metal and still both pre-merger companies are arguing that in fact Hadoop does better run on bare-metal. Other solutions such as Spark are performing better in the Cloud and built for the Cloud. This is the real threat for Hadoop, since the Cloud is simply something that won’t go away now – with most companies switching to it.

Object stores provide a great and cheap alternative to HDFS and the management of Object Stores is ways easier. I only call it an alternative here since Object Stores still miss several Enterprise Features. However, I expect that the large cloud providers such as AWS and Microsoft will invest significantly in this space and provide great additions to their object stores even this year. Object Stores in the cloud will catch up fast this year – and probably surpass HDFS functionality by 2020. If this happens and the cost benefits remain better than bare-metal Hadoop, there is really no need for it anymore.

On the analytics layer, the cloud is also ways superior. Running dynamic Spark Jobs against data in object stores (or managed NoSQL databases) are impressive. You don’t have to manage Clusters anymore, which takes a lot of pain and headache away from large IT departments. This will increase performance and speed of developments. Another disadvantage I see for the leading Hadoop solutions is their salesforce: they get better compensated for on-prem solutions, so they try to tell companies to keep out of the cloud – which isn’t the best strategy in 2019.

What about enterprise adoption of Hadoop?

However, there is still some hope about Enterprise Integration, which is often handled better from Hadoop distributions. And even though the entire world is moving on the Cloud, there are still many legacy systems running on-premise. Also, after the HWX/Cloudera merger, their mission statement became of being the leading company for big data in the cloud. So if they are going to fully execute this, I am sure that there will be a huge market share ahead of them – and the initially described threads could even be turned down. Let’s see what 2019 and 2020 will bring in this respect and what the future of Hadoop might bring.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company

Now you probably think: is Mario crazy? In fact, during this post, I will explain why cloud is not the future.

First, let’s have a look at the economic facts of the cloud. If we look at share prices of companies providing cloud services, it is rather easy to say: those shares are skyrocketing! (Not mentioning recent drops in some shares, but these are rather market dynamics than real valuations). One thing is also about overall company performances: the income of companies providing cloud services increased a lot. Have a look at the major cloud providers such as AWS, Google, Oracle or Microsoft: they make quite a lot of their revenue now with cloud services. So, obviously here, my initial statement seems to be wrong. So why did I just choose this one? Still crazy?

Let’s look at another explanation on this: it might be all about technology, right? I was recently playing with AWS API Gateway and AWS Lambda. Wow, how easy is it to write a great API? I could program an API for an Android APP in some hours, deployment was easy. Remember back when you first had to deploy your full stack for this? Make sure to have all libraries set up and alike? Another sample: Data Analytics. Currently, much of this is moving from “classical” Hadoop-backed HDFS to decoupled Architectures (Object Stores as “Data Lake” and Spark for Compute/Analytics). This is also a clear favour for the Cloud, because both can be scaled individually and utilisation is easier to handle. When you need more compute power, you would spin up new instances and disconnect them again when you are done. This simply can’t be done with on-prem or private cloud, since the available capacity is calculated to match some corporate requirements. Also this is clearly in favour of the Cloud.

But what else? Let’s look at how new Applications or Services are developed. Nowadays, almost every Service is developed “Cloud first”, which means that they aren’t available without the cloud or at least they get available at a very late stage / substantial delay. So if you want to stay ahead in the innovation, it is necessary to embrace cloud here. And please don’t tell me that you would rather wait as it isn’t necessary to be with the first one’s to move. Answer: of course it is fine to wait until your business is dead ;).

So, there are no real points against the cloud, so why did I then formulate the title like this? Provocation? Clickbaiting? NO: Cloud is not the future, it is the present!

The Azure CLI is my favorite tool to manage Hadoop Clusters on Azure. Why? Because I can use the tools I am used to from Linux now from my Windows PC. In Windows 10, I am using the Ubuntu Bash for that, which gives me all the major tools for managing remote Hadoop Clusters.

One thing I am doing frequently, is starting and stopping Hadoop Clusters based on Cloudera. If you are coming from Powershell, this might be rather painfull for you, since you can only start each vm in the cluster sequentially, meaning that a cluster consisting of 10 or more nodes is rather slow to start and might take hours! In the Azure CLI I can easily do this by specifiying “–nowait” and all runs in parallel. The only disadvantage is that I won’t get any notifications on when the cluster is ready. But I am doing this with a simple hack: ssh’ing into the cluster (since I have to do this anyway). SSH will succeed once the Masternodes are ready and so I can perform some tasks on the nodes (such as restarting Cloudera Manager since CM is usually a bit “dizzy” after sending it to sleep and waking it up again :))

Let’s start with the easiest step: stopping the cluster. The Azure CLI always starts with “az” as command (meaning Azure of course). The command for stopping one or more vm’s with the Azure CLI is “vm stop”. The only two things I need to provide now are the id’s I want to stop and “–nowait” since I want to quit the script right after.

So, the script would look like the following:

az vm stop --ids YOUR_IDS --no-wait

However, this has still one major disadvantage: you would need to provide all ID’s Hardcoded. This doesn’t matter at all if your cluster never changes, but in my case I add and delete vm’s to or from the cluster, so this script doesn’t play well for my case. However, the CLI is very flexible (and so is bash) and I can query all my vm’s in a resource group. This will give me the IDs which are currently in the cluster (let’s assume I delete dropped vm’s and add new vm’s to the RG). The Query for retrieving all VMs in a Resource Group is easy:

az vm list --resource-group YOUR_RESOURCE_GROUP --query "[].id" -o tsv

This will give me all IDs in the RG. The real fun starts when doing this in one statement:

az vm stop --ids $(az vm list --resource-group clouderarg --query "[].id" -o tsv) --no-wait

Which is really nice and easy 🙂

It is similar with starting VMs in a Resource Group:

az vm start --ids $(az vm list --resource-group mmhclouderarg --query "[].id" -o tsv) --no-wait