In the last blog post, we looked at how to access the Jupyter Notebook and work with Spark. Now, we will have some hands-on with Spark and do some first samples. One of our main ambitions is to write a lambda expression in spark with python

Intro to the Spark RDD

Therefore, we start with RDDs. RDDs are the basic API in Spark. Even though it is much more convenient to use Dataframes, we start with the RDDs in order to get a basic understanding of Spark.

RDDs stand for “Resilient Distributed Dataset”. Basically, this means that the Dataset is used for parallel computation. In RDDs, we can store any kind of data and apply some functions to it (such as the sum or the average). Let’s create a new Python notebook and import the required libraries.

Create a new Notebook in Python 3
The new Notebook with pyspark

Once the new notebook is created, let’s start to work with Spark.

First, we need to import the necessary libraries. Therefore, we use “pyspark” and “random”. Also, we need to create the Spark context that is used throughout our application. Note that you can only execute this line once, because if you create the SparkContext twice – it is case sensitive!

Create the Spark Context in Python

import pyspark
import random
sc = pyspark.SparkContext(appName="Cloudvane_S01")

When done with this, hit the “Run” Button in the Notebook. Next to the current cell, you will now see the [ ] turning into [*]. This means that the process is currently running and something is happening. Once it has finished, it will turn into [1] or any other incremental number. Also, if errors occur, you will see them below your code. If this part of the code succeeded, you will see no other output than the [1].

Create random data in Spark

Next, let’s produce some data that we can work with. Therefore, we create an array and fill it with some data. In our case, we add 100 items. Each item is of a random value calculated with expovariate.

someValues = []
for i in range(0,100):
    someValues.append(random.expovariate(random.random()-0.5))
someValues

When the data is created, we distribute it over the network by calling “sc.parallelize”. This creates an RDD now and enables us to work with Spark.

spark_data = sc.parallelize(someValues)

Different Functions on RDD in Spark

We can apply various functions to the RDD. One sample would be to use the “Sum” function.

sp_sum = spark_data.sum()
sp_sum

Another sample is the “Count” function.

sp_ct = spark_data.count()
sp_ct

We can also do more complex calculations by defining methods that do some calculations. In Python, this is done by “def functionname(params)”. The following sample creates the average of the array that is passed onto the function.

def average(vals):
    return vals.sum() / vals.count()

The function is simply invoked with our data.

average(spark_data)

A lambda function in Spark and Python

Last but not least, we can also filter data. In the following sample, we only include positive values. We do this with a simple Lambda function. I’ve explained Lambda functions in detail in the Python tutorial, in case you want to learn more.

sp_pos = spark_data.filter(lambda x: x>0.0).collect()
sp_pos

Now you are ready to go and write your own lambda expression with spark in Python. There is of course much more to learn about Spark, so make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!