Big Data Data Science python Tutorials

Python for Spark Tutorial – Getting started with Python


A common discussion among Data Engineers is on what language to use. One of the main platforms for Data Processing nowadays is Apache Spark. A lot of Data Engineers and Data Scientists use Apache Spark for their workloads. Spark is written in Scala, which itself derives from Java. When working with Spark, most people nowadays however use another language: Python.

Python is not only used among Data Engineers. In fact, it is much more popular with Data Scientists – even though there is a “religious” fight between R and Python going on. However, Python is a very popular language now and a lot of people use it – also for Big Data and Data Science workloads.

I’ve created a tutorial on Apache Spark recently and decided to do it with Python rather than with Scala. My background is in the object oriented, strongly typed world such as Java or C#, and I will write this tutorial now with the eyes of the many people coming from the exact same background – by highlighting the differences and where you might have to watch out.

When it comes to Python, it is perfect for the world of the web – it offers a lot of libraries for http interactions. Therefore, its popularity within web developers is very high. Also, Python also offers a lot of libraries for machine learning and thus the number of Data Scientists using Python is also high. Both of them I would argue are a result of Google – Google was pushing Python a lot and also contributed a lot to Python – in both terms. Google AppEngine, one of the first Platform as a Service solutions, was mainly useable via Python. Now, Google offers much more services with Python. Tensorflow is just one of them. But how did this happen? Basically, the founder of Python – Guido van Rossum – was working for a long time at Google. Another important thing: Python has nothing to do with the snake – it is coming from “Monty Python”. However, I will use the snake from time to time as a header image šŸ˜‰

But now, let’s come to the main aspects in terms of Python and what it means for Developers and Data Scientists. Basically, there are several differences when working with Python when you come from Java or similar languages. I want to start with some of them; the following list isn’t complete, you will see much more changes in the following weeks throughout this tutorial. As of comparison, I keep on stating “C-like languages”. With this, I basically mean C, C++, C# and Java.

  • Dynamic. Python is basically a static dynamic language. This means that you don’t have to care when you create a new variable about its type – such as Integer or String. However, once a variable is assigned, its type is assigned and then it is very static.
  • Indentation. Unlike Java or any other C-like language, Python doesn’t use {} to structure your code. All of that is done with intents (tabs!). In C-like languages, you structure your code blocks with both – {} and intents. However, C-like languages only use the indentation for better readability – Python uses it for the language itself!
  • No Switch. Yes, you read it right – there are no switch statements in Python. You have to do all of that via If-Then statements.

Now, let’s look at the most important thing – assigning variables in Python. Since Python is dynamic, you don’t need to remember any complex types and can just assign them with stating the name of the variable and assign them. Also, you don’t need any semicolons – the line end is the end of the statement. The syntax for this is the following:

variable_name = VALUE

Now, we need to start coding in Python. I’ve decided to use Jupyter, since it is also a common tool to work with Spark. If you haven’t installed Jupyter yet, you can read how to do it in this tutorial. Below are several variable assignments for different types. If you want to print the content of a variable, simply put the variable name into a new line.

data = 123
data

And the corresponding output is:

123

Let’s now use a String:

name = "Mario"
name

Again, the output is as expected:

'Mario'

I stated that Python is dynamic, but it turns into static once a variable get’s assigned. You can see this behaviour with the following code:

res = data + name

Umpf. Yes, an error occurred:

TypeError: unsupported operand type(s) for +: 'int' and 'str'

Python doesn’t allow this much of dynamic-ness. So, the good-old C-like world is still fine, isn’t it? Let’s try to mix an Integer with a float:

daf = 1.23
daf + data

As you can see, the output is really fine:

124.23

When talking about types, Python knows the common types such as Int, Float, String, Boolean. There is no differentiation between “Long” and “Int”, a non-floating point number is always an int; the same applies to Float – there is no double. However, Python has some additional basic types. I won’t describe all here, only the most relevant ones:

  • Complex: This is a number which is expressed in imaginary parts – when the number of digits is to large to display, you often see something like this: 9,12e21
  • Dict: A dictionary, where you have a key and a value with it. A dict is initialised with {}.
  • Tuple: A set of values; you could basically call this an array, but an array in C-like languages can only contain the same types, whereas the Tuple takes more types. A tuple is initialised with ()
  • Set: A set is like a list, but it can’t contain any duplicates. A set is initialised with {}
  • List: A list can contain duplicates and mixed types. A list is initialised with []

Now, let’s have a look at how these types are used. We first start with Dict(ionary)

persons = {"mario": 35, "vienna": "austria", 3: "age"}
persons
{'mario': 35, 'vienna': 'austria', 3: 'age'}

Was easy, right? Let’s continue with the Tuple:

pers = ("Mario", 35, 9.99)
pers
('Mario', 35, 9.99)

Also, here we have the expected output. Next up is the set:

pset = {"mario", 35, True}
pset
{35, True, 'mario'}

Last but not least is the List:

plist = ["Mario", 35, True]
plist
['Mario', 35, True]

We can also combine different types into complex types. Let’s assume we want to add more values to a key-value pair. To achieve that, we create the main entity and add a list, where we have different key-value pairs inside. The following sample shows the products sold in a shop. The first level is the product, then in a list are the age of the product and the price.

shop = {"Milk": [{"age": 35}, {"price": 9.99}], "Bred": [{"age": 22}, {"price": 2.99}]}
shop
{'Milk': [{'age': 35}, {'price': 9.99}],
 'Bred': [{'age': 22}, {'price': 2.99}]}

Now, we have learned about data types. In the next tutorial, I will write about control structures.

I lead a team of Senior Experts in Data & Data Science as Head of Data & Analytics and AI at A1 Telekom Austria Group. I also teach this topic at various universities and frequently speak at various Conferences. In 2010 I wrote a book about Cloud Computing, which is often used at German & Austrian Universities. In my home country (Austria) I am part of several organisations on Big Data & Data Science.

0 comments on “Python for Spark Tutorial – Getting started with Python

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: