A common discussion among Data Engineers is on what language to use. One of the main platforms for Data Processing nowadays is Apache Spark. A lot of Data Engineers and Data Scientists use Apache Spark for their workloads. Spark is written in Scala, which itself derives from Java. When working with Spark, most people nowadays however use another language: Python. Follow me over the next posts during this python tutorial.

Part I of the Python tutorial: getting started

Python is not only used among Data Engineers. In fact, it is much more popular with Data Scientists – even though there is a “religious” fight between R and Python going on. However, Python is a very popular language now and a lot of people use it – also for Big Data and Data Science workloads.

I’ve created a tutorial on Apache Spark recently and decided to do it with Python rather than with Scala. My background is in the object oriented, strongly typed world such as Java or C#, and I will write this tutorial now with the eyes of the many people coming from the exact same background – by highlighting the differences and where you might have to watch out.

When it comes to Python, it is perfect for the world of the web – it offers a lot of libraries for http interactions. Therefore, its popularity within web developers is very high. Also, Python also offers a lot of libraries for machine learning and thus the number of Data Scientists using Python is also high. Both of them I would argue are a result of Google – Google was pushing Python a lot and also contributed a lot to Python – in both terms. Google AppEngine, one of the first Platform as a Service solutions, was mainly useable via Python. Now, Google offers much more services with Python. Tensorflow is just one of them. But how did this happen? Basically, the founder of Python – Guido van Rossum – was working for a long time at Google. Another important thing: Python has nothing to do with the snake – it is coming from “Monty Python”. However, I will use the snake from time to time as a header image 😉

But now, let’s come to the main aspects in terms of Python and what it means for Developers and Data Scientists. Basically, there are several differences when working with Python when you come from Java or similar languages. I want to start with some of them; the following list isn’t complete, you will see much more changes in the following weeks throughout this tutorial. As of comparison, I keep on stating “C-like languages”. With this, I basically mean C, C++, C# and Java.

  • Dynamic. Python is basically a static dynamic language. This means that you don’t have to care when you create a new variable about its type – such as Integer or String. However, once a variable is assigned, its type is assigned and then it is very static.
  • Indentation. Unlike Java or any other C-like language, Python doesn’t use {} to structure your code. All of that is done with intents (tabs!). In C-like languages, you structure your code blocks with both – {} and intents. However, C-like languages only use the indentation for better readability – Python uses it for the language itself!
  • No Switch. Yes, you read it right – there are no switch statements in Python. You have to do all of that via If-Then statements.

Variables in Python

Now, let’s look at the most important thing – assigning variables in Python. Since Python is dynamic, you don’t need to remember any complex types and can just assign them with stating the name of the variable and assign them. Also, you don’t need any semicolons – the line end is the end of the statement. The syntax for this is the following:

variable_name = VALUE

Now, we need to start coding in Python. I’ve decided to use Jupyter, since it is also a common tool to work with Spark. If you haven’t installed Jupyter yet, you can read how to do it in this tutorial. Below are several variable assignments for different types. If you want to print the content of a variable, simply put the variable name into a new line.

data = 123
data

And the corresponding output is:

123

Let’s now use a String:

name = "Mario"
name

Again, the output is as expected:

'Mario'

Dynamic variables in Python

I stated that Python is dynamic, but it turns into static once a variable get’s assigned. You can see this behaviour with the following code:

res = data + name

Umpf. Yes, an error occurred:

TypeError: unsupported operand type(s) for +: 'int' and 'str'

Python doesn’t allow this much of dynamic-ness. So, the good-old C-like world is still fine, isn’t it? Let’s try to mix an Integer with a float:

daf = 1.23
daf + data

As you can see, the output is really fine:

124.23

Complex Data Types in Python

When talking about types, Python knows the common types such as Int, Float, String, Boolean. There is no differentiation between “Long” and “Int”, a non-floating point number is always an int; the same applies to Float – there is no double. However, Python has some additional basic types. I won’t describe all here, only the most relevant ones:

  • Complex: This is a number which is expressed in imaginary parts – when the number of digits is to large to display, you often see something like this: 9,12e21
  • Dict: A dictionary, where you have a key and a value with it. A dict is initialised with {}.
  • Tuple: A set of values; you could basically call this an array, but an array in C-like languages can only contain the same types, whereas the Tuple takes more types. A tuple is initialised with ()
  • Set: A set is like a list, but it can’t contain any duplicates. A set is initialised with {}
  • List: A list can contain duplicates and mixed types. Initialize a list with []

Now, let’s have a look at how these types are used. We first start with Dict(ionary)

persons = {"mario": 35, "vienna": "austria", 3: "age"}
persons
{'mario': 35, 'vienna': 'austria', 3: 'age'}

Was easy, right? Let’s continue with the Tuple:

pers = ("Mario", 35, 9.99)
pers
('Mario', 35, 9.99)

Also, here we have the expected output. Next up is the set:

There are even more complex types

pset = {"mario", 35, True}
pset
{35, True, 'mario'}

Last but not least is the List:

plist = ["Mario", 35, True]
plist
['Mario', 35, True]

Compound Types in Python

We can also combine different types into complex types. Let’s assume we want to add more values to a key-value pair. To achieve that, we create the main entity and add a list, where we have different key-value pairs inside. The following sample shows the products sold in a shop. The first level is the product, then in a list are the age of the product and the price.

shop = {"Milk": [{"age": 35}, {"price": 9.99}], "Bred": [{"age": 22}, {"price": 2.99}]}
shop
{'Milk': [{'age': 35}, {'price': 9.99}],
 'Bred': [{'age': 22}, {'price': 2.99}]}

Now, we have learned about data types. In the next Python tutorial, I will write about control structures.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!