## NumPY in Python Tutorial for Data Science

In my previous post, I gave an introduction to Python Libraries for Data Engineering and Data Science. In this post, we will have a first look at NumPy, one of the most important libraries to work with in Python.

NumPy is the simplest library for working with data. It is often re-used by other libraries such as Pandas, so it is necessary to first understand NumPy. The focus of this library is on easy transformations of Vectors, Matrizes and Arrays. It provides a lot of functionality on that. But let’s get our hands dirty with the library and have a look at it!

Before you get started, please make sure to have the Sandbox setup and ready

## Getting started with NumPy

First of all, we need to import the library. This works with the following import statement in Python:

``import numpy as np``

This should now give us access to NumPy libraries. Let us first create an 3-dimensional array with 5 values in it. In NumPy, this works with the “arange” method. We provide “15” as the number of items and then let it re-shape to 3×5:

``````vals = np.arange(15).reshape(3,5)
vals``````

This should now give us an output array with 2 dimensions, where each dimension contains 5 values. The values range from 0 to 14:

``````array([[ 0,  1,  2,  3,  4],
[ 5,  6,  7,  8,  9],
[10, 11, 12, 13, 14]])``````

NumPy contains a lot of different variables and functions. To have PI, you simply import “pi” from numpy:

``````from numpy import pi
pi``````

We can now use PI for further work and calculations in Python.

## Simple Calculations with NumPy

Let’s create a new array with 5 values:

``````vl = np.arange(5)
vl``````

An easy way to calculate is to calculate something to the power. This works with “**”

``````nv = vl**2
nv``````

Now, this should give us the following output:

``array([ 0,  1,  4,  9, 16])``

The same applies to “3”: if we want to calculate everything in an array to the power of 3:

``````nn = vl**3
nn``````

And the output should be similar:

``array([ 0,  1,  8, 27, 64])``

## Working with Random Numbers in NumPy

NumPy contains the function “random” to create random numbers. This method takes the dimensions of the array to fit the numbers into. We use a 3×3 array:

``````nr = np.random.random((3,3))
nr *= 100
nr``````

Please note that random returns numbers between 0 and 1, so in order to create higher numbers we need to “stretch” them. We thus multiply by 100. The output should be something like this:

``````array([[90.30147522,  6.88948191,  6.41853222],
[82.76187536, 73.37687372,  9.48770728],
[59.02523947, 84.56571797,  5.05225463]])``````

Your numbers should be different, since we are working with random numbers in here. We can do this as well with a 3-dimensional array:

``````n3d = np.random.random((3,3,3))
n3d *= 100
n3d``````

Also here, your numbers would be different, but the overall “structure” should look like the following:

``````array([[[89.02863455, 83.83509441, 93.94264059],
[55.79196044, 79.32574406, 33.06871588],
[26.11848117, 64.05158411, 94.80789032]],

[[19.19231999, 63.52128357,  8.10253043],
[21.35001753, 25.11397256, 74.92458022],
[35.62544853, 98.17595966, 23.10038137]],

[[81.56526913,  9.99720992, 79.52580966],
[38.69294158, 25.9849473 , 85.97255179],
[38.42338734, 67.53616027, 98.64039687]]])``````

## Other means to work with Numbers in Python

NumPy provides several other options to work with data. There are several aggregation functions available that we can use. Let’s now look for the maximum value in the previously created array:

``n3d.max()``

In my example this would return 98.6. You would get a different number, since we made it random. Also, it is possible to return the maximum number of a specific axis within an array. We therefore add the keyword “axis” to the “max” function:

``n3d.max(axis=1)``

This would now return the maximum number for each of the axis within the array. In my example, the results look like this:

``````array([[93.94264059, 79.32574406, 94.80789032],
[63.52128357, 74.92458022, 98.17595966],
[81.56526913, 85.97255179, 98.64039687]])``````

Another option is to create the sum. We can do this by the entire array, or by providing the axis keyword:

``n3d.sum(axis=1)``

In the next sample, we make the data look more pretty. This can be done by rounding the numbers to 2 digits:

``n3d.round(2)``

## Iterating arrays in Python

Often, it is necessary to iterate over items. In NumPy, this can be achieved by using the built-in iterator. We get it by the function “nditer”. This function needs the array to iterate over and then we can include it in a for-each loop:

``````or val in np.nditer(n3d):
print(val)``````

The above sample would iterate over all values in the array and then prints the values. If we want to modify the items within the array, we need to set the flag “op_flags” to “readwrite”. This enables us to do modifications to the array while iterating it. In the next sample, we iterate over each item and then create the modulo of 3 from it:

``````n3d = n3d.round(0)

for i in iter:
i[...] = i%3

n3d``````

These are the basics of NumPy. In our next tutorial, we will have a look at Pandas: a very powerful dataframe library.

If you liked this post, you might consider the tutorial about Python itself. This gives you a great insight into the Python language for Spark itself. If you want to know more about Python, you should consider visiting the official page.

## Python for Spark Tutorial – Statistics and Mathematics in Python

One of the reasons why Python is so popular for Data Science is that Python has a very rich set of functionality for Mathematics and Statistics. In this tutorial, I will show the very basic functions; however, you might be very disappointed, since they are really basic. When we talk about real data science, you might rather consider learning scikit learn, pytorch or Spark ML. However, today’s tutorial will focus on the elements of it, before moving on to the more complex tutorials.

## Basic Mathematics in Python from the math Library

The math-library in Python provides a great number of most of the relevant functionality you might want to use in Python when working with numbers. The following samples provide some overview on them:

```import math
vone = 1.2367
print(math.ceil(vone))```

First, we import “math” from the standard library and then we create some values. The first function we use is ceiling. In the following sample, we calculate the greatest common denominator between two numbers.

`math.gcd(44,77)`

Other functions are logarithmic, power, cosinus and many more. Some of them are displayed in the following sample:

```math.log(5)
math.pow(2,3)
math.cos(4)
math.pi```

## Basic statistics in Python from the statistics library

The standard library offers some elementary statistical functions. We will first import the library and then calculate the mean of 5 values:

```from statistics import *
values = [1,2,3,4,5]
mean(values)```

Some other possible functions are:

```median(values)
stdev(values)
variance(values)```

Have a look at those two libraries – there is quite a lot to explore.

### What’s next?

Now, the tutorial series for Python is over. You should now be fit to using pyspark. If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

## Python for Spark Tutorial – FuncTools and IterTools in Python

Python has a really great standard library. In the next two tutorial sessions, we will have a first look at this standard library. We will mainly focus on what is relevant for Spark developers in the long run. Today, we will focus on FuncTools and IterTools in Python, the next tutorial will deal with some mathematical functions. But first, let’s start with “reduce

## The reduce() function from the IterTools in Python

Basically, the reduce function takes an iterable list and executes a function on it. In most of the cases, this will be a lambda function but it could also be a normal function. In our sample, we take some values and create the sum of it by moving from left to right:

```from functools import reduce
values = [1,4,5,3,2]
reduce(lambda x,y: x+y, values)```

And we get the expected output

`15`

## The sorted() function

Another very useful function is the “sorted” function. Basically, this sorts values or pairs of tuples in an array. The easiest way to apply it is to do it with our previous values (which were unsorted!):

`print(sorted(values))`

The output is now in the expected sorting:

`[1, 2, 3, 4, 5]`

However, we can still improve this by even sorting complex objects. Sorted takes a key to sort on, and this is passed as a lamdba expression. We state that we want to sort it by age. Make sure that you still have the “Person” class from our previous tutorial:

```perli = [Person("Mario", "Meir-Huber", 35, 1.0), Person("Helena", "Meir-Huber", 5, 1.0)]
print(perli)
print(sorted(perli, key=lambda p: p.age))```

As you can see, our values are now sorted based on the age member.

```[Person(firstname='Mario', lastname='Meir-Huber', age=35, score=1.0), Person(firstname='Helena', lastname='Meir-Huber', age=5, score=1.0)]
[Person(firstname='Helena', lastname='Meir-Huber', age=5, score=1.0), Person(firstname='Mario', lastname='Meir-Huber', age=35, score=1.0)]```

## The chain() function

The chain() method is very helpful if you want to hook up two lists with the same objects in it. Basically, we take the Person-Class again and create a new instance. We then chain the two lists together:

```import itertools
perstwo = [Person("Some", "Other", 46, 1.0)]
persons = itertools.chain(perli, perstwo)
for pers in persons:
print(pers.firstname)```

Also here, we get the expected output:

```Mario
Helena
Some```

## The groupby() function

Another great feature when working with data is grouping of data. Python also allows us to do so. The groupby() method takes two parameters: the list to group and the key as lambda expression. We create a new array of tuple pairs and group by the family name:

```from itertools import groupby
pl = [("Meir-Huber", "Mario"), ("Meir-Huber", "Helena"), ("Some", "Other")]
for k,v in groupby(pl, lambda p: p[0]):
print("Family {}".format(k))
for p in v:
print("\tFamily member: {}".format(p[1]))```

Basically, the groupby() method returns the key (as the value type) and the objects as list in the key group. This means that another iteration is necessary in order to access the elements in the group. The output of the above sample looks like this:

```Family Meir-Huber
Family member: Mario
Family member: Helena
Family Some
Family member: Other```

## The repeat() function

A nice function is the repeat() function. Basically, it copies an element several times. For instance, if we want to copy a person 4 times, this can be done like this:

```lst = itertools.repeat(perstwo, 4)
for p in lst:
print(p)```

And also the output is just as expected:

```[Person(firstname='Some', lastname='Other', age=46, score=1.0)]
[Person(firstname='Some', lastname='Other', age=46, score=1.0)]
[Person(firstname='Some', lastname='Other', age=46, score=1.0)]
[Person(firstname='Some', lastname='Other', age=46, score=1.0)]```

## The takewhile() and the dropwhile() function in IterTools in Python

Two functions – takewhile and dropwhile – are also very helpful in Python. Basically, they are very similar, but their result is the opposite form each other. takewhile runs until a condition is true, dropwhile runs once a condition is false. Takewhile will take elements from an array/list as long as the predicate is true (e.g. lower than 20, this would mean that elements are only considered as long as they are below 20) – Dropwhile with the same condition would remove elements as long as their values are below 20. The following sample shows this:

```vals = range(1,40)
for v in itertools.takewhile(lambda vl: vl<20, vals):
print(v)
print("######")
for v in itertools.dropwhile(lambda vl: vl<20, vals):
print(v)```

And also here, the output is as expected:

```1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
######
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39```

As you can see, these are quite helpful functions. In our last Python tutorial, we will have a look at some basic mathematical and statistical functions.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

## Python for Spark Tutorial – The Dataclass in Python

One thing that everyone that deals with data is with classes that make data accessible to the code as objects. In all cases – and Python isn’t different here – wrapper classes and O/R mappers have to be written. However, Python has a powerful decorator for us at hand, that allows us to ease up or work. This decorator is called “dataclass”

## The dataclass in Python

The nice thing about the dataclass decorator is that it enables us to add a great set of functionality to an object containing data without the need to re-write it always. Basically, this decorator adds the following functionality:

• __init__: the constructor with all defined member variables. In order to use this, the member variables must be initialised with its type – which is rather uncommon in Python
• __repr__: this pretty prints the class with all its member variables as a string
• __eq__: a function to compare two classes for ordering
• order functions: this creates several order functions such as __lt__ (lower than), __gt__ (greater than), __le__ (lower equals) and __ge__ (greater equals)
• __hash__: adds a hash-function to the class
• frozen: prevents the class from adding/deleting attributes on runtime

The definition for a dataclass in Python is easy:

`@dataclassclass Classname():    CLASS-BLOCK `

You can also add each of the above described properties separately, e.g. with frozen=True or alike.

In the following sample, we will create a Person-Dataclass.

```from dataclasses import dataclass
@dataclass
class Person:
firstname: str
lastname: str
age: int
score: float
p = Person("Mario", "Meir-Huber", 35, 1.0)
print(p)```

Please note the differences in how to annotate the member variables. You can see that there is now no need for a constructor anymore, since this is already done for you. When you print the class, the __repr__() function is called. The output should look like the following:

`Person(firstname='Mario', lastname='Meir-Huber', age=35, score=1.0)`

As you can see, the dataclass abstracts a lot of our problems. In the next tutorial we will have a look at IterTools and FuncTools.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

## Python for Spark Tutorial – Logging in Python

Once you put your applications into production, you won’t be able to debug them any more. This is creating some issues, since you won’t know what is going on in the background. Imagine, a user does something and an application occurs – maybe, you don’t even know that this behaviour can lead to an error. To overcome this obstacle, we have a powerful tool in almost any programming environment: logging in Python.

## How to do logging in Python

Basically, the logger is imported from “logger” and it is used as a singleton. This means that you don’t need to create any classes or alike. Basically, first you need to instruct the logger with some information – such as the path to store the logs in and the format to be used. In our sample, we will use these parameters:

• filename: The name of the file to write to
• filemode: how the file should be created or appended
• format: how the logs should be written into the file (regular expressions, …)

Then, you can call different logging levels to the logger. This is done by simply typing “logger” and using the action:

`logger.<<ACTION>> `

Basically, we use these actions:

• debug: a debug message that something was executed, …
• info: some information that a new routine or alike is started
• warning: something didn’t work as expected, but no error occurred
• error: a severe error occurred that lead to wrong behaviour of the program
• exception: an exception occurred. It is logged as “error” but in addition it includes the error message

```import logging
logging.basicConfig(filename="../data/logs/log.log", filemode="w", format="%(asctime)s - %(levelname)s - %(message)s")```

We store the log itself in a directory that first needs to be created. Then, we provide a format with the time, the name of the level (e.g. INFO) and the message itself. Now, we can go into writing the log itself:

```logging.debug("Application started")
logging.warning("The user did an unexpected click")
logging.info("Ok, all is fine (still!)")
logging.error("Now it has crashed ... ")```

This creates some log information into the file. Now, let’s see how this works with exceptions. Basically, we “provoke” an exception and log it with “exception”. We also set the parameter “exc_info” to true, which includes the exception without passing it on explicitly (Python handles that for us :))

## Logging exceptions in Python

```try:
4/0
except ZeroDivisionError as ze:
logging.exception("oh no!", exc_info=True)```

Now, we can review our file and the output should be like this:

```2019-08-13 16:21:04,329 - WARNING - The user did an unexpected click
2019-08-13 16:21:04,889 - ERROR - Now it has crashed ...
2019-08-13 16:21:05,461 - ERROR - oh no!
Traceback (most recent call last):
File "<ipython-input-9-5d33bb8d3dd6>", line 2, in <module>
4/0
ZeroDivisionError: division by zero```

As you can see, logging is really straight-forward and easy to use in Python. So, no more excuses to not do it :). Have fun logging!

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

## Python for Spark Tutorial – Dynamically creating classes in Python

In our previous tutorial, we had a look at how to (de) serialise objects from and to JSON in Python. Now, let’s have a look into how to dynamically create and extend classes in Python. Basically, we are using the library that Python itself is using. This is the dynamic type function in Python. This function takes several parameters, we will only focus on three relevant one’s for our sample.

## How to use the dynamic type function in Python

Basically, this function takes several parameters. We utilize 3 parameters. These are:

`type(CLASS_NAME, INHERITS, PARAMETERS)`

These parameters have the following meaning:

• CLASS_NAME: The name of the new class
• INHERITS: from which the new type should inherit
• PARAMETERS: new methods or parameters added to the class

In our following example, we want to extend the Person class with a new attribute called “location”. We call our new class “PersonNew” and instruct Python to inherit from “Person”, which we have created some tutorials earlier. Strange is that it is passed as an array, even there can only be one inheritance hierarchy in Python. Last, we specify the method “location” as key-value pair. Our sample looks like the following:

```pn = type("PersonNew", (Person,), {"location": "Vienna"})
pn.age = 35
pn.name = "Mario"```

If you test the code, it will just work like expected. All other objects such as age and name can also be retrieved. Now, let’s make it a bit more complex. We extend our previous sample with the JSON serialisation to be capable of dynamically creating a JSON object from a string.

## Dynamically creating a class in Python from JSON

We therefore create a new function that takes the object to serialise and takes all values out of that. In addition, we add one more key-value pair, which we call “__class__” in order to store the name of the class. getting the class-name is a bit more complex, since it is written like “class ‘main.PersonNew'”. Therefore, we first split the object name with a “.”, take the last entry and again split it by the ‘ and take the first one. There are more elegant ways for this, but I want to keep it simple. Once we have the classname, we store it in the dictionary and return the dictionary. The complex sample is here:

```def map_proxy(obj):
dict = {}
for k in obj.__dict__.keys():
dict.update({k : obj.__dict__.get(k)})
cls_name = str(obj).split(".")[1].split("'")[0]
dict.update({"__class__" : cls_name})
return dict```

We can now use the json.dumps method and call the map_proxy function to return the JSON string:

```st_pn = json.dumps(map_proxy(pn))
print(st_pn)```

Now, we are ready to dynamically create a new class with the “type” method. We name the method after the class name that was provided above. This can be retrieved with “__class__”. We let it inherit from Person and pass the parameters from the entire object into it, since it is already a key/value pair:

```def dyn_create(obj):
return type(obj["__class__"], (Person, ), obj)```

We can now also invoke the json.loads method to dynamically create the class:

```obj = json.loads(st_pn, object_hook=dyn_create)
print(obj)
print(obj.location)```

And the output should be like that:

``````{"location": "Vienna", "__module__": "__main__", "__doc__": null, "age": 35, "name": "Mario", "__class__": "PersonNew"}
<class '__main__.PersonNew'>
Vienna``````

As you can see, it is very easy to dynamically create new classes in Python. We could largely improve this code, but i’ve created this tutorial for explanatory reasons rather than usability ;).

In our next tutorial, we will have a look at logging.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

## Python for Spark Tutorial – Python serialization with Objects and JSON

One important aspect of working with Data is serialisation. Basically, this means that classes can be persisted to a storage (e.g. the file system, HDFS or S3). With Spark, a lot of file formats are possible. However, in this tutorial we will have a look on how to deal with JSON, a very popular file format and often used in Spark. Now we will have a look at Python serialization.

## What is it and how does Python serialization work?

JSON stands for “Java Script Object Notation” and was usually developed for Client-Server applications with JavaScript as main user of it. It was built to have less overhead than XML.

First, let’s start with copying objects. Basically, Python knows two ways: normal copies and deep copies. The difference is that with normal copies, references to objects within the copied object are built. This is relevant when using objects as classes. In a deep copy, no references are built but every value is copied to the new object. This means that you can now use it independent from the previous one.

To copy objects to another, you only need to import copy and call the copy or deepcopy function. The following code shows how this works.

```import copy
ps1 = Person("Mario", 35)
pss = copy.copy(ps1)
psd = copy.deepcopy(ps1)
ps1.name = "Meir-Huber"
print(ps1.name)
print(pss.name)
print(psd.name)```

And the output should be this:

```Meir-Huber
Mario
Mario```

## JSON serialization in Python

Now, let’s look at how we can serialise an object with the use of JSON. Basically, you need to import “json”. An object that you want to serialise needs to be serialise-able. A lot of classes in Python already implement that. However, when we want to serialise our own object (e.g. the “Person” class that we have created in this tutorial), we need to implement the serialise-function or a custom serialiser. However, Python is great and provides us the possibility to access all variables in an object via the “__dict__” dictionary. This means that we don’t have to write our own serialiser and can do this via an easy call to “dumps” of “json”:

```import json
js = json.dumps(ps1.__dict__)
print(js)```

The above function creates a JSON representation of the entire class

`{"name": "Meir-Huber", "age": 35}`

We might want to add more information to the JSON string – e.g. the class name that it was originally stored in. We can do this by calling a custom function in the “dumps” method. This method gets the object to be serialised as only parameter. We then only pass the original object (Person) and the function we want to execute. We name this function “make_nice”. In the function, we create a dictionary and add the name of the class as first parameter. We give this the key “obj_name”. We then join the dictionary of the object into the new dictionary and return it.

## Finishing the serialization

Another parameter added to the “dumps” function is “indent”. The only thing it does is printing it pretty – by adding line breaks and indents. This is just for improved readability. The method and call looks like this:

```def make_nice(obj):
dict = {
"obj_name": obj.__class__.__name__
}
dict.update(obj.__dict__)
return dict
js_pretty = json.dumps(ps1, default=make_nice,indent=3)
print(js_pretty)```

And the result should now look like the following:

```{
"obj_name": "Person",
"name": "Meir-Huber",
"age": 35
}```

Now, we know how we can serialise an object to a JSON string. Basically, you can now store this string to a file or an object on S3. The only thing that we haven’t discussed yet is how to get back an object from a string. We therefore take the JSON object we “dumps” before. Our goal now is to create a Person object from it. This can be done via the call “loads” from the json-object. We also define a method to do the casting via the “object_hook” parameter. This object_hook method has one argument – the JSON object itself. We access each of the parameters from the object with named indexers and return the new object.

```str_json = "{\"name\": \"Meir-Huber\", \"age\": 35}"
def create(obj):
print(obj)
return Person(obj["name"], obj["age"])
print(obj)```

The output should now look like this.

```{'name': 'Meir-Huber', 'age': 35}
<__main__.Person object at 0x7fb84831ddd8>```

Now we know how to create JSON serialisers and how to get them back from a string value. In the next tutorial, we will have a look on how to improve this and make it more dynamic – by dynamic class creation in Python.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

## Python for Spark Tutorial – String manipulations in Python

In the last tutorials, we already worked a lot with Strings and even manipulated some of them. Now, it is about time to have a look at the theory behind it. Basically, formatting strings is very easy. The only thing you need is the “format” method appended to a string with a variable amount of data. If you add numbers, the str() function is executed on them by itself, so no need to convert them. This tutorial is about String manipulations in Python.

## String manipulations in Python

Basically, the annotation is very similar to the one from other string formatters you are used to. One really nice thing though is that you don’t need to provide the positional arguments. Python assumes that the positions are in-line with the parameters you provide. An easy sample is this:

```str01 = "This is my string {} and the value is {}".format("Test", 11)
print(str01)```

And the output should look like this:

`This is my string Test and the value is 11`

You can also use classes for this. Therefore, we define a class “Person”:

```class Person:
def __init__(self, name, age):
self.name = name
self.age = age
p = Person("Mario Meir-Huber", 35)
str02 = "The author \"{}\" is {} years old".format(p.name, p.age)
print(p.name)
print(str02)```

The output for this should look like this:

```Mario Meir-Huber
The author "Mario Meir-Huber" is 35 years old```

## The difflib in Python

One nice thing in Python is the difflib. This library enables us to easily check two array of strings for differences. One use-case would be to check my lastname for differences. Note that my lastname is one of the most frequent lastname combinations in the german speaking countries and thus allows different ways to write it.

To work with difflib, simply import it and call the difflib context_diff function. This prints the differences detected with “!”.

```import difflib
arr01 = ["Mario", "Meir", "Huber"]
arr02 = ["Mario", "Meier", "Huber"]
for line in difflib.context_diff(arr01, arr02):
print(line)```

Below you can see the output. One difference was spotted. You can easily use this for spotting differences in datasets and creating golden records from it.

```***
---
***************
*** 1,3 ****
Mario
! Meir
Huber
--- 1,3 ----
Mario
! Meier
Huber```

## Textwrap in Python

Another nice feature in Python is the usage of textwrap. This library has some basic features for text “prettyfying”. Basically, in the following sample, we use 5 different things:

• Indent: creates an indent to a text, e.g. a tab before the text
• Wrap: wraps the text into an array of strings in case it is longer than the maximum width. This is useful to split text into a maximum number of arrays
• Fill: does the same as Wrap, but creates new lines out of it
• Shorten: shortens the text with a specified maximum number. This is written like “[…]” and you might use it to add a “read more” around it
• Detent: deletes any whitespace before or after the text

The functions are used in simple statements:

```from textwrap import *
print(indent("Mario Meir-Huber", "\t"))
print(wrap("Mario Meir-Huber", width=10))
print(fill("Mario Meir-Huber", width=10))
print(shorten("Mario Meir-Huber Another", width=15))
print(dedent(" Mario Meir-Huber "))```

And the output should look like this:

```	Mario Meir-Huber
['Mario', 'Meir-Huber']
Mario
Meir-Huber
Mario [...]
Mario Meir-Huber ```

Today’s tutorial was more of a “housekeeping” since we used it already. In the next tutorial, I will write about object serialisation with JSON, as this is also very useful.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

## Python for Spark Tutorial – Python Async and await functionality

In the last tutorials, we had a look at methods, classes and deorators. Now, let’s have a brief look at asynchronous operations in Python. Most of the time, this is anyway abstracted for us via Spark, but it is nevertheless relevant to have some basic understanding of it. In this Tutorial, we will look at Python async and await functionality.

## Python Async and await functionality

Basically, you define a method to be asynchronous by simply adding “async” as keyword ahead of the method definition. This is written like that:

`async def FUNCTION_NAME():        FUNCTION-BLOCK`

Another keyword in that context is “await”. Basically, every function that is doing something asynchronous is awaitable. When adding “await”, nothing else happens until the asynchronous function has finished. This means that you might loose the benefit of asynchronous execution but get better handling when working with web data. In the following code, we create an async function that sleeps some seconds (between 1 and 10). We call the function twice with the “await” operator.

```import asyncio
import random
async def func():
tim = random.randint(1,10)
await asyncio.sleep(tim)
print(f"Function finished after {tim} seconds")
await func()
await func()```

In the output, you can see that it was first waited for the first function to finish and only then the second one was executed. Basically, all of the execution happened sequentially, not in parallel.

```Function finished after 9 seconds
Function finished after 9 seconds```

Python also knows parallel execution. This is done via Tasks. We use the Method “create_task” from the asyncio library in order to execute a function in parallel. In order to see how this works, we invoke the function several times and add a print-statement at the end of the code.

## Parallel execution in Python async

```asyncio.create_task(func())
print("doing something else ...")```

This now looks very different to the previous sample. The print statement is the first to show up, and all code path finish after 9 seconds max. This is due to the fact that (A) the first execution finishes after 1 second – thus the print statement is the first to be shown, since it is executed immediately. (B) Everything is executed in parallel and the maximum sleep interval is 9 seconds.

```doing something else ...
Function finished after 1 seconds
Function finished after 1 seconds
Function finished after 3 seconds
Function finished after 4 seconds
Function finished after 5 seconds
Function finished after 7 seconds
Function finished after 7 seconds
Function finished after 7 seconds
Function finished after 8 seconds
Function finished after 10 seconds
Function finished after 10 seconds
Function finished after 10 seconds```

However, there are also some issues with async operations. You can never say how long it takes a task to execute. It could finish fast or it could also take forever, due to a weak network connection or an overloaded server. Therefore, you might want to specify a timeout, which is the maximum an operation should be waited for. In Python, this is done via the “wait_for” method. It basically takes the function to execute and the timeout in seconds. In case the call runs into a timeout, a “TimeoutError” is raised. This allows us to surround it with a try-block.

## Dealing with TimeoutError in Python

```try:
await asyncio.wait_for(func(), timeout=3.0)
except asyncio.TimeoutError:
print("Timeout occured")```

In two third of the cases, our function will run into a timeout. The function should return this:

`Timeout occured`

Each task that should be executed can also be controlled. Whenever you call the “create_task” function, it returns a Task-object. A task can either be done, cancelled or contain an error. In the next sample, we create a new task and wait for it’s completion. We then check if the task was done or cancelled. You could also check for an error and retrieve the error message from it.

```task = asyncio.create_task(func())

In our case, no error should have occurred and thus the output should be the following:

```running task
Function finished after 8 seconds

Now we know how to work with async operations in Python. In our next tutorial, we will have a deeper look into how to work with Strings.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

## Python for Spark Tutorial – Python decorator – Part 2

Decorators are powerful things in most programming languages. They help us making code more readable and adding functionality to a method or class. Basically, decorators are added above the method or class declaration in order to create some behaviour. Basically, we differentiate between two kind of decorators: method decorators and class decorators. In this tutorial we will look at the class and how a Python decorator works.

## The Python decorator for a class

Class decorators are used to add some behaviour to a class. Normally, you would use this when you want to add some kind of behaviour to a class that is outside of its inheritance structure – e.g. by adding something that is too abstract to bring it to the inheritance structure itself.

The definition of that is very similar to the method decorators:

`@DECORATORNAMEclass CLASSNAME():    CLASS-BLOCK `

The decorator definition is also very similar to the last tutorial’s sample. We first create a method that takes a class and then create the inner method. Within the inner method, we create a new function that we want to “append” to the class. We call this method “fly” that simply prints “Now flying …” to the console. To add this function to the class, we call the “setattr” function of Python. We then return the class and the class wrapper.

```def altitude(cls):
def clswrapper(*args):
def fly():
print("Now flying ... ")
setattr(cls, "fly", fly)
return cls
return clswrapper```

## How to use the Python decorator

Now, our decorator is ready to be used. We first need to create a class. Therefore, we re-use the sample of the vehicles, but simplify it a bit. We create a class “Vehicle” that has a function “accelerate” and create two sub classes “Car” and “Plane” that both inherit from “Vehicle”. The only difference now is that we add a decorator to the class “Plane”. We want to add the possibility to fly to the Plane.

```class Vehicle:
speed = 0
def accelerate(self, speed):
self.speed = speed
class Car(Vehicle):
pass
@altitude
class Plane(Vehicle):
pass```

Now, we want to test our output:

```c = Car()
p = Plane()
c.accelerate(100)
print(c.speed)
print(p.fly())```

Output:

```100
Now flying ... ```

Basically, there are a lot of scenarios when you would use class decorators. For instance, you can add functionality to classes that contain data in order to convert this into a more readable table or alike.

In our next tutorial, we will look at the await-operator.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.