Python has a really great standard library. In the next two tutorial sessions, we will have a first look at this standard library. We will mainly focus on what is relevant for Spark developers in the long run. Today, we will focus on FuncTools and IterTools in Python, the next tutorial will deal with some mathematical functions. But first, let’s start with “reduce
The reduce() function from the IterTools in Python
Basically, the reduce function takes an iterable list and executes a function on it. In most of the cases, this will be a lambda function but it could also be a normal function. In our sample, we take some values and create the sum of it by moving from left to right:
from functools import reduce
values = [1,4,5,3,2]
reduce(lambda x,y: x+y, values)
And we get the expected output
15
The sorted() function
Another very useful function is the “sorted” function. Basically, this sorts values or pairs of tuples in an array. The easiest way to apply it is to do it with our previous values (which were unsorted!):
print(sorted(values))
The output is now in the expected sorting:
[1, 2, 3, 4, 5]
However, we can still improve this by even sorting complex objects. Sorted takes a key to sort on, and this is passed as a lamdba expression. We state that we want to sort it by age. Make sure that you still have the “Person” class from our previous tutorial:
perli = [Person("Mario", "Meir-Huber", 35, 1.0), Person("Helena", "Meir-Huber", 5, 1.0)]
print(perli)
print(sorted(perli, key=lambda p: p.age))
As you can see, our values are now sorted based on the age member.
[Person(firstname='Mario', lastname='Meir-Huber', age=35, score=1.0), Person(firstname='Helena', lastname='Meir-Huber', age=5, score=1.0)]
[Person(firstname='Helena', lastname='Meir-Huber', age=5, score=1.0), Person(firstname='Mario', lastname='Meir-Huber', age=35, score=1.0)]
The chain() function
The chain() method is very helpful if you want to hook up two lists with the same objects in it. Basically, we take the Person-Class again and create a new instance. We then chain the two lists together:
import itertools
perstwo = [Person("Some", "Other", 46, 1.0)]
persons = itertools.chain(perli, perstwo)
for pers in persons:
print(pers.firstname)
Also here, we get the expected output:
Mario
Helena
Some
The groupby() function
Another great feature when working with data is grouping of data. Python also allows us to do so. The groupby() method takes two parameters: the list to group and the key as lambda expression. We create a new array of tuple pairs and group by the family name:
from itertools import groupby
pl = [("Meir-Huber", "Mario"), ("Meir-Huber", "Helena"), ("Some", "Other")]
for k,v in groupby(pl, lambda p: p[0]):
print("Family {}".format(k))
for p in v:
print("\tFamily member: {}".format(p[1]))
Basically, the groupby() method returns the key (as the value type) and the objects as list in the key group. This means that another iteration is necessary in order to access the elements in the group. The output of the above sample looks like this:
Family Meir-Huber
Family member: Mario
Family member: Helena
Family Some
Family member: Other
The repeat() function
A nice function is the repeat() function. Basically, it copies an element several times. For instance, if we want to copy a person 4 times, this can be done like this:
lst = itertools.repeat(perstwo, 4)
for p in lst:
print(p)
And also the output is just as expected:
[Person(firstname='Some', lastname='Other', age=46, score=1.0)]
[Person(firstname='Some', lastname='Other', age=46, score=1.0)]
[Person(firstname='Some', lastname='Other', age=46, score=1.0)]
[Person(firstname='Some', lastname='Other', age=46, score=1.0)]
The takewhile() and the dropwhile() function in IterTools in Python
Two functions – takewhile and dropwhile – are also very helpful in Python. Basically, they are very similar, but their result is the opposite form each other. takewhile runs until a condition is true, dropwhile runs once a condition is false. Takewhile will take elements from an array/list as long as the predicate is true (e.g. lower than 20, this would mean that elements are only considered as long as they are below 20) – Dropwhile with the same condition would remove elements as long as their values are below 20. The following sample shows this:
vals = range(1,40)
for v in itertools.takewhile(lambda vl: vl<20, vals):
print(v)
print("######")
for v in itertools.dropwhile(lambda vl: vl<20, vals):
print(v)
And also here, the output is as expected:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
######
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
As you can see, these are quite helpful functions. In our last Python tutorial, we will have a look at some basic mathematical and statistical functions.
If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.