In recent years, a lot of traditional companies founded digital labs that should server their digitalisation efforts. But if you look at the results of these labs, they are rather limited. Most of the “products” or PoCs never came back to their real products. Overall, they could be considered as failure. But why? Why is it the wrong data strategy?

What is a silicon valley lab?

Let’s first look at what those labs are and why they were founded. Everywhere in the world, there is increased pressure on companies to digitalise themselves. Basically, how C-Level executives handed that in traditional companies is by looking at (successful) Silicon Valley startups. They did trips to the Valley and found a very cool culture there.

Basically, a lot of companies were built in the garage or old fabrics, thus giving that a very industrial style. Back in good old Europe, the executives decided: “We need to have something very similar”. What they did is: they rented a fabric hall somewhere, equipped it with IT and hired the smartes people available on the markets to create their now digital products. Their idea was also to keep them away (physically) from their traditional company premises in order to build something new and don’t look too much at the company itself. A lot of money was burned with this approach. What C-Level executives weren’t told are a few things:

(A) Silicon Valley companies don’t work in garages or fabric halls because it is fancy and the way they like it. Often, they don’t have money to rent expensive office space, especially when prices in the valley are very high. The culture that is typical for the valley is rather something that was done because it was necessary, but not because of coolness.

(B) being remote to the traditional business works best when you develop a product from the ground up and completely new. However, most traditional companies still earn most of their money with their business and in most cases it will also stay like that. A car manufacturer will earn most money with cars, digital products then come on top. With this remote type of development, it often proved impossible to integrate the results of the labs into the real products.

So what can executives do to overcome this dilema with failed PoC’s in Data Science projects?

There is no silver bullet available for this challenge. The popular website Venturebeat even claims that 87% of Data Science projects never make it into production. It depends mainly on what should be achieved. When we look at startups, their founders often come from large enterprises that were unhappy with how their business used to work.

I would argue that most large enterprises basically have the innovation power they are seeking for, but it is often under-utilized or even not utilized. One thing is crucial: keep the right balance between distance or closeness to the legacy products. It is necessary to understand and built on top of the legacy products, but also it is necessary to not get corrupted by them – often, people keep on doing their things for years and simply don’t question it.

To successfully change products and services, the best thing is to bring in someone external that doesn’t understand the company that well but has the competence to accept their history. This person(s) should not be engineers (they are also needed) but rather senior executives with a strong background in digital technologies. Seeing things different brings new ideas and can bring each company forward in the digitalisation aspects 🙂

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company.

A linear regression model

In the previous tutorial posts, we looked at the Linear Regression and discussed some basics of statistics such as the Standard Deviation and the Standard Error. Today, we will look at the Logistic Regression. It is similar in name to the linear regression, but different in usage. Let’s have a look

The Logistic Regression explained

One of the main difference to the Linear Regression for the Logistic Regression is that you the logistic regression is binary – it calculates values between 0 and 1 and thus states if something is rather true or false. This means that the result of a prediction could be “fail” or “succeed” for a test. In a churn model, this would mean that a customer either stays with the company or leaves the company.

Another key difference to the Linear Regression is that the regression curve can’t be calculated. Therefore, in the Logistic Regression, the regression curve is “estimated” and optimised. There is a mathematical function to do this estimation – called the “Maximum Likelihood Method”. Normally, these Parameters are calculated by different Machine Learning Tools so that you don’t have to do it.

Another aspect is the concept of “Odds”. Basically, the odd of a certain event happening or not happening is calculated. This could be a certain team winning a soccer game: let’s assume that Team X wins 7 out of 10 games (thus loosing 3, we don’t take a draw). The odds in this case would be 7:10 on winning or 3:10 on loosing.

This time we won’t calculate the Logistic Regression, since it is way too long. In the next tutorial, I will focus on classifiers such as Random Forest and Naive Bayes.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In our last tutorial, we looked at different aggregation functions in Apache Spark. This time, we will look at what different options are available to join data in Apache Spark. Joining Data is an essential thing when it comes to working with data.

Join Data in Spark

In most cases, you will sooner or later come across data joining in whatever form. In our tutorial, we will look at three different ways of joining data: via a join, a union or an intersect.

But first, we need to create some datasets we can join. Our financial datasets from the previous samples isn’t comprehensive enough, so we create some other dummy data:

names = spark.createDataFrame([(1, "Mario"), (2, "Max"), (3, "Sue")], ("nid", "name"))
revenues = spark.createDataFrame([(1, 9.99), (2, 189.99), (3, 1099.99)], ("rid", "revenue"))

We simply create two dataframes: names and revenues. These dataframes are initialised as a list of key/value pairs. Now, it is about time to learn the first item: the “join” function.

Creating a join in Spark: the Join() in PySpark

A join() operation is performed on the dataframe itself. Basically, the join operation takes several parameters. We will look at the most important ones:

  • Other Dataset: the other dataset to join with
  • Join by: the IDs to join on (must be existing in both datasets)
  • Type: inner, outer, left/right outer join

Basically, the inner join returns all values that are matched (e.g. available in both datasets). The outer join also returns values that haven’t got a matching “partner” in the other dataset; non-matching datasets are filled with Null-values. The difference between left and right outer join is that dataset either from the left (first dataset) or right (second dataset) is used. The left or right syntax comes from SQL, where you write it from left to right. In our following sample, we use an inner join:

joined = names.join(revenues, names.nid == revenues.rid, "inner").drop("rid")
joined.show()

Basically, all items should have a “matching partner” from both datasets. The result should be this:

+---+-----+-------+
|nid| name|revenue|
+---+-----+-------+
|  1|Mario|   9.99|
|  3|  Sue|1099.99|
|  2|  Max| 189.99|
+---+-----+-------+

Creating the union in Spark: the union() method in PySpark

Creating a union is the second way to bring data together. Basically, a Union is appending data to the original dataset without taking the structure or ids into account. Basically, the union just copies the other dataset to the first dataset to the end of it. The following sample shows the union:

dnames.union(revenues).show()

The output should be this:

+---+-------+
|nid|   name|
+---+-------+
|  1|  Mario|
|  2|    Max|
|  3|    Sue|
|  1|   9.99|
|  2| 189.99|
|  3|1099.99|
+---+-------+

Creating the intersection in Spark: the intersect() in PySpark

The third way in our tutorial today is the intersect() method. Basically, it only returns those datasets that have a matching dataset between the two compared datasets. As for our sample, we first need to create a new dataset, where we add new items of id/name pairs and then call the intersect statement:

names2 = spark.createDataFrame([(1, "Mario"), (4, "Tom"), (5, "Lati")], ("nid", "name"))
names2.intersect(names).show()

And the output should be this:

+---+-----+
|nid| name|
+---+-----+
|  1|Mario|
+---+-----+

Now you should be familiar with all different ways on how to join data. In the next sample, we will look at how to limit data results.

There is of course much more to learn about Spark, so make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.

Python has a really great standard library. In the next two tutorial sessions, we will have a first look at this standard library. We will mainly focus on what is relevant for Spark developers in the long run. Today, we will focus on FuncTools and IterTools in Python, the next tutorial will deal with some mathematical functions. But first, let’s start with “reduce

The reduce() function from the IterTools in Python

Basically, the reduce function takes an iterable list and executes a function on it. In most of the cases, this will be a lambda function but it could also be a normal function. In our sample, we take some values and create the sum of it by moving from left to right:

from functools import reduce
values = [1,4,5,3,2]
reduce(lambda x,y: x+y, values)

And we get the expected output

15

The sorted() function

Another very useful function is the “sorted” function. Basically, this sorts values or pairs of tuples in an array. The easiest way to apply it is to do it with our previous values (which were unsorted!):

print(sorted(values))

The output is now in the expected sorting:

[1, 2, 3, 4, 5]

However, we can still improve this by even sorting complex objects. Sorted takes a key to sort on, and this is passed as a lamdba expression. We state that we want to sort it by age. Make sure that you still have the “Person” class from our previous tutorial:

perli = [Person("Mario", "Meir-Huber", 35, 1.0), Person("Helena", "Meir-Huber", 5, 1.0)]
print(perli)
print(sorted(perli, key=lambda p: p.age))

As you can see, our values are now sorted based on the age member.

[Person(firstname='Mario', lastname='Meir-Huber', age=35, score=1.0), Person(firstname='Helena', lastname='Meir-Huber', age=5, score=1.0)]
[Person(firstname='Helena', lastname='Meir-Huber', age=5, score=1.0), Person(firstname='Mario', lastname='Meir-Huber', age=35, score=1.0)]

The chain() function

The chain() method is very helpful if you want to hook up two lists with the same objects in it. Basically, we take the Person-Class again and create a new instance. We then chain the two lists together:

import itertools
perstwo = [Person("Some", "Other", 46, 1.0)]
persons = itertools.chain(perli, perstwo)
for pers in persons:
    print(pers.firstname)

Also here, we get the expected output:

Mario
Helena
Some

The groupby() function

Another great feature when working with data is grouping of data. Python also allows us to do so. The groupby() method takes two parameters: the list to group and the key as lambda expression. We create a new array of tuple pairs and group by the family name:

from itertools import groupby
pl = [("Meir-Huber", "Mario"), ("Meir-Huber", "Helena"), ("Some", "Other")]
for k,v in groupby(pl, lambda p: p[0]):
    print("Family {}".format(k))
    for p in v:
        print("\tFamily member: {}".format(p[1]))

Basically, the groupby() method returns the key (as the value type) and the objects as list in the key group. This means that another iteration is necessary in order to access the elements in the group. The output of the above sample looks like this:

Family Meir-Huber
	Family member: Mario
	Family member: Helena
Family Some
	Family member: Other

The repeat() function

A nice function is the repeat() function. Basically, it copies an element several times. For instance, if we want to copy a person 4 times, this can be done like this:

lst = itertools.repeat(perstwo, 4)
for p in lst:
    print(p)

And also the output is just as expected:

[Person(firstname='Some', lastname='Other', age=46, score=1.0)]
[Person(firstname='Some', lastname='Other', age=46, score=1.0)]
[Person(firstname='Some', lastname='Other', age=46, score=1.0)]
[Person(firstname='Some', lastname='Other', age=46, score=1.0)]

The takewhile() and the dropwhile() function in IterTools in Python

Two functions – takewhile and dropwhile – are also very helpful in Python. Basically, they are very similar, but their result is the opposite form each other. takewhile runs until a condition is true, dropwhile runs once a condition is false. Takewhile will take elements from an array/list as long as the predicate is true (e.g. lower than 20, this would mean that elements are only considered as long as they are below 20) – Dropwhile with the same condition would remove elements as long as their values are below 20. The following sample shows this:

vals = range(1,40)
for v in itertools.takewhile(lambda vl: vl<20, vals):
    print(v)
print("######")
for v in itertools.dropwhile(lambda vl: vl<20, vals):
    print(v)

And also here, the output is as expected:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
######
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

As you can see, these are quite helpful functions. In our last Python tutorial, we will have a look at some basic mathematical and statistical functions.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

In my previous posts we had a look at some fundamentals of machine learning and had a look at the linear regression. Today, we will look at another statistical topic: false positives and false negatives. You will come across these terms quite often when working with data, so let’s have a look at them.

The false positive

In statistics, there is one error, called the false positive error. This happens when the prediction states something to be true, but in reality it is false. To easily remember the false positive, you could describe this as a false alarm. A simple example for that is the airport security check: when you pass the security check, you have to walk through a metal detector. If you don’t wear any metal items with you (since you left them for the x-ray!), no alarm will go on. But in some rather rare cases, the alarm might still go on. Either you forgot something or the metal detector had an error – in this case, a false positive. The metal detector predicted that you have metal items somewhere with you, but in fact you don’t.

Another sample of a false positive in machine learning would be in image recognition: imagine your algorithm is trained to recognise cats. There are so many cat pictures on the web, so it is easy to train this algorithm. However, you would then feed the algorithm the image of a dog and the algorithm would call it a cat, even though it is a dog. This again is a false positive.

In a business context, your algorithm might predict that a specific customer is going to buy a certain product for sure. but in fact, this customer didn’t buy it. Again, here we have our false positive. Now, let’s have a look at the other error: the false negative.

The false negative

The other error in statistics is the false negative. Similar to the false positive, it is something that should be avoided. It is very similar to the false positive, just the other way around. Let’s look at the airport example one more time: you wear a metal item (such as a watch) and go through the metal detector. You simply forgot to take off the watch. And – the metal detector doesn’t go on this time. Now, you are a false negative: the metal detector stated that you don’t wear any metal items, but in fact you did. A condition was predicted to be true but in fact it was false.

A false positive is often useful to score your data quality. Now that you understand some of the most important basics of statistics, we will have a look at another machine learning algorithm in my next post: the logistic regression.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In our last tutorial section, we worked with Filters and Groups. Now, we will look at different aggregation functions in Spark and Python. Today, we will look at the Aggregation function in Spark, that allows us to apply different aggregations on columns in Spark

The agg() function in PySpark

Basically, there is one function that takes care of all the agglomerations in Spark. This is called “agg()” and takes some other function in it. There are various possibilities, the most common ones are building sums, calculating the average or max/min values. The “agg()” function is called on a grouped dataset and is executed on one column. In the following samples, some possibilities are shown:

from pyspark.sql.functions import *
df_ordered.groupby().agg(max(df_ordered.price)).collect()

In this sample, we imported all available functions from pyspark.sql. We called the “agg” function on the df_ordered dataset that we have created in the previous tutorial. We than use the “max()” function that retrieves the highest value of the price. The output should be the following:

[Row(max(price)=99.99)]

Now, we want to calculate the average value of the price. Similar to the above example, we use the “agg()” function and instead of “max()” we call the “avg” function.

df_ordered.groupby().agg(avg(df_ordered.price)).collect()

The output should be this:

[Row(avg(price)=42.11355000000023)]

Now, let’s get the sum of all orders. This can be done with the “sum()” function. It is also very similar to the previous samples:

df_ordered.groupby().agg(sum(df_ordered.price)).collect()

And the output should be this:

[Row(sum(price)=4211355.000000023)]

To calculate the mean value, we can use “mean” with it:

df_ordered.groupby().agg(mean(df_ordered.price)).collect()

This should be the output:

[Row(avg(price)=42.11355000000023)]

Another useful function is “count”, where we can simply count all datasets for a column:

df_ordered.groupby().agg(count(df_ordered.price)).collect()

And the output should be this:

[Row(count(price)=100000)]

Today’s tutorial was very easy. It was dealing with aggregation function in Spark and how you can create them with the use of the agg() method. In the next sample, we will look at different ways to join data together.

There is of course much more to learn about Spark, so make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.

One thing that everyone that deals with data is with classes that make data accessible to the code as objects. In all cases – and Python isn’t different here – wrapper classes and O/R mappers have to be written. However, Python has a powerful decorator for us at hand, that allows us to ease up or work. This decorator is called “dataclass”

The dataclass in Python

The nice thing about the dataclass decorator is that it enables us to add a great set of functionality to an object containing data without the need to re-write it always. Basically, this decorator adds the following functionality:

  • __init__: the constructor with all defined member variables. In order to use this, the member variables must be initialised with its type – which is rather uncommon in Python
  • __repr__: this pretty prints the class with all its member variables as a string
  • __eq__: a function to compare two classes for ordering
  • order functions: this creates several order functions such as __lt__ (lower than), __gt__ (greater than), __le__ (lower equals) and __ge__ (greater equals)
  • __hash__: adds a hash-function to the class
  • frozen: prevents the class from adding/deleting attributes on runtime

The definition for a dataclass in Python is easy:

@dataclass
class Classname():
CLASS-BLOCK

You can also add each of the above described properties separately, e.g. with frozen=True or alike.

In the following sample, we will create a Person-Dataclass.

from dataclasses import dataclass
@dataclass
class Person:
    firstname: str
    lastname: str
    age: int
    score: float
p = Person("Mario", "Meir-Huber", 35, 1.0)
print(p)

Please note the differences in how to annotate the member variables. You can see that there is now no need for a constructor anymore, since this is already done for you. When you print the class, the __repr__() function is called. The output should look like the following:

Person(firstname='Mario', lastname='Meir-Huber', age=35, score=1.0)

As you can see, the dataclass abstracts a lot of our problems. In the next tutorial we will have a look at IterTools and FuncTools.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

A linear regression model

Now we have learned how to write a Linear Regression model from hand in our last tutorial. Also, we had a look at the prediction error and standard error. Today, we want to focus on a way how to measure the performance of a model. In marketing, a common methodology for this is lift and gain charts. They can also be used for other things, but in our today’s sample we will use a marketing scenario.

The marketing scenario for Lift and Gain charts

Let’s assume that you are in charge of an outbound call campaign. Basically, your goal is to increase conversions of people contacted via this campaign. Like with most campaigns, you have a certain – limited – budget and thus need to plan the campaign smart. This is where machine learning comes into play: you only want to contact those people that are most relevant to buy the product. Therefore, you contact the top X percent of customers where you rather expect a conversion and avoid contacting those customers that are very unlikely to get converted. We assume that you already built a model for that and that we now do the campaign. We will measure our results with a gain chart, but first let’s create some data.

Our sample data represents all our customers, grouped into decentiles. Basically, we group the customers into top 10%, top 20%, … until we reach all customers. We add the number of conversions to it as well:

Decantile# of CustomersConversions
120033
220030
320027
420025
520023
620019
720015
820011
92007
102002

As you can see in the above table, the first decentile contains most conversions and is thus our top group. The conversion rates for each group in percent are:

% Conversions
17,2%
15,6%
14,1%
13,0%
12,0%
9,9%
7,8%
5,7%
3,6%
1,0%

As you can see, 17.2% of all top 10% customers could be converted. From each group, it declines. So, the best approach is to first contact the top customers. As a next step, we add the cumulative conversions. This number is then used for our cumulative gain chart.

Cumulative % Conversions
17,2%
32,8%
46,9%
59,9%
71,9%
81,8%
89,6%
95,3%
99,0%
100,0%

Cumulative Gain Chart

With this data, we can now create the cumulative gain chart. In our case, this would look like the following:

A cumulative gain chart
A cumulative gain chart

The Lift factor

Now, let’s have a look at the lift factor. The base for the lift factor is always the lift 1. This means that there was a random sample selected and no structured approach was done. Basically, the lift factor is the ratio you get between the number of customers contacted in % and the number of conversions for the decentile in %. With our sample data, this lift data would look like the following:

Lift
1,72
1,64
1,56
1,50
1,44
1,36
1,28
1,19
1,10
1,00

Thus we would have a lift factor of 1.72 with the first percentile, decreasing towards the full customer set.

In this tutorial, we’ve learned about how to verify a machine learning model. In the next tutorial, we will have a look at false positives and some other important topics before moving on with Logistic Regression.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

Data Science is often this mystical thing – only a few understand it, finding people doing it is very hard. The skill gap is everywhere and companies are facing issues staffing their projects. However, most companies want to become “data driven” and thus would need to have the skills available. In this context, we talk about data science democratisation.

What is data science democratisation?

However, I still think that we are currently doing it somewhat wrong – we need to enable more people to do what data scientists are doing without the need for them to do complex algorithmic things. This is when “self service Analytics” comes into play – giving business users more power and enabling them in doing “data science” with easy tools.

In an ideal world, each business user would have some basic data capabilities and is fully capable of doing her own insights in some way – by driving all decisions with the data, not with gut feelings. There are already some tools out there that enable exactly that – self service Analytics. This would also mean that the processes in companies have to shift a lot – away from traditional processes and data ownership. The goal of self-service analytics is diverse:

  • Reducing the FTE input for Data Science. At the moment, we need data scientists to do the job. However, those people aren’t available at the market in large scale and are very hard to find. This leads to several issues in doing data science.
  • Reducing the TTM. If for every business question we would need the help of a data scientist, every question will become a project that takes weeks. Decisions often need to be done fast, otherwise they might not be relevant at all.

What needs to be done to achieve it?

In my previous paragraph, I was writing about the “ideal world”. Now you might question what is the business reality out there and what needs to be done in order to achieve this? Well, it is easer said then done. Basically, there are some organisational and technical measures that needs to be applied:

  • No Silos. People can only work with Data if they have the full view of all available data. There should be no “hidden” data and everyone in the company should be capable of checking data for integrity. Knowledge means power and if one unit possess all the data, they are very powerful. Therefore, data should be “free” within the company.
  • Self-service Data access. People and business units in the company should be capable of accessing data in an easy (and self-service) manner. It must be easy for them to find, search, retrieve and visualize data.
  • Data thinking and mindset. Everyone in the company – ranging from top managers to business users – need to have a data thinking and mindset. This means that they should use data for all of their daily decisions rather than “gut feelings”. They should challenge their decisions with provability and data.

Technical enablers for data democratisation

  • Governance, Metadata Management and Data Catalogs: I keep on repeating myself – but as long as these elementary things aren’t solved, the above one’s are impossible to reach. Most companies only do governance to an extend of legal and regulatory requirements, but they should do much more than that – enabling a self-service environment.
  • Data Abstraction / Virtualisation: This is one of the key things to enable easy data access at some level. To all data sources, an easy interface – ideally with SQL-like feeling – should be available. This gives business users an easy tool to access all data, not just parts of it.

You might now think that the data scientist will get jobless? I would argue that it is the contrary. Self-service analytics isn’t made to handle the complex things – it is made for quick insights and proving that a business hypothesis might work. Based on this, much more questions will arise and thus create more work for data scientists. Also, achieving self-service analytics will lead to a lot of work for data engineers that finally have to integrate that data.

I hope you enjoyed this post about data science democratisation. If you want to learn more about how to deal with Big Data and Data Science in Business, read this tutorial about Big Data Business.

In our last tutorial section, we had a brief introduction to Spark Dataframes. Now, let’s have a look into filtering, grouping and sorting Data with Apache Spark Dataframes. First, we will start with the orderby statement in Spark, which allows us to order data in Spark. Next, we will continue with the sort statement in Spark that allows us to sort data in Spark and then we will focus on the groupby statement in Spark, which will eventually allow us to group data in Spark.

Order Data in Spark

First, let’s look at how to order data. Basically, there is an easy function on all dataframes, called “orderBy”. This function takes several parameters, but the most important parameters for us are:

  • Columns: a list of columns to order the dataset by. This is either one or more items
  • Order: ascending (=True) or descending (ascending=False)

If we again load our dataset from the previous sample, we can now easily apply the orderBy() function. In our sample, we order everything by personid:

df_new = spark.read.parquet(fileName)
df_ordered = df_new.orderBy(df_new.personid, ascending=True)
df_ordered.show()

The output should be:

+--------+----------+---+---------------+-----+-----+
|personid|personname|age|    productname|price|state|
+--------+----------+---+---------------+-----+-----+
|       0|    Amelia| 33|Bicycle gearing|19.99|   NY|
|       0|    Amelia| 33|      Kickstand|14.99|   NY|
|       0|    Amelia| 33|   Bicycle bell| 9.99|   NY|
|       0|    Amelia| 33|         Fender|29.99|   NY|
|       0|    Amelia| 33|         Fender|29.99|   NY|
|       0|    Amelia| 33|      Kickstand|14.99|   NY|
|       0|    Amelia| 33| Bicycle saddle|59.99|   NY|
|       0|    Amelia| 33| Bicycle saddle|59.99|   NY|
|       0|    Amelia| 33|  Bicycle brake|49.99|   NY|
|       0|    Amelia| 33|   Bicycle bell| 9.99|   NY|
|       0|    Amelia| 33|Bicycle gearing|19.99|   NY|
|       0|    Amelia| 33|   Bicycle fork|79.99|   NY|
|       0|    Amelia| 33|   Bicycle fork|79.99|   NY|
|       0|    Amelia| 33|  Bicycle Frame|99.99|   NY|
|       0|    Amelia| 33|Luggage carrier|34.99|   NY|
|       0|    Amelia| 33|Luggage carrier|34.99|   NY|
|       0|    Amelia| 33|      Kickstand|14.99|   NY|
|       0|    Amelia| 33|   Bicycle fork|79.99|   NY|
|       0|    Amelia| 33|  Bicycle Frame|99.99|   NY|
|       0|    Amelia| 33|   Bicycle fork|79.99|   NY|
+--------+----------+---+---------------+-----+-----+
only showing top 20 rows

Filter Data in Spark

Next, we want to filter our data. We take the ordered data again and only take customers that are between 33 and 35 years. Just like the previous “orderby” function, we can also apply an easy function here: filter. This function basically takes one filter argument. In order to have a “between”, we need to chain two filter statements together. Also like the previous sample, the column to filter with needs to be applied:

df_filtered = df_ordered.filter(df_ordered.age < 35).filter(df_ordered.age > 33)
df_filtered.show()

The output should look like this:

+--------+----------+---+---------------+-----+-----+
|personid|personname|age|    productname|price|state|
+--------+----------+---+---------------+-----+-----+
|      47|      Jake| 34|  Bicycle chain|39.99|   TX|
|      47|      Jake| 34|Bicycle gearing|19.99|   TX|
|      47|      Jake| 34|  Bicycle Frame|99.99|   TX|
|      47|      Jake| 34|Luggage carrier|34.99|   TX|
|      47|      Jake| 34|Bicycle gearing|19.99|   TX|
|      47|      Jake| 34|Luggage carrier|34.99|   TX|
|      47|      Jake| 34|   Bicycle bell| 9.99|   TX|
|      47|      Jake| 34|Bicycle gearing|19.99|   TX|
|      47|      Jake| 34|   Bicycle fork|79.99|   TX|
|      47|      Jake| 34|  Bicycle Frame|99.99|   TX|
|      47|      Jake| 34|      Saddlebag|24.99|   TX|
|      47|      Jake| 34|Luggage carrier|34.99|   TX|
|      47|      Jake| 34| Bicycle saddle|59.99|   TX|
|      47|      Jake| 34|  Bicycle Frame|99.99|   TX|
|      47|      Jake| 34|         Fender|29.99|   TX|
|      47|      Jake| 34|      Kickstand|14.99|   TX|
|      47|      Jake| 34|   Bicycle bell| 9.99|   TX|
|      47|      Jake| 34|   Bicycle bell| 9.99|   TX|
|      47|      Jake| 34|  Bicycle brake|49.99|   TX|
|      47|      Jake| 34|      Saddlebag|24.99|   TX|
+--------+----------+---+---------------+-----+-----+
only showing top 20 rows

Group Data in Spark

In order to make more complex queries, we can also group data by different columns. This is done with the “groupBy” statement. Basically, this statement also takes just one argument – the column(s) to sort by. The following sample is a bit more complex, but I will explain it after the sample:

from pyspark.sql.functions import bround
df_grouped = df_ordered \
    .groupBy(df_ordered.personid) \
    .sum("price") \
    .orderBy("sum(price)") \
    .select("personid", bround("sum(price)", 2)) \
    .withColumnRenamed("bround(sum(price), 2)", "value")
df_grouped.show()

So, what has happened here? We had several steps:

  • Take the previously ordered dataset and group it by personid
  • Create the sum of each person’s items
  • Order everything descending by the column for the sum. NOTE: the column is named “sum(price)” since it is a new column
  • We round the column “sum(price)” by two decimal points so that it looks nicer. Note again, that the name of the column is changed again to “ground(sum(price), 2)”
  • Since the column is now at a really hard to interpret name, we call the “withColumnRenamed” function to give the column a much nicer name. We call our column “value”

The output should look like this:

+--------+--------+
|personid|   value|
+--------+--------+
|      37|18555.38|
|      69|18825.24|
|       6|18850.19|
|     196|19050.34|
|      96|19060.34|
|     144|19165.37|
|     108|19235.21|
|      61|19275.52|
|      63|19330.23|
|     107|19390.22|
|      46|19445.41|
|      79|19475.16|
|     181|19480.35|
|      17|19575.33|
|     123|19585.29|
|      70|19655.19|
|     134|19675.31|
|     137|19715.16|
|      25|19720.07|
|      45|19720.14|
+--------+--------+
only showing top 20 rows

You can also do each of the statements step-by-step in case you want to see each transformation and its impact.

In our next tutorial, we will have a look at aggregation functions in Apache Spark.

There is of course much more to learn about Spark, so make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.