In our last tutorial section, we worked with Filters and Groups. Now, we will look at different aggregation functions in Spark and Python. Today, we will look at the Aggregation function in Spark, that allows us to apply different aggregations on columns in Spark

The agg() function in PySpark

Basically, there is one function that takes care of all the agglomerations in Spark. This is called “agg()” and takes some other function in it. There are various possibilities, the most common ones are building sums, calculating the average or max/min values. The “agg()” function is called on a grouped dataset and is executed on one column. In the following samples, some possibilities are shown:

from pyspark.sql.functions import *
df_ordered.groupby().agg(max(df_ordered.price)).collect()

In this sample, we imported all available functions from pyspark.sql. We called the “agg” function on the df_ordered dataset that we have created in the previous tutorial. We than use the “max()” function that retrieves the highest value of the price. The output should be the following:

[Row(max(price)=99.99)]

Now, we want to calculate the average value of the price. Similar to the above example, we use the “agg()” function and instead of “max()” we call the “avg” function.

df_ordered.groupby().agg(avg(df_ordered.price)).collect()

The output should be this:

[Row(avg(price)=42.11355000000023)]

Now, let’s get the sum of all orders. This can be done with the “sum()” function. It is also very similar to the previous samples:

df_ordered.groupby().agg(sum(df_ordered.price)).collect()

And the output should be this:

[Row(sum(price)=4211355.000000023)]

To calculate the mean value, we can use “mean” with it:

df_ordered.groupby().agg(mean(df_ordered.price)).collect()

This should be the output:

[Row(avg(price)=42.11355000000023)]

Another useful function is “count”, where we can simply count all datasets for a column:

df_ordered.groupby().agg(count(df_ordered.price)).collect()

And the output should be this:

[Row(count(price)=100000)]

Today’s tutorial was very easy. It was dealing with aggregation function in Spark and how you can create them with the use of the agg() method. In the next sample, we will look at different ways to join data together.

If you enjoyed this tutorial, make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. Your learning journey can still continue. For full details about Apache Spark, make sure to visit the official page.

One thing that everyone that deals with data is with classes that make data accessible to the code as objects. In all cases – and Python isn’t different here – wrapper classes and O/R mappers have to be written. However, Python has a powerful decorator for us at hand, that allows us to ease up or work. This decorator is called “dataclass”

The dataclass in Python

The nice thing about the dataclass decorator is that it enables us to add a great set of functionality to an object containing data without the need to re-write it always. Basically, this decorator adds the following functionality:

  • __init__: the constructor with all defined member variables. In order to use this, the member variables must be initialised with its type – which is rather uncommon in Python
  • __repr__: this pretty prints the class with all its member variables as a string
  • __eq__: a function to compare two classes for ordering
  • order functions: this creates several order functions such as __lt__ (lower than), __gt__ (greater than), __le__ (lower equals) and __ge__ (greater equals)
  • __hash__: adds a hash-function to the class
  • frozen: prevents the class from adding/deleting attributes on runtime

The definition for a dataclass in Python is easy:

@dataclass
class Classname():
CLASS-BLOCK

You can also add each of the above described properties separately, e.g. with frozen=True or alike.

In the following sample, we will create a Person-Dataclass.

from dataclasses import dataclass
@dataclass
class Person:
    firstname: str
    lastname: str
    age: int
    score: float
        
p = Person("Mario", "Meir-Huber", 35, 1.0)
print(p)

Please note the differences in how to annotate the member variables. You can see that there is now no need for a constructor anymore, since this is already done for you. When you print the class, the __repr__() function is called. The output should look like the following:

Person(firstname='Mario', lastname='Meir-Huber', age=35, score=1.0)

As you can see, the dataclass abstracts a lot of our problems. In the next tutorial we will have a look at IterTools and FuncTools.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

Now we have learned how to write a Linear Regression model from hand in our last tutorial. Also, we had a look at the prediction error and standard error. Today, we want to focus on a way how to measure the performance of a model. In marketing, a common methodology for this is lift and gain charts. They can also be used for other things, but in our today’s sample we will use a marketing scenario.

The marketing scenario for Lift and Gain charts

Let’s assume that you are in charge of an outbound call campaign. Basically, your goal is to increase conversions of people contacted via this campaign. Like with most campaigns, you have a certain – limited – budget and thus need to plan the campaign smart. This is where machine learning comes into play: you only want to contact those people that are most relevant to buy the product. Therefore, you contact the top X percent of customers where you rather expect a conversion and avoid contacting those customers that are very unlikely to get converted. We assume that you already built a model for that and that we now do the campaign. We will measure our results with a gain chart, but first let’s create some data.

Our sample data represents all our customers, grouped into decentiles. Basically, we group the customers into top 10%, top 20%, … until we reach all customers. We add the number of conversions to it as well:

Decantile# of CustomersConversions
120033
220030
320027
420025
520023
620019
720015
820011
92007
102002

As you can see in the above table, the first decentile contains most conversions and is thus our top group. The conversion rates for each group in percent are:

% Conversions
17,2%
15,6%
14,1%
13,0%
12,0%
9,9%
7,8%
5,7%
3,6%
1,0%

As you can see, 17.2% of all top 10% customers could be converted. From each group, it declines. So, the best approach is to first contact the top customers. As a next step, we add the cumulative conversions. This number is then used for our cumulative gain chart.

Cumulative % Conversions
17,2%
32,8%
46,9%
59,9%
71,9%
81,8%
89,6%
95,3%
99,0%
100,0%

Cumulative Gain Chart

With this data, we can now create the cumulative gain chart. In our case, this would look like the following:

A cumulative gain chart
A cumulative gain chart

The Lift factor

Now, let’s have a look at the lift factor. The base for the lift factor is always the lift 1. This means that there was a random sample selected and no structured approach was done. Basically, the lift factor is the ratio you get between the number of customers contacted in % and the number of conversions for the decentile in %. With our sample data, this lift data would look like the following:

Lift
1,72
1,64
1,56
1,50
1,44
1,36
1,28
1,19
1,10
1,00

Thus we would have a lift factor of 1.72 with the first percentile, decreasing towards the full customer set.

In this tutorial, we’ve learned about how to verify a machine learning model. In the next tutorial, we will have a look at false positives and some other important topics before moving on with Logistic Regression.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

Data Science is often this mystical thing – only a few understand it, finding people doing it is very hard. The skill gap is everywhere and companies are facing issues staffing their projects. However, most companies want to become “data driven” and thus would need to have the skills available. In this context, we talk about data science democratisation.

What is data science democratisation?

However, I still think that we are currently doing it somewhat wrong – we need to enable more people to do what data scientists are doing without the need for them to do complex algorithmic things. This is when “self service Analytics” comes into play – giving business users more power and enabling them in doing “data science” with easy tools.

In an ideal world, each business user would have some basic data capabilities and is fully capable of doing her own insights in some way – by driving all decisions with the data, not with gut feelings. There are already some tools out there that enable exactly that – self service Analytics. This would also mean that the processes in companies have to shift a lot – away from traditional processes and data ownership. The goal of self-service analytics is diverse:

  • Reducing the FTE input for Data Science. At the moment, we need data scientists to do the job. However, those people aren’t available at the market in large scale and are very hard to find. This leads to several issues in doing data science.
  • Reducing the TTM. If for every business question we would need the help of a data scientist, every question will become a project that takes weeks. Decisions often need to be done fast, otherwise they might not be relevant at all.

What needs to be done to achieve it?

In my previous paragraph, I was writing about the “ideal world”. Now you might question what is the business reality out there and what needs to be done in order to achieve this? Well, it is easer said then done. Basically, there are some organisational and technical measures that needs to be applied:

  • No Silos. People can only work with Data if they have the full view of all available data. There should be no “hidden” data and everyone in the company should be capable of checking data for integrity. Knowledge means power and if one unit possess all the data, they are very powerful. Therefore, data should be “free” within the company.
  • Self-service Data access. People and business units in the company should be capable of accessing data in an easy (and self-service) manner. It must be easy for them to find, search, retrieve and visualize data.
  • Data thinking and mindset. Everyone in the company – ranging from top managers to business users – need to have a data thinking and mindset. This means that they should use data for all of their daily decisions rather than “gut feelings”. They should challenge their decisions with provability and data.

Technical enablers for data democratisation

  • Governance, Metadata Management and Data Catalogs: I keep on repeating myself – but as long as these elementary things aren’t solved, the above one’s are impossible to reach. Most companies only do governance to an extend of legal and regulatory requirements, but they should do much more than that – enabling a self-service environment.
  • Data Abstraction / Virtualisation: This is one of the key things to enable easy data access at some level. To all data sources, an easy interface – ideally with SQL-like feeling – should be available. This gives business users an easy tool to access all data, not just parts of it.

You might now think that the data scientist will get jobless? I would argue that it is the contrary. Self-service analytics isn’t made to handle the complex things – it is made for quick insights and proving that a business hypothesis might work. Based on this, much more questions will arise and thus create more work for data scientists. Also, achieving self-service analytics will lead to a lot of work for data engineers that finally have to integrate that data.

I hope you enjoyed this post about data science democratisation. If you want to learn more about how to deal with Big Data and Data Science in Business, read this tutorial about Big Data Business.

In our last tutorial section, we had a brief introduction to Spark Dataframes. Now, let’s have a look into filtering, grouping and sorting Data with Apache Spark Dataframes. First, we will start with the orderby statement in Spark, which allows us to order data in Spark. Next, we will continue with the sort statement in Spark that allows us to sort data in Spark and then we will focus on the groupby statement in Spark, which will eventually allow us to group data in Spark.

Order Data in Spark

First, let’s look at how to order data. Basically, there is an easy function on all dataframes, called “orderBy”. This function takes several parameters, but the most important parameters for us are:

  • Columns: a list of columns to order the dataset by. This is either one or more items
  • Order: ascending (=True) or descending (ascending=False)

If we again load our dataset from the previous sample, we can now easily apply the orderBy() function. In our sample, we order everything by personid:

df_new = spark.read.parquet(fileName)
df_ordered = df_new.orderBy(df_new.personid, ascending=True)
df_ordered.show()

The output should be:

+--------+----------+---+---------------+-----+-----+
|personid|personname|age|    productname|price|state|
+--------+----------+---+---------------+-----+-----+
|       0|    Amelia| 33|Bicycle gearing|19.99|   NY|
|       0|    Amelia| 33|      Kickstand|14.99|   NY|
|       0|    Amelia| 33|   Bicycle bell| 9.99|   NY|
|       0|    Amelia| 33|         Fender|29.99|   NY|
|       0|    Amelia| 33|         Fender|29.99|   NY|
|       0|    Amelia| 33|      Kickstand|14.99|   NY|
|       0|    Amelia| 33| Bicycle saddle|59.99|   NY|
|       0|    Amelia| 33| Bicycle saddle|59.99|   NY|
|       0|    Amelia| 33|  Bicycle brake|49.99|   NY|
|       0|    Amelia| 33|   Bicycle bell| 9.99|   NY|
|       0|    Amelia| 33|Bicycle gearing|19.99|   NY|
|       0|    Amelia| 33|   Bicycle fork|79.99|   NY|
|       0|    Amelia| 33|   Bicycle fork|79.99|   NY|
|       0|    Amelia| 33|  Bicycle Frame|99.99|   NY|
|       0|    Amelia| 33|Luggage carrier|34.99|   NY|
|       0|    Amelia| 33|Luggage carrier|34.99|   NY|
|       0|    Amelia| 33|      Kickstand|14.99|   NY|
|       0|    Amelia| 33|   Bicycle fork|79.99|   NY|
|       0|    Amelia| 33|  Bicycle Frame|99.99|   NY|
|       0|    Amelia| 33|   Bicycle fork|79.99|   NY|
+--------+----------+---+---------------+-----+-----+
only showing top 20 rows

Filter Data in Spark

Next, we want to filter our data. We take the ordered data again and only take customers that are between 33 and 35 years. Just like the previous “orderby” function, we can also apply an easy function here: filter. This function basically takes one filter argument. In order to have a “between”, we need to chain two filter statements together. Also like the previous sample, the column to filter with needs to be applied:

df_filtered = df_ordered.filter(df_ordered.age < 35).filter(df_ordered.age > 33)
df_filtered.show()

The output should look like this:

+--------+----------+---+---------------+-----+-----+
|personid|personname|age|    productname|price|state|
+--------+----------+---+---------------+-----+-----+
|      47|      Jake| 34|  Bicycle chain|39.99|   TX|
|      47|      Jake| 34|Bicycle gearing|19.99|   TX|
|      47|      Jake| 34|  Bicycle Frame|99.99|   TX|
|      47|      Jake| 34|Luggage carrier|34.99|   TX|
|      47|      Jake| 34|Bicycle gearing|19.99|   TX|
|      47|      Jake| 34|Luggage carrier|34.99|   TX|
|      47|      Jake| 34|   Bicycle bell| 9.99|   TX|
|      47|      Jake| 34|Bicycle gearing|19.99|   TX|
|      47|      Jake| 34|   Bicycle fork|79.99|   TX|
|      47|      Jake| 34|  Bicycle Frame|99.99|   TX|
|      47|      Jake| 34|      Saddlebag|24.99|   TX|
|      47|      Jake| 34|Luggage carrier|34.99|   TX|
|      47|      Jake| 34| Bicycle saddle|59.99|   TX|
|      47|      Jake| 34|  Bicycle Frame|99.99|   TX|
|      47|      Jake| 34|         Fender|29.99|   TX|
|      47|      Jake| 34|      Kickstand|14.99|   TX|
|      47|      Jake| 34|   Bicycle bell| 9.99|   TX|
|      47|      Jake| 34|   Bicycle bell| 9.99|   TX|
|      47|      Jake| 34|  Bicycle brake|49.99|   TX|
|      47|      Jake| 34|      Saddlebag|24.99|   TX|
+--------+----------+---+---------------+-----+-----+
only showing top 20 rows

Group Data in Spark

In order to make more complex queries, we can also group data by different columns. This is done with the “groupBy” statement. Basically, this statement also takes just one argument – the column(s) to sort by. The following sample is a bit more complex, but I will explain it after the sample:

from pyspark.sql.functions import bround
df_grouped = df_ordered \
    .groupBy(df_ordered.personid) \
    .sum("price") \
    .orderBy("sum(price)") \
    .select("personid", bround("sum(price)", 2)) \
    .withColumnRenamed("bround(sum(price), 2)", "value")
df_grouped.show()

So, what has happened here? We had several steps:

  • Take the previously ordered dataset and group it by personid
  • Create the sum of each person’s items
  • Order everything descending by the column for the sum. NOTE: the column is named “sum(price)” since it is a new column
  • We round the column “sum(price)” by two decimal points so that it looks nicer. Note again, that the name of the column is changed again to “ground(sum(price), 2)”
  • Since the column is now at a really hard to interpret name, we call the “withColumnRenamed” function to give the column a much nicer name. We call our column “value”

The output should look like this:

+--------+--------+
|personid|   value|
+--------+--------+
|      37|18555.38|
|      69|18825.24|
|       6|18850.19|
|     196|19050.34|
|      96|19060.34|
|     144|19165.37|
|     108|19235.21|
|      61|19275.52|
|      63|19330.23|
|     107|19390.22|
|      46|19445.41|
|      79|19475.16|
|     181|19480.35|
|      17|19575.33|
|     123|19585.29|
|      70|19655.19|
|     134|19675.31|
|     137|19715.16|
|      25|19720.07|
|      45|19720.14|
+--------+--------+
only showing top 20 rows

You can also do each of the statements step-by-step in case you want to see each transformation and its impact.

In our next tutorial, we will have a look at aggregation functions in Apache Spark.

If you enjoyed this tutorial, make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. Your learning journey can still continue. For full details about Apache Spark, make sure to visit the official page.

Once you put your applications into production, you won’t be able to debug them any more. This is creating some issues, since you won’t know what is going on in the background. Imagine, a user does something and an application occurs – maybe, you don’t even know that this behaviour can lead to an error. To overcome this obstacle, we have a powerful tool in almost any programming environment: logging in Python.

How to do logging in Python

Basically, the logger is imported from “logger” and it is used as a singleton. This means that you don’t need to create any classes or alike. Basically, first you need to instruct the logger with some information – such as the path to store the logs in and the format to be used. In our sample, we will use these parameters:

  • filename: The name of the file to write to
  • filemode: how the file should be created or appended
  • format: how the logs should be written into the file (regular expressions, …)

Then, you can call different logging levels to the logger. This is done by simply typing “logger” and using the action:

logger.<<ACTION>> 

Basically, we use these actions:

  • debug: a debug message that something was executed, …
  • info: some information that a new routine or alike is started
  • warning: something didn’t work as expected, but no error occurred
  • error: a severe error occurred that lead to wrong behaviour of the program
  • exception: an exception occurred. It is logged as “error” but in addition it includes the error message

Now, let’s start with the logging configuration:

import logging
logging.basicConfig(filename="../data/logs/log.log", filemode="w", format="%(asctime)s - %(levelname)s - %(message)s")

We store the log itself in a directory that first needs to be created. Then, we provide a format with the time, the name of the level (e.g. INFO) and the message itself. Now, we can go into writing the log itself:

logging.debug("Application started")
logging.warning("The user did an unexpected click")
logging.info("Ok, all is fine (still!)")
logging.error("Now it has crashed ... ")

This creates some log information into the file. Now, let’s see how this works with exceptions. Basically, we “provoke” an exception and log it with “exception”. We also set the parameter “exc_info” to true, which includes the exception without passing it on explicitly (Python handles that for us :))

Logging exceptions in Python

try:
    4/0
except ZeroDivisionError as ze:
    logging.exception("oh no!", exc_info=True)

Now, we can review our file and the output should be like this:

2019-08-13 16:21:04,329 - WARNING - The user did an unexpected click
2019-08-13 16:21:04,889 - ERROR - Now it has crashed ... 
2019-08-13 16:21:05,461 - ERROR - oh no!
Traceback (most recent call last):
  File "<ipython-input-9-5d33bb8d3dd6>", line 2, in <module>
    4/0
ZeroDivisionError: division by zero

As you can see, logging is really straight-forward and easy to use in Python. So, no more excuses to not do it :). Have fun logging!

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

In my previous posts, I explained the Linear Regression and stated that there are some errors in it. This is called the error of prediction (for individual predictions) and there is also a standard error. A prediction is good if the individual errors of prediction and the standard error are small. Let’s now start by examining the error of prediction, which is called the standard error in a linear regression model.

Error of prediction in Linear regression

Let’s recall the table from the previous tutorial:

YearAd Spend (X)Revenue (Y)Prediction (Y’)
2013 €    345.126,00  €   41.235.645,00  €   48.538.859,48
2014 €    534.678,00  €   62.354.984,00  €   65.813.163,80
2015 €    754.738,00  €   82.731.657,00  €   85.867.731,47
2016 €    986.453,00  € 112.674.539,00  € 106.984.445,76
2017 € 1.348.754,00  € 156.544.387,00  € 140.001.758,86
2018 € 1.678.943,00  € 176.543.726,00  € 170.092.632,46
2019 € 2.165.478,00  € 199.645.326,00  € 214.431.672,17

We can see that there is a clear difference in between the prediction and the actual numbers. We calculate the error in each prediction by taking the real value minus the prediction:

Y-Y’
-€   7.303.214,48
-€   3.458.179,80
-€   3.136.074,47
 €   5.690.093,24
 € 16.542.628,14
 €   6.451.093,54
-€ 14.786.346,17

In the above table, we can see how each prediction differs from the real value. Thus it is our prediction error on the actual values.

Calculating the Standard Error

Now, we want to calculate the standard error. First, let’s have a look at the formular:

Basically, we take the sum of all error to the square, divide it by the number of occurrences and take the square root of it. We already have Y-Y’ calculated, so we only need to make the square of it:

Y-Y’(Y-Y’)^2
-€   7.303.214,48  €    53.336.941.686.734,40
-€   3.458.179,80  €    11.959.007.558.032,20
-€   3.136.074,47  €      9.834.963.088.101,32
 €   5.690.093,24  €    32.377.161.053.416,10
 € 16.542.628,14  €  273.658.545.777.043,00
 €   6.451.093,54  €    41.616.607.923.053,70
-€ 14.786.346,17  €  218.636.033.083.835,00

The sum of it is 641.419.260.170.216,00 €

And N is 7, since it contains 7 Elements. Divided by 7, it is: 91.631.322.881.459,50 €

The last step is to take the square root, which results in the standard error of 9.572.425,13 € for our linear regression.

Now, we have most items cleared for our linear regression and can move on to the logistic regression in our next tutorial.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

We have already learned a couple of things about the RDDs in Spark – Spark’s low-level API. Today, it is time for the Apache Spark Dataframes Tutorial. Basically, Spark Dataframes are easier to work with than RDDs are. Working with them is more straight forward and easier and gives a more natural feeling.

Introduction to Spark Dataframes

To get started with Dataframes, we need some data to work with. In the RDD examples, we’ve created random values. Now, we want to work with a more comprehensive dataset. I’ve prepared a “table” about revenues and uploaded them to my Git repository. You can download it here. [NOTE: if you did the Hive tutorial, you should already have the dataset]

After you have downloaded the data, upload the data to your Jupyter environment. First, create a new folder named “data” in Jupyter. Now, navigate to this folder and click on “Upload” in the upper right corner. With this, you can now upload a new file.

Upload a file to Jupyter

Once you have done this, your file will be available under “data/revenues.csv”. It will make sense to explore the file so that you understand the structure of it.

Dataframes are very comfortable to work with. First of all, you can use them like tables and add headers and datatypes to them. If you already looked at our dataset, you can see that there are no types or headers assigned, so we need to create them first. This works with using the “StructType” map of the pyspark.sql.types package within the Dataframes API. For each header entry, you need to provide a “StructField” entity to the StructType map. A StructField has the following arguments:

  • Name: name of the field provided as a string value
  • Type: data type of the field. This can be any basic datatype from the pyspark.sql.types package. Normally, they are:
    • Numeric values: ByteType, ShortType, IntegerType, LongType
    • Decimal values: FloatType, DoubleType, DecimalType
    • Date values: DateType, TimestampType
    • Characters: StringType
    • Boolean: BooleanType
    • Null: NullType
  • Nullable: True or False if the field should be nullable

Working with Spark Dataframes

Once this is done, we can read the file from the path we have uploaded it with providing the “spark.read.csv” method. But before this happens, it is necessary to create the spark session. With dataframes, this is done with the Session builder. We provide an app name and some config parameters and call the “getOrCreate” function of SparkSession.

from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder.master("local") \
.appName("Cloudvane-Sample-02") \
.config("net.cloudvane.sampleId", "Cloudvane.net Spark Sample 02").getOrCreate()
structure = StructType([StructField("personid", IntegerType(), False),
                      StructField("personname", StringType(), False),
                      StructField("state", StringType(), True),
                      StructField("age", IntegerType(), True),
                      StructField("productid", IntegerType(), False),
                      StructField("productname", StringType(), False),
                      StructField("price", DoubleType(), False)])
df = spark.read.csv("data/revenues.csv", structure)

Now, let’s show the dataframe:

df

The output of this should look like the following:

DataFrame[personid: int, personname: string, state: string, age: int, productid: int, productname: string, price: double]

However, this only shows the header names and associated data types. Spark is built for large datasets, often having billions of rows – so seeing the result of such things is often not possible. However, you can show the top items by calling the “show()” function of a dataframe.

df.show()

This will now show some of the items in the form of a table. With our dataset, this looks like this:

The output of the Dataframe show() function

Modifying a Spark Dataframe

In the spark dataframe, there are some unnecessary rows inside. A common task for data engineers and data scientists is data cleansing. For our following samples, we won’t need “productid” – unless it might be useful for other tasks. Therefore, we simply remove the entire column with calling the “drop” functionality with Spark. After that, we show the result of it again.

df_clean = df.drop("productid")
df_clean.show()

The output of this again looks like the following:

Using the df.drop function in Spark

Now we have a heavily modified dataframe. At some point it would be useful to persist this dataset somewhere. In our case, we store it simply within our jupyter environment because it is the easiest way to access it for our samples. We store the result as a parquet file – one of the most common file formats to work with Spark. We can use the “write” function and after that add “parquet” to the calls. The “parquet” format knows several parameters that we will use:

  • FileName: the name of the file. For our sample below, we will create the filename from the current date and time. Therefore, we import “datetime” from the python framework. The method “strftime” formats the datetime in the specified format. our format should be “dd-mm-yyyy hh:mm:ss”. Also, we need to import the “os” package, which provides information about the filesystem used by the operating system. We will use the “path.join” function out of the package.
  • PartitionBy: With large datasets, it is best to partition them in order to increase efficiency of access. Partitions are done on Columns, which has a positive impact on read/scan operations on data. In our case, we provide the state as the partition criteria.
  • Compression: in order to decrease the size on disk, we use a compression algorithm. In our case, this should be gzip.

This is now all that we need to know in order to run the next code snippet:

import datetime
import os
 
today = datetime.datetime.today()
fileName = os.path.join("data", today.strftime('%d-%m-%Y %H:%M'))
df_clean.write.parquet(fileName, partitionBy="state", compression="gzip")

In today’s Spark Dataframes Tutorial, we have learned about the basics. In our next tutorial, we will look at grouping, filtering and sorting data.

If you enjoyed this tutorial, make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. Your learning journey can still continue.

In our previous tutorial, we had a look at how to (de) serialise objects from and to JSON in Python. Now, let’s have a look into how to dynamically create and extend classes in Python. Basically, we are using the library that Python itself is using. This is the dynamic type function in Python. This function takes several parameters, we will only focus on three relevant one’s for our sample.

How to use the dynamic type function in Python

Basically, this function takes several parameters. We utilize 3 parameters. These are:

type(CLASS_NAME, INHERITS, PARAMETERS)

These parameters have the following meaning:

  • CLASS_NAME: The name of the new class
  • INHERITS: from which the new type should inherit
  • PARAMETERS: new methods or parameters added to the class

In our following example, we want to extend the Person class with a new attribute called “location”. We call our new class “PersonNew” and instruct Python to inherit from “Person”, which we have created some tutorials earlier. Strange is that it is passed as an array, even there can only be one inheritance hierarchy in Python. Last, we specify the method “location” as key-value pair. Our sample looks like the following:

pn = type("PersonNew", (Person,), {"location": "Vienna"})
pn.age = 35
pn.name = "Mario"

If you test the code, it will just work like expected. All other objects such as age and name can also be retrieved. Now, let’s make it a bit more complex. We extend our previous sample with the JSON serialisation to be capable of dynamically creating a JSON object from a string.

Dynamically creating a class in Python from JSON

We therefore create a new function that takes the object to serialise and takes all values out of that. In addition, we add one more key-value pair, which we call “__class__” in order to store the name of the class. getting the class-name is a bit more complex, since it is written like “class ‘main.PersonNew'”. Therefore, we first split the object name with a “.”, take the last entry and again split it by the ‘ and take the first one. There are more elegant ways for this, but I want to keep it simple. Once we have the classname, we store it in the dictionary and return the dictionary. The complex sample is here:

def map_proxy(obj):
    dict = {}
    
    for k in obj.__dict__.keys():
        dict.update({k : obj.__dict__.get(k)})
        
    cls_name = str(obj).split(".")[1].split("'")[0]
    dict.update({"__class__" : cls_name})
        
    return dict

We can now use the json.dumps method and call the map_proxy function to return the JSON string:

st_pn = json.dumps(map_proxy(pn))
print(st_pn)

Now, we are ready to dynamically create a new class with the “type” method. We name the method after the class name that was provided above. This can be retrieved with “__class__”. We let it inherit from Person and pass the parameters from the entire object into it, since it is already a key/value pair:

def dyn_create(obj):
    
    return type(obj["__class__"], (Person, ), obj)

We can now also invoke the json.loads method to dynamically create the class:

obj = json.loads(st_pn, object_hook=dyn_create)
print(obj)
print(obj.location)

And the output should be like that:

{"location": "Vienna", "__module__": "__main__", "__doc__": null, "age": 35, "name": "Mario", "__class__": "PersonNew"}
<class '__main__.PersonNew'>
Vienna

As you can see, it is very easy to dynamically create new classes in Python. We could largely improve this code, but i’ve created this tutorial for explanatory reasons rather than usability ;).

In our next tutorial, we will have a look at logging.

Here you can go to the overview of the Python tutorial. If you want to dig deeper into the language, have a look at the official Python documentation.

In my previous posts, I introduced the basics of machine learning. Today, I want to focus on the two elementary algorithms: linear and logistic regression. Basically, you would learn them at the very beginning of your journey for machine learning, but eventually not use them much later on any more. But to understand the concepts of it, it is helpful to understand them.

Linear Regression

A Linear Regression is the simplest model for Data Science. Linear Regression is of supervised learning and used in Trend Analysis, Time-Series Analysis, Risk in Banking and many more.

In a linear regression, a relationship between a dependent variable y and a dataset of xn is linear. This basically means, that if there is data of a specific trend, a future trend can be predicted. Let’s assume that there is a significant relation between ad spendings and sales. We would have the following table:

YearAd SpendRevenue
2013 €      345.126,00  €      41.235.645,00
2014 €      534.678,00  €      62.354.984,00
2015 €      754.738,00  €      82.731.657,00
2016 €      986.453,00  €    112.674.539,00
2017 €   1.348.754,00  €    156.544.387,00
2018 €   1.678.943,00  €    176.543.726,00
2019 €   2.165.478,00  €    199.645.326,00

If you look at the data, it is very easy to figure out that that there is some kind of relation between how much money you spend on the ads and the revenue you get. Basically, the ratio is 1:92 to 1:119. Please not that I totally made up the numbers. however, based on this numbers, you could basically predict what revenues to obtain when spending X amount of data. The relation between them is therefore linear and we can easily plot it on a line chart:

Linear Regression

As you can see, some of the values are above the line and others below. Let’s now manually calculate the linear function. There are some steps necessary that should eventually lead to the prediction values. Let’s assume we want to know if we spend a specific money on ads, what revenue we can expect. Let’s assume we want to know how much value we create for 1 Million spend on ads. The linear regression function for this is:

predicted score (Y') = bX + intercept (A)

This means that we now need to calculate several values: (A) the slope (it is our “b” and the intercept (it is our A). X is the only value we know – our 1 Million spend. Let’s first calculate the slope

Calculating the Slope

The first thing we need to do is calculating the slope. For this, we need to have the standard deviation of both X and XY. Let’s first start with X – our revenues. The standard deviation is calculated for each revenue individually. There are some steps involved:

  • Creating the average of the revenues
  • Subtracting the individual revenue
  • Building the square

The first step is to create the average of both values. The average for the revenues should be:  € 118.818.609,14 and the average for the spend should be:  € 1.116.310,00.

Next, we need to create the standard deviation of each item. For the ad spend, we do this by substracting each individual ad spend and building the square. The table for this should look like the following:

The formular is: (Average of Ad spend – ad spend) ^ 2

YearAd spendStddev (X)
2013 €    345.126,00  €              594.724.761.856,00
2014 €    534.678,00  €              338.295.783.424,00
2015 €    754.738,00  €              130.734.311.184,00
2016 €    986.453,00  €                16.862.840.449,00
2017 € 1.348.754,00  €                54.030.213.136,00
2018 € 1.678.943,00  €              316.555.892.689,00
2019 € 2.165.478,00  €           1.100.753.492.224,00

Quite huge numbers already, right? Now, let’s create the standard deviation for the revenues. This is done by taking the average of the ad spend – ad spend and multiplying it with the same procedure for the revenues. This should result in:

YearRevenueY_Ad_Stddev
2013 €                  41.235.645,00  €    59.830.740.619.545,10
2014 €                  62.354.984,00  €    32.841.051.219.090,30
2015 €                  82.731.657,00  €    13.048.031.460.197,10
2016 €                112.674.539,00  €         797.850.516.541,00
2017 €                156.544.387,00  €      8.769.130.708.225,71
2018 €                176.543.726,00  €    32.478.055.672.684,90
2019 €                199.645.326,00  €    84.800.804.871.574,80

Now, we only need to sum up the columns for Y and X. The sums should be:

€ 2.551.957.294.962,00 for the X-Row
€ 232.565.665.067.859,00 for the Y-Row

Now, we need to divide the Y-Row by the X-Row and would get the following slope: 91,1322715

Calculating the Intercept

The intercept is somewhat easier. The formular for it is: average(y) – Slope * average(x). We already have all relevant variables calculated in our previous step. Our intercept should equal:  € 17.086.743,14.

Predicting the value with the Linear Regression

Now, we can build our function. This is: Y = 91,1322715X + 17.086.743,14

As stated in the beginning, our X should be 1 Million and we want to know our revenue:  € 108.219.014,64

The prediction is actually lower than the values which are closer (2016 and 2017 values). If you change the values to 2 Million or 400k, it will again get closer. Predictions always produce some errors and they are normally shown. Therefore, the error table would look like the following:

ad spentreal revenue (Y)prediction (Y’)error
2013 €                       345.126,00  €                  41.235.645,00  €                  48.538.859,48 -€     7.303.214,48
2014 €                       534.678,00  €                  62.354.984,00  €                  65.813.163,80 -€     3.458.179,80
2015 €                       754.738,00  €                  82.731.657,00  €                  85.867.731,47 -€     3.136.074,47
2016 €                       986.453,00  €                112.674.539,00  €                106.984.445,76  €     5.690.093,24
2017 €                    1.348.754,00  €                156.544.387,00  €                140.001.758,86  €   16.542.628,14
2018 €                    1.678.943,00  €                176.543.726,00  €                170.092.632,46  €     6.451.093,54
2019 €                    2.165.478,00  €                199.645.326,00  €                214.431.672,17 -€   14.786.346,17

The error calculation is done by using the real value and deducting the predicted value from it. And voila – you have your error. One common thing in machine learning is to reduce the error and make predictions more accurate.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.