The SciPy Notebook

Python is one of the most used tools for Data Science. Since its incarnation, Python received a lot of attention from both Data Scientists and Data Engineers. In a modern data scientists toolbox, it is hard to not know about Python. Some month ago, I’ve created a tutorial on the Python language itself. Now, it is about time to have a tutorial dedicated to python data science!

Python Data Science Tutorial: the tools you need

First, let’s get started with a list of tools that we need for our python data science tutorial. The most important thing is Python itself, so you should be already familiar with it. If not, consider learning Python with this tutorial first. Once done, we need some additional libraries. Don’t worry, they will be used from a pre-defined environment. But let’s first have a look at all the libraries that we will use throughout this tutorial series.

Basically, we will focus on three libraries in this series: they are NumPy, Pandas and MatPlotLib. All of these 3 libraries will be briefly described in this post, but will get a more comprehensive coverage over the next weeks.

NumPy

NumPy is an easy to use open source library for scientific computing. The main element of NumPy is the n-dimensional array, which provides powerful means to do vector-based calculations. The library is integrated on a very elementary level (C) and thus comes with high performance. NumPy is very useful for elementary mathematical functions and you will definitely use the random number generator of NumPy from time to time!

It is essential to learn about NumPy, since it is used in quite some Data Science projects. The next tutorial in this series will be an intro to NumPy.

Pandas

Pandas is a great open source library for data manipulation. The key element of Pandas is the Dataframe, which is also often used from Spark. Pandas is a library that offers a lot of opportunities to both Data Scientists and Data Engineers when it comes to handling data.

Pandas can deal with different data types: it is well suited for Time-Series data and also for tabular data. In the next posts, we will explore Pandas more.

MatPlotLib

Last but not least, it is also useful to present the data in a visual format. This is the strength of MatPlotLib. This Python library provides great tools for data visualization in Python. Our tutorial series will end with a description of MatPlotLib.

Setting up the environment for our Tutorial

Starting with the next tutorial in this series, we will start to code (yay!). However, it also means that you need to have an environment up and running. There are several options available for this. My preference is to use Jupyter (a notebook app) in an docker environment. To set this up, you need to have Docker running on your device. If you are not familiar with Docker, you can learn about Docker here. Please install Docker for this series first. If you don’t have it yet, find out how to install it from this link: https://docs.docker.com/install/. The installation procedure will take some time to finish, so please be patient.

Docker comes with an easy tool called “Kitematic”, which allows you to easily download and install docker containers. Luckily, the Jupyter Team provided a comprehensive container for Python and of course Jupyter itself. Once your docker is installed successfully, download the following container: “scipy-notebook”. Note that the download will take a while.

SciPy Notebook on Docker
The scipy notebook in Docker

if you liked this post, you might consider the tutorial about Python itself. This gives you a great insight into the Python language for Spark itself. If you want to know more about Python, you should consider visiting the official page.

golden record

In our last tutorial for Data Governance, we now look at Master Data Management. This is the last of our four pillars. Master Data is the core data in the company, which should be clean, accurate and in a clear data model.

What is the goal of Master Data Management?

It is important to have exactly one dataset of key data assets within the company. This could for instance be the data about a customer or a supplier. It is useful to have one customer exactly once. Many companies have their customer data spread over different systems and thus having issues getting a connection between those systems. If a customer walks into a store, the sales agents often have to use different CRM tools to get a holistic picture of the customer. This often leads to not fully understanding the customer within a company.

In order to reach this, it is necessary to harmonise within a company. Reducing double entries and finding the “golden record” is a key challenge in MDM: all data about one customer should be connected and in one place. Today, this is often called “Customer 360”. But achieving this isn’t easy at all.

How to find the “Golden Record”?

Basically, there are several options to find the golden record within a dataset. Let’s imagine we have the following dataset; each of the entries is exactly the same person, but names are written different:

NameSocial Security NumberPassportMatching Group ID
Mario Meir123-45-6789
Meir Mario123-45-6789P 123456 M
M. MeirP 123456 M
How to find the golden record in a dataset

Basically, in this dataset, we see that there is a match on the social security number and on the passport. So, we can apply hierarchical matching. First, we match those entries that are rather unique. Normally, the social security number is unique, as well as the passport ID. In this case, we could match the dataset to one dataset. This would be now represented in matching groups:

NameSocial Security NumberPassportMatching Group ID
Mario Meir123-45-67891
Meir Mario123-45-6789P 123456 M1
M. MeirP 123456 M
Hierarchical matching

What else can be done to increase the quality of your Master Data?

Basically, in addition to hierarchical matching, there are several other techniques available. The most common one is the “manual matching”, where employees seek for duplicated data and thus match this data. However, a better approach is to match data via machine learning and combine it with the “manual matching”!

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

Next to Data Security & Privacy as well as Data Quality Management, there is a huge importance in Data Access and Search. This topic focuses on finding and accessing data in your data assets. Most large enterprises have a lot of data at their finger tips, but different business units don’t know where and how to find it. In this tutorial, we will have a look at how to solve this issue.

What are the ingredients for successful Data Access and Search?

There are several pre-conditions that need to be fulfilled in order to make data accessible. One of the pre-conditions is to have data security and privacy solved. If you want to make data accessible in large-scale, it is very important to ensure that only those users can access the data they should access. As a result of this, all users should see data assets in the company via a data catalog, but not the data itself. In this catalog, people should have the possibility to browse different data assets available in the company and start asking more questions.

A good data catalog constantly checks the data for updates to the catalog itself and to possible modifications. In addition to these requirements mentioned before, the data catalog checks for different data quality measures as described in the previous tutorial.

What should be inside a data catalog?

Based on the above mentioned things, a data catalog contains a lot of data about data. Next to different data assets available, each data asset should be described and offer several informations about it:

  • Titel. Title of the dataset
  • Description. What this dataset is about.
  • Categories. Tags, to enable search.
  • Business Unit. Unit, maintaining the dataset (z.b. Marketing)
  • Data Owner. Person, in charge of maintaining the dataset.
  • Data Producer. System that produces the data
  • Data Steward. Person taking care of the dataset, if not data owner itself.
  • Timespan. This indicates a date when to when the data was recorded.
  • Data refresh interval. If not in real-time available, indication how often the data gets refreshed
  • Quality metrics. Indications on data quality.
  • Data Access or Sample Data. Information on how to access the data or a sample dataset to explore the data
  • Transformations. When and how was the data transformed?

How does a data catalog looks like?

This items above are samples for the contents of a data catalog entry. A good data catalog makes it easy for users to find and search within the metadata. The following sample shows the data catalog from the US government:

US government open data portal

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

We started our tutorial with a general intro to Data Governance and then went a bit deeper into data security and data privacy. In this post, we will have a look at how to ensure a certain level of data quality in your data sets. Data Quality is a very important aspect. Imagine, you have wrong data about your customers and you build your marketing campaign on it. The campaign might return wrong results. This can damage your brand and turn away previously loyal customers. Therefore, data quality is highly essential.

How to measure data quality?

There are several aspects on how to measure data quality. I’ve summarised them into 5 core metrics. If you browse different literature, you might find more or less metrics. However, these five metrics should give you a core understanding of data quality management.

The 5 dimensions of data quality
The 5 dimensions of data quality

Availability

Availability states that data should be available. If we want to query all existing users interested in luxury cars, we are not interested in a subset but all of them. Availability is also a challenge addressed by the CAP-Theorem. In this case, it doesn’t focus on the general availability of the database but at the availability of each dataset itself. The algorithm querying the data should be as good as possible to retrieve all available data. There should be easy to use tools and languages to retrieve the data. Normally, each database provides a query language such as SQL, or O/R Mappers to developers.

With availability is also meant that the data used for a specific use-case should be available to data analysts in business units. A data relevant for a marketing campaign might be existing but not available for the campaign. For instance, the company might have specific customer data available in the data warehouse, but it isn’t know to business units that the data actually exists.

Correctness & Completness

Correctness means that Data has to be correct. If we again query for all existing users on a web portal interested in luxury cars, the data about that should be correct. By correctness, it is meant that the data should really represent people interested in luxury cars and that faked entries should be removed. A data set is also not correct if the user changed his or her address without the company knowing about it. Therefore, it must be tracked when which dataset was last updated.

Similar to correctness is completness. Data should be complete. Targeting all users interested in luxury cars only makes sense if we can target them somehow, e.g. by e-mail. If the e-mail field is blank or any other field we would like to target our users, data is not complete for our use-case.

Timeliness

Data should be up-to date. A user might change the e-mail address after a while and our database should reflect these changes whenever and wherever possible. If we target our users for luxury cars, it won’t be good at all if only 50% of the user’s e-mail addresses are correct. We might have “big data” but the data is not correct since updates didn’t occur for a while.

Consistency

This shouldn’t be confused with the consistency requirement by the CAP-Theorem. Data might be duplicated, since users might register several times to get various benefits. The user might select “luxury cars” and with another account “budget cars”. Duplicate accounts leads to inconsistency of data and it is a frequent problem in large web portals such as Facebook

Understandability

It should be easy to understand data. If we query our database for people interested in luxury cars, we should have the possibility to easily understand what the data is about. Once the data is returned, we should use our favorite tool to work with the data. The data itself should describe itself and we should know how to handle it. If the data returns a “zip” column, we know that this is the ZIP-code individual users are living in.

What can you do to improve your data quality?

Basically, it all starts with starting. You need to start tracking your data quality at some point and then need to continuously improve it. There are several tools existing that support your endeavour. But keep in mind: bad data creates bad decisions!

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In our previous tutorial intro, we outlined the four pillars that are relevant to data governance. In this post, I will go for a deeper dive into the data security and data privacy aspects of data governance.

What is data security?

Data security is all about securing the data against intrusions from the in- or outside of an organisation. Basically, it deals with hardening any systems that store data and making sure that data is only stored in a safe and secure way.

When dealing with data privacy, it comes in several layers:

  • Infrastructure: ensuring that the physical infrastructure is protected against any unwanted access. This starts with physical access control to servers and any devices associated with the organisation. This layer is only relevant when done on-premise.
  • Operating Systems and virtualisation: here it needs to be ensured that the operating system is in a secure state. If done on-premise, it requires both the host and the guest OS and the virtualisation software. When done in the cloud, it only applies to IaaS
  • Databases and Data Stores: any databases need to be constantly checked for vulnerabilities. If using any other stores such as object stores, they also need to be secured. This applies to on-premise and IaaS cloud solutions, but not to PaaS or SaaS cloud solutions
  • Application Security: When building a software on top of the previous stack, it is necessary to write this software in a secure manner. This applies to both on-prem and cloud. When using PaaS or SaaS solutions, it is the only relevant security challenge for companies implementing it. Therefore, it is highly important to look for a comprehensive security concept on this layer.

What if you ignore it?

Having issues with data security is a frequent failure of companies. There are a lot of examples of data leaks like with LinkedIn, Deutsche Telekom or Twitter. Almost nobody is secure and thus this block needs to be taken into consideration at the highest level when building a data strategy. Experts argue that it might not be a question when an intrusion happens. The only question might be how long the organisation needs to realise it and thus take counter-measures and minimise the damage.

A key recommendation (but not the only one) is to encrypt all data, so that it is more challenging to get full access.

What is data privacy?

Another important block is data privacy. This now deals more with the question on who can read or access the data within a company. Basically, algorithms and people should work with (pseudo) anonymised data whenever possible. Analysts or Data Scientists shouldn’t see any personal information within the data that they are dealing with. If we take a marketing campaign, the analysts working with the data should only see the minimum available data for them necessary to build the campaign. The marketing tool should then combine the results of their selection with the addresses of their target. There are several tools available that obfuscate personal identifiable data (PID) and thus make the work with it easier.

The above described is also called as the “need to know principle”. People should only see the data that they really need to know. When looking at how companies build their access rights to data, it is often built on a very individual basis. People ask for access, state why they need it and the data owner gives them access. However, this is rather manual and not necessarily fit for the new era of privacy.

A business driven role-based access model

A much better approach is to build on a role-based access model. By roles, it doesn’t necessarily mean Active Directory roles. It is more built on the business roles that users are in. For example, a role would be “Marketing Analyst”. This user would get access to specific data that he or she needs for the daily work. Access to all data that are relevant should be given, but nothing more than that. The roles in this approach should be clearly business focused and not technology-focused.

Another key aspect in data privacy is to understand who was accessing what data. It is necessary to store a comprehensive audit log about all data access and thus make data breaches trackable.

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

Data Governance

Everybody is talking about Data Science and Big Data, but one heavily ignored topic is Data Governance and Data Quality. Executives all over the world want to invest into doing data science, but they often ignore Data Governance. Some month ago I wrote about this and shared my frustration about it. Now I’ve decided to go for a more pragmatic approach and describe what Data Governance is all about. This should bring some clarity into the topic and reduce emotions.

Why is Data Governance important?

It is important to keep a certain level of quality in the data. Making decisions on Bad Data Quality leads to bad overall decisions. Data Governance efforts are increasing exponentially when not done in the very beginning of your Data Strategy.

Also, there are a lot of challenges around Data Governance:

  • Keeping a high level of security is often slowing down business implementations
  • Initial investments are necessary – that don’t show value for month to years
  • Benefits are only visible “on top” of governance – e.g. with faster business results or better insights and thus it is not easy to “quantify” the impact
  • Data Governance is often considered as “unsexy” to do. Everybody talks about data science, but nobody about data governance. In fact, Data Scientists can do almost nothing without data governance
  • Data Governance tools are rare – and those that are available are very expensive. Open Source doesn’t focus too much on it, as there is less “buzz” around it than AI. However, this also creates opportunities for us

Companies can basically follow three different strategies. Each strategy differs in the level of maturity:

  • Reactive Governance: Efforts are rather designed to respond to current pains. This happens when the organization has suffered a regulatory breach or a data disaster
  • Pre-emptive Governance: The organization is facing a major change or threat. This strategy is designed to ward off significant issues that could affect success of the company. Often it is driven by impending regulatory & compliance needs
  • Proactive Governance: All efforts are designed to improve capabilities to resolve risk and data issues. This strategy builds on reactive governance to create an ever-increasing body of validated rules, standards, and tested processes. It is also part of a wider Information Management strategy

The 4 pillars

4 data governance pillars
The 4 pillars of Data Governance

As you can see in the image, there are basically 4 main pillars. During the next weeks, I will describe each of them in detail. But let’s have a first look at them now:

  • Data Security & Data Privacy: The overall goal in here is to keep the data secure against external access. It is built on encryption, access management and accessibility. Often, a Roles-based access is defined in this process. A typical definition in here is privacy and security by design
  • Data Quality Management: In this pillar, different measures for Data Quality are defined and tracked. Typically, for each dataset, specific quality measures are looked after. This gives data consumers an overview of the data quality.
  • Data Access & Search: This pillar is all about making data accessible and searchable within the company assets. A typical sample here is a Data Catalog, that shows all available company data to end users.
  • Master Data Management: master data is the common data of the company – e.g. the customer data, the data of suppliers and alike. Data in here should be of high quality and consistent. One physical customer should occur exactly as one person and not as multiple persons

For each of the above mentioned pillars, I will write individual articles over the next weeks.

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

During the past tutorials, we have aquired a lot of knowledge about Spark. Now, we are with the last tutorial on Spark, where we will have a look at Cube and Rollup. Basically both are useful for multi-dimensional data for further processing.

Data for Spark Rollup and Cube functions

First, let’s create a dataset that we later want to work with. Our dataset is the monthly salary of people working in Finance or Sales:

employees = spark.createDataFrame([("Mario", 4400, "Sales")\
                                  , ("Max", 3420, "Finance")\
                                  , ("Sue", 5500, "Sales")\
                                  , ("Tom", 6700, "Finance")]\
                                 , ("name", "salary", "department"))

We then use the first function – rollup. We want to have the rollup to be on the department and the name of the person.

employees.rollup(employees.department, employees.name)\
            .sum()\
            .withColumnRenamed("sum(salary)", "salary")\
            .orderBy("department", "salary")\
            .show()

Here you can see the output (I will discuss it after you reviewed it):

+----------+-----+------+
|department| name|salary|
+----------+-----+------+
|      null| null| 20020|
|   Finance|  Max|  3420|
|   Finance|  Tom|  6700|
|   Finance| null| 10120|
|     Sales|Mario|  4400|
|     Sales|  Sue|  5500|
|     Sales| null|  9900|
+----------+-----+------+

We have several lines in this now. Let’s look at it line-by-line:

  • The first line is consisting of two null values and the sum of all salaries. So, this would represent the entire company. Basically, it fills department and name with null, since it is neither a department nor a specific person – it is all departments and all persons in it.
  • The second and third line are Max and Tom, who work in the finance department
  • The fourth line is the sum of the finance department; here you see “null” in the name, since it isn’t a name, but the entire department
  • The same story continues for the following lines with the sales department

So, basically, we get different things: (A) the sum of all revenues, (B) the individual values and (C) the revenues per department. Now, let’s build the cube:

employees.cube(employees.department, employees.name)\
            .sum()\
            .withColumnRenamed("sum(salary)", "salary")\
            .orderBy("department", "salary")\
            .show()

Here, the results are in even more dimensions. First, we have the values of each person, but not from the department. Then, we have all results and then again the departments and individuals in it. The cube isn’t relevant for us for this calculation much. The background is that a cube creates all possible combinations, whereas the rollup only creates hierarchies. The cube also treats null’s as a possible combination, that’s why we have the individuals here several times. Here is the output:

+----------+-----+------+
|department| name|salary|
+----------+-----+------+
|      null|  Max|  3420|
|      null|Mario|  4400|
|      null|  Sue|  5500|
|      null|  Tom|  6700|
|      null| null| 20020|
|   Finance|  Max|  3420|
|   Finance|  Tom|  6700|
|   Finance| null| 10120|
|     Sales|Mario|  4400|
|     Sales|  Sue|  5500|
|     Sales| null|  9900|
+----------+-----+------+

I hope you liked the tutorials on Spark. There is much more to learn – e.g. about machine learning or different libraries for that. Make sure to check out the tutorial section in order to figure that out.

There is of course much more to learn about Spark, so make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.

In the previous tutorial, we learned about data cleaning in Spark. Today, we will look at different options to work with columns and rows in Spark. First, we will start with renaming columns. We did this already several times so far, and it is a frequent task in data engineering. In the following sample, we will rename a column:

thirties = clean.select(clean.name, clean.age.between(30, 39)).withColumnRenamed("((age >= 30) AND (age <= 39))", "goodage")
thirties.show()

As you could see, we took the old name – which was very complicated – and renamed it to “goodage”. The output should be the following:

+-----+-------+
| name|goodage|
+-----+-------+
|  Max|  false|
|  Tom|   true|
|  Sue|  false|
|Mario|   true|
+-----+-------+

In the next sample, we want to filter columns on a string-expression. This can be done with the “endswith” method being applied to the column name that should be filtered. In the following sample, we want to filter all contacts that are from Austria:

austrian = clean.filter(clean.lang.endswith("at"))
austrian.show()

As you can see, only one result is returned (as expected):

+---+-----+---+-----+
|nid| name|age| lang|
+---+-----+---+-----+
|  1|Mario| 35|DE-at|
+---+-----+---+-----+

Removing Null-Values in Spark

In our next sample, we want to filter all rows that contain null values in a specific column. This is useful to get a glimpse of null values in datasets. This can easily be done by applying the “isNull” function on a column:

nullvalues = dirtyset.filter(dirtyset.age.isNull())
nullvalues.show()

Here, we get the two results containing these null values:

+---+----+----+-----+
|nid|name| age| lang|
+---+----+----+-----+
|  4| Tom|null|AT-ch|
|  5| Tom|null|AT-ch|
+---+----+----+-----+

Another useful function in Spark is the “Like” function. If you are familiar with SQL, it should be easy to apply this. If not – basically, it scans text in a column, which contains one or more specific literals. You can use different expressions to filter for patterns. The following one filters all people that have “DE” in it, independent of what follows afterwards (“%”):

langde = clean.filter(clean.lang.like("DE%"))
langde.show()

Here, we get all items:

+---+-----+---+-----+
|nid| name|age| lang|
+---+-----+---+-----+
|  2|  Max| 46|DE-de|
|  4|  Tom| 34|DE-ch|
|  1|Mario| 35|DE-at|
+---+-----+---+-----+

Shorten Strings in a Column in Spark

Several times, we want to shorten string values. The following sample takes the first 2 letters with the “substr” function on the column. We afterwards apply the “alias” function, which renames the function (similar to the “withColumnRenamed” function above).

shortnames = clean.select(clean.name.substr(0,2).alias("sn")).collect()
shortnames

Also here, we get the expected output; please note that it isn’t unique anymore (names!):

[Row(sn='Ma'), Row(sn='To'), Row(sn='Su'), Row(sn='Ma')]

Spark offers much more functionality to manipulate Columns, so just play with the API :). In the next tutorial, we will have a look at how to build Cubes and Rollups in Spark

There is of course much more to learn about Spark, so make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.


Over the past tutorials, we have acquired quite some knowledge about Python and Spark. We know most of the relevant statements in Apache Spark and are now ready for some data cleaning, which is one of the key tasks of a data engineer. Before we get started, we need to have something that a data engineer (unfortunately) often works with: dirty data. We will focus on how to deal with missing, corrupt and wrong data in Spark

Dealing with wrong data in Spark

In the following sample, we create some data. Note, that there are some errors in it:

  1. The first field (id: 1) is having the wrong language – there is no “AT-at”, but it needs to be “DE-at”
  2. The fourth field (id: 4) is having a null-value “None” and also the wrong language – AT-ch, whereas it should be “DE-ch”
  3. The fifth field (id: 5) is a duplicate of the previous one
dirtyset = spark.createDataFrame([(1, "Mario", 35, "AT-at")\
                                  , (2, "Max", 46, "DE-de")\
                                  , (3, "Sue", 22, "EN-uk")\
                                  , (4, "Tom", None, "AT-ch")\
                                  , (5, "Tom", None, "AT-ch")]\
                                 , ("nid", "name", "age", "lang"))
dirtyset.show()

The dataset should look like this:

+---+-----+----+-----+
|nid| name| age| lang|
+---+-----+----+-----+
|  1|Mario|  35|AT-at|
|  2|  Max|  46|DE-de|
|  3|  Sue|  22|EN-uk|
|  4|  Tom|null|AT-ch|
|  5|  Tom|null|AT-ch|
+---+-----+----+-----+

Deleting duplicates in Spark

The first thing we want to do is removing duplicates. There is an easy function for that in Apache Spark – called “dropDuplicates”. When you look at the previous dataset, it might be very easy to figure out that the last one is a duplicate – but wait! For Apache Spark, it isn’t that easy, because the id is different – it is 4 vs 5. Spark doesn’t figure out which columns are relevant to take duplicates from. If we would apply the “dropDuplicates” to the dataframe, it wouldn’t remove anything. So, we need to apply the columns it should take into account when removing duplicates. We tell spark that we want to work with “name” and “age” for this purpose and pass a list of these to the function:

nodub = dirtyset.dropDuplicates(["name", "age"])
nodub.show()

When you now execute the code, it should result in the cleaned dataset:

+---+-----+----+-----+
|nid| name| age| lang|
+---+-----+----+-----+
|  2|  Max|  46|DE-de|
|  4|  Tom|null|AT-ch|
|  3|  Sue|  22|EN-uk|
|  1|Mario|  35|AT-at|
+---+-----+----+-----+

This was very easy so far. Now, let’s take care of the wrong language values.

Replacing wrong values in columns in Apache Spark

The Spark “na” functions provide this for us. There is a function called “na.replace” that takes the old value and the new value to replace on. Since we have two different values, we need to call the function twice:

reallangs = nodub.na.replace("AT-at", "DE-at").na.replace("AT-ch", "DE-ch")
reallangs.show()

As you can see, the values are replaced accordingly and now also this should work:

+---+-----+----+-----+
|nid| name| age| lang|
+---+-----+----+-----+
|  2|  Max|  46|DE-de|
|  4|  Tom|null|DE-ch|
|  3|  Sue|  22|EN-uk|
|  1|Mario|  35|DE-at|
+---+-----+----+-----+

Only one last thing is now necessary: replacing null values in the dataset.

Replacing null values in Apache Spark

Dealing and working with null-values is always a complicated thing to achieve. Null-values basically means that we don’t know something and that it might have a negative impact on our future predictions and analysis. However, there are some solutions:

  • Ignoring null-values by removing the rows containing them
  • adding a standard-value (e.g. in our case either 0 or a very high value)
  • Using statistics to calculate the most appropriate value

The first one would take away a lot of data. The more columns you have, the higher the possibility is to have null-values! This would reduce the relevant samples and thus the accuracy of predictions. The second one would add some other challenges in our case: it would either increase the average to a very high number or to a very low number. So, only the last opportunity stays for us: calculate a value for the dataset. If you have several features, such as name, it is somewhat easier to do so. In our sample, we use a very easy method: just calculating the average from the correct values. The following sample does exactly that, I will explain each step after the sample:

from pyspark.sql.functions import *
avage = reallangs.groupby().agg(avg(reallangs.age)).collect()
rage = int(avage[0][0])
clean = reallangs.na.fill(rage)
clean.show()

So, what has happend here?

  • We start by calculating the average from all ages we have. This is done with the agg() function
  • Next, we convert the average (which is a float) to an int. Since it is of type row, we have to use two indices to get to our value
  • We then call the “na.fill” with the previously calculated average

The output of that should look like the following:

+---+-----+---+-----+
|nid| name|age| lang|
+---+-----+---+-----+
|  2|  Max| 46|DE-de|
|  4|  Tom| 34|DE-ch|
|  3|  Sue| 22|EN-uk|
|  1|Mario| 35|DE-at|
+---+-----+---+-----+

That’s all! Easy, isn’t it? Of course, we could add much more intelligence to it, but let’s keep it as is right now. More will be added in the tutorials on Data Science, which are about to come soon :). Dealing with wrong data in Spark is one of the key things you will do before going for Data Science.

There is of course much more to learn about Spark, so make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.

In our previous tutorial, we looked at how to join data in Apache Spark. Another frequently used thing when working with data is to reduce the number of results by limit data in spark to a specific number. This is done with the limit statement.

Limit Data in Spark with the limit() method

Basically, the limit statement is very easy. It is easy to use since it only takes the number of results to return as a parameter. The limit statement is usually applied with an order-statement. In the following sample, we use the limit statement on the df_ordered dataset which we introduced in the tutorial on filtering and ordering data in Spark. After the sample, I will explain what the steps are.

sumed = df_ordered.groupby(df_ordered.personid) \
                  .agg(sum(df_ordered.price)) \
                  .toDF("pid", "ordervalue")
newPers = df_ordered.join(sumed, sumed.pid == df_ordered.personid, "inner") \
                    .drop("productname", "price", "pid").distinct() \
                    .orderBy("ordervalue", ascending=False) \
                    .limit(10)
newPers.show()

Basically, the above sample shows the top 10 customers from our dataset. The following steps are applied:

  1. Grouping the dataset by the person id
  2. Creating the sum of products bought by the customer
  3. And creating a new dataframe from it

We then join the dataset of ordered values back into the person data. Spark doesn’t allow appending this data and keeping all the original values (like personname, age, …) in it. In the next statement, we do the following:

  1. We join the newly created dataset into the original dataset
  2. Remove the unnecessary items such as productname, price and pid
  3. Order everything by ordervalue descending
  4. and limit the results to only have the top 10 customers.

Now, the result should look like the following:

+--------+----------+---+-----+------------------+
|personid|personname|age|state|        ordervalue|
+--------+----------+---+-----+------------------+
|     162|     Heidi| 37|   GA|24269.340000000226|
|      38|     Daisy| 45|   CA|23799.450000000204|
|     140|     Elsie| 64|   FL|  23759.5400000002|
|      18|      Ruby| 47|   GA|23414.710000000185|
|     180|   Caitlin| 65|   NY| 23124.71000000019|
|     159|    Taylor| 41|   NY|23054.670000000162|
|     131|     Aaron| 67|   TX| 23049.63000000016|
|      49|     Dylan| 47|   TX| 23029.68000000018|
|     136|    Isabel| 52|   CA| 22839.85000000014|
|      43|     Mason| 30|   CA|22834.710000000185|
+--------+----------+---+-----+------------------+

The limit statement itself is very easy, however, it is a bit more complex on how to get towards using the statement ;). In the next tutorial, we will look at how to deal with corrupt data – get ready for some data cleaning!

There is of course much more to learn about Spark, so make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.