Over the past tutorials, we have acquired quite some knowledge about Python and Spark. We know most of the relevant statements in Apache Spark and are now ready for some data cleaning, which is one of the key tasks of a data engineer. Before we get started, we need to have something that a data engineer (unfortunately) often works with: dirty data. We will focus on how to deal with missing, corrupt and wrong data in Spark
Dealing with wrong data in Spark
In the following sample, we create some data. Note, that there are some errors in it:
- The first field (id: 1) is having the wrong language – there is no “AT-at”, but it needs to be “DE-at”
- The fourth field (id: 4) is having a null-value “None” and also the wrong language – AT-ch, whereas it should be “DE-ch”
- The fifth field (id: 5) is a duplicate of the previous one
dirtyset = spark.createDataFrame([(1, "Mario", 35, "AT-at")\ , (2, "Max", 46, "DE-de")\ , (3, "Sue", 22, "EN-uk")\ , (4, "Tom", None, "AT-ch")\ , (5, "Tom", None, "AT-ch")]\ , ("nid", "name", "age", "lang")) dirtyset.show()
The dataset should look like this:
+---+-----+----+-----+ |nid| name| age| lang| +---+-----+----+-----+ | 1|Mario| 35|AT-at| | 2| Max| 46|DE-de| | 3| Sue| 22|EN-uk| | 4| Tom|null|AT-ch| | 5| Tom|null|AT-ch| +---+-----+----+-----+
Deleting duplicates in Spark
The first thing we want to do is removing duplicates. There is an easy function for that in Apache Spark – called “dropDuplicates”. When you look at the previous dataset, it might be very easy to figure out that the last one is a duplicate – but wait! For Apache Spark, it isn’t that easy, because the id is different – it is 4 vs 5. Spark doesn’t figure out which columns are relevant to take duplicates from. If we would apply the “dropDuplicates” to the dataframe, it wouldn’t remove anything. So, we need to apply the columns it should take into account when removing duplicates. We tell spark that we want to work with “name” and “age” for this purpose and pass a list of these to the function:
nodub = dirtyset.dropDuplicates(["name", "age"]) nodub.show()
When you now execute the code, it should result in the cleaned dataset:
+---+-----+----+-----+ |nid| name| age| lang| +---+-----+----+-----+ | 2| Max| 46|DE-de| | 4| Tom|null|AT-ch| | 3| Sue| 22|EN-uk| | 1|Mario| 35|AT-at| +---+-----+----+-----+
This was very easy so far. Now, let’s take care of the wrong language values.
Replacing wrong values in columns in Apache Spark
The Spark “na” functions provide this for us. There is a function called “na.replace” that takes the old value and the new value to replace on. Since we have two different values, we need to call the function twice:
reallangs = nodub.na.replace("AT-at", "DE-at").na.replace("AT-ch", "DE-ch") reallangs.show()
As you can see, the values are replaced accordingly and now also this should work:
+---+-----+----+-----+ |nid| name| age| lang| +---+-----+----+-----+ | 2| Max| 46|DE-de| | 4| Tom|null|DE-ch| | 3| Sue| 22|EN-uk| | 1|Mario| 35|DE-at| +---+-----+----+-----+
Only one last thing is now necessary: replacing null values in the dataset.
Replacing null values in Apache Spark
Dealing and working with null-values is always a complicated thing to achieve. Null-values basically means that we don’t know something and that it might have a negative impact on our future predictions and analysis. However, there are some solutions:
- Ignoring null-values by removing the rows containing them
- adding a standard-value (e.g. in our case either 0 or a very high value)
- Using statistics to calculate the most appropriate value
The first one would take away a lot of data. The more columns you have, the higher the possibility is to have null-values! This would reduce the relevant samples and thus the accuracy of predictions. The second one would add some other challenges in our case: it would either increase the average to a very high number or to a very low number. So, only the last opportunity stays for us: calculate a value for the dataset. If you have several features, such as name, it is somewhat easier to do so. In our sample, we use a very easy method: just calculating the average from the correct values. The following sample does exactly that, I will explain each step after the sample:
from pyspark.sql.functions import * avage = reallangs.groupby().agg(avg(reallangs.age)).collect() rage = int(avage[0][0]) clean = reallangs.na.fill(rage) clean.show()
So, what has happend here?
- We start by calculating the average from all ages we have. This is done with the agg() function
- Next, we convert the average (which is a float) to an int. Since it is of type row, we have to use two indices to get to our value
- We then call the “na.fill” with the previously calculated average
The output of that should look like the following:
+---+-----+---+-----+ |nid| name|age| lang| +---+-----+---+-----+ | 2| Max| 46|DE-de| | 4| Tom| 34|DE-ch| | 3| Sue| 22|EN-uk| | 1|Mario| 35|DE-at| +---+-----+---+-----+
That’s all! Easy, isn’t it? Of course, we could add much more intelligence to it, but let’s keep it as is right now. More will be added in the tutorials on Data Science, which are about to come soon :). Dealing with wrong data in Spark is one of the key things you will do before going for Data Science.
There is of course much more to learn about Spark, so make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.
Leave a Reply
Want to join the discussion?Feel free to contribute!