In our last tutorial section, we had a brief introduction to Spark Dataframes. Now, let’s have a look into filtering, grouping and sorting Data with Apache Spark Dataframes. First, we will start with the orderby statement in Spark, which allows us to order data in Spark. Next, we will continue with the sort statement in Spark that allows us to sort data in Spark and then we will focus on the groupby statement in Spark, which will eventually allow us to group data in Spark.

Order Data in Spark

First, let’s look at how to order data. Basically, there is an easy function on all dataframes, called “orderBy”. This function takes several parameters, but the most important parameters for us are:

  • Columns: a list of columns to order the dataset by. This is either one or more items
  • Order: ascending (=True) or descending (ascending=False)

If we again load our dataset from the previous sample, we can now easily apply the orderBy() function. In our sample, we order everything by personid:

df_new = spark.read.parquet(fileName)
df_ordered = df_new.orderBy(df_new.personid, ascending=True)
df_ordered.show()

The output should be:

+--------+----------+---+---------------+-----+-----+
|personid|personname|age|    productname|price|state|
+--------+----------+---+---------------+-----+-----+
|       0|    Amelia| 33|Bicycle gearing|19.99|   NY|
|       0|    Amelia| 33|      Kickstand|14.99|   NY|
|       0|    Amelia| 33|   Bicycle bell| 9.99|   NY|
|       0|    Amelia| 33|         Fender|29.99|   NY|
|       0|    Amelia| 33|         Fender|29.99|   NY|
|       0|    Amelia| 33|      Kickstand|14.99|   NY|
|       0|    Amelia| 33| Bicycle saddle|59.99|   NY|
|       0|    Amelia| 33| Bicycle saddle|59.99|   NY|
|       0|    Amelia| 33|  Bicycle brake|49.99|   NY|
|       0|    Amelia| 33|   Bicycle bell| 9.99|   NY|
|       0|    Amelia| 33|Bicycle gearing|19.99|   NY|
|       0|    Amelia| 33|   Bicycle fork|79.99|   NY|
|       0|    Amelia| 33|   Bicycle fork|79.99|   NY|
|       0|    Amelia| 33|  Bicycle Frame|99.99|   NY|
|       0|    Amelia| 33|Luggage carrier|34.99|   NY|
|       0|    Amelia| 33|Luggage carrier|34.99|   NY|
|       0|    Amelia| 33|      Kickstand|14.99|   NY|
|       0|    Amelia| 33|   Bicycle fork|79.99|   NY|
|       0|    Amelia| 33|  Bicycle Frame|99.99|   NY|
|       0|    Amelia| 33|   Bicycle fork|79.99|   NY|
+--------+----------+---+---------------+-----+-----+
only showing top 20 rows

Filter Data in Spark

Next, we want to filter our data. We take the ordered data again and only take customers that are between 33 and 35 years. Just like the previous “orderby” function, we can also apply an easy function here: filter. This function basically takes one filter argument. In order to have a “between”, we need to chain two filter statements together. Also like the previous sample, the column to filter with needs to be applied:

df_filtered = df_ordered.filter(df_ordered.age < 35).filter(df_ordered.age > 33)
df_filtered.show()

The output should look like this:

+--------+----------+---+---------------+-----+-----+
|personid|personname|age|    productname|price|state|
+--------+----------+---+---------------+-----+-----+
|      47|      Jake| 34|  Bicycle chain|39.99|   TX|
|      47|      Jake| 34|Bicycle gearing|19.99|   TX|
|      47|      Jake| 34|  Bicycle Frame|99.99|   TX|
|      47|      Jake| 34|Luggage carrier|34.99|   TX|
|      47|      Jake| 34|Bicycle gearing|19.99|   TX|
|      47|      Jake| 34|Luggage carrier|34.99|   TX|
|      47|      Jake| 34|   Bicycle bell| 9.99|   TX|
|      47|      Jake| 34|Bicycle gearing|19.99|   TX|
|      47|      Jake| 34|   Bicycle fork|79.99|   TX|
|      47|      Jake| 34|  Bicycle Frame|99.99|   TX|
|      47|      Jake| 34|      Saddlebag|24.99|   TX|
|      47|      Jake| 34|Luggage carrier|34.99|   TX|
|      47|      Jake| 34| Bicycle saddle|59.99|   TX|
|      47|      Jake| 34|  Bicycle Frame|99.99|   TX|
|      47|      Jake| 34|         Fender|29.99|   TX|
|      47|      Jake| 34|      Kickstand|14.99|   TX|
|      47|      Jake| 34|   Bicycle bell| 9.99|   TX|
|      47|      Jake| 34|   Bicycle bell| 9.99|   TX|
|      47|      Jake| 34|  Bicycle brake|49.99|   TX|
|      47|      Jake| 34|      Saddlebag|24.99|   TX|
+--------+----------+---+---------------+-----+-----+
only showing top 20 rows

Group Data in Spark

In order to make more complex queries, we can also group data by different columns. This is done with the “groupBy” statement. Basically, this statement also takes just one argument – the column(s) to sort by. The following sample is a bit more complex, but I will explain it after the sample:

from pyspark.sql.functions import bround
df_grouped = df_ordered \
    .groupBy(df_ordered.personid) \
    .sum("price") \
    .orderBy("sum(price)") \
    .select("personid", bround("sum(price)", 2)) \
    .withColumnRenamed("bround(sum(price), 2)", "value")
df_grouped.show()

So, what has happened here? We had several steps:

  • Take the previously ordered dataset and group it by personid
  • Create the sum of each person’s items
  • Order everything descending by the column for the sum. NOTE: the column is named “sum(price)” since it is a new column
  • We round the column “sum(price)” by two decimal points so that it looks nicer. Note again, that the name of the column is changed again to “ground(sum(price), 2)”
  • Since the column is now at a really hard to interpret name, we call the “withColumnRenamed” function to give the column a much nicer name. We call our column “value”

The output should look like this:

+--------+--------+
|personid|   value|
+--------+--------+
|      37|18555.38|
|      69|18825.24|
|       6|18850.19|
|     196|19050.34|
|      96|19060.34|
|     144|19165.37|
|     108|19235.21|
|      61|19275.52|
|      63|19330.23|
|     107|19390.22|
|      46|19445.41|
|      79|19475.16|
|     181|19480.35|
|      17|19575.33|
|     123|19585.29|
|      70|19655.19|
|     134|19675.31|
|     137|19715.16|
|      25|19720.07|
|      45|19720.14|
+--------+--------+
only showing top 20 rows

You can also do each of the statements step-by-step in case you want to see each transformation and its impact.

In our next tutorial, we will have a look at aggregation functions in Apache Spark.

If you enjoyed this tutorial, make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. Your learning journey can still continue. For full details about Apache Spark, make sure to visit the official page.

1 reply

Trackbacks & Pingbacks

  1. […] sample, we use the limit statement on the df_ordered dataset which we introduced in the tutorial on filtering and ordering data in Spark. After the sample, I will explain what the steps […]

Leave a Reply

Want to join the discussion?
Feel free to contribute!