In our previous tutorial, we looked at how to join data in Apache Spark. Another frequently used thing when working with data is to reduce the number of results by limit data in spark to a specific number. This is done with the limit statement.

Limit Data in Spark with the limit() method

Basically, the limit statement is very easy. It is easy to use since it only takes the number of results to return as a parameter. The limit statement is usually applied with an order-statement. In the following sample, we use the limit statement on the df_ordered dataset which we introduced in the tutorial on filtering and ordering data in Spark. After the sample, I will explain what the steps are.

sumed = df_ordered.groupby(df_ordered.personid) \
                  .agg(sum(df_ordered.price)) \
                  .toDF("pid", "ordervalue")
newPers = df_ordered.join(sumed, sumed.pid == df_ordered.personid, "inner") \
                    .drop("productname", "price", "pid").distinct() \
                    .orderBy("ordervalue", ascending=False) \
                    .limit(10)
newPers.show()

Basically, the above sample shows the top 10 customers from our dataset. The following steps are applied:

  1. Grouping the dataset by the person id
  2. Creating the sum of products bought by the customer
  3. And creating a new dataframe from it

We then join the dataset of ordered values back into the person data. Spark doesn’t allow appending this data and keeping all the original values (like personname, age, …) in it. In the next statement, we do the following:

  1. We join the newly created dataset into the original dataset
  2. Remove the unnecessary items such as productname, price and pid
  3. Order everything by ordervalue descending
  4. and limit the results to only have the top 10 customers.

Now, the result should look like the following:

+--------+----------+---+-----+------------------+
|personid|personname|age|state|        ordervalue|
+--------+----------+---+-----+------------------+
|     162|     Heidi| 37|   GA|24269.340000000226|
|      38|     Daisy| 45|   CA|23799.450000000204|
|     140|     Elsie| 64|   FL|  23759.5400000002|
|      18|      Ruby| 47|   GA|23414.710000000185|
|     180|   Caitlin| 65|   NY| 23124.71000000019|
|     159|    Taylor| 41|   NY|23054.670000000162|
|     131|     Aaron| 67|   TX| 23049.63000000016|
|      49|     Dylan| 47|   TX| 23029.68000000018|
|     136|    Isabel| 52|   CA| 22839.85000000014|
|      43|     Mason| 30|   CA|22834.710000000185|
+--------+----------+---+-----+------------------+

The limit statement itself is very easy, however, it is a bit more complex on how to get towards using the statement ;). In the next tutorial, we will look at how to deal with corrupt data – get ready for some data cleaning!

If you enjoyed this tutorial, make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. Your learning journey can still continue. For full details about Apache Spark, make sure to visit the official page.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!