In our last tutorial section, we worked with Filters and Groups. Now, we will look at different aggregation functions in Spark and Python. Today, we will look at the Aggregation function in Spark, that allows us to apply different aggregations on columns in Spark

The agg() function in PySpark

Basically, there is one function that takes care of all the agglomerations in Spark. This is called “agg()” and takes some other function in it. There are various possibilities, the most common ones are building sums, calculating the average or max/min values. The “agg()” function is called on a grouped dataset and is executed on one column. In the following samples, some possibilities are shown:

from pyspark.sql.functions import *
df_ordered.groupby().agg(max(df_ordered.price)).collect()

In this sample, we imported all available functions from pyspark.sql. We called the “agg” function on the df_ordered dataset that we have created in the previous tutorial. We than use the “max()” function that retrieves the highest value of the price. The output should be the following:

[Row(max(price)=99.99)]

Now, we want to calculate the average value of the price. Similar to the above example, we use the “agg()” function and instead of “max()” we call the “avg” function.

df_ordered.groupby().agg(avg(df_ordered.price)).collect()

The output should be this:

[Row(avg(price)=42.11355000000023)]

Now, let’s get the sum of all orders. This can be done with the “sum()” function. It is also very similar to the previous samples:

df_ordered.groupby().agg(sum(df_ordered.price)).collect()

And the output should be this:

[Row(sum(price)=4211355.000000023)]

To calculate the mean value, we can use “mean” with it:

df_ordered.groupby().agg(mean(df_ordered.price)).collect()

This should be the output:

[Row(avg(price)=42.11355000000023)]

Another useful function is “count”, where we can simply count all datasets for a column:

df_ordered.groupby().agg(count(df_ordered.price)).collect()

And the output should be this:

[Row(count(price)=100000)]

Today’s tutorial was very easy. It was dealing with aggregation function in Spark and how you can create them with the use of the agg() method. In the next sample, we will look at different ways to join data together.

There is of course much more to learn about Spark, so make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply