During the past tutorials, we have aquired a lot of knowledge about Spark. Now, we are with the last tutorial on Spark, where we will have a look at Cube and Rollup. Basically both are useful for multi-dimensional data for further processing.
Data for Spark Rollup and Cube functions
First, let’s create a dataset that we later want to work with. Our dataset is the monthly salary of people working in Finance or Sales:
employees = spark.createDataFrame([("Mario", 4400, "Sales")\ , ("Max", 3420, "Finance")\ , ("Sue", 5500, "Sales")\ , ("Tom", 6700, "Finance")]\ , ("name", "salary", "department"))
We then use the first function – rollup. We want to have the rollup to be on the department and the name of the person.
employees.rollup(employees.department, employees.name)\ .sum()\ .withColumnRenamed("sum(salary)", "salary")\ .orderBy("department", "salary")\ .show()
Here you can see the output (I will discuss it after you reviewed it):
+----------+-----+------+ |department| name|salary| +----------+-----+------+ | null| null| 20020| | Finance| Max| 3420| | Finance| Tom| 6700| | Finance| null| 10120| | Sales|Mario| 4400| | Sales| Sue| 5500| | Sales| null| 9900| +----------+-----+------+
We have several lines in this now. Let’s look at it line-by-line:
- The first line is consisting of two null values and the sum of all salaries. So, this would represent the entire company. Basically, it fills department and name with null, since it is neither a department nor a specific person – it is all departments and all persons in it.
- The second and third line are Max and Tom, who work in the finance department
- The fourth line is the sum of the finance department; here you see “null” in the name, since it isn’t a name, but the entire department
- The same story continues for the following lines with the sales department
So, basically, we get different things: (A) the sum of all revenues, (B) the individual values and (C) the revenues per department. Now, let’s build the cube:
employees.cube(employees.department, employees.name)\ .sum()\ .withColumnRenamed("sum(salary)", "salary")\ .orderBy("department", "salary")\ .show()
Here, the results are in even more dimensions. First, we have the values of each person, but not from the department. Then, we have all results and then again the departments and individuals in it. The cube isn’t relevant for us for this calculation much. The background is that a cube creates all possible combinations, whereas the rollup only creates hierarchies. The cube also treats null’s as a possible combination, that’s why we have the individuals here several times. Here is the output:
+----------+-----+------+ |department| name|salary| +----------+-----+------+ | null| Max| 3420| | null|Mario| 4400| | null| Sue| 5500| | null| Tom| 6700| | null| null| 20020| | Finance| Max| 3420| | Finance| Tom| 6700| | Finance| null| 10120| | Sales|Mario| 4400| | Sales| Sue| 5500| | Sales| null| 9900| +----------+-----+------+
I hope you liked the tutorials on Spark. There is much more to learn – e.g. about machine learning or different libraries for that. Make sure to check out the tutorial section in order to figure that out.
There is of course much more to learn about Spark, so make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.