Apache Spark Tutorial: Data Transformation on RDD

In the last post, we learned the basics of Spark RDDs. Now, we will have a look at data transformation in Apache Spark. One data transformation was already used in the last tutorial: filter. We will have a look at this now as the first transformation. But before we start, we need to have a look at one major thing: lambda expressions. Each of the functions we use for data transformations in Spark take functions in the form of a lambda expression. In our last tutorial, we filtered data based on a lambda expression. Let’s recall the code to have a deeper look into it:

sp_pos = spark_data.filter(lambda x: x>0.0).collect()

What is crucial for us is the statement that is in the braces „()“ after the filter command. In there, we see the following: „lamdba x: x>0.0“. Bascially, with „lambda x“ we state that the following evaluation should be applied to all items in the dataset. For each iteration, we use „x“ as the variable. So it reads like: „apply the evaluation, if the variable is greater than 0, to all items in our dataset“. If x would be a complex dataset, we could also use it’s fields and methods. But in our case, it is a simple number.

One more thing that is important for transformations: transformations in Spark are always „lazy“ bound to it’s execution. So calling the filter function does nothing, until you call an execution on it. The one we used above is „.collect()“. Collect() in our case calls the filter.

Filtering

Filtering in data is one of the most frequent used transformations. The filter criteria is parsed as a lambda expression. You can also chain different filter criteria easily, since we have late binding on it.

We extend the above sample by only showing numbers that are smaller than 3.0. The code for that looks the following:

sp_pos = spark_data.filter(lambda x: x>0.0).filter(lambda y: y<3.0).collect()
sp_pos

Sorting

Another important transformation is sorting data. At some point, you want to arrange the data in either ascending or descending order. This is done with the „sortBy“ function. The „sortBy“ function takes a lambda expression that takes the field to filter. In our above example, this isn’t relevant since we only have one item per RDD. Let’s have a look at how to use it with our dataset:

sp_sorted = spark_data.sortBy(lambda x: x).collect()
sp_sorted

Now, if we want to filter it in the opposite order, we can set the optional „ascending“ keyword to false. Let’s have a look at this:

sp_sorted = spark_data.sortBy(lambda x: x, False).collect()
sp_sorted
The sorted dataset

Joining data

Often, it is necessary to join two different datasets together. This is done by the „Join“ function. To use the join, we need to create new datasets first (and we will need them further on as well):

ds_one = sc.parallelize([("Mark", 1984), ("Lisa", 1985)])
ds_two = sc.parallelize([("Mark", 2015), ("Anastasia", 2017)])
sorted(ds_one.join(ds_two).collect())

You can also use the inner or outer join on RDDs.

Today, we looked at some basic data transformation in Spark. Over the next couple of tutorial posts, I will walk you through more of them.

If you enjoyed this tutorial, make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.

Python for Spark Tutorial – Method and Function in Python

In the last tutorial, we’ve learned about the different control structures in Python. Now that we know control structures and basics of Python programming, let’s have a look into how to encapsulate code into functions. Therefore, we will have a look at the method and function in Python.

Defining a Function in Python

Due to the overall differences with C-like languages, also the method definition is slightly different. However, it is very similar to the control structures layout. Let’s have a look

def functionname(args):
FUNCTION BLOCK
return value

A function in Python is always defined with „def“. After that, a function name is provided. Values passed to the function are then in parentheses. Due to the dynamic aspects of Python, it doesn’t know any dedicated type definitions. This means that values are passed by their name to the function. After the „:“, the function block starts. Everything within the function block needs to be indented. Python can also return values by adding the „return“ statement. The following function adds one to a number:

def myiter(n):
    return n + 1
val = myiter(33)
val

Output:

34

As you would expect, it is also possible to add different levels by indenting control structures and alike. The following function creates an array from 0-4 (in the range of 5) and calls a function that iterates over each item in the array.

def printall(vals):
    for val in vals:
        print(val)
values = range(5)
printall(values)

Output:

0
1
2
3
4

Lambda Expressions

A very cool feature of most modern programming languages is the availability of Lambda expressions. With this, it is possible to significantly reduce the code complexity by writing easy functions in one-liners. For instance, the first sample of the iterator could also be written as a lambda expression. Basically, a lambda expression is a function defined in-line to be called on each item in a list or an array. It is very useful for data manipulations. Basically, a Lambda expression is introduced with the following statement:

lambda variable: STATEMENT

In this case, variable is one or more variable(s) to work with in the following statement. For instance, if it is the previous sample, it would mean a number. Thus we would only use one variable. If it would be a dictionary, it can also be just one (and the key/values are available as methods) or you would provide both, for instance as x, y. A sample lambda expression matching with the previous one is this:

v = map(lambda x: x + 1, values)
printall(v)

Output:

1
2
3
4
5

In the above statement, we used the „map“ function out of python that is capable of calling a lambda function on the specified iterable (array in our case). We re-used the existing array specified in the statement above. Each item of the array is now changed in its value by one. As you can see, it is very easy to work with lambda expressions in Python and they are very useful to keep your code simple and clean.

In the next tutorial, we will have a look at how to encapsulate methods and functions into classes and packages. We will also have a look at inheritance.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

Python for Spark Tutorial – Control structures: the Python If Statement

Now we have learned about the basics of Python in the last tutorial. Now, we focus on some very important things, every developer needs – control structures. Basically, in this post, I will explain the Python If statement and two loops.

The Python If Statement

This is something, every programer learns at the very beginning. The good news is: also Python can do it :). Basically, the syntax is very easy:

if expression:
IF-BLOCK
elif expression:
ELSE-IF-BLOCK
else:
ELSE-BLOCK

An if-statement starts with „if“ and is then immediately followed by the expression. Please note that there are no brackets like in C-languages. After the expression, a „:“. The if-block is written with an indent. If everything that should be executed in the if-block is written and then the if-block starts. After the if-block is finished, there is either an elif (else-if) or else block – or the end of the entire block. The following example shows this:

ds = 12
if ds > 10:
    print("TRUE")
else:
    print("FALSE")
if ds > 15:
    print("TRUE")
else:
    print("FALSE")
TRUE
FALSE

The if-statement also knows an else-if. Basically, you can check for different conditions within one statement. The following shows the else-if (elif) block:

if ds < 10:
    print("TRUE")
elif ds > 11:
    print("FALSE")
FALSE

While-Loop

A very important loop is the while-loop. The while-loop executes code as long until a condition is false. What is very prominent in Python is the existence of an „else“ block in the while-loop. Basically, the else-block is executed once the condition of the while-loop is false. You can use this for cleanup or alike. The syntax of the while-loop is as follows:

while(expression):
WHILE-BLOCK
else:
ELSE-BLOCK

In Python, you can also use „continue“ and „break“ in your loop. Both have different effects: continue skips the current instance of the loop, whereas break terminates the execution of the entire loop. You might need break for error handling in a loop. A simple loop counting down from 12 looks like the following:

ds = 12
while(ds > 0):
    ds -= 1
    if ds == 0: continue;
    print(ds)
else:
    print("we're done here")
11
10
9
8
7
6
5
4
3
2
1
we're done here

In the above loop, the else-block was used and we added a check if the statement is 0 to skip the execution. Basically, we count from 12 downwards (but start at 11, since at the first iteration we already decreased the value).

In the following sample, we exchange the „continue“ with „break“. Check what happens:

ds = 12
while(ds > 0):
    ds -= 1
    if ds == 0: break;
    print(ds)
else:
    print("we're done here")
11
10
9
8
7
6
5
4
3
2
1

For-Loop

The For-Loop is the other loop used in Apache Spark. It is mainly used to iterate over datasets. In the for-loop, we have to stop our thinking about how for-loops looked like in C-like languages. We don’t have any iterators in terms of numbers any more. We only specify the item name for each iteration and the collection/list to iterate on. The syntax is very easy:

for iterator in iterable:
FOR-BLOCK

Normally, you would iterate over an array, list, map or alike. In our sample, we will use the „persons“ map we have created in our previous sample. Please note one thing: we have used different types, so not all types are of type string. If you now would like to print and concatenate them, you first need to ensure to convert each non-string type. That’s why we use „str()“ for conversion:

for person in persons:
    print(str(person) + " is " + str(persons[person]))
mario is 35
vienna is austria
3 is age

The output here is also very clear. Now you might be disappointed by the non-existing „counter“ for. The good thing is that you could still do it by providing the „range“ keyword. It isn’t the same as you might be used to, but might get you into Python faster ;). With the range-keyword, the sample looks like this:

for i in range(5):
    print(i)
0
1
2
3
4

Easy, isn’t it? Now, we are ready to have a look at functions in our following tutorial.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

How to do Data Science wrong – the booking.com way

I use booking.com a lot for my bookings, but one thing which constantly bugs me are the e-mails after a booking – stating „The prices for [CITY YOU JUST BOOKED] just dropped again!“. Really, booking.com?!? It happend to me several times already that I booked a Hotel and after some hours I received a message that the prices in this city just dropped. This is a perfect sample of how to do data science wrong

How to do data science wrong

So, I am wondering if this happens on purpose or rather on incident. If we would expect it to happen on purpose, I would like to question the purpose of it. I just booked a Hotel and was sure that I got a good deal. But – sorry – you spent more 😉 No, I don‘t think so. I believe it rather the opposite: incident.

I do expect that booking.com is having some issues with either data silos or with the speed of the data. Either there is no connection between the ordering system and the campaigning system and thus the data doesn‘t flow between those two systems. After some time, I receive getting messages, so I think that the booking.com systems aren‘t built to handle this topic in real-time.

You order something on booking.com – the system is probably optimised on bringing this order process through, send and receive information from their (hotel) partners, … – but they don‘t update the data on the CRM or Marketing systems, that create adds. Now, my suggestion is that once you book a hotel, booking.com tracks that you looked at a specific city. This is then added to their user database and the marketing automation tool is updated.

However, the order process seems to be totally de-coupled from this process and doesn‘t receive the data about this fast enough – and most likely, their marketing automation system is set to „aggressive“ marketing once you have looked up a city – and sends recommendations often. This then leads to some discrepancy (or consistency) in their systems.

For me, this is also a great example of eventual consistency in database design. At some point, booking.com‘s systems will all be up-to date, so they stop re-targeting you. However, their eventual in the consistency is very, very late 🙂

Let me know what experiences you had.

This post is part of the „Big Data for Business“ tutorial. In this tutorial, I explain various aspects of handling data right within a company. All credits in here go to the fabulous booking.com!

Python Tutorial – Getting started with Python

A common discussion among Data Engineers is on what language to use. One of the main platforms for Data Processing nowadays is Apache Spark. A lot of Data Engineers and Data Scientists use Apache Spark for their workloads. Spark is written in Scala, which itself derives from Java. When working with Spark, most people nowadays however use another language: Python. Follow me over the next posts during this python tutorial.

Part I of the Python tutorial: getting started

Python is not only used among Data Engineers. In fact, it is much more popular with Data Scientists – even though there is a „religious“ fight between R and Python going on. However, Python is a very popular language now and a lot of people use it – also for Big Data and Data Science workloads.

I’ve created a tutorial on Apache Spark recently and decided to do it with Python rather than with Scala. My background is in the object oriented, strongly typed world such as Java or C#, and I will write this tutorial now with the eyes of the many people coming from the exact same background – by highlighting the differences and where you might have to watch out.

When it comes to Python, it is perfect for the world of the web – it offers a lot of libraries for http interactions. Therefore, its popularity within web developers is very high. Also, Python also offers a lot of libraries for machine learning and thus the number of Data Scientists using Python is also high. Both of them I would argue are a result of Google – Google was pushing Python a lot and also contributed a lot to Python – in both terms. Google AppEngine, one of the first Platform as a Service solutions, was mainly useable via Python. Now, Google offers much more services with Python. Tensorflow is just one of them. But how did this happen? Basically, the founder of Python – Guido van Rossum – was working for a long time at Google. Another important thing: Python has nothing to do with the snake – it is coming from „Monty Python“. However, I will use the snake from time to time as a header image 😉

But now, let’s come to the main aspects in terms of Python and what it means for Developers and Data Scientists. Basically, there are several differences when working with Python when you come from Java or similar languages. I want to start with some of them; the following list isn’t complete, you will see much more changes in the following weeks throughout this tutorial. As of comparison, I keep on stating „C-like languages“. With this, I basically mean C, C++, C# and Java.

  • Dynamic. Python is basically a static dynamic language. This means that you don’t have to care when you create a new variable about its type – such as Integer or String. However, once a variable is assigned, its type is assigned and then it is very static.
  • Indentation. Unlike Java or any other C-like language, Python doesn’t use {} to structure your code. All of that is done with intents (tabs!). In C-like languages, you structure your code blocks with both – {} and intents. However, C-like languages only use the indentation for better readability – Python uses it for the language itself!
  • No Switch. Yes, you read it right – there are no switch statements in Python. You have to do all of that via If-Then statements.

Variables in Python

Now, let’s look at the most important thing – assigning variables in Python. Since Python is dynamic, you don’t need to remember any complex types and can just assign them with stating the name of the variable and assign them. Also, you don’t need any semicolons – the line end is the end of the statement. The syntax for this is the following:

variable_name = VALUE

Now, we need to start coding in Python. I’ve decided to use Jupyter, since it is also a common tool to work with Spark. If you haven’t installed Jupyter yet, you can read how to do it in this tutorial. Below are several variable assignments for different types. If you want to print the content of a variable, simply put the variable name into a new line.

data = 123
data

And the corresponding output is:

123

Let’s now use a String:

name = "Mario"
name

Again, the output is as expected:

'Mario'

Dynamic variables in Python

I stated that Python is dynamic, but it turns into static once a variable get’s assigned. You can see this behaviour with the following code:

res = data + name

Umpf. Yes, an error occurred:

TypeError: unsupported operand type(s) for +: 'int' and 'str'

Python doesn’t allow this much of dynamic-ness. So, the good-old C-like world is still fine, isn’t it? Let’s try to mix an Integer with a float:

daf = 1.23
daf + data

As you can see, the output is really fine:

124.23

Complex Data Types in Python

When talking about types, Python knows the common types such as Int, Float, String, Boolean. There is no differentiation between „Long“ and „Int“, a non-floating point number is always an int; the same applies to Float – there is no double. However, Python has some additional basic types. I won’t describe all here, only the most relevant ones:

  • Complex: This is a number which is expressed in imaginary parts – when the number of digits is to large to display, you often see something like this: 9,12e21
  • Dict: A dictionary, where you have a key and a value with it. A dict is initialised with {}.
  • Tuple: A set of values; you could basically call this an array, but an array in C-like languages can only contain the same types, whereas the Tuple takes more types. A tuple is initialised with ()
  • Set: A set is like a list, but it can’t contain any duplicates. A set is initialised with {}
  • List: A list can contain duplicates and mixed types. Initialize a list with []

Now, let’s have a look at how these types are used. We first start with Dict(ionary)

persons = {"mario": 35, "vienna": "austria", 3: "age"}
persons
{'mario': 35, 'vienna': 'austria', 3: 'age'}

Was easy, right? Let’s continue with the Tuple:

pers = ("Mario", 35, 9.99)
pers
('Mario', 35, 9.99)

Also, here we have the expected output. Next up is the set:

There are even more complex types

pset = {"mario", 35, True}
pset
{35, True, 'mario'}

Last but not least is the List:

plist = ["Mario", 35, True]
plist
['Mario', 35, True]

Compound Types in Python

We can also combine different types into complex types. Let’s assume we want to add more values to a key-value pair. To achieve that, we create the main entity and add a list, where we have different key-value pairs inside. The following sample shows the products sold in a shop. The first level is the product, then in a list are the age of the product and the price.

shop = {"Milk": [{"age": 35}, {"price": 9.99}], "Bred": [{"age": 22}, {"price": 2.99}]}
shop
{'Milk': [{'age': 35}, {'price': 9.99}],
 'Bred': [{'age': 22}, {'price': 2.99}]}

Now, we have learned about data types. In the next Python tutorial, I will write about control structures.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

Apache Spark Tutorial: RDDs, Lambda Expressions and Loading Data

In the last blog post, we looked at how to access the Jupyter Notebook and work with Spark. Now, we will have some hands-on with Spark and do some first samples. One of our main ambitions is to write a lambda expression in spark with python

Intro to the Spark RDD

Therefore, we start with RDDs. RDDs are the basic API in Spark. Even though it is much more convenient to use Dataframes, we start with the RDDs in order to get a basic understanding of Spark.

RDDs stand for „Resilient Distributed Dataset“. Basically, this means that the Dataset is used for parallel computation. In RDDs, we can store any kind of data and apply some functions to it (such as the sum or the average). Let’s create a new Python notebook and import the required libraries.

Create a new Notebook in Python 3
The new Notebook with pyspark

Once the new notebook is created, let’s start to work with Spark.

First, we need to import the necessary libraries. Therefore, we use „pyspark“ and „random“. Also, we need to create the Spark context that is used throughout our application. Note that you can only execute this line once, because if you create the SparkContext twice – it is case sensitive!

Create the Spark Context in Python

import pyspark
import random
sc = pyspark.SparkContext(appName="Cloudvane_S01")

When done with this, hit the „Run“ Button in the Notebook. Next to the current cell, you will now see the [ ] turning into [*]. This means that the process is currently running and something is happening. Once it has finished, it will turn into [1] or any other incremental number. Also, if errors occur, you will see them below your code. If this part of the code succeeded, you will see no other output than the [1].

Create random data in Spark

Next, let’s produce some data that we can work with. Therefore, we create an array and fill it with some data. In our case, we add 100 items. Each item is of a random value calculated with expovariate.

someValues = []
for i in range(0,100):
    someValues.append(random.expovariate(random.random()-0.5))
someValues

When the data is created, we distribute it over the network by calling „sc.parallelize“. This creates an RDD now and enables us to work with Spark.

spark_data = sc.parallelize(someValues)

Different Functions on RDD in Spark

We can apply various functions to the RDD. One sample would be to use the „Sum“ function.

sp_sum = spark_data.sum()
sp_sum

Another sample is the „Count“ function.

sp_ct = spark_data.count()
sp_ct

We can also do more complex calculations by defining methods that do some calculations. In Python, this is done by „def functionname(params)“. The following sample creates the average of the array that is passed onto the function.

def average(vals):
    return vals.sum() / vals.count()

The function is simply invoked with our data.

average(spark_data)

A lambda function in Spark and Python

Last but not least, we can also filter data. In the following sample, we only include positive values. We do this with a simple Lambda function. I’ve explained Lambda functions in detail in the Python tutorial, in case you want to learn more.

sp_pos = spark_data.filter(lambda x: x>0.0).collect()
sp_pos

Now you are ready to go and write your own lambda expression with spark in Python. There is of course much more to learn about Spark, so make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.

Apache Spark Tutorial: Setting up Apache Spark in Docker

In our last tutorial, we had some brief introduction to Apache Spark. Now, in this tutorial we will have a look into how to setup an environment to work with Apache Spark. To make things easy, we will setup Spark in Docker. If you are not familiar with Docker, you can learn about Docker here. To get started, we first need to install Docker. If you don’t have it yet, find out how to install it from this link: https://docs.docker.com/install/. The installation procedure will take some time to finish, so please be patient.

Docker comes with an easy tool called „Kitematic“, which allows you to easily download and install docker containers. Luckily, the Jupyter Team provided a comprehensive container for Spark, including Python and of course Jupyter itself. Once your docker is installed successfully, download the container for Spark via Kitematic. Select „all-spark-notebook“ for our samples. Note that the download will take a while.

Download Apache Spark for Docker

Once your download has finished, it is about time to start your Docker container. When you download the container via Kitematic, it will be started by default. Within the container logs, you can see the URL and port to which Jupyter is mapped. Open the URL and enter the Token. When everything works as expected, you can now create new Notebooks in Jupyter.

Enter the URL and the Token
Jupyter is running

Now you are ready to go and write your own lambda expression with spark in Python. There is of course much more to learn about Spark, so make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.

Apache Spark Tutorial: An introduction to Apache Spark

In this tutorial, I will provide a first introduction to Apache Spark. Apache Spark is the number one Big Data Tool nowadays. It is even considered ad the „killer“ to Hadoop, even though Hadoop isn’t that old yet. However, Apache Spark has several advantages over „traditional“ Hadoop. One of the key benefits is that Spark is ways better suited for Big Data Analytics in the Cloud than Hadoop is. Hadoop itself was never built for the Cloud, since it was built years before the Cloud took over major workloads. Apache Spark in contrast was built during the Cloud became common sense and thus has several benefits over it – e.g. by using object stores to access data such as Amazon S3.

However, Spark also integrates well into an existing Hadoop environment. Apache Spark runs native on Hadoop as a Yarn application and it re-uses different Hadoop components such as HDFS, HBase, Hive. Spark replaces Map/Reduce for batch processing with it‘s own technology, which is much faster. Hive can also run with Spark on it and thus is ways faster. Additionally, Spark comes with new technologies for interactive queries, streaming and machine learning.

Apache Spark is great in terms of performance. To sort 100 TB of data, Spark does that 3 times faster as Map/Reduce by only using 1/10th of nodes. It is well suited for sorting PB of Data and it won several sorting benchmarks such as the GraySort and CloudSort benchmark.

Spark is written in Scala

It is written in Scala, but is often used from Python. However, if you want to use the newest features of Spark, it is often necessary to work with Scala. Spark uses Micro-batches for „real-time“ processing, meaning that it isn‘t true real-time. The fastest interval for Micro-batches is 0.5 seconds. Spark should run in the same LAN as the data is stored. In terms of the Cloud, this means the same datacenter or availability zone. Spark shouldn‘t run on the same nodes as the data is stored (e.g. with Hbase). With Hadoop, this is the other way around; Hadoop processes the data where it is stored. There are several options to run Spark: Standalone, on Apache Mesos, on Hadoop or via Kubernetes/Docker. For our future tutorials, we will use Docker for it.

Spark has 4 main components. Over the next tutorials, we will have a look at each of them. These 4 components are:

  • Spark SQL: Provides a SQL Language, Dataframes and Datasets. This is the most convenient way to use Spark
  • Spark Streaming: Provides Micro-batch execution for near real-time applications
  • Spark ML: Built-In Machine Learning Library for Spark
  • GraphX: Built-In Library for Graph Processing

To develop Apache Spark applications, it is possible to use either Scala, Java, Python or R. Each Spark application starts with a „driver program“. The driver program executes the „main“ function.

RDD – Resilient Distributed Datasets

Data Elements in Spark are called „RDD – Resilient Distributed Datasets“. This can be files on HDFS, an Object Store such as S3 or any other kind of dataset. RDDs are distributed on different nodes and Spark doesn’t take care of them. RDDs can also be kept in memory for faster execution. Spark works with „Shared Variables“. These are variables shared over different nodes, e.g. for computation. There are two types:

  • Broadcast variables: used to cache values in memory on all nodes (e.g. commonly used values)
  • Accumulators: used to add values (e.g. counters, sums or similar)

I hope you enjoyed the introduction to in the next tutorial about Apache Spark we will have a look at how to setup the environment to work with Apache Spark.

Now you are ready to go and write your own lambda expression with spark in Python. There is of course much more to learn about Spark, so make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.