In our last tutorial, we had some brief introduction to Apache Spark. Now, in this tutorial we will have a look into how to setup an environment to work with Apache Spark. To make things easy, we will setup Spark in Docker. If you are not familiar with Docker, you can learn about Docker here. To get started, we first need to install Docker. If you don’t have it yet, find out how to install it from this link: https://docs.docker.com/install/. The installation procedure will take some time to finish, so please be patient.
Docker comes with an easy tool called “Kitematic”, which allows you to easily download and install docker containers. Luckily, the Jupyter Team provided a comprehensive container for Spark, including Python and of course Jupyter itself. Once your docker is installed successfully, download the container for Spark via Kitematic. Select “all-spark-notebook” for our samples. Note that the download will take a while.
Once your download has finished, it is about time to start your Docker container. When you download the container via Kitematic, it will be started by default. Within the container logs, you can see the URL and port to which Jupyter is mapped. Open the URL and enter the Token. When everything works as expected, you can now create new Notebooks in Jupyter.
Now you are ready to go and write your own lambda expression with spark in Python. There is of course much more to learn about Spark, so make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.