In this tutorial, I will provide a first introduction to Apache Spark. Apache Spark is the number one Big Data Tool nowadays. It is even considered ad the “killer” to Hadoop, even though Hadoop isn’t that old yet. However, Apache Spark has several advantages over “traditional” Hadoop. One of the key benefits is that Spark is ways better suited for Big Data Analytics in the Cloud than Hadoop is. Hadoop itself was never built for the Cloud, since it was built years before the Cloud took over major workloads. Apache Spark in contrast was built during the Cloud became common sense and thus has several benefits over it – e.g. by using object stores to access data such as Amazon S3.
However, Spark also integrates well into an existing Hadoop environment. Apache Spark runs native on Hadoop as a Yarn application and it re-uses different Hadoop components such as HDFS, HBase, Hive. Spark replaces Map/Reduce for batch processing with it‘s own technology, which is much faster. Hive can also run with Spark on it and thus is ways faster. Additionally, Spark comes with new technologies for interactive queries, streaming and machine learning.
Apache Spark is great in terms of performance. To sort 100 TB of data, Spark does that 3 times faster as Map/Reduce by only using 1/10th of nodes. It is well suited for sorting PB of Data and it won several sorting benchmarks such as the GraySort and CloudSort benchmark.
Spark is written in Scala
It is written in Scala, but is often used from Python. However, if you want to use the newest features of Spark, it is often necessary to work with Scala. Spark uses Micro-batches for „real-time“ processing, meaning that it isn‘t true real-time. The fastest interval for Micro-batches is 0.5 seconds. Spark should run in the same LAN as the data is stored. In terms of the Cloud, this means the same datacenter or availability zone. Spark shouldn‘t run on the same nodes as the data is stored (e.g. with Hbase). With Hadoop, this is the other way around; Hadoop processes the data where it is stored. There are several options to run Spark: Standalone, on Apache Mesos, on Hadoop or via Kubernetes/Docker. For our future tutorials, we will use Docker for it.
Spark has 4 main components. Over the next tutorials, we will have a look at each of them. These 4 components are:
- Spark SQL: Provides a SQL Language, Dataframes and Datasets. This is the most convenient way to use Spark
- Spark Streaming: Provides Micro-batch execution for near real-time applications
- Spark ML: Built-In Machine Learning Library for Spark
- GraphX: Built-In Library for Graph Processing
To develop Apache Spark applications, it is possible to use either Scala, Java, Python or R. Each Spark application starts with a „driver program“. The driver program executes the „main“ function.
RDD – Resilient Distributed Datasets
Data Elements in Spark are called „RDD – Resilient Distributed Datasets“. This can be files on HDFS, an Object Store such as S3 or any other kind of dataset. RDDs are distributed on different nodes and Spark doesn’t take care of them. RDDs can also be kept in memory for faster execution. Spark works with „Shared Variables“. These are variables shared over different nodes, e.g. for computation. There are two types:
- Broadcast variables: used to cache values in memory on all nodes (e.g. commonly used values)
- Accumulators: used to add values (e.g. counters, sums or similar)
I hope you enjoyed the introduction to in the next tutorial about Apache Spark we will have a look at how to setup the environment to work with Apache Spark.
Now you are ready to go and write your own lambda expression with spark in Python. There is of course much more to learn about Spark, so make sure to read the entire Apache Spark Tutorial. I regularly update this tutorial with new content. Also, I created several other tutorials, such as the Machine Learning Tutorial and the Python for Spark Tutorial. The official Apache Spark page can intensify your experience. Your learning journey can still continue.