linear regression in spark

In my previous post I’ve briefly introduced Spark ML. In this post I want to show how you can actually work with Spark ML, before continuing with some more theory on Spark ML. We will have a look at how to predict the wine quality with a Linear Regression in Spark. In order to get started, please make sure to setup your environment based on this tutorial. If you haven’t heard of a Linear Regression, I recommend you reading the introduction to the linear regression first.

The Linear Regression in Spark

There are several Machine Learning Models available in Apache Spark. The easiest one is the Linear Regression. In this post, we will only use the linear regression. Our goal is to have a quick start into Spark ML and then extend it over the next couple of tutorials and get much deeper into it. By now, you should have a working environment of Spark ready. Next, we need some data. Luckily, the wine quality dataset is a often used one and you can download it from here. Load it into the same folder as your new PySpark 3 Notebook.

First, we need to import some packages from pyspark. SparkSession and LinearRegression are very obvious. The only one that isn’t obvious at first is the VectorAssembler. I will explain later what we need this class for.

from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler

Create the SparkContext and Load the Data

We first start by creating the SparkContext. This is a standard procedure and not yet rocket science.

spark = SparkSession.builder.master("local") \
.appName("Cloudvane-Sample-03") \
.config("net.cloudvane.sampleId", "Cloudvane.net Spark Sample 03").getOrCreate()

Next, we load our data. We specify that the format is of type “csv” (Comma Separated Values). The file is however delimited with “;” instead of “,”, so we need to specify this as well. Also, we want Spark to get the schema without any manual intervention from us, so we set “inferSchema” to True. Spark should now figure out how the data types are. Also, we specify that our file has headers. Last but not least, we need to load the file with its filename.

data = spark.read.format("csv").options(delimiter=";", inferSchema=True, header=True).load("winequality-white.csv")

We briefly check how our Dataset looks like. We just use one line in Jupyter with “data”:

data

… and the output should be the following:

DataFrame[fixed acidity: double, volatile acidity: double, citric acid: double, residual sugar: double, chlorides: double, free sulfur dioxide: double, total sulfur dioxide: double, density: double, pH: double, sulphates: double, alcohol: double, quality: int]

Remember, if you want to see what is inside your data, use “data.show()”. Your dataframe should contain this data:

The Wine Quality Dataframe
The Wine Quality Dataframe

Time for some Feature Engineering

In order for Spark to process this data, we need to create a vector out of our data. In order to do this, we use the VectorAssembler that was imported above. Basically, the VectorAssembler takes the data and moves it into a simple Vector. We take the first 11 columns, since the “quality” column should serve as our Label. The Label is the value we later want to predict. We name this Vector now “features” and transform the data.

va = VectorAssembler(inputCols=data.columns[:11], outputCol="features")
adj = va.transform(data)
adj.show()

The new Dataset – called “adj” – now has an additional column named “features”. For Machine Learning, we only need the features, so we can get rid of the other data columns. Also, we want to rename the column “quality” to “label” to make it clear on what we are working with.

lab = adj.select("features", "quality")
training_data = lab.withColumnRenamed("quality", "label")

Now, the dataframe should be cleaned and we are ready for the Linear Regression in Spark!

Running the Linear Regression

First, we create the Linear Regression. We set the maximum Iterations to 30, the ElasticNet mixing Parameter to 0.3 and the regularization parameter to 0.3. Also, we need to make sure to set the features column to “features” and the label column to “label”. Once the Linear Regression is created, we fit the training data into it. After that, we create our predictions with the “transform” function. The code for that is here:

lr = LinearRegression(maxIter=30, regParam=0.3, elasticNetParam=0.3, featuresCol="features", labelCol="label")
lrModel = lr.fit(training_data)
predictionsDF = lrModel.transform(training_data)
predictionsDF.show()

This should now create a new dataframe with the features, the label and the prediction. When you review you output, it already predicts quite ok-ish values for a wine:

+--------------------+-----+------------------+
|            features|label|        prediction|
+--------------------+-----+------------------+
|[7.0,0.27,0.36,20...|    6| 5.546350842823183|
|[6.3,0.3,0.34,1.6...|    6|5.6602634543897645|
|[8.1,0.28,0.4,6.9...|    6| 5.794350562842575|
|[7.2,0.23,0.32,8....|    6| 5.793638052734819|
|[7.2,0.23,0.32,8....|    6| 5.793638052734819|
|[8.1,0.28,0.4,6.9...|    6| 5.794350562842575|
|[6.2,0.32,0.16,7....|    6|5.6645781552987655|
|[7.0,0.27,0.36,20...|    6| 5.546350842823183|
|[6.3,0.3,0.34,1.6...|    6|5.6602634543897645|
|[8.1,0.22,0.43,1....|    6| 6.020023174935914|
|[8.1,0.27,0.41,1....|    5| 6.178863965783833|
|[8.6,0.23,0.4,4.2...|    5| 5.756611684447172|
|[7.9,0.18,0.37,1....|    5| 6.012659811971332|
|[6.6,0.16,0.4,1.5...|    7| 6.343695124494296|
|[8.3,0.42,0.62,19...|    5| 5.605663225763592|
|[6.6,0.17,0.38,1....|    7| 6.139779557853963|
|[6.3,0.48,0.04,1....|    6| 5.537802384697061|
|[6.2,0.66,0.48,1....|    8| 6.028338973062226|
|[7.4,0.34,0.42,1....|    6|5.9853604241636615|
|[6.5,0.31,0.14,7....|    5| 5.652874078868445|
+--------------------+-----+------------------+
only showing top 20 rows

You could now go into a supermarket of your choice and aquire a wine and fit the data of the wine into your model. The model would tell you how good the wine is and if you should buy one or not.

This is already our first linear regression with Spark – a very easy model. However, there is much more to learn:

  • We would need to understand the standard deviation of this model and how accurate it is. If you review some predictions, we ware not very acuarate at all. So it needs to be tweaked
  • We will later compare different ML algorithms and build a pipeline

However, it is good for a start!

This tutorial is part of the Apache Spark MLlib Tutorial. If you are not yet familiar with Spark or Python, I recommend you first reading the tutorial on Spark and the tutorial on Python. Also, you need to understand the core concepts of Machine Learning, which you can learn in this tutorial. Also, you might refer to the official Apache Spark ML documentation.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply