About 1,5 years ago I was writing that Cloud is not the future. Instead, I claimed that it is the present. In fact, most companies are already embracing the Cloud. Today, I want to revisit this statement and take it to the next level: Cloud IaaS is not the Future

What is wrong about Cloud IaaS?

Cloud IaaS was great in the early days of the Cloud. It gave us freedom to move our workloads in a lift-and-shift scenario to the Cloud. Also, it greatly improved how we can handle workloads in a more dynamic way. Adding servers and shutting them down on demand was really easy. In an on-premise scenario, this was far from easy. All big Cloud providers today provide a comprehensive toolset and third-party applications for IaaS solutions. But why is it not as great as it used to be?

Honestly, I was never a big fan of IaaS in the Cloud. To state it blunt, it didn’t improve much (other than scale and flexibility) to an on-premise world. With Cloud IaaS, we still have to maintain all our servers like in the old days with on-premise. Security patches, updates, fixes and alike stays with those that build the services. Since the early days, I was a big fan of Cloud PaaS (Platform as a Service). 

What is the status of Cloud PaaS?

Over the last 1.5 years, a lot of mature Cloud PaaS services emerged. Cloud PaaS has been around for almost 10 years, but the current level of maturity is impressive. Some two years ago, they have mainly been general purpose Services, but now they moved into very specific domains. There are now a lot of services available for things such as IoT or Data Analytics.

The current trend in Cloud PaaS is definitely the move towards „Serverless Analytics“. Analytics has always been a slow-mover when it came to the Cloud. Other functional areas had already Cloud-native implementations, when Analytical workloads were still developed for the on-premise world. Hadoop was one of these projects, but other projects took over and Hadoop is in the decline. Now, more analytical applications will be developed with a PaaS-stack.

What should you do now?

Cloud PaaS isn’t a revolution or anything spectacular new. If you have no experiences with Cloud PaaS, I would urge you to look at these platforms asap. They will become essential for your business and do provide a lot of benefits. Again – it isn’t the future, it is the present!

If you want to learn more about Serverless Analytics, I can recommend you this tutorial. (Disclaimer: I am not affiliated with Udemy!)

linear regression in spark

In my previous post I’ve briefly introduced Spark ML. In this post I want to show how you can actually work with Spark ML, before continuing with some more theory on Spark ML. We will have a look at how to predict the wine quality with a Linear Regression in Spark. In order to get started, please make sure to setup your environment based on this tutorial. If you haven’t heard of a Linear Regression, I recommend you reading the introduction to the linear regression first.

The Linear Regression in Spark

There are several Machine Learning Models available in Apache Spark. The easiest one is the Linear Regression. In this post, we will only use the linear regression. Our goal is to have a quick start into Spark ML and then extend it over the next couple of tutorials and get much deeper into it. By now, you should have a working environment of Spark ready. Next, we need some data. Luckily, the wine quality dataset is a often used one and you can download it from here. Load it into the same folder as your new PySpark 3 Notebook.

First, we need to import some packages from pyspark. SparkSession and LinearRegression are very obvious. The only one that isn’t obvious at first is the VectorAssembler. I will explain later what we need this class for.

from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler

Create the SparkContext and Load the Data

We first start by creating the SparkContext. This is a standard procedure and not yet rocket science.

spark = SparkSession.builder.master("local") \
.appName("Cloudvane-Sample-03") \
.config("net.cloudvane.sampleId", "Cloudvane.net Spark Sample 03").getOrCreate()

Next, we load our data. We specify that the format is of type “csv” (Comma Separated Values). The file is however delimited with “;” instead of “,”, so we need to specify this as well. Also, we want Spark to get the schema without any manual intervention from us, so we set “inferSchema” to True. Spark should now figure out how the data types are. Also, we specify that our file has headers. Last but not least, we need to load the file with its filename.

data = spark.read.format("csv").options(delimiter=";", inferSchema=True, header=True).load("winequality-white.csv")

We briefly check how our Dataset looks like. We just use one line in Jupyter with “data”:


… and the output should be the following:

DataFrame[fixed acidity: double, volatile acidity: double, citric acid: double, residual sugar: double, chlorides: double, free sulfur dioxide: double, total sulfur dioxide: double, density: double, pH: double, sulphates: double, alcohol: double, quality: int]

Remember, if you want to see what is inside your data, use “data.show()”. Your dataframe should contain this data:

The Wine Quality Dataframe
The Wine Quality Dataframe

Time for some Feature Engineering

In order for Spark to process this data, we need to create a vector out of our data. In order to do this, we use the VectorAssembler that was imported above. Basically, the VectorAssembler takes the data and moves it into a simple Vector. We take the first 11 columns, since the “quality” column should serve as our Label. The Label is the value we later want to predict. We name this Vector now “features” and transform the data.

va = VectorAssembler(inputCols=data.columns[:11], outputCol="features")
adj = va.transform(data)

The new Dataset – called “adj” – now has an additional column named “features”. For Machine Learning, we only need the features, so we can get rid of the other data columns. Also, we want to rename the column “quality” to “label” to make it clear on what we are working with.

lab = adj.select("features", "quality")
training_data = lab.withColumnRenamed("quality", "label")

Now, the dataframe should be cleaned and we are ready for the Linear Regression in Spark!

Running the Linear Regression

First, we create the Linear Regression. We set the maximum Iterations to 30, the ElasticNet mixing Parameter to 0.3 and the regularization parameter to 0.3. Also, we need to make sure to set the features column to “features” and the label column to “label”. Once the Linear Regression is created, we fit the training data into it. After that, we create our predictions with the “transform” function. The code for that is here:

lr = LinearRegression(maxIter=30, regParam=0.3, elasticNetParam=0.3, featuresCol="features", labelCol="label")
lrModel = lr.fit(training_data)
predictionsDF = lrModel.transform(training_data)

This should now create a new dataframe with the features, the label and the prediction. When you review you output, it already predicts quite ok-ish values for a wine:

|            features|label|        prediction|
|[7.0,0.27,0.36,20...|    6| 5.546350842823183|
|[6.3,0.3,0.34,1.6...|    6|5.6602634543897645|
|[8.1,0.28,0.4,6.9...|    6| 5.794350562842575|
|[7.2,0.23,0.32,8....|    6| 5.793638052734819|
|[7.2,0.23,0.32,8....|    6| 5.793638052734819|
|[8.1,0.28,0.4,6.9...|    6| 5.794350562842575|
|[6.2,0.32,0.16,7....|    6|5.6645781552987655|
|[7.0,0.27,0.36,20...|    6| 5.546350842823183|
|[6.3,0.3,0.34,1.6...|    6|5.6602634543897645|
|[8.1,0.22,0.43,1....|    6| 6.020023174935914|
|[8.1,0.27,0.41,1....|    5| 6.178863965783833|
|[8.6,0.23,0.4,4.2...|    5| 5.756611684447172|
|[7.9,0.18,0.37,1....|    5| 6.012659811971332|
|[6.6,0.16,0.4,1.5...|    7| 6.343695124494296|
|[8.3,0.42,0.62,19...|    5| 5.605663225763592|
|[6.6,0.17,0.38,1....|    7| 6.139779557853963|
|[6.3,0.48,0.04,1....|    6| 5.537802384697061|
|[6.2,0.66,0.48,1....|    8| 6.028338973062226|
|[7.4,0.34,0.42,1....|    6|5.9853604241636615|
|[6.5,0.31,0.14,7....|    5| 5.652874078868445|
only showing top 20 rows

You could now go into a supermarket of your choice and aquire a wine and fit the data of the wine into your model. The model would tell you how good the wine is and if you should buy one or not.

This is already our first linear regression with Spark – a very easy model. However, there is much more to learn:

  • We would need to understand the standard deviation of this model and how accurate it is. If you review some predictions, we ware not very acuarate at all. So it needs to be tweaked
  • We will later compare different ML algorithms and build a pipeline

However, it is good for a start!

This tutorial is part of the Apache Spark MLlib Tutorial. If you are not yet familiar with Spark or Python, I recommend you first reading the tutorial on Spark and the tutorial on Python. Also, you need to understand the core concepts of Machine Learning, which you can learn in this tutorial. Also, you might refer to the official Apache Spark ML documentation.

Large enterprises have a lot of legacy systems in their footprint. This created a lot of challenges (but also opportunities!) for system integrators. Now, since companies strive to become data driven, it becomes an even bigger challenge. But luckily there is a new thing out there that can help: data abstraction.

Data Abstraction: why do you need it?

If a company wants to become data driven, it is necessary to unlock all the data that is available within a company. However, this is easier said than done. Most companies have a heterogenous IT landscape and thus struggle to integrate essential data sources into their analytical systems. In the past, there have been several approaches to it. Data was loaded (with some delay) to an analytical data warehouse. This data warehouse didn’t power any operational systems, so it was decoupled.

However, several things proved to be wrong with Data warehouses handling analytical workload: (a) data warehouses tend to be super-expensive in both cost and operations. It makes sense for KPIs and highly structured data, but not for other datasets. And (b) due to the high cost, data warehouses were loaded with some hours to even days of delay. In a real-time world, this isn’t good at all.

But – didn’t the datalake solve this already?

Some years ago, Data lakes surfaced. They were more efficient in terms of speed and cost than traditional data warehouses. However, data warehouses kept the master data, which data lakes often need. So a connection between the two needed to be established. In early days, data was simply replicated in order to do so. Next to datalakes, many other systems (NoSQL mainly) surfaced. Business units aquired different other systems, that made more integration efforts necessary. So, there was no end to data siles at all – it even got worse (and will continue to do so)

So, why not give in to the pressure of heterogenous systems and data stores and try to solve it differently? This is where data abstraction comes into play …

What is it about?

As already introduced, Data Abstraction should reduce your sleepless nights when it comes to accessing and unlocking your data assets. It is like a virtual layer that you add in between your data storages and your data consumers to enable one common access. The following illustration shows this:

Data Abstraction shows how to abstract you data
Data Abstraction

Basically, you build a layer on top of your data sources. Of course, it doesn’t solve the challenges around data integration, but it ensures that consumers can expect to have one common layer that they can plug into. Also, it enables you to exchange the technical layer of a data source without consumers taking note of it. You might consider to re-develop a data source from the ground up, in order to make it more performant. Both the old and the new stack will conform to the data abstraction and thus consumers won’t realize that there are significant changes under the hood.

This sounds really nice. So what’s the (technical) solution to it?

Basically, I don’t recommend any technology at this stage. There are several technologies that enable Data Abstraction. They can be clustered into 3 different areas:

  1. Lightweight SQL Engines: There are several products and tools (both Open Source and non-Open Source) available, which enable SQL access to different data sources. They not only plug into relational databases, but also into non-relational databases. Most tools provide easy integration and abstraction.
  2. API Integration: It is possible to integrate your data sources via an API layer that eventually abstracts the below data sources. The pain of integration is higher than with SQL Engines, but it gives you more flexbility on top and a higher degree of abstraction. In contrast to SQL engines, your consumers won’t plug too deep into database specifics. If you want to go for a really advanced tech stack, I recommend you reading about Graphs.
  3. Full-blown solution: There are several proprietary tools available, that provide numerous connectors to data sources. What is really great about these solutions is that they also include chaching mechanisms for frequent data access. You get much higher performance with limited implementation cost. However, you will lock into a specific soltuion.

Which solution you eventually consider to go for, is fully up to you. It depends on the company and its know-how and characterists. In most cases, it is also a combination of different solutions.

So what is next?

There are many tools and services out there which enable data abstraction. Data Abstraction is more of a concept than a concrete technology – not even an architectural pattern. In some cases, you might acquire a technology. Or you would abstract your data via an API or Graph. There are many technologies, tools and services out there to solve your issues.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. If you look for open data, I would recommend you browsing some open data catalogs like the open data catalog from the U.S. government.

Spark ML is Apache Spark’s answer to machine learning and data science. The library has several powerful features for typical machine learning and data science tasks. In the following posts I will introduce Spark ML.

What is Spark ML?

The goals of MLlib is to solve complex Machine Learning and Data Science tasks in an easy API. Basically, Spark provides a Dataframe-based API for common Machine Learning tasks. These include different machine learning algorithms, options for feature engineering and data transformations, persisting models and different mathematical utilities.

A key concept in Data Science are pipelines. This is also included with a comprehensive library. Pipelines are used to abstract the work with Machine Learning models and the data around it. I will explain the concept of Pipelines in a later post with some examples. Basically, Pipelines enable us to use different algorithms in one workflow alongside with its data.

Feature Engineering in MLlib

The Library also includes several aspects for feature engineering. Basically, this is a thing that every data science process contains. These tasks include:

  • Feature Extraction: extracting features from RAW-Data. This is for instance converting text to a vector.
  • Feature Transformers: Transforming Features to a different status. This includes Scalers and alike.
  • Feature Selection: Selecting features, for instance into a VectorSlicer

The library basically gives you a lot of possibilities for Feature engineering. I will explain Feature Engineering capabilities also in a later tutorial.

Machine Learning Models

Of course, the core task of the machine learning library is on Machine Learning models itself. There are a large number of standard algorithms available for Clustering, Regression and Classification. We will use different algorithms over the next couple of posts, so stay tuned for more details about them. In the next post, we will create a first model with a Linear Regression. In order to get started, please make sure to setup your environment based on this tutorial. If you haven’t heard of a Linear Regression, I recommend you reading the introduction to the linear regression first.

This tutorial is part of the Apache Spark MLlib Tutorial. If you are not yet familiar with Spark or Python, I recommend you first reading the tutorial on Spark and the tutorial on Python. Also, you need to understand the core concepts of Machine Learning, which you can learn in this tutorial. Also, you might refer to the official Apache Spark ML documentation.