Spark ML is Apache Spark’s answer to machine learning and data science. The library has several powerful features for typical machine learning and data science tasks. In the following posts I will introduce Spark ML.
What is Spark ML?
The goals of MLlib is to solve complex Machine Learning and Data Science tasks in an easy API. Basically, Spark provides a Dataframe-based API for common Machine Learning tasks. These include different machine learning algorithms, options for feature engineering and data transformations, persisting models and different mathematical utilities.
A key concept in Data Science are pipelines. This is also included with a comprehensive library. Pipelines are used to abstract the work with Machine Learning models and the data around it. I will explain the concept of Pipelines in a later post with some examples. Basically, Pipelines enable us to use different algorithms in one workflow alongside with its data.
Feature Engineering in MLlib
The Library also includes several aspects for feature engineering. Basically, this is a thing that every data science process contains. These tasks include:
- Feature Extraction: extracting features from RAW-Data. This is for instance converting text to a vector.
- Feature Transformers: Transforming Features to a different status. This includes Scalers and alike.
- Feature Selection: Selecting features, for instance into a VectorSlicer
The library basically gives you a lot of possibilities for Feature engineering. I will explain Feature Engineering capabilities also in a later tutorial.
Machine Learning Models
Of course, the core task of the machine learning library is on Machine Learning models itself. There are a large number of standard algorithms available for Clustering, Regression and Classification. We will use different algorithms over the next couple of posts, so stay tuned for more details about them. In the next post, we will create a first model with a Linear Regression. In order to get started, please make sure to setup your environment based on this tutorial. If you haven’t heard of a Linear Regression, I recommend you reading the introduction to the linear regression first.
This tutorial is part of the Apache Spark MLlib Tutorial. If you are not yet familiar with Spark or Python, I recommend you first reading the tutorial on Spark and the tutorial on Python. Also, you need to understand the core concepts of Machine Learning, which you can learn in this tutorial. Also, you might refer to the official Apache Spark ML documentation.