In my previous post, I gave an introduction to Python Libraries for Data Engineering and Data Science. In this post, we will have a first look at NumPy, one of the most important libraries to work with in Python.
NumPy is the simplest library for working with data. It is often re-used by other libraries such as Pandas, so it is necessary to first understand NumPy. The focus of this library is on easy transformations of Vectors, Matrizes and Arrays. It provides a lot of functionality on that. But let’s get our hands dirty with the library and have a look at it!
Before you get started, please make sure to have the Sandbox setup and ready
Getting started with NumPy
First of all, we need to import the library. This works with the following import statement in Python:
import numpy as np
This should now give us access to NumPy libraries. Let us first create an 3-dimensional array with 5 values in it. In NumPy, this works with the “arange” method. We provide “15” as the number of items and then let it re-shape to 3×5:
vals = np.arange(15).reshape(3,5)
vals
This should now give us an output array with 2 dimensions, where each dimension contains 5 values. The values range from 0 to 14:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
NumPy contains a lot of different variables and functions. To have PI, you simply import “pi” from numpy:
from numpy import pi
pi
We can now use PI for further work and calculations in Python.
Simple Calculations with NumPy
Let’s create a new array with 5 values:
vl = np.arange(5)
vl
An easy way to calculate is to calculate something to the power. This works with “**”
nv = vl**2
nv
Now, this should give us the following output:
array([ 0, 1, 4, 9, 16])
The same applies to “3”: if we want to calculate everything in an array to the power of 3:
nn = vl**3
nn
And the output should be similar:
array([ 0, 1, 8, 27, 64])
Working with Random Numbers in NumPy
NumPy contains the function “random” to create random numbers. This method takes the dimensions of the array to fit the numbers into. We use a 3×3 array:
nr = np.random.random((3,3))
nr *= 100
nr
Please note that random returns numbers between 0 and 1, so in order to create higher numbers we need to “stretch” them. We thus multiply by 100. The output should be something like this:
array([[90.30147522, 6.88948191, 6.41853222],
[82.76187536, 73.37687372, 9.48770728],
[59.02523947, 84.56571797, 5.05225463]])
Your numbers should be different, since we are working with random numbers in here. We can do this as well with a 3-dimensional array:
n3d = np.random.random((3,3,3))
n3d *= 100
n3d
Also here, your numbers would be different, but the overall “structure” should look like the following:
array([[[89.02863455, 83.83509441, 93.94264059],
[55.79196044, 79.32574406, 33.06871588],
[26.11848117, 64.05158411, 94.80789032]],
[[19.19231999, 63.52128357, 8.10253043],
[21.35001753, 25.11397256, 74.92458022],
[35.62544853, 98.17595966, 23.10038137]],
[[81.56526913, 9.99720992, 79.52580966],
[38.69294158, 25.9849473 , 85.97255179],
[38.42338734, 67.53616027, 98.64039687]]])
Other means to work with Numbers in Python
NumPy provides several other options to work with data. There are several aggregation functions available that we can use. Let’s now look for the maximum value in the previously created array:
n3d.max()
In my example this would return 98.6. You would get a different number, since we made it random. Also, it is possible to return the maximum number of a specific axis within an array. We therefore add the keyword “axis” to the “max” function:
n3d.max(axis=1)
This would now return the maximum number for each of the axis within the array. In my example, the results look like this:
array([[93.94264059, 79.32574406, 94.80789032],
[63.52128357, 74.92458022, 98.17595966],
[81.56526913, 85.97255179, 98.64039687]])
Another option is to create the sum. We can do this by the entire array, or by providing the axis keyword:
n3d.sum(axis=1)
In the next sample, we make the data look more pretty. This can be done by rounding the numbers to 2 digits:
n3d.round(2)
Iterating arrays in Python
Often, it is necessary to iterate over items. In NumPy, this can be achieved by using the built-in iterator. We get it by the function “nditer”. This function needs the array to iterate over and then we can include it in a for-each loop:
or val in np.nditer(n3d):
print(val)
The above sample would iterate over all values in the array and then prints the values. If we want to modify the items within the array, we need to set the flag “op_flags” to “readwrite”. This enables us to do modifications to the array while iterating it. In the next sample, we iterate over each item and then create the modulo of 3 from it:
n3d = n3d.round(0)
with np.nditer(n3d, op_flags=['readwrite']) as iter:
for i in iter:
i[...] = i%3
n3d
These are the basics of NumPy. In our next tutorial, we will have a look at Pandas: a very powerful dataframe library.
If you liked this post, you might consider the tutorial about Python itself. This gives you a great insight into the Python language for Spark itself. If you want to know more about Python, you should consider visiting the official page.
I am talking a lot to different people in my domain – either on conferences or as I know them personally. One thing most of them have in common is one thing: frustration. But why are people working with data frustrated? Why do we see so many frustrated data scientists? Is it the complexity of the job on dealing with data or is it something else? My experience is clearly one thing: something else.
Why are people working with Data frustrated?
One pattern is very clear: most people I talk to that are frustrated with their job working in classical industries. Whenever I talk to people in the IT industry or in Startups, they seem to be very happy. This is largely in contrast to people working in “classical” industries or in consulting companies. There are several reasons to that:
- First, it is often about a lack of support within traditional companies. Processes are complex and employees work in that company for quite some time. Bringing in new people (or the cool data scientists) often creates frictions with the established employees of the company. Doing things different to how they used to be done isn’t well perceived by the established type of employees and they have the power and will to block any kind of innovation. The internal network they have can’t compete with any kind of data science magic.
- Second, data is difficult to grasp and organised in silos. Established companies often have an IT function as a cost center, so things were done or fixed on the fly. It was never really intended to dismantle those silos, as budgets were never reserved or made available in doing so. Even now, most companies don’t look into any kind of data governance to reduce their silos. Data quality isn’t a key aspect they strive for. The new kind of people – data scientists – are often “hunting” for data rather than working with the data.
- Third, the technology stack is heterogenous and legacy brings in a lot of frustration as well. This is very similar to the second point. Here, the issue is rather about not knowing how to get the data out of a system without a clear API rather than finding data at all.
- Fourth, everybody forgets about data engineers. Data Scientists sit alone and though they do have some skills in Python, they aren’t the ones operating a technology stack. Often, there is a mismatch between data scientists and data engineers in corporations.
- Fifth, legacy always kicks in. Mandatory regulatory reporting and finance reporting is often taking away resources from the organisation. You can’t just say: “Hey, I am not doing this report for the regulatory since I want to find some patterns in the behaviour of my customers”. Traditional industries are more heavy regulated than Startups or IT companies. This leads to data scientists being reused for standard reporting (not even self-service!). Then the answer often is: “This is not what I signed up for!”
- Sixth, Digitalisation and Data units are often created in order to show it to the shareholder report. There is no real need from the board for impact. Impact is driven from the business and the business knows how to do so. There won’t be significant growth at all but some growth with “doing it as usual”. (However, startups and companies changing the status quo will get this significant growth!)
- Seventh, Data scientists need to be in the business, whereas data engineers need to be in the IT department close to the IT systems. Period. However, Tribes need to be centrally steered.
How to overcome this frustration?
Basically, there is no fast cure available to this problem to reduce the frustrated data scientists. The field is still young, so confusion and wrong decisions outside of the IT industry is normal. Projects will fail, skilled people will leave and find new jobs. Over time, companies will get more and more mature in their journey and thus everything around data will become part of the established parts of a company. Just like controlling, marketing or any other function. It is yet to find its place and organisation type.
A college of mine, Vivien Roussez, wrote a nice library in R to predict time series. The package is called “autoTS” and provides a high level interface for univariate time series predictions. It implements many algorithms, most of them provided by the forecast
package. You can find the package as an open source project on GitHub. Over the last few weeks we saw a lot of Data Science happening due to Corona. One of the challenges on this is to use the right forecast.
Introduction to autoTS
by Vivien Roussez
The autoTS
package provides a high level interface for univariate time series predictions. It implements many algorithms, most of them provided by the forecast
package. The main goals of the package are :
- Simplify the preparation of the time series ;
- Train the algorithms and compare their results, to chose the best one ;
- Gather the results in a final tidy dataframe
What are the inputs ?
The package is designed to work on one time series at a time. Parallel calculations can be put on top of it (see example below). The user has to provide 2 simple vectors :
- One with the dates (s.t. the
lubridate
package can parse them) - The second with the corresponding values
Warnings
This package implements each algorithm with a unique parametrization, meaning that the user cannot tweak the algorithms (eg modify SARIMA specfic parameters).
Example on real-world data
Before getting started, you need to install the required package “autoTS”. This works with the following code:
knitr::opts_chunk$set(warning = F,message = F,fig.width = 8,fig.height = 5)
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(lubridate))
library(autoTS)
For this example, we will use the GDP quarterly data of the european countries provided by eurostat. The database can be downloaded from this page and then chose “GDP and main components (output, expenditure and income) (namq_10_gdp)” and then adjust the time dimension to select all available data and download as a csv file with the correct formatting (1 234.56). The csv is in the “Data” folder of this notebook.
dat <- read.csv("Data/namq_10_gdp_1_Data.csv")
str(dat)
## 'data.frame': 93456 obs. of 7 variables:
## $ TIME : Factor w/ 177 levels "1975Q1","1975Q2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ GEO : Factor w/ 44 levels "Albania","Austria",..: 15 15 15 15 15 15 15 15 15 15 ...
## $ UNIT : Factor w/ 3 levels "Chain linked volumes (2010), million euro",..: 2 2 2 2 3 3 3 3 1 1 ...
## $ S_ADJ : Factor w/ 4 levels "Calendar adjusted data, not seasonally adjusted data",..: 4 2 1 3 4 2 1 3 4 2 ...
## $ NA_ITEM : Factor w/ 1 level "Gross domestic product at market prices": 1 1 1 1 1 1 1 1 1 1 ...
## $ Value : Factor w/ 19709 levels "1 008.3","1 012.9",..: 19709 19709 19709 19709 19709 19709 19709 19709 19709 19709 ...
## $ Flag.and.Footnotes: Factor w/ 5 levels "","b","c","e",..: 1 1 1 1 1 1 1 1 1 1 ...
head(dat)
## TIME GEO
## 1 1975Q1 European Union - 27 countries (from 2019)
## 2 1975Q1 European Union - 27 countries (from 2019)
## 3 1975Q1 European Union - 27 countries (from 2019)
## 4 1975Q1 European Union - 27 countries (from 2019)
## 5 1975Q1 European Union - 27 countries (from 2019)
## 6 1975Q1 European Union - 27 countries (from 2019)
## UNIT
## 1 Chain linked volumes, index 2010=100
## 2 Chain linked volumes, index 2010=100
## 3 Chain linked volumes, index 2010=100
## 4 Chain linked volumes, index 2010=100
## 5 Current prices, million euro
## 6 Current prices, million euro
## S_ADJ
## 1 Unadjusted data (i.e. neither seasonally adjusted nor calendar adjusted data)
## 2 Seasonally adjusted data, not calendar adjusted data
## 3 Calendar adjusted data, not seasonally adjusted data
## 4 Seasonally and calendar adjusted data
## 5 Unadjusted data (i.e. neither seasonally adjusted nor calendar adjusted data)
## 6 Seasonally adjusted data, not calendar adjusted data
## NA_ITEM Value Flag.and.Footnotes
## 1 Gross domestic product at market prices :
## 2 Gross domestic product at market prices :
## 3 Gross domestic product at market prices :
## 4 Gross domestic product at market prices :
## 5 Gross domestic product at market prices :
## 6 Gross domestic product at market prices :
Data preparation
First, we have to clean the data (not too ugly though). First thing is to convert the TIME column into a well known date format that lubridate can handle. In this example, the yq
function can parse the date without modification of the column. Then, we have to remove the blank in the values that separates thousands… Finally, we only keep data since 2000 and the unadjusted series in current prices.
After that, we should get one time series per country
dat <- mutate(dat,dates=yq(as.character(TIME)),
values = as.numeric(stringr::str_remove(Value," "))) %>%
filter(year(dates)>=2000 &
S_ADJ=="Unadjusted data (i.e. neither seasonally adjusted nor calendar adjusted data)" &
UNIT == "Current prices, million euro")
filter(dat,GEO %in% c("France","Austria")) %>%
ggplot(aes(dates,values,color=GEO)) + geom_line() + theme_minimal() +
labs(title="GDP of (completely) random countries")

Now we’re good to go !
Prediction on a random country
Let’s see how to use the package on one time series :
- Extract dates and values of the time series you want to work on
- Create the object containing all you need afterwards
- Train algo and determine which one is the best (over the last known year)
- Implement the best algorithm on full data
ex1 <- filter(dat,GEO=="France")
preparedTS <- prepare.ts(ex1$dates,ex1$values,"quarter")
## What is in this new object ?
str(preparedTS)
## List of 4
## $ obj.ts : Time-Series [1:77] from 2000 to 2019: 363007 369185 362905 383489 380714 ...
## $ obj.df :'data.frame': 77 obs. of 2 variables:
## ..$ dates: Date[1:77], format: "2000-01-01" "2000-04-01" ...
## ..$ val : num [1:77] 363007 369185 362905 383489 380714 ...
## $ freq.num : num 4
## $ freq.alpha: chr "quarter"
plot.ts(preparedTS$obj.ts)

ggplot(preparedTS$obj.df,aes(dates,val)) + geom_line() + theme_minimal()

Get the best algorithm for this time series :
## What is the best model for prediction ?
best.algo <- getBestModel(ex1$dates,ex1$values,"quarter",graph = F)
names(best.algo)
## [1] "prepedTS" "best" "train.errors" "res.train" "algos"
## [6] "graph.train"
print(paste("The best algorithm is",best.algo$best))
## [1] "The best algorithm is my.ets"
best.algo$graph.train

You find in the result of this function :
- The name of the best model
- The errors of each algorithm on the test set
- The graphic of the train step
- The prepared time series
- The list of used algorithm (that you can customize)
The result of this function can be used as direct input of the my.prediction
function
## Build the predictions
final.pred <- my.predictions(bestmod = best.algo)
tail(final.pred,24)
## # A tibble: 24 x 4
## dates type actual.value ets
## <date> <chr> <dbl> <dbl>
## 1 2015-04-01 <NA> 548987 NA
## 2 2015-07-01 <NA> 541185 NA
## 3 2015-10-01 <NA> 566281 NA
## 4 2016-01-01 <NA> 554121 NA
## 5 2016-04-01 <NA> 560873 NA
## 6 2016-07-01 <NA> 546383 NA
## 7 2016-10-01 <NA> 572752 NA
## 8 2017-01-01 <NA> 565221 NA
## 9 2017-04-01 <NA> 573720 NA
## 10 2017-07-01 <NA> 563671 NA
## # … with 14 more rows
ggplot(final.pred) + geom_line(aes(dates,actual.value),color="black") +
geom_line(aes_string("dates",stringr::str_remove(best.algo$best,"my."),linetype="type"),color="red") +
theme_minimal()

Not too bad, right ?
Scaling predictions
Let’s say we want to make a prediction for each country in the same time and be the fastest possible →→ let’s combine the package’s functions with parallel computing. We have to reshape the data to get one column per country and then iterate over the columns of the data frame.
Prepare data
suppressPackageStartupMessages(library(tidyr))
dat.wide <- select(dat,GEO,dates,values) %>%
group_by(dates) %>%
spread(key = "GEO",value = "values")
head(dat.wide)
## # A tibble: 6 x 45
## # Groups: dates [6]
## dates Albania Austria Belgium `Bosnia and Her… Bulgaria Croatia Cyprus
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2000-01-01 NA 50422. 62261 NA 2941. 5266. 2547.
## 2 2000-04-01 NA 53180. 65046 NA 3252. 5811 2784.
## 3 2000-07-01 NA 53881. 62754 NA 4015. 6409. 2737.
## 4 2000-10-01 NA 56123. 68161 NA 4103. 6113 2738.
## 5 2001-01-01 NA 52911. 64318 NA 3284. 5777. 2688.
## 6 2001-04-01 NA 54994. 67537 NA 3669. 6616. 2946.
## # … with 37 more variables: Czechia <dbl>, Denmark <dbl>, Estonia <dbl>, `Euro
## # area (12 countries)` <dbl>, `Euro area (19 countries)` <dbl>, `Euro area
## # (EA11-2000, EA12-2006, EA13-2007, EA15-2008, EA16-2010, EA17-2013,
## # EA18-2014, EA19)` <dbl>, `European Union - 15 countries (1995-2004)` <dbl>,
## # `European Union - 27 countries (from 2019)` <dbl>, `European Union - 28
## # countries` <dbl>, Finland <dbl>, France <dbl>, `Germany (until 1990 former
## # territory of the FRG)` <dbl>, Greece <dbl>, Hungary <dbl>, Iceland <dbl>,
## # Ireland <dbl>, Italy <dbl>, `Kosovo (under United Nations Security Council
## # Resolution 1244/99)` <dbl>, Latvia <dbl>, Lithuania <dbl>,
## # Luxembourg <dbl>, Malta <dbl>, Montenegro <dbl>, Netherlands <dbl>, `North
## # Macedonia` <dbl>, Norway <dbl>, Poland <dbl>, Portugal <dbl>,
## # Romania <dbl>, Serbia <dbl>, Slovakia <dbl>, Slovenia <dbl>, Spain <dbl>,
## # Sweden <dbl>, Switzerland <dbl>, Turkey <dbl>, `United Kingdom` <dbl>
pull ## Compute bulk predictions
library(doParallel)
pipeline <- function(dates,values)
{
pred <- getBestModel(dates,values,"quarter",graph = F) %>%
my.predictions()
return(pred)
}
doMC::registerDoMC(parallel::detectCores()-1) # parallel backend (for UNIX)
system.time({
res <- foreach(ii=2:ncol(dat.wide),.packages = c("dplyr","autoTS")) %dopar%
pipeline(dat.wide$dates,pull(dat.wide,ii))
})
## user system elapsed
## 342.339 3.405 66.336
names(res) <- colnames(dat.wide)[-1]
str(res)
## List of 44
## $ Albania :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ stlm : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Austria :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 50422 53180 53881 56123 52911 ...
## ..$ sarima : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Belgium :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 62261 65046 62754 68161 64318 ...
## ..$ shortterm : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Bosnia and Herzegovina :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ stlm : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Bulgaria :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 2941 3252 4015 4103 3284 ...
## ..$ tbats : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Croatia :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 5266 5811 6409 6113 5777 ...
## ..$ shortterm : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Cyprus :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 2547 2784 2737 2738 2688 ...
## ..$ sarima : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Czechia :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 15027 16430 17229 18191 16677 ...
## ..$ tbats : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Denmark :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 42567 44307 43892 47249 44143 ...
## ..$ prophet : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Estonia :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 1391 1575 1543 1662 1570 ...
## ..$ shortterm : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Euro area (12 countries) :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ prophet : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Euro area (19 countries) :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ prophet : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Euro area (EA11-2000, EA12-2006, EA13-2007, EA15-2008, EA16-2010, EA17-2013, EA18-2014, EA19):Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ prophet : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ European Union - 15 countries (1995-2004) :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ prophet : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ European Union - 27 countries (from 2019) :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ prophet : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ European Union - 28 countries :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ prophet : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Finland :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 31759 33836 34025 36641 34474 ...
## ..$ bats : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ France :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 363007 369185 362905 383489 380714 ...
## ..$ ets : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Germany (until 1990 former territory of the FRG) :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 515500 523900 536120 540960 530610 ...
## ..$ sarima : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Greece :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 33199 34676 37285 37751 35237 ...
## ..$ shortterm : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Hungary :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 11516 12630 13194 13955 12832 ...
## ..$ shortterm : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Iceland :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 2304 2442 2557 2447 2232 ...
## ..$ stlm : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Ireland :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 25583 26751 27381 28666 29766 ...
## ..$ tbats : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Italy :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 292517 309098 298655 338996 309967 ...
## ..$ ets : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Kosovo (under United Nations Security Council Resolution 1244/99) :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ ets : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Latvia :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 1848 2165 2238 2382 2005 ...
## ..$ ets : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Lithuania :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 2657 3124 3267 3505 2996 ...
## ..$ shortterm : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Luxembourg :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 5646 5730 5689 6015 5811 ...
## ..$ sarima : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Malta :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 979 1110 1158 1152 1031 ...
## ..$ shortterm : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Montenegro :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ ets : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Netherlands :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 109154 113124 110955 118774 118182 ...
## ..$ shortterm : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ North Macedonia :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 901 1052 1033 1108 986 ...
## ..$ shortterm : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Norway :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 44900 43730 46652 50638 48355 ...
## ..$ sarima : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Poland :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 41340 44210 46944 54163 47445 ...
## ..$ bagged : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Portugal :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 30644 31923 32111 33788 31927 ...
## ..$ stlm : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Romania :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 7901 9511 11197 11630 8530 ...
## ..$ shortterm : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Serbia :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 0 0 0 0 0 ...
## ..$ sarima : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Slovakia :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 5100 5722 5764 5752 5343 ...
## ..$ shortterm : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Slovenia :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 5147 5591 5504 5667 5407 ...
## ..$ bats : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Spain :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 153378 162400 158526 171946 166204 ...
## ..$ bagged : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Sweden :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 67022 73563 68305 73399 66401 ...
## ..$ bats : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Switzerland :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 70048 72725 74957 77476 76092 ...
## ..$ bats : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ Turkey :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 59944 70803 82262 82981 60075 ...
## ..$ shortterm : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
## $ United Kingdom :Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 4 variables:
## ..$ dates : Date[1:85], format: "2000-01-01" "2000-04-01" ...
## ..$ type : chr [1:85] NA NA NA NA ...
## ..$ actual.value: num [1:85] 438090 440675 446918 462127 441157 ...
## ..$ bagged : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
There is no free lunch…
There is no best algorithm in general ⇒⇒ depends on the data !
sapply(res,function(xx) colnames(select(xx,-dates,-type,-actual.value)) ) %>% table()
## .
## bagged bats ets prophet sarima shortterm stlm tbats
## 3 4 5 7 6 12 4 3
sapply(res,function(xx) colnames(select(xx,-dates,-type,-actual.value)) )
## Albania
## "stlm"
## Austria
## "sarima"
## Belgium
## "shortterm"
## Bosnia and Herzegovina
## "stlm"
## Bulgaria
## "tbats"
## Croatia
## "shortterm"
## Cyprus
## "sarima"
## Czechia
## "tbats"
## Denmark
## "prophet"
## Estonia
## "shortterm"
## Euro area (12 countries)
## "prophet"
## Euro area (19 countries)
## "prophet"
## Euro area (EA11-2000, EA12-2006, EA13-2007, EA15-2008, EA16-2010, EA17-2013, EA18-2014, EA19)
## "prophet"
## European Union - 15 countries (1995-2004)
## "prophet"
## European Union - 27 countries (from 2019)
## "prophet"
## European Union - 28 countries
## "prophet"
## Finland
## "bats"
## France
## "ets"
## Germany (until 1990 former territory of the FRG)
## "sarima"
## Greece
## "shortterm"
## Hungary
## "shortterm"
## Iceland
## "stlm"
## Ireland
## "tbats"
## Italy
## "ets"
## Kosovo (under United Nations Security Council Resolution 1244/99)
## "ets"
## Latvia
## "ets"
## Lithuania
## "shortterm"
## Luxembourg
## "sarima"
## Malta
## "shortterm"
## Montenegro
## "ets"
## Netherlands
## "shortterm"
## North Macedonia
## "shortterm"
## Norway
## "sarima"
## Poland
## "bagged"
## Portugal
## "stlm"
## Romania
## "shortterm"
## Serbia
## "sarima"
## Slovakia
## "shortterm"
## Slovenia
## "bats"
## Spain
## "bagged"
## Sweden
## "bats"
## Switzerland
## "bats"
## Turkey
## "shortterm"
## United Kingdom
## "bagged"
We hope you enjoy working with this package to build your time series predictions in the future. Now you should be capable of extending your data science algorithms on corona with Time Series predicitons. If you want to learn more about data science, I recommend you doing this tutorial.
In the last two posts we introduced the core concepts of Deep Learning, Feedforward Neural Network and Convolutional Neural Network. In this post, we will have a look at two other popular deep learning techniques: Recurrent Neural Network and Long Short-Term Memory.
Recurrent Neural Network
The main difference to the previously introduced Networks is that the Recurrent Neural Network provides a feedback loop to the previous neuron. This architecture makes it possible to remember important information about the input the network received and takes the learning into consideration along with the next input. RNNs work very well with sequential data such as sound, time series (sensor) data or written natural languages.
The advantage of a RNN over a feedforward network is that the RNN can remember the output and use the output to predict the next element in a series, while a feedforward network is not able to fed the output back to the network. Real-time gesture tracking in videos is another important use-case for RNNs.

Long Short-Term Memory
A usual RNN has a short-term memory, which is already great at some aspect. However, there are requirenments for more advanced memory functionality. Long Short-Term Memory is solving this problem. The two Austrian researchers Josef Hochreiter and Jürgen Schmidhuber introduced LSTM. LSTMs enable RNNs to remember inputs over a long period of time. Therefore, LSTMs are used in combination with RNNs for sequential data which have long time lags in between.
LSTM learns over time on which information is relevant and what information isn’t relevant. This is done by assigning weights to information. This information is then assigned to three different gates within the LSTM: the input gate, the output gate and the “forget” gate.
This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.
Working with data is a complex thing and not done in some days. It is rather a matter of several sequential steps that lead to a final output. In this post, I present the data science process for project execution in data science.
What is the Data Science Process?
Data Science is often mainly consisting of data wrangling and feature engineering, before one can go to the exciting stuff. Since data science is often very exploratory, processes didn’t much evolve around it (yet). In the data science process, I group the process in three main steps that have several sub-steps in it. Let’s first start with the three main steps:
- Data Acquisition
- Feature Engineering and Selection
- Model Training and Extraction
Each of the different process main steps contains some sub-steps. I will describe them a bit in detail now.
Step 1: Data Acquistion
Data Engineering is the main ingredient in this step. After a business question was formulated, it is necessary to now look for the data. In an ideal setup, you would already have a data catalog in your enterprise. If not, you might need to ask several people until you have found the right place to dig deeper into it.
First of all, you need to acquire the data. They might be internal sources but you might also combine them with external sources. In this context, you might want to read about the different data sources you need. Once you are done with having a first look at the data, it is necessary to integrate the data.
Data integration is often perceived as a challenging task. You need to setup a new environment to store the data or you need to extend an existing schema. A common practise is to build a data science lab. A data science lab should be an easy platform for data engineers and data scientists to work with data. A best practise for that is to use a prepared environment in the cloud for it.
After integrating the data, there comes the heavy part of cleaning the data. In most cases, Data is very nasty and thus needs a lot of cleaning with it. This is also mainly carried out by data engineers alongside with data analysts in a company. Once you are done with the data acquisition part of it, you can move on with the feature engineering and selection step.
Typically, this first process step can be very painful and long-lasting. It depends on different factors of an enterprise, such as the data quality itself, the availability of a data catalog and corresponding metadata descriptions. If your maturity in all these items is very high, it can take some days to a week, but in average it is rather 2 to 4 weeks of work.
Step 2: Feature Engineering and Selection
In the next step we start with a very important step in the Data Science process: Feature Engineering. Features are very important for Machine Learning and have a huge impact on the quality of the predictions. With Feature Engineering, you have to understand the domain you are in and what to use with it. One need to understand what data to use and for what reason.
After the feature engineering itself, it is necessary to select the relevant features with the feature selection. A common mistake is the overfitting of a model, or also called “feature explosion”. It happens often that too many features are created and thus the predictions aren’t accurate anymore. Therefore, it is very important to select only those features that are relevant to the use-case and thus bring some significance.
Another important step is the development of the cross-validation structure. This is necessary to check how the model will perform in practice. The cross-validation will measure the performance of your model and give you insights on how to use it. Next after that is the Hyperparameter tuning. Hyperparameters are fine-tuned to improve the prediction of your model.
This process is now carried out mainly by Data Scientists, but still supported by data engineers. The next and final step in the data science process is Model Training and Extraction.
Step 3: Model Training and Extraction
The last step in the process is the model training and extraction. In this step, the algorithm(s) for the model prediction are selected and compared to each other. In order to ease up work here, it is necessary to put all your process into a pipeline. (Note: I will explain the concept of the pipeline in a later post). After the training is done, you can go into the predictions itself and bring the model into production.
The following illustration outlines the now presented process:

The Data Science process itself can easily be carried out in a Scrum or Kanban approach, depending on your favourite management style. For instance, you could have each of the 3 process steps as sprints. The first sprint “Data Acquisition” might last longer than the other sprints or you could even break the first one into several sprints. For Agile Data Science, I can recommend you reading this post.
In my previous post I’ve briefly introduced Spark ML. In this post I want to show how you can actually work with Spark ML, before continuing with some more theory on Spark ML. We will have a look at how to predict the wine quality with a Linear Regression in Spark. In order to get started, please make sure to setup your environment based on this tutorial. If you haven’t heard of a Linear Regression, I recommend you reading the introduction to the linear regression first.
The Linear Regression in Spark
There are several Machine Learning Models available in Apache Spark. The easiest one is the Linear Regression. In this post, we will only use the linear regression. Our goal is to have a quick start into Spark ML and then extend it over the next couple of tutorials and get much deeper into it. By now, you should have a working environment of Spark ready. Next, we need some data. Luckily, the wine quality dataset is a often used one and you can download it from here. Load it into the same folder as your new PySpark 3 Notebook.
First, we need to import some packages from pyspark. SparkSession and LinearRegression are very obvious. The only one that isn’t obvious at first is the VectorAssembler. I will explain later what we need this class for.
from pyspark.sql import SparkSession from pyspark.ml.classification import LogisticRegression from pyspark.ml.feature import VectorAssembler
Create the SparkContext and Load the Data
We first start by creating the SparkContext. This is a standard procedure and not yet rocket science.
spark = SparkSession.builder.master("local") \ .appName("Cloudvane-Sample-03") \ .config("net.cloudvane.sampleId", "Cloudvane.net Spark Sample 03").getOrCreate()
Next, we load our data. We specify that the format is of type “csv” (Comma Separated Values). The file is however delimited with “;” instead of “,”, so we need to specify this as well. Also, we want Spark to get the schema without any manual intervention from us, so we set “inferSchema” to True. Spark should now figure out how the data types are. Also, we specify that our file has headers. Last but not least, we need to load the file with its filename.
data = spark.read.format("csv").options(delimiter=";", inferSchema=True, header=True).load("winequality-white.csv")
We briefly check how our Dataset looks like. We just use one line in Jupyter with “data”:
data
… and the output should be the following:
DataFrame[fixed acidity: double, volatile acidity: double, citric acid: double, residual sugar: double, chlorides: double, free sulfur dioxide: double, total sulfur dioxide: double, density: double, pH: double, sulphates: double, alcohol: double, quality: int]
Remember, if you want to see what is inside your data, use “data.show()”. Your dataframe should contain this data:

Time for some Feature Engineering
In order for Spark to process this data, we need to create a vector out of our data. In order to do this, we use the VectorAssembler that was imported above. Basically, the VectorAssembler takes the data and moves it into a simple Vector. We take the first 11 columns, since the “quality” column should serve as our Label. The Label is the value we later want to predict. We name this Vector now “features” and transform the data.
va = VectorAssembler(inputCols=data.columns[:11], outputCol="features") adj = va.transform(data) adj.show()
The new Dataset – called “adj” – now has an additional column named “features”. For Machine Learning, we only need the features, so we can get rid of the other data columns. Also, we want to rename the column “quality” to “label” to make it clear on what we are working with.
lab = adj.select("features", "quality") training_data = lab.withColumnRenamed("quality", "label")
Now, the dataframe should be cleaned and we are ready for the Linear Regression in Spark!
Running the Linear Regression
First, we create the Linear Regression. We set the maximum Iterations to 30, the ElasticNet mixing Parameter to 0.3 and the regularization parameter to 0.3. Also, we need to make sure to set the features column to “features” and the label column to “label”. Once the Linear Regression is created, we fit the training data into it. After that, we create our predictions with the “transform” function. The code for that is here:
lr = LinearRegression(maxIter=30, regParam=0.3, elasticNetParam=0.3, featuresCol="features", labelCol="label") lrModel = lr.fit(training_data) predictionsDF = lrModel.transform(training_data) predictionsDF.show()
This should now create a new dataframe with the features, the label and the prediction. When you review you output, it already predicts quite ok-ish values for a wine:
+--------------------+-----+------------------+ | features|label| prediction| +--------------------+-----+------------------+ |[7.0,0.27,0.36,20...| 6| 5.546350842823183| |[6.3,0.3,0.34,1.6...| 6|5.6602634543897645| |[8.1,0.28,0.4,6.9...| 6| 5.794350562842575| |[7.2,0.23,0.32,8....| 6| 5.793638052734819| |[7.2,0.23,0.32,8....| 6| 5.793638052734819| |[8.1,0.28,0.4,6.9...| 6| 5.794350562842575| |[6.2,0.32,0.16,7....| 6|5.6645781552987655| |[7.0,0.27,0.36,20...| 6| 5.546350842823183| |[6.3,0.3,0.34,1.6...| 6|5.6602634543897645| |[8.1,0.22,0.43,1....| 6| 6.020023174935914| |[8.1,0.27,0.41,1....| 5| 6.178863965783833| |[8.6,0.23,0.4,4.2...| 5| 5.756611684447172| |[7.9,0.18,0.37,1....| 5| 6.012659811971332| |[6.6,0.16,0.4,1.5...| 7| 6.343695124494296| |[8.3,0.42,0.62,19...| 5| 5.605663225763592| |[6.6,0.17,0.38,1....| 7| 6.139779557853963| |[6.3,0.48,0.04,1....| 6| 5.537802384697061| |[6.2,0.66,0.48,1....| 8| 6.028338973062226| |[7.4,0.34,0.42,1....| 6|5.9853604241636615| |[6.5,0.31,0.14,7....| 5| 5.652874078868445| +--------------------+-----+------------------+ only showing top 20 rows
You could now go into a supermarket of your choice and aquire a wine and fit the data of the wine into your model. The model would tell you how good the wine is and if you should buy one or not.
This is already our first linear regression with Spark – a very easy model. However, there is much more to learn:
- We would need to understand the standard deviation of this model and how accurate it is. If you review some predictions, we ware not very acuarate at all. So it needs to be tweaked
- We will later compare different ML algorithms and build a pipeline
However, it is good for a start!
This tutorial is part of the Apache Spark MLlib Tutorial. If you are not yet familiar with Spark or Python, I recommend you first reading the tutorial on Spark and the tutorial on Python. Also, you need to understand the core concepts of Machine Learning, which you can learn in this tutorial. Also, you might refer to the official Apache Spark ML documentation.
Spark ML is Apache Spark’s answer to machine learning and data science. The library has several powerful features for typical machine learning and data science tasks. In the following posts I will introduce Spark ML.
What is Spark ML?
The goals of MLlib is to solve complex Machine Learning and Data Science tasks in an easy API. Basically, Spark provides a Dataframe-based API for common Machine Learning tasks. These include different machine learning algorithms, options for feature engineering and data transformations, persisting models and different mathematical utilities.
A key concept in Data Science are pipelines. This is also included with a comprehensive library. Pipelines are used to abstract the work with Machine Learning models and the data around it. I will explain the concept of Pipelines in a later post with some examples. Basically, Pipelines enable us to use different algorithms in one workflow alongside with its data.
Feature Engineering in MLlib
The Library also includes several aspects for feature engineering. Basically, this is a thing that every data science process contains. These tasks include:
- Feature Extraction: extracting features from RAW-Data. This is for instance converting text to a vector.
- Feature Transformers: Transforming Features to a different status. This includes Scalers and alike.
- Feature Selection: Selecting features, for instance into a VectorSlicer
The library basically gives you a lot of possibilities for Feature engineering. I will explain Feature Engineering capabilities also in a later tutorial.
Machine Learning Models
Of course, the core task of the machine learning library is on Machine Learning models itself. There are a large number of standard algorithms available for Clustering, Regression and Classification. We will use different algorithms over the next couple of posts, so stay tuned for more details about them. In the next post, we will create a first model with a Linear Regression. In order to get started, please make sure to setup your environment based on this tutorial. If you haven’t heard of a Linear Regression, I recommend you reading the introduction to the linear regression first.
This tutorial is part of the Apache Spark MLlib Tutorial. If you are not yet familiar with Spark or Python, I recommend you first reading the tutorial on Spark and the tutorial on Python. Also, you need to understand the core concepts of Machine Learning, which you can learn in this tutorial. Also, you might refer to the official Apache Spark ML documentation.
In the last couple of posts, we’ve learned about various aspects of Machine Learning. Now, we will focus on other aspects of Machine Learning: Deep Learning. After introducing the key concepts of Deep Learning in the previous post, we will have a look at two concepts: the Convolutional Neural Network (CNN) and the Feedforward Neural Network
The Feedforward Neural Network
Feedforward neural networks are the most general-purpose neural network. The entry point is the input layer and it consists of several hidden layers and an output layer. Each layer has a connection to the previous layer. This is one-way only, so that nodes can’t for a cycle. The information in a feedforward network only moves into one direction – from the input layer, through the hidden layers to the output layer. It is the easiest version of a Neural Network. The below image illustrates the Feedforward Neural Network.

Convolutional Neural Networks (CNN)
The Convolutional Neural Network is very effective in Image recognition and similar tasks. For that reason it is also good for Video processing. The difference to the Feedforward neural network is that the CNN contains 3 dimensions: width, height and depth. Not all neurons in one layer are fully connected to neurons in the next layer. There are three different type of layers in a Convolutional Neural Network, which are also different to feedforward neural networks:
Convolution Layer
Convolution puts the input image through several convolutional filters. Each filter activates certain features, such as: edges, colors or objects. Next, the feature map is created out of them. The deeper the network goes the more sophisticated those filters become. The convolutional layer automatically learns which features are most important to extract for a specific task.
Rectified linear units (ReLU)
The goal of this layer is to improve the training speed and impact. Negative values in the layers are removed.
Pooling/Subsampling
Pooling simplifies the output by performing nonlinear downsampling. The number of parameters that the network needs to learn about gets reduced. In convolutional neural networks, the operation is useful since the outgoing connections usually receive similar information.
This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.
For Data itself, there are a lot of different sources that are needed. Based on the company and industry, they differ a lot. However, to create a complex view on your company, it isn’t necessary only to have your own data. There are several other data sources you should consider.
Data you already have
The first data source – data you have – seems to be the easiest. However, it isn’t as easy as you might believe. Bringing your data in order is actually a very difficult task and can’t be achieved that easy. I’ve written several blog posts here about the challenges around data and you can review them. Basically, all of them focus on your internal data sources. I won’t re-state them in detail here, but it is mainly about data governance and access.
Data that you can acquire
The second data source – data you can acquire – is another important aspect. By acquire I basically mean everything that you don’t have to pay to an external party as data provider. You might use surveys (and pay for it as well) or acquire the data from open data platforms. Also, you might collect data from social media or with other kind of crawlers. This data source is very important for you, as you can get great overview and insights into your specific questions.
In the past, I’ve seen a lot of companies utilising the second one and we did a lot on that aspect. For this kind of data, you don’t necessarily have to pay for it – some data sources are free. And if you pay for something, you don’t pay for the data itself but rather for the (semi)-manual way of collecting it. Also here, it differs heavily from industry to industry and what the company is all about. I’ve seen companies collecting data from news sites to get insights into their competition and mentions or simply by scanning social media. A lot is possible with this aspect of data source.
Data you can buy
The last one – data you can buy – is easy to get but very expensive in cash-out terms. There are a lot of data providers selling different kind of data. Often, it is demographic data or data about customers. Different platforms collect data from a large number of online sites and thus track individuals over different sites and their behavior. Such platforms then sell this kind of data to marketing departments with more insights. Also here, you can buy this kind of data from that platforms and thus enrich your own first-party and second-party data. Imagine, you are operating a retail business selling all kind of furniture.
You would probably not know much about your web shop visitors, since they are anonymous until they buy something. With data bought from such kind of data providers, it would now be possible for you to figure out if an anonymous visitor is an outdoor enthusiast. You might adjust your offers to match his or her interest best. Or, you might learn that the person visiting your shop recently bought a countryside house with a garden. You might now adjust your offers to present garden furniture or Barbecue accessories. With this kind of third party data, you can achieve a lot and better understand your customers and your company.
This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. If you look for open data, I would recommend you browsing some open data catalogs like the open data catalog from the U.S. government.