A college of mine, Vivien Roussez, wrote a nice library in R to predict time series. The package is called “autoTS” and provides a high level interface for univariate time series predictions. It implements many algorithms, most of them provided by the forecast package. You can find the package as an open source project on GitHub. Over the last few weeks we saw a lot of Data Science happening due to Corona. One of the challenges on this is to use the right forecast.

Introduction to autoTS

by Vivien Roussez

The autoTS package provides a high level interface for univariate time series predictions. It implements many algorithms, most of them provided by the forecast package. The main goals of the package are :

  • Simplify the preparation of the time series ;
  • Train the algorithms and compare their results, to chose the best one ;
  • Gather the results in a final tidy dataframe

What are the inputs ?

The package is designed to work on one time series at a time. Parallel calculations can be put on top of it (see example below). The user has to provide 2 simple vectors :

  • One with the dates (s.t. the lubridate package can parse them)
  • The second with the corresponding values

Warnings

This package implements each algorithm with a unique parametrization, meaning that the user cannot tweak the algorithms (eg modify SARIMA specfic parameters).

Example on real-world data

Before getting started, you need to install the required package “autoTS”. This works with the following code:

knitr::opts_chunk$set(warning = F,message = F,fig.width = 8,fig.height = 5)
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(lubridate))
library(autoTS)

For this example, we will use the GDP quarterly data of the european countries provided by eurostat. The database can be downloaded from this page and then chose “GDP and main components (output, expenditure and income) (namq_10_gdp)” and then adjust the time dimension to select all available data and download as a csv file with the correct formatting (1 234.56). The csv is in the “Data” folder of this notebook.

dat <- read.csv("Data/namq_10_gdp_1_Data.csv")
str(dat)
## 'data.frame':    93456 obs. of  7 variables:
##  $ TIME              : Factor w/ 177 levels "1975Q1","1975Q2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ GEO               : Factor w/ 44 levels "Albania","Austria",..: 15 15 15 15 15 15 15 15 15 15 ...
##  $ UNIT              : Factor w/ 3 levels "Chain linked volumes (2010), million euro",..: 2 2 2 2 3 3 3 3 1 1 ...
##  $ S_ADJ             : Factor w/ 4 levels "Calendar adjusted data, not seasonally adjusted data",..: 4 2 1 3 4 2 1 3 4 2 ...
##  $ NA_ITEM           : Factor w/ 1 level "Gross domestic product at market prices": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Value             : Factor w/ 19709 levels "1 008.3","1 012.9",..: 19709 19709 19709 19709 19709 19709 19709 19709 19709 19709 ...
##  $ Flag.and.Footnotes: Factor w/ 5 levels "","b","c","e",..: 1 1 1 1 1 1 1 1 1 1 ...
head(dat)
##     TIME                                       GEO
## 1 1975Q1 European Union - 27 countries (from 2019)
## 2 1975Q1 European Union - 27 countries (from 2019)
## 3 1975Q1 European Union - 27 countries (from 2019)
## 4 1975Q1 European Union - 27 countries (from 2019)
## 5 1975Q1 European Union - 27 countries (from 2019)
## 6 1975Q1 European Union - 27 countries (from 2019)
##                                   UNIT
## 1 Chain linked volumes, index 2010=100
## 2 Chain linked volumes, index 2010=100
## 3 Chain linked volumes, index 2010=100
## 4 Chain linked volumes, index 2010=100
## 5         Current prices, million euro
## 6         Current prices, million euro
##                                                                           S_ADJ
## 1 Unadjusted data (i.e. neither seasonally adjusted nor calendar adjusted data)
## 2                          Seasonally adjusted data, not calendar adjusted data
## 3                          Calendar adjusted data, not seasonally adjusted data
## 4                                         Seasonally and calendar adjusted data
## 5 Unadjusted data (i.e. neither seasonally adjusted nor calendar adjusted data)
## 6                          Seasonally adjusted data, not calendar adjusted data
##                                   NA_ITEM Value Flag.and.Footnotes
## 1 Gross domestic product at market prices     :
## 2 Gross domestic product at market prices     :
## 3 Gross domestic product at market prices     :
## 4 Gross domestic product at market prices     :
## 5 Gross domestic product at market prices     :
## 6 Gross domestic product at market prices     :

Data preparation

First, we have to clean the data (not too ugly though). First thing is to convert the TIME column into a well known date format that lubridate can handle. In this example, the yq function can parse the date without modification of the column. Then, we have to remove the blank in the values that separates thousands… Finally, we only keep data since 2000 and the unadjusted series in current prices.

After that, we should get one time series per country

dat <- mutate(dat,dates=yq(as.character(TIME)),
              values = as.numeric(stringr::str_remove(Value," "))) %>%
  filter(year(dates)>=2000 &
           S_ADJ=="Unadjusted data (i.e. neither seasonally adjusted nor calendar adjusted data)" &
           UNIT == "Current prices, million euro")
filter(dat,GEO %in% c("France","Austria")) %>%
  ggplot(aes(dates,values,color=GEO)) + geom_line() + theme_minimal() +
  labs(title="GDP of (completely) random countries")

Now we’re good to go !

Prediction on a random country

Let’s see how to use the package on one time series :

  • Extract dates and values of the time series you want to work on
  • Create the object containing all you need afterwards
  • Train algo and determine which one is the best (over the last known year)
  • Implement the best algorithm on full data
ex1 <- filter(dat,GEO=="France")
preparedTS <- prepare.ts(ex1$dates,ex1$values,"quarter")
## What is in this new object ?
str(preparedTS)
## List of 4
##  $ obj.ts    : Time-Series [1:77] from 2000 to 2019: 363007 369185 362905 383489 380714 ...
##  $ obj.df    :'data.frame':  77 obs. of  2 variables:
##   ..$ dates: Date[1:77], format: "2000-01-01" "2000-04-01" ...
##   ..$ val  : num [1:77] 363007 369185 362905 383489 380714 ...
##  $ freq.num  : num 4
##  $ freq.alpha: chr "quarter"
plot.ts(preparedTS$obj.ts)
ggplot(preparedTS$obj.df,aes(dates,val)) + geom_line() + theme_minimal()

Get the best algorithm for this time series :

## What is the best model for prediction ?
best.algo <- getBestModel(ex1$dates,ex1$values,"quarter",graph = F)
names(best.algo)
## [1] "prepedTS"     "best"         "train.errors" "res.train"    "algos"
## [6] "graph.train"
print(paste("The best algorithm is",best.algo$best))
## [1] "The best algorithm is my.ets"
best.algo$graph.train

You find in the result of this function :

  • The name of the best model
  • The errors of each algorithm on the test set
  • The graphic of the train step
  • The prepared time series
  • The list of used algorithm (that you can customize)

The result of this function can be used as direct input of the my.prediction function

## Build the predictions
final.pred <- my.predictions(bestmod = best.algo)
tail(final.pred,24)
## # A tibble: 24 x 4
##    dates      type  actual.value   ets
##    <date>     <chr>        <dbl> <dbl>
##  1 2015-04-01 <NA>        548987    NA
##  2 2015-07-01 <NA>        541185    NA
##  3 2015-10-01 <NA>        566281    NA
##  4 2016-01-01 <NA>        554121    NA
##  5 2016-04-01 <NA>        560873    NA
##  6 2016-07-01 <NA>        546383    NA
##  7 2016-10-01 <NA>        572752    NA
##  8 2017-01-01 <NA>        565221    NA
##  9 2017-04-01 <NA>        573720    NA
## 10 2017-07-01 <NA>        563671    NA
## # … with 14 more rows
ggplot(final.pred) + geom_line(aes(dates,actual.value),color="black") +
  geom_line(aes_string("dates",stringr::str_remove(best.algo$best,"my."),linetype="type"),color="red") +
  theme_minimal() 

Not too bad, right ?

Scaling predictions

Let’s say we want to make a prediction for each country in the same time and be the fastest possible →→ let’s combine the package’s functions with parallel computing. We have to reshape the data to get one column per country and then iterate over the columns of the data frame.

Prepare data

suppressPackageStartupMessages(library(tidyr))
dat.wide <- select(dat,GEO,dates,values) %>%
  group_by(dates) %>%
  spread(key = "GEO",value = "values")
head(dat.wide)
## # A tibble: 6 x 45
## # Groups:   dates [6]
##   dates      Albania Austria Belgium `Bosnia and Her… Bulgaria Croatia Cyprus
##   <date>       <dbl>   <dbl>   <dbl>            <dbl>    <dbl>   <dbl>  <dbl>
## 1 2000-01-01      NA  50422.   62261               NA    2941.   5266.  2547.
## 2 2000-04-01      NA  53180.   65046               NA    3252.   5811   2784.
## 3 2000-07-01      NA  53881.   62754               NA    4015.   6409.  2737.
## 4 2000-10-01      NA  56123.   68161               NA    4103.   6113   2738.
## 5 2001-01-01      NA  52911.   64318               NA    3284.   5777.  2688.
## 6 2001-04-01      NA  54994.   67537               NA    3669.   6616.  2946.
## # … with 37 more variables: Czechia <dbl>, Denmark <dbl>, Estonia <dbl>, `Euro
## #   area (12 countries)` <dbl>, `Euro area (19 countries)` <dbl>, `Euro area
## #   (EA11-2000, EA12-2006, EA13-2007, EA15-2008, EA16-2010, EA17-2013,
## #   EA18-2014, EA19)` <dbl>, `European Union - 15 countries (1995-2004)` <dbl>,
## #   `European Union - 27 countries (from 2019)` <dbl>, `European Union - 28
## #   countries` <dbl>, Finland <dbl>, France <dbl>, `Germany (until 1990 former
## #   territory of the FRG)` <dbl>, Greece <dbl>, Hungary <dbl>, Iceland <dbl>,
## #   Ireland <dbl>, Italy <dbl>, `Kosovo (under United Nations Security Council
## #   Resolution 1244/99)` <dbl>, Latvia <dbl>, Lithuania <dbl>,
## #   Luxembourg <dbl>, Malta <dbl>, Montenegro <dbl>, Netherlands <dbl>, `North
## #   Macedonia` <dbl>, Norway <dbl>, Poland <dbl>, Portugal <dbl>,
## #   Romania <dbl>, Serbia <dbl>, Slovakia <dbl>, Slovenia <dbl>, Spain <dbl>,
## #   Sweden <dbl>, Switzerland <dbl>, Turkey <dbl>, `United Kingdom` <dbl>

pull ## Compute bulk predictions

library(doParallel)
pipeline <- function(dates,values)
{
  pred <- getBestModel(dates,values,"quarter",graph = F)  %>%
    my.predictions()
  return(pred)
}
doMC::registerDoMC(parallel::detectCores()-1) # parallel backend (for UNIX)
system.time({
  res <- foreach(ii=2:ncol(dat.wide),.packages = c("dplyr","autoTS")) %dopar%
  pipeline(dat.wide$dates,pull(dat.wide,ii))
})
##    user  system elapsed
## 342.339   3.405  66.336
names(res) <- colnames(dat.wide)[-1]
str(res)
## List of 44
##  $ Albania                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ stlm        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Austria                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 50422 53180 53881 56123 52911 ...
##   ..$ sarima      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Belgium                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 62261 65046 62754 68161 64318 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Bosnia and Herzegovina                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ stlm        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Bulgaria                                                                                     :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 2941 3252 4015 4103 3284 ...
##   ..$ tbats       : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Croatia                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 5266 5811 6409 6113 5777 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Cyprus                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 2547 2784 2737 2738 2688 ...
##   ..$ sarima      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Czechia                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 15027 16430 17229 18191 16677 ...
##   ..$ tbats       : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Denmark                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 42567 44307 43892 47249 44143 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Estonia                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 1391 1575 1543 1662 1570 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Euro area (12 countries)                                                                     :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Euro area (19 countries)                                                                     :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Euro area (EA11-2000, EA12-2006, EA13-2007, EA15-2008, EA16-2010, EA17-2013, EA18-2014, EA19):Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ European Union - 15 countries (1995-2004)                                                    :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ European Union - 27 countries (from 2019)                                                    :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ European Union - 28 countries                                                                :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Finland                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 31759 33836 34025 36641 34474 ...
##   ..$ bats        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ France                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 363007 369185 362905 383489 380714 ...
##   ..$ ets         : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Germany (until 1990 former territory of the FRG)                                             :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 515500 523900 536120 540960 530610 ...
##   ..$ sarima      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Greece                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 33199 34676 37285 37751 35237 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Hungary                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 11516 12630 13194 13955 12832 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Iceland                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 2304 2442 2557 2447 2232 ...
##   ..$ stlm        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Ireland                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 25583 26751 27381 28666 29766 ...
##   ..$ tbats       : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Italy                                                                                        :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 292517 309098 298655 338996 309967 ...
##   ..$ ets         : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Kosovo (under United Nations Security Council Resolution 1244/99)                            :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ ets         : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Latvia                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 1848 2165 2238 2382 2005 ...
##   ..$ ets         : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Lithuania                                                                                    :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 2657 3124 3267 3505 2996 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Luxembourg                                                                                   :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 5646 5730 5689 6015 5811 ...
##   ..$ sarima      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Malta                                                                                        :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 979 1110 1158 1152 1031 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Montenegro                                                                                   :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ ets         : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Netherlands                                                                                  :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 109154 113124 110955 118774 118182 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ North Macedonia                                                                              :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 901 1052 1033 1108 986 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Norway                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 44900 43730 46652 50638 48355 ...
##   ..$ sarima      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Poland                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 41340 44210 46944 54163 47445 ...
##   ..$ bagged      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Portugal                                                                                     :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 30644 31923 32111 33788 31927 ...
##   ..$ stlm        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Romania                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 7901 9511 11197 11630 8530 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Serbia                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 ...
##   ..$ sarima      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Slovakia                                                                                     :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 5100 5722 5764 5752 5343 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Slovenia                                                                                     :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 5147 5591 5504 5667 5407 ...
##   ..$ bats        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Spain                                                                                        :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 153378 162400 158526 171946 166204 ...
##   ..$ bagged      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Sweden                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 67022 73563 68305 73399 66401 ...
##   ..$ bats        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Switzerland                                                                                  :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 70048 72725 74957 77476 76092 ...
##   ..$ bats        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Turkey                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 59944 70803 82262 82981 60075 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ United Kingdom                                                                               :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 438090 440675 446918 462127 441157 ...
##   ..$ bagged      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...

There is no free lunch…

There is no best algorithm in general ⇒⇒ depends on the data !

sapply(res,function(xx) colnames(select(xx,-dates,-type,-actual.value)) ) %>% table()
## .
##    bagged      bats       ets   prophet    sarima shortterm      stlm     tbats
##         3         4         5         7         6        12         4         3
sapply(res,function(xx) colnames(select(xx,-dates,-type,-actual.value)) )
##                                                                                       Albania
##                                                                                        "stlm"
##                                                                                       Austria
##                                                                                      "sarima"
##                                                                                       Belgium
##                                                                                   "shortterm"
##                                                                        Bosnia and Herzegovina
##                                                                                        "stlm"
##                                                                                      Bulgaria
##                                                                                       "tbats"
##                                                                                       Croatia
##                                                                                   "shortterm"
##                                                                                        Cyprus
##                                                                                      "sarima"
##                                                                                       Czechia
##                                                                                       "tbats"
##                                                                                       Denmark
##                                                                                     "prophet"
##                                                                                       Estonia
##                                                                                   "shortterm"
##                                                                      Euro area (12 countries)
##                                                                                     "prophet"
##                                                                      Euro area (19 countries)
##                                                                                     "prophet"
## Euro area (EA11-2000, EA12-2006, EA13-2007, EA15-2008, EA16-2010, EA17-2013, EA18-2014, EA19)
##                                                                                     "prophet"
##                                                     European Union - 15 countries (1995-2004)
##                                                                                     "prophet"
##                                                     European Union - 27 countries (from 2019)
##                                                                                     "prophet"
##                                                                 European Union - 28 countries
##                                                                                     "prophet"
##                                                                                       Finland
##                                                                                        "bats"
##                                                                                        France
##                                                                                         "ets"
##                                              Germany (until 1990 former territory of the FRG)
##                                                                                      "sarima"
##                                                                                        Greece
##                                                                                   "shortterm"
##                                                                                       Hungary
##                                                                                   "shortterm"
##                                                                                       Iceland
##                                                                                        "stlm"
##                                                                                       Ireland
##                                                                                       "tbats"
##                                                                                         Italy
##                                                                                         "ets"
##                             Kosovo (under United Nations Security Council Resolution 1244/99)
##                                                                                         "ets"
##                                                                                        Latvia
##                                                                                         "ets"
##                                                                                     Lithuania
##                                                                                   "shortterm"
##                                                                                    Luxembourg
##                                                                                      "sarima"
##                                                                                         Malta
##                                                                                   "shortterm"
##                                                                                    Montenegro
##                                                                                         "ets"
##                                                                                   Netherlands
##                                                                                   "shortterm"
##                                                                               North Macedonia
##                                                                                   "shortterm"
##                                                                                        Norway
##                                                                                      "sarima"
##                                                                                        Poland
##                                                                                      "bagged"
##                                                                                      Portugal
##                                                                                        "stlm"
##                                                                                       Romania
##                                                                                   "shortterm"
##                                                                                        Serbia
##                                                                                      "sarima"
##                                                                                      Slovakia
##                                                                                   "shortterm"
##                                                                                      Slovenia
##                                                                                        "bats"
##                                                                                         Spain
##                                                                                      "bagged"
##                                                                                        Sweden
##                                                                                        "bats"
##                                                                                   Switzerland
##                                                                                        "bats"
##                                                                                        Turkey
##                                                                                   "shortterm"
##                                                                                United Kingdom
##                                                                                      "bagged"

We hope you enjoy working with this package to build your time series predictions in the future. Now you should be capable of extending your data science algorithms on corona with Time Series predicitons. If you want to learn more about data science, I recommend you doing this tutorial.

linear regression in spark

In my previous post I’ve briefly introduced Spark ML. In this post I want to show how you can actually work with Spark ML, before continuing with some more theory on Spark ML. We will have a look at how to predict the wine quality with a Linear Regression in Spark. In order to get started, please make sure to setup your environment based on this tutorial. If you haven’t heard of a Linear Regression, I recommend you reading the introduction to the linear regression first.

The Linear Regression in Spark

There are several Machine Learning Models available in Apache Spark. The easiest one is the Linear Regression. In this post, we will only use the linear regression. Our goal is to have a quick start into Spark ML and then extend it over the next couple of tutorials and get much deeper into it. By now, you should have a working environment of Spark ready. Next, we need some data. Luckily, the wine quality dataset is a often used one and you can download it from here. Load it into the same folder as your new PySpark 3 Notebook.

First, we need to import some packages from pyspark. SparkSession and LinearRegression are very obvious. The only one that isn’t obvious at first is the VectorAssembler. I will explain later what we need this class for.

from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler

Create the SparkContext and Load the Data

We first start by creating the SparkContext. This is a standard procedure and not yet rocket science.

spark = SparkSession.builder.master("local") \
.appName("Cloudvane-Sample-03") \
.config("net.cloudvane.sampleId", "Cloudvane.net Spark Sample 03").getOrCreate()

Next, we load our data. We specify that the format is of type “csv” (Comma Separated Values). The file is however delimited with “;” instead of “,”, so we need to specify this as well. Also, we want Spark to get the schema without any manual intervention from us, so we set “inferSchema” to True. Spark should now figure out how the data types are. Also, we specify that our file has headers. Last but not least, we need to load the file with its filename.

data = spark.read.format("csv").options(delimiter=";", inferSchema=True, header=True).load("winequality-white.csv")

We briefly check how our Dataset looks like. We just use one line in Jupyter with “data”:

data

… and the output should be the following:

DataFrame[fixed acidity: double, volatile acidity: double, citric acid: double, residual sugar: double, chlorides: double, free sulfur dioxide: double, total sulfur dioxide: double, density: double, pH: double, sulphates: double, alcohol: double, quality: int]

Remember, if you want to see what is inside your data, use “data.show()”. Your dataframe should contain this data:

The Wine Quality Dataframe
The Wine Quality Dataframe

Time for some Feature Engineering

In order for Spark to process this data, we need to create a vector out of our data. In order to do this, we use the VectorAssembler that was imported above. Basically, the VectorAssembler takes the data and moves it into a simple Vector. We take the first 11 columns, since the “quality” column should serve as our Label. The Label is the value we later want to predict. We name this Vector now “features” and transform the data.

va = VectorAssembler(inputCols=data.columns[:11], outputCol="features")
adj = va.transform(data)
adj.show()

The new Dataset – called “adj” – now has an additional column named “features”. For Machine Learning, we only need the features, so we can get rid of the other data columns. Also, we want to rename the column “quality” to “label” to make it clear on what we are working with.

lab = adj.select("features", "quality")
training_data = lab.withColumnRenamed("quality", "label")

Now, the dataframe should be cleaned and we are ready for the Linear Regression in Spark!

Running the Linear Regression

First, we create the Linear Regression. We set the maximum Iterations to 30, the ElasticNet mixing Parameter to 0.3 and the regularization parameter to 0.3. Also, we need to make sure to set the features column to “features” and the label column to “label”. Once the Linear Regression is created, we fit the training data into it. After that, we create our predictions with the “transform” function. The code for that is here:

lr = LinearRegression(maxIter=30, regParam=0.3, elasticNetParam=0.3, featuresCol="features", labelCol="label")
lrModel = lr.fit(training_data)
predictionsDF = lrModel.transform(training_data)
predictionsDF.show()

This should now create a new dataframe with the features, the label and the prediction. When you review you output, it already predicts quite ok-ish values for a wine:

+--------------------+-----+------------------+
|            features|label|        prediction|
+--------------------+-----+------------------+
|[7.0,0.27,0.36,20...|    6| 5.546350842823183|
|[6.3,0.3,0.34,1.6...|    6|5.6602634543897645|
|[8.1,0.28,0.4,6.9...|    6| 5.794350562842575|
|[7.2,0.23,0.32,8....|    6| 5.793638052734819|
|[7.2,0.23,0.32,8....|    6| 5.793638052734819|
|[8.1,0.28,0.4,6.9...|    6| 5.794350562842575|
|[6.2,0.32,0.16,7....|    6|5.6645781552987655|
|[7.0,0.27,0.36,20...|    6| 5.546350842823183|
|[6.3,0.3,0.34,1.6...|    6|5.6602634543897645|
|[8.1,0.22,0.43,1....|    6| 6.020023174935914|
|[8.1,0.27,0.41,1....|    5| 6.178863965783833|
|[8.6,0.23,0.4,4.2...|    5| 5.756611684447172|
|[7.9,0.18,0.37,1....|    5| 6.012659811971332|
|[6.6,0.16,0.4,1.5...|    7| 6.343695124494296|
|[8.3,0.42,0.62,19...|    5| 5.605663225763592|
|[6.6,0.17,0.38,1....|    7| 6.139779557853963|
|[6.3,0.48,0.04,1....|    6| 5.537802384697061|
|[6.2,0.66,0.48,1....|    8| 6.028338973062226|
|[7.4,0.34,0.42,1....|    6|5.9853604241636615|
|[6.5,0.31,0.14,7....|    5| 5.652874078868445|
+--------------------+-----+------------------+
only showing top 20 rows

You could now go into a supermarket of your choice and aquire a wine and fit the data of the wine into your model. The model would tell you how good the wine is and if you should buy one or not.

This is already our first linear regression with Spark – a very easy model. However, there is much more to learn:

  • We would need to understand the standard deviation of this model and how accurate it is. If you review some predictions, we ware not very acuarate at all. So it needs to be tweaked
  • We will later compare different ML algorithms and build a pipeline

However, it is good for a start!

This tutorial is part of the Apache Spark MLlib Tutorial. If you are not yet familiar with Spark or Python, I recommend you first reading the tutorial on Spark and the tutorial on Python. Also, you need to understand the core concepts of Machine Learning, which you can learn in this tutorial. Also, you might refer to the official Apache Spark ML documentation.

Spark ML is Apache Spark’s answer to machine learning and data science. The library has several powerful features for typical machine learning and data science tasks. In the following posts I will introduce Spark ML.

What is Spark ML?

The goals of MLlib is to solve complex Machine Learning and Data Science tasks in an easy API. Basically, Spark provides a Dataframe-based API for common Machine Learning tasks. These include different machine learning algorithms, options for feature engineering and data transformations, persisting models and different mathematical utilities.

A key concept in Data Science are pipelines. This is also included with a comprehensive library. Pipelines are used to abstract the work with Machine Learning models and the data around it. I will explain the concept of Pipelines in a later post with some examples. Basically, Pipelines enable us to use different algorithms in one workflow alongside with its data.

Feature Engineering in MLlib

The Library also includes several aspects for feature engineering. Basically, this is a thing that every data science process contains. These tasks include:

  • Feature Extraction: extracting features from RAW-Data. This is for instance converting text to a vector.
  • Feature Transformers: Transforming Features to a different status. This includes Scalers and alike.
  • Feature Selection: Selecting features, for instance into a VectorSlicer

The library basically gives you a lot of possibilities for Feature engineering. I will explain Feature Engineering capabilities also in a later tutorial.

Machine Learning Models

Of course, the core task of the machine learning library is on Machine Learning models itself. There are a large number of standard algorithms available for Clustering, Regression and Classification. We will use different algorithms over the next couple of posts, so stay tuned for more details about them. In the next post, we will create a first model with a Linear Regression. In order to get started, please make sure to setup your environment based on this tutorial. If you haven’t heard of a Linear Regression, I recommend you reading the introduction to the linear regression first.

This tutorial is part of the Apache Spark MLlib Tutorial. If you are not yet familiar with Spark or Python, I recommend you first reading the tutorial on Spark and the tutorial on Python. Also, you need to understand the core concepts of Machine Learning, which you can learn in this tutorial. Also, you might refer to the official Apache Spark ML documentation.

In the last couple of posts, we’ve learned about various aspects of Machine Learning. Now, we will focus on other aspects of Machine Learning: Deep Learning. After introducing the key concepts of Deep Learning in the previous post, we will have a look at two concepts: the Convolutional Neural Network (CNN) and the Feedforward Neural Network

The Feedforward Neural Network

Feedforward neural networks are the most general-purpose neural network. The entry point is the input layer and it consists of several hidden layers and an output layer. Each layer has a connection to the previous layer. This is one-way only, so that nodes can’t for a cycle. The information in a feedforward network only moves into one direction – from the input layer, through the hidden layers to the output layer. It is the easiest version of a Neural Network. The below image illustrates the Feedforward Neural Network.

Feedforward Neural Network

Convolutional Neural Networks (CNN)

The Convolutional Neural Network is very effective in Image recognition and similar tasks. For that reason it is also good for Video processing. The difference to the Feedforward neural network is that the CNN contains 3 dimensions: width, height and depth. Not all neurons in one layer are fully connected to neurons in the next layer. There are three different type of layers in a Convolutional Neural Network, which are also different to feedforward neural networks:

Convolution Layer

Convolution puts the input image through several convolutional filters. Each filter activates certain features, such as: edges, colors or objects. Next, the feature map is created out of them. The deeper the network goes the more sophisticated those filters become. The convolutional layer automatically learns which features are most important to extract for a specific task.

Rectified linear units (ReLU)

The goal of this layer is to improve the training speed and impact. Negative values in the layers are removed.

Pooling/Subsampling

Pooling simplifies the output by performing nonlinear downsampling. The number of parameters that the network needs to learn about gets reduced. In convolutional neural networks, the operation is useful since the outgoing connections usually receive similar information.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

A linear regression model

In the last couple of posts, we’ve learned about various aspects of Machine Learning. Now, we will focus on other aspects of Machine Learning: Deep Learning. In this post, I will give an introduction to deep learning. Over the last couple of years, this was the hype around AI. But what is so exciting about Deep Learning? First, let’s have a look at the concepts of Deep Learning.

A brief introduction to Deep Learning

Basically, Deep Learning should function similar to the human brain. Everything is built around Neurons, which work in networks (neural networks). The smallest element in a neural network is the neuron, which takes an input parameter and creates an output parameter, based on the bias and weight it has. The following image shows the Neuron in Deep Learning:

The Neuron in a Neuronal Network in Deep Learning
The Neuron in a Neuronal Network in Deep Learning

Next, there are Layers in the Network, which consists of several Neurons. Each Layer has some transformations, that will eventually lead to an end result. Each Layer will get much closer to the target result. If your Deep Learning model built to recognise hand writing, the first layer would probably recognise gray-scales, the second layer a connection between different pixels, the third layer would recognise simple figures and the fourth layer would recognise the letter. The following image shows a typical neural net:

A neural net for Deep Learning
A neural net for Deep Learning

A typical workflow in a neural net calculation for image recognition could look like this:

  • All images are split into batches
  • Each batch is sent to the GPU for calculation
  • The model starts the analysis with random weights
  • A cost function gets specified, that compares the results with the truth
  • Back propagation of the result happens
  • Once a model calculation is finished, the result is merged and returned

How is it different to Machine Learning?

Although Deep Learning is often considered to be a “subset” of Machine Learning, it is quite different. For different aspects, Deep Learning often achieves better results than “traditional” machine learning models. The following table should provide an overview of these differences:

Machine Leaning Deep Learning
Feature extraction happens manuallyFeature extraction is done automatically
Features are used to create a model that categorises elementsPerforms “end-to-end learning” 
Shallow learning  Deep learning algorithms scale with data

This is only the basic overview of Deep Learning. Deep Learning knows several different methods. In the next tutorial, we will have a look at different interpretations of Deep Learning.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In the first posts, I introduced different type of Machine Learning concepts. On of them is classification. Basically, classification is about identifying to which set of categories a certain observation belongs. Classifications are normally of supervised learning techniques. A typical classification is Spam detection in e-mails – the two possible classifications in this case are either “spam” or “no spam”. The two most common classification algorithms are the naive bayes classification and the random forest classification.

What classification algorithms are there?

Basically, there are a lot of classification algorithms available and when working in the field of Machine Learning, you will discover a large number of algorithms every time. In this tutorial, we will only focus on the two most important ones (Random Forest, Naive Bayes) and the basic one (Decision Tree)

The Decision Tree classifier

The basic classifier is the Decision tree classifier. It basically builds classification models in the form of a tree structure. The dataset is broken down into smaller subsets and gets detailed by each leave. It could be compared to a survey, where each question has an effect on the next question. Let’s assume the following case: Tom was captured by the police and is a suspect in robing a bank. The questions could represent the following tree structure:

Basic sample of a Decision Tree
Basic sample of a Decision Tree

Basically, by going from one leave to another, you get closer to the result of either “guilty” or “not guilty”. Also, each leaf has a weight.

The Random Forest classification

Random forest is a really great classifier, often used and also often very efficient. It is an ensemble classifier made using many decision tree models. There are ensemble models that combine the different results. The random forest model can both run regression and classification models.

Basically, it divides the data set into subsets and then runs on the data. Random forest models run efficient on large datasets, since all compute can be split and thus it is easier to run the model in parallel. It can handle thousands of input variables without variable deletion. It computes proximities between pairs of cases that can be used in clustering, locating outliers or (by scaling) give interesting views of the data.

There are also some disadvantages with the random forest classifier: the main problem is its complexity. Working with random forest is more challenging than classic decision trees and thus needs skilled people. Also, the complexity creates large demands for compute power.

Random Forest is often used by financial institutions. A typical use-case is credit risk prediction. If you have ever applied for a credit, you might know the questions being asked by banks. They are often fed into random forest models.

The Naive Bayes classifier

The Naive Bayes classifier is based on prior knowledge of conditions that might relate to an event. It is based on the Bayes Theorem. There is a strong independence between features assumed. It uses categorial data to calculate ratios between events.

The benefit of Naive Bayes are different. It can easily and fast predict classes of data sets. Also, it can predict multiple classes. Naive Bayes performs better compared to models such as logistic regression and there is a lot less training data needed.

A key challenge is that if a categorical variable has a category which was not checked in the training data set, then model will assign a 0 (zero) probability, which makes it unable for prediction. Also, it is known to be a rather bad estimator. Also, it is rather complex to use.

As stated, there are many more algorithms available. In the next tutorial, we will have a look at Deep Learning.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

A linear regression model

In the previous tutorial posts, we looked at the Linear Regression and discussed some basics of statistics such as the Standard Deviation and the Standard Error. Today, we will look at the Logistic Regression. It is similar in name to the linear regression, but different in usage. Let’s have a look

The Logistic Regression explained

One of the main difference to the Linear Regression for the Logistic Regression is that you the logistic regression is binary – it calculates values between 0 and 1 and thus states if something is rather true or false. This means that the result of a prediction could be “fail” or “succeed” for a test. In a churn model, this would mean that a customer either stays with the company or leaves the company.

Another key difference to the Linear Regression is that the regression curve can’t be calculated. Therefore, in the Logistic Regression, the regression curve is “estimated” and optimised. There is a mathematical function to do this estimation – called the “Maximum Likelihood Method”. Normally, these Parameters are calculated by different Machine Learning Tools so that you don’t have to do it.

Another aspect is the concept of “Odds”. Basically, the odd of a certain event happening or not happening is calculated. This could be a certain team winning a soccer game: let’s assume that Team X wins 7 out of 10 games (thus loosing 3, we don’t take a draw). The odds in this case would be 7:10 on winning or 3:10 on loosing.

This time we won’t calculate the Logistic Regression, since it is way too long. In the next tutorial, I will focus on classifiers such as Random Forest and Naive Bayes.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In my previous posts we had a look at some fundamentals of machine learning and had a look at the linear regression. Today, we will look at another statistical topic: false positives and false negatives. You will come across these terms quite often when working with data, so let’s have a look at them.

The false positive

In statistics, there is one error, called the false positive error. This happens when the prediction states something to be true, but in reality it is false. To easily remember the false positive, you could describe this as a false alarm. A simple example for that is the airport security check: when you pass the security check, you have to walk through a metal detector. If you don’t wear any metal items with you (since you left them for the x-ray!), no alarm will go on. But in some rather rare cases, the alarm might still go on. Either you forgot something or the metal detector had an error – in this case, a false positive. The metal detector predicted that you have metal items somewhere with you, but in fact you don’t.

Another sample of a false positive in machine learning would be in image recognition: imagine your algorithm is trained to recognise cats. There are so many cat pictures on the web, so it is easy to train this algorithm. However, you would then feed the algorithm the image of a dog and the algorithm would call it a cat, even though it is a dog. This again is a false positive.

In a business context, your algorithm might predict that a specific customer is going to buy a certain product for sure. but in fact, this customer didn’t buy it. Again, here we have our false positive. Now, let’s have a look at the other error: the false negative.

The false negative

The other error in statistics is the false negative. Similar to the false positive, it is something that should be avoided. It is very similar to the false positive, just the other way around. Let’s look at the airport example one more time: you wear a metal item (such as a watch) and go through the metal detector. You simply forgot to take off the watch. And – the metal detector doesn’t go on this time. Now, you are a false negative: the metal detector stated that you don’t wear any metal items, but in fact you did. A condition was predicted to be true but in fact it was false.

A false positive is often useful to score your data quality. Now that you understand some of the most important basics of statistics, we will have a look at another machine learning algorithm in my next post: the logistic regression.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

A linear regression model

Now we have learned how to write a Linear Regression model from hand in our last tutorial. Also, we had a look at the prediction error and standard error. Today, we want to focus on a way how to measure the performance of a model. In marketing, a common methodology for this is lift and gain charts. They can also be used for other things, but in our today’s sample we will use a marketing scenario.

The marketing scenario for Lift and Gain charts

Let’s assume that you are in charge of an outbound call campaign. Basically, your goal is to increase conversions of people contacted via this campaign. Like with most campaigns, you have a certain – limited – budget and thus need to plan the campaign smart. This is where machine learning comes into play: you only want to contact those people that are most relevant to buy the product. Therefore, you contact the top X percent of customers where you rather expect a conversion and avoid contacting those customers that are very unlikely to get converted. We assume that you already built a model for that and that we now do the campaign. We will measure our results with a gain chart, but first let’s create some data.

Our sample data represents all our customers, grouped into decentiles. Basically, we group the customers into top 10%, top 20%, … until we reach all customers. We add the number of conversions to it as well:

Decantile# of CustomersConversions
120033
220030
320027
420025
520023
620019
720015
820011
92007
102002

As you can see in the above table, the first decentile contains most conversions and is thus our top group. The conversion rates for each group in percent are:

% Conversions
17,2%
15,6%
14,1%
13,0%
12,0%
9,9%
7,8%
5,7%
3,6%
1,0%

As you can see, 17.2% of all top 10% customers could be converted. From each group, it declines. So, the best approach is to first contact the top customers. As a next step, we add the cumulative conversions. This number is then used for our cumulative gain chart.

Cumulative % Conversions
17,2%
32,8%
46,9%
59,9%
71,9%
81,8%
89,6%
95,3%
99,0%
100,0%

Cumulative Gain Chart

With this data, we can now create the cumulative gain chart. In our case, this would look like the following:

A cumulative gain chart
A cumulative gain chart

The Lift factor

Now, let’s have a look at the lift factor. The base for the lift factor is always the lift 1. This means that there was a random sample selected and no structured approach was done. Basically, the lift factor is the ratio you get between the number of customers contacted in % and the number of conversions for the decentile in %. With our sample data, this lift data would look like the following:

Lift
1,72
1,64
1,56
1,50
1,44
1,36
1,28
1,19
1,10
1,00

Thus we would have a lift factor of 1.72 with the first percentile, decreasing towards the full customer set.

In this tutorial, we’ve learned about how to verify a machine learning model. In the next tutorial, we will have a look at false positives and some other important topics before moving on with Logistic Regression.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In my previous posts, I explained the Linear Regression and stated that there are some errors in it. This is called the error of prediction (for individual predictions) and there is also a standard error. A prediction is good if the individual errors of prediction and the standard error are small. Let’s now start by examining the error of prediction, which is called the standard error in a linear regression model.

Error of prediction in Linear regression

Let’s recall the table from the previous tutorial:

YearAd Spend (X)Revenue (Y)Prediction (Y’)
2013 €    345.126,00  €   41.235.645,00  €   48.538.859,48
2014 €    534.678,00  €   62.354.984,00  €   65.813.163,80
2015 €    754.738,00  €   82.731.657,00  €   85.867.731,47
2016 €    986.453,00  € 112.674.539,00  € 106.984.445,76
2017 € 1.348.754,00  € 156.544.387,00  € 140.001.758,86
2018 € 1.678.943,00  € 176.543.726,00  € 170.092.632,46
2019 € 2.165.478,00  € 199.645.326,00  € 214.431.672,17

We can see that there is a clear difference in between the prediction and the actual numbers. We calculate the error in each prediction by taking the real value minus the prediction:

Y-Y’
-€   7.303.214,48
-€   3.458.179,80
-€   3.136.074,47
 €   5.690.093,24
 € 16.542.628,14
 €   6.451.093,54
-€ 14.786.346,17

In the above table, we can see how each prediction differs from the real value. Thus it is our prediction error on the actual values.

Calculating the Standard Error

Now, we want to calculate the standard error. First, let’s have a look at the formular:

Basically, we take the sum of all error to the square, divide it by the number of occurrences and take the square root of it. We already have Y-Y’ calculated, so we only need to make the square of it:

Y-Y’(Y-Y’)^2
-€   7.303.214,48  €    53.336.941.686.734,40
-€   3.458.179,80  €    11.959.007.558.032,20
-€   3.136.074,47  €      9.834.963.088.101,32
 €   5.690.093,24  €    32.377.161.053.416,10
 € 16.542.628,14  €  273.658.545.777.043,00
 €   6.451.093,54  €    41.616.607.923.053,70
-€ 14.786.346,17  €  218.636.033.083.835,00

The sum of it is 641.419.260.170.216,00 €

And N is 7, since it contains 7 Elements. Divided by 7, it is: 91.631.322.881.459,50 €

The last step is to take the square root, which results in the standard error of 9.572.425,13 € for our linear regression.

Now, we have most items cleared for our linear regression and can move on to the logistic regression in our next tutorial.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.