Posts

A college of mine, Vivien Roussez, wrote a nice library in R to predict time series. The package is called “autoTS” and provides a high level interface for univariate time series predictions. It implements many algorithms, most of them provided by the forecast package. You can find the package as an open source project on GitHub. Over the last few weeks we saw a lot of Data Science happening due to Corona. One of the challenges on this is to use the right forecast.

Introduction to autoTS

by Vivien Roussez

The autoTS package provides a high level interface for univariate time series predictions. It implements many algorithms, most of them provided by the forecast package. The main goals of the package are :

  • Simplify the preparation of the time series ;
  • Train the algorithms and compare their results, to chose the best one ;
  • Gather the results in a final tidy dataframe

What are the inputs ?

The package is designed to work on one time series at a time. Parallel calculations can be put on top of it (see example below). The user has to provide 2 simple vectors :

  • One with the dates (s.t. the lubridate package can parse them)
  • The second with the corresponding values

Warnings

This package implements each algorithm with a unique parametrization, meaning that the user cannot tweak the algorithms (eg modify SARIMA specfic parameters).

Example on real-world data

Before getting started, you need to install the required package “autoTS”. This works with the following code:

knitr::opts_chunk$set(warning = F,message = F,fig.width = 8,fig.height = 5)
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(lubridate))
library(autoTS)

For this example, we will use the GDP quarterly data of the european countries provided by eurostat. The database can be downloaded from this page and then chose “GDP and main components (output, expenditure and income) (namq_10_gdp)” and then adjust the time dimension to select all available data and download as a csv file with the correct formatting (1 234.56). The csv is in the “Data” folder of this notebook.

dat <- read.csv("Data/namq_10_gdp_1_Data.csv")
str(dat)
## 'data.frame':    93456 obs. of  7 variables:
##  $ TIME              : Factor w/ 177 levels "1975Q1","1975Q2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ GEO               : Factor w/ 44 levels "Albania","Austria",..: 15 15 15 15 15 15 15 15 15 15 ...
##  $ UNIT              : Factor w/ 3 levels "Chain linked volumes (2010), million euro",..: 2 2 2 2 3 3 3 3 1 1 ...
##  $ S_ADJ             : Factor w/ 4 levels "Calendar adjusted data, not seasonally adjusted data",..: 4 2 1 3 4 2 1 3 4 2 ...
##  $ NA_ITEM           : Factor w/ 1 level "Gross domestic product at market prices": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Value             : Factor w/ 19709 levels "1 008.3","1 012.9",..: 19709 19709 19709 19709 19709 19709 19709 19709 19709 19709 ...
##  $ Flag.and.Footnotes: Factor w/ 5 levels "","b","c","e",..: 1 1 1 1 1 1 1 1 1 1 ...
head(dat)
##     TIME                                       GEO
## 1 1975Q1 European Union - 27 countries (from 2019)
## 2 1975Q1 European Union - 27 countries (from 2019)
## 3 1975Q1 European Union - 27 countries (from 2019)
## 4 1975Q1 European Union - 27 countries (from 2019)
## 5 1975Q1 European Union - 27 countries (from 2019)
## 6 1975Q1 European Union - 27 countries (from 2019)
##                                   UNIT
## 1 Chain linked volumes, index 2010=100
## 2 Chain linked volumes, index 2010=100
## 3 Chain linked volumes, index 2010=100
## 4 Chain linked volumes, index 2010=100
## 5         Current prices, million euro
## 6         Current prices, million euro
##                                                                           S_ADJ
## 1 Unadjusted data (i.e. neither seasonally adjusted nor calendar adjusted data)
## 2                          Seasonally adjusted data, not calendar adjusted data
## 3                          Calendar adjusted data, not seasonally adjusted data
## 4                                         Seasonally and calendar adjusted data
## 5 Unadjusted data (i.e. neither seasonally adjusted nor calendar adjusted data)
## 6                          Seasonally adjusted data, not calendar adjusted data
##                                   NA_ITEM Value Flag.and.Footnotes
## 1 Gross domestic product at market prices     :
## 2 Gross domestic product at market prices     :
## 3 Gross domestic product at market prices     :
## 4 Gross domestic product at market prices     :
## 5 Gross domestic product at market prices     :
## 6 Gross domestic product at market prices     :

Data preparation

First, we have to clean the data (not too ugly though). First thing is to convert the TIME column into a well known date format that lubridate can handle. In this example, the yq function can parse the date without modification of the column. Then, we have to remove the blank in the values that separates thousands… Finally, we only keep data since 2000 and the unadjusted series in current prices.

After that, we should get one time series per country

dat <- mutate(dat,dates=yq(as.character(TIME)),
              values = as.numeric(stringr::str_remove(Value," "))) %>%
  filter(year(dates)>=2000 &
           S_ADJ=="Unadjusted data (i.e. neither seasonally adjusted nor calendar adjusted data)" &
           UNIT == "Current prices, million euro")
filter(dat,GEO %in% c("France","Austria")) %>%
  ggplot(aes(dates,values,color=GEO)) + geom_line() + theme_minimal() +
  labs(title="GDP of (completely) random countries")

Now we’re good to go !

Prediction on a random country

Let’s see how to use the package on one time series :

  • Extract dates and values of the time series you want to work on
  • Create the object containing all you need afterwards
  • Train algo and determine which one is the best (over the last known year)
  • Implement the best algorithm on full data
ex1 <- filter(dat,GEO=="France")
preparedTS <- prepare.ts(ex1$dates,ex1$values,"quarter")
## What is in this new object ?
str(preparedTS)
## List of 4
##  $ obj.ts    : Time-Series [1:77] from 2000 to 2019: 363007 369185 362905 383489 380714 ...
##  $ obj.df    :'data.frame':  77 obs. of  2 variables:
##   ..$ dates: Date[1:77], format: "2000-01-01" "2000-04-01" ...
##   ..$ val  : num [1:77] 363007 369185 362905 383489 380714 ...
##  $ freq.num  : num 4
##  $ freq.alpha: chr "quarter"
plot.ts(preparedTS$obj.ts)
ggplot(preparedTS$obj.df,aes(dates,val)) + geom_line() + theme_minimal()

Get the best algorithm for this time series :

## What is the best model for prediction ?
best.algo <- getBestModel(ex1$dates,ex1$values,"quarter",graph = F)
names(best.algo)
## [1] "prepedTS"     "best"         "train.errors" "res.train"    "algos"
## [6] "graph.train"
print(paste("The best algorithm is",best.algo$best))
## [1] "The best algorithm is my.ets"
best.algo$graph.train

You find in the result of this function :

  • The name of the best model
  • The errors of each algorithm on the test set
  • The graphic of the train step
  • The prepared time series
  • The list of used algorithm (that you can customize)

The result of this function can be used as direct input of the my.prediction function

## Build the predictions
final.pred <- my.predictions(bestmod = best.algo)
tail(final.pred,24)
## # A tibble: 24 x 4
##    dates      type  actual.value   ets
##    <date>     <chr>        <dbl> <dbl>
##  1 2015-04-01 <NA>        548987    NA
##  2 2015-07-01 <NA>        541185    NA
##  3 2015-10-01 <NA>        566281    NA
##  4 2016-01-01 <NA>        554121    NA
##  5 2016-04-01 <NA>        560873    NA
##  6 2016-07-01 <NA>        546383    NA
##  7 2016-10-01 <NA>        572752    NA
##  8 2017-01-01 <NA>        565221    NA
##  9 2017-04-01 <NA>        573720    NA
## 10 2017-07-01 <NA>        563671    NA
## # … with 14 more rows
ggplot(final.pred) + geom_line(aes(dates,actual.value),color="black") +
  geom_line(aes_string("dates",stringr::str_remove(best.algo$best,"my."),linetype="type"),color="red") +
  theme_minimal() 

Not too bad, right ?

Scaling predictions

Let’s say we want to make a prediction for each country in the same time and be the fastest possible →→ let’s combine the package’s functions with parallel computing. We have to reshape the data to get one column per country and then iterate over the columns of the data frame.

Prepare data

suppressPackageStartupMessages(library(tidyr))
dat.wide <- select(dat,GEO,dates,values) %>%
  group_by(dates) %>%
  spread(key = "GEO",value = "values")
head(dat.wide)
## # A tibble: 6 x 45
## # Groups:   dates [6]
##   dates      Albania Austria Belgium `Bosnia and Her… Bulgaria Croatia Cyprus
##   <date>       <dbl>   <dbl>   <dbl>            <dbl>    <dbl>   <dbl>  <dbl>
## 1 2000-01-01      NA  50422.   62261               NA    2941.   5266.  2547.
## 2 2000-04-01      NA  53180.   65046               NA    3252.   5811   2784.
## 3 2000-07-01      NA  53881.   62754               NA    4015.   6409.  2737.
## 4 2000-10-01      NA  56123.   68161               NA    4103.   6113   2738.
## 5 2001-01-01      NA  52911.   64318               NA    3284.   5777.  2688.
## 6 2001-04-01      NA  54994.   67537               NA    3669.   6616.  2946.
## # … with 37 more variables: Czechia <dbl>, Denmark <dbl>, Estonia <dbl>, `Euro
## #   area (12 countries)` <dbl>, `Euro area (19 countries)` <dbl>, `Euro area
## #   (EA11-2000, EA12-2006, EA13-2007, EA15-2008, EA16-2010, EA17-2013,
## #   EA18-2014, EA19)` <dbl>, `European Union - 15 countries (1995-2004)` <dbl>,
## #   `European Union - 27 countries (from 2019)` <dbl>, `European Union - 28
## #   countries` <dbl>, Finland <dbl>, France <dbl>, `Germany (until 1990 former
## #   territory of the FRG)` <dbl>, Greece <dbl>, Hungary <dbl>, Iceland <dbl>,
## #   Ireland <dbl>, Italy <dbl>, `Kosovo (under United Nations Security Council
## #   Resolution 1244/99)` <dbl>, Latvia <dbl>, Lithuania <dbl>,
## #   Luxembourg <dbl>, Malta <dbl>, Montenegro <dbl>, Netherlands <dbl>, `North
## #   Macedonia` <dbl>, Norway <dbl>, Poland <dbl>, Portugal <dbl>,
## #   Romania <dbl>, Serbia <dbl>, Slovakia <dbl>, Slovenia <dbl>, Spain <dbl>,
## #   Sweden <dbl>, Switzerland <dbl>, Turkey <dbl>, `United Kingdom` <dbl>

pull ## Compute bulk predictions

library(doParallel)
pipeline <- function(dates,values)
{
  pred <- getBestModel(dates,values,"quarter",graph = F)  %>%
    my.predictions()
  return(pred)
}
doMC::registerDoMC(parallel::detectCores()-1) # parallel backend (for UNIX)
system.time({
  res <- foreach(ii=2:ncol(dat.wide),.packages = c("dplyr","autoTS")) %dopar%
  pipeline(dat.wide$dates,pull(dat.wide,ii))
})
##    user  system elapsed
## 342.339   3.405  66.336
names(res) <- colnames(dat.wide)[-1]
str(res)
## List of 44
##  $ Albania                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ stlm        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Austria                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 50422 53180 53881 56123 52911 ...
##   ..$ sarima      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Belgium                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 62261 65046 62754 68161 64318 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Bosnia and Herzegovina                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ stlm        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Bulgaria                                                                                     :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 2941 3252 4015 4103 3284 ...
##   ..$ tbats       : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Croatia                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 5266 5811 6409 6113 5777 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Cyprus                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 2547 2784 2737 2738 2688 ...
##   ..$ sarima      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Czechia                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 15027 16430 17229 18191 16677 ...
##   ..$ tbats       : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Denmark                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 42567 44307 43892 47249 44143 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Estonia                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 1391 1575 1543 1662 1570 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Euro area (12 countries)                                                                     :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Euro area (19 countries)                                                                     :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Euro area (EA11-2000, EA12-2006, EA13-2007, EA15-2008, EA16-2010, EA17-2013, EA18-2014, EA19):Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ European Union - 15 countries (1995-2004)                                                    :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ European Union - 27 countries (from 2019)                                                    :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ European Union - 28 countries                                                                :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Finland                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 31759 33836 34025 36641 34474 ...
##   ..$ bats        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ France                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 363007 369185 362905 383489 380714 ...
##   ..$ ets         : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Germany (until 1990 former territory of the FRG)                                             :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 515500 523900 536120 540960 530610 ...
##   ..$ sarima      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Greece                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 33199 34676 37285 37751 35237 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Hungary                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 11516 12630 13194 13955 12832 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Iceland                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 2304 2442 2557 2447 2232 ...
##   ..$ stlm        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Ireland                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 25583 26751 27381 28666 29766 ...
##   ..$ tbats       : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Italy                                                                                        :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 292517 309098 298655 338996 309967 ...
##   ..$ ets         : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Kosovo (under United Nations Security Council Resolution 1244/99)                            :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ ets         : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Latvia                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 1848 2165 2238 2382 2005 ...
##   ..$ ets         : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Lithuania                                                                                    :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 2657 3124 3267 3505 2996 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Luxembourg                                                                                   :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 5646 5730 5689 6015 5811 ...
##   ..$ sarima      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Malta                                                                                        :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 979 1110 1158 1152 1031 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Montenegro                                                                                   :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ ets         : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Netherlands                                                                                  :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 109154 113124 110955 118774 118182 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ North Macedonia                                                                              :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 901 1052 1033 1108 986 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Norway                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 44900 43730 46652 50638 48355 ...
##   ..$ sarima      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Poland                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 41340 44210 46944 54163 47445 ...
##   ..$ bagged      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Portugal                                                                                     :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 30644 31923 32111 33788 31927 ...
##   ..$ stlm        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Romania                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 7901 9511 11197 11630 8530 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Serbia                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 ...
##   ..$ sarima      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Slovakia                                                                                     :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 5100 5722 5764 5752 5343 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Slovenia                                                                                     :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 5147 5591 5504 5667 5407 ...
##   ..$ bats        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Spain                                                                                        :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 153378 162400 158526 171946 166204 ...
##   ..$ bagged      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Sweden                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 67022 73563 68305 73399 66401 ...
##   ..$ bats        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Switzerland                                                                                  :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 70048 72725 74957 77476 76092 ...
##   ..$ bats        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Turkey                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 59944 70803 82262 82981 60075 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ United Kingdom                                                                               :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 438090 440675 446918 462127 441157 ...
##   ..$ bagged      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...

There is no free lunch…

There is no best algorithm in general ⇒⇒ depends on the data !

sapply(res,function(xx) colnames(select(xx,-dates,-type,-actual.value)) ) %>% table()
## .
##    bagged      bats       ets   prophet    sarima shortterm      stlm     tbats
##         3         4         5         7         6        12         4         3
sapply(res,function(xx) colnames(select(xx,-dates,-type,-actual.value)) )
##                                                                                       Albania
##                                                                                        "stlm"
##                                                                                       Austria
##                                                                                      "sarima"
##                                                                                       Belgium
##                                                                                   "shortterm"
##                                                                        Bosnia and Herzegovina
##                                                                                        "stlm"
##                                                                                      Bulgaria
##                                                                                       "tbats"
##                                                                                       Croatia
##                                                                                   "shortterm"
##                                                                                        Cyprus
##                                                                                      "sarima"
##                                                                                       Czechia
##                                                                                       "tbats"
##                                                                                       Denmark
##                                                                                     "prophet"
##                                                                                       Estonia
##                                                                                   "shortterm"
##                                                                      Euro area (12 countries)
##                                                                                     "prophet"
##                                                                      Euro area (19 countries)
##                                                                                     "prophet"
## Euro area (EA11-2000, EA12-2006, EA13-2007, EA15-2008, EA16-2010, EA17-2013, EA18-2014, EA19)
##                                                                                     "prophet"
##                                                     European Union - 15 countries (1995-2004)
##                                                                                     "prophet"
##                                                     European Union - 27 countries (from 2019)
##                                                                                     "prophet"
##                                                                 European Union - 28 countries
##                                                                                     "prophet"
##                                                                                       Finland
##                                                                                        "bats"
##                                                                                        France
##                                                                                         "ets"
##                                              Germany (until 1990 former territory of the FRG)
##                                                                                      "sarima"
##                                                                                        Greece
##                                                                                   "shortterm"
##                                                                                       Hungary
##                                                                                   "shortterm"
##                                                                                       Iceland
##                                                                                        "stlm"
##                                                                                       Ireland
##                                                                                       "tbats"
##                                                                                         Italy
##                                                                                         "ets"
##                             Kosovo (under United Nations Security Council Resolution 1244/99)
##                                                                                         "ets"
##                                                                                        Latvia
##                                                                                         "ets"
##                                                                                     Lithuania
##                                                                                   "shortterm"
##                                                                                    Luxembourg
##                                                                                      "sarima"
##                                                                                         Malta
##                                                                                   "shortterm"
##                                                                                    Montenegro
##                                                                                         "ets"
##                                                                                   Netherlands
##                                                                                   "shortterm"
##                                                                               North Macedonia
##                                                                                   "shortterm"
##                                                                                        Norway
##                                                                                      "sarima"
##                                                                                        Poland
##                                                                                      "bagged"
##                                                                                      Portugal
##                                                                                        "stlm"
##                                                                                       Romania
##                                                                                   "shortterm"
##                                                                                        Serbia
##                                                                                      "sarima"
##                                                                                      Slovakia
##                                                                                   "shortterm"
##                                                                                      Slovenia
##                                                                                        "bats"
##                                                                                         Spain
##                                                                                      "bagged"
##                                                                                        Sweden
##                                                                                        "bats"
##                                                                                   Switzerland
##                                                                                        "bats"
##                                                                                        Turkey
##                                                                                   "shortterm"
##                                                                                United Kingdom
##                                                                                      "bagged"

We hope you enjoy working with this package to build your time series predictions in the future. Now you should be capable of extending your data science algorithms on corona with Time Series predicitons. If you want to learn more about data science, I recommend you doing this tutorial.

Everyone (or at least most) companies today talk about digital transformation and treat data as a main asset for this. The question is where to store this data. In a traditional database? In a DWH? Ever heard about the datalake?

What is the datalake?

I think we should take a step back to answer this question. First of all, a Datalake is not a single piece of software. It consists of a large variety of Platforms, where Hadoop is a central one, but not the only one – it includes other tools such as Spark, Kafka, … and many more. Also, it includes relational Databases – such as PostgreSQL for instance. If we look at how truly digital companies such as Facebook, Google or Amazon solve these problems, then the technology stack is also clear; in fact, they heavily contribute to and use Hadoop & similar technologies. So the answer is clear: you don’t need overly expensive DWHs any more.

However, many C-Level executives might now say: “but we’ve invested millions in our DWH over the last years (or even decades)”. Here the question is getting more complex. How should we treat our DWH? Should it be replaced or should the DWH become the single source of truth and should the Datalake be ignored? In my opinion, both options aren’t valid:

Can the datalake replace a data warehouse?

First, replacing a DWH and moving all data to a Datalake will be a massive project that will bind too many resources in a company. Finding people with adequate skills isn’t easy, so this can’t be the solution to it. In addition to that, there are hundreds of business KPIs built, a lot of units within large enterprises built their decisions on these. Moving them to a Datalake will most likely break (important) business processes. Also, previous investments will be vaporised. So a big-bang replacement is clearly a no-go.

Second, keeping everything in the DWH is not feasible. Modern tools such as Python, Tensorflow and many more aren’t well supported by proprietary software (or at least, get the support with delay). From a skills-perspective, most young professionals coming from university get skills in technologies such as Spark, Hadoop and alike and therefore the skills shortage can be solved easier by moving towards a Datalake.

I am speaking at a large number of international conferences; whenever I ask the audience if they want to work with proprietary DWH databases, no hands go up. If I ask them if they want to work with Datalake technologies, everyone raises the hand. The fact is, that employees choose the company they want to work for, not vice versa. We have a skills shortage in this area, everyone ignoring or not accepting that is simply wrong. Also, a DWH is way more expensive then a Datalake. So also this option is not a valid one.

What to do now?

So what is my recommendation or strategy? For large, established enterprises, it is a combination of both steps, but with a clear path towards replacing the DWH in the long run. I am not a supporter of complex, long-running projects that are hard to control and track. Replacing the DWH should be a vision, not a project. This can be achieved by agile project management, combined with a long-term strategy: new projects are solely done by Datalake technologies.

All future investments and platform implementations must use the Datalake as the single source of truth. Once existing KPIs and processes are renewed, it must be ensured that these technologies are implemented on the Datalake and that the data gets shifted to the Datalake from the DWH. To make this succeed, it is necessary to have a strong Metadata management and data governance in place, otherwise the Datalake will be a very messy place – and thus become a data swamp.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. Wikipedia describes the concept of the datalake very well.

Since there are many cloud providers out there and I often come across the problem to switch between different platforms (such as Google AppEngine, Amazon S3, …) I have decided to write a single client that will work with all different platforms – or at least as most as possible. I’ve created a project on Google Code here and I will start to write a first draft of interfaces. In the first step, I will include Amazon S3. I hope that more people will join this project and help me creating a great project 😉