We started our tutorial with a general intro to Data Governance and then went a bit deeper into data security and data privacy. In this post, we will have a look at how to ensure a certain level of data quality in your data sets. Data Quality is a very important aspect. Imagine, you have wrong data about your customers and you build your marketing campaign on it. The campaign might return wrong results. This can damage your brand and turn away previously loyal customers. Therefore, data quality is highly essential.

How to measure data quality?

There are several aspects on how to measure data quality. I’ve summarised them into 5 core metrics. If you browse different literature, you might find more or less metrics. However, these five metrics should give you a core understanding of data quality management.

The 5 dimensions of data quality
The 5 dimensions of data quality

Availability

Availability states that data should be available. If we want to query all existing users interested in luxury cars, we are not interested in a subset but all of them. Availability is also a challenge addressed by the CAP-Theorem. In this case, it doesn’t focus on the general availability of the database but at the availability of each dataset itself. The algorithm querying the data should be as good as possible to retrieve all available data. There should be easy to use tools and languages to retrieve the data. Normally, each database provides a query language such as SQL, or O/R Mappers to developers.

With availability is also meant that the data used for a specific use-case should be available to data analysts in business units. A data relevant for a marketing campaign might be existing but not available for the campaign. For instance, the company might have specific customer data available in the data warehouse, but it isn’t know to business units that the data actually exists.

Correctness & Completness

Correctness means that Data has to be correct. If we again query for all existing users on a web portal interested in luxury cars, the data about that should be correct. By correctness, it is meant that the data should really represent people interested in luxury cars and that faked entries should be removed. A data set is also not correct if the user changed his or her address without the company knowing about it. Therefore, it must be tracked when which dataset was last updated.

Similar to correctness is completness. Data should be complete. Targeting all users interested in luxury cars only makes sense if we can target them somehow, e.g. by e-mail. If the e-mail field is blank or any other field we would like to target our users, data is not complete for our use-case.

Timeliness

Data should be up-to date. A user might change the e-mail address after a while and our database should reflect these changes whenever and wherever possible. If we target our users for luxury cars, it won’t be good at all if only 50% of the user’s e-mail addresses are correct. We might have “big data” but the data is not correct since updates didn’t occur for a while.

Consistency

This shouldn’t be confused with the consistency requirement by the CAP-Theorem. Data might be duplicated, since users might register several times to get various benefits. The user might select “luxury cars” and with another account “budget cars”. Duplicate accounts leads to inconsistency of data and it is a frequent problem in large web portals such as Facebook

Understandability

It should be easy to understand data. If we query our database for people interested in luxury cars, we should have the possibility to easily understand what the data is about. Once the data is returned, we should use our favorite tool to work with the data. The data itself should describe itself and we should know how to handle it. If the data returns a “zip” column, we know that this is the ZIP-code individual users are living in.

What can you do to improve your data quality?

Basically, it all starts with starting. You need to start tracking your data quality at some point and then need to continuously improve it. There are several tools existing that support your endeavour. But keep in mind: bad data creates bad decisions!

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In our previous tutorial intro, we outlined the four pillars that are relevant to data governance. In this post, I will go for a deeper dive into the data security and data privacy aspects of data governance.

What is data security?

Data security is all about securing the data against intrusions from the in- or outside of an organisation. Basically, it deals with hardening any systems that store data and making sure that data is only stored in a safe and secure way.

When dealing with data privacy, it comes in several layers:

  • Infrastructure: ensuring that the physical infrastructure is protected against any unwanted access. This starts with physical access control to servers and any devices associated with the organisation. This layer is only relevant when done on-premise.
  • Operating Systems and virtualisation: here it needs to be ensured that the operating system is in a secure state. If done on-premise, it requires both the host and the guest OS and the virtualisation software. When done in the cloud, it only applies to IaaS
  • Databases and Data Stores: any databases need to be constantly checked for vulnerabilities. If using any other stores such as object stores, they also need to be secured. This applies to on-premise and IaaS cloud solutions, but not to PaaS or SaaS cloud solutions
  • Application Security: When building a software on top of the previous stack, it is necessary to write this software in a secure manner. This applies to both on-prem and cloud. When using PaaS or SaaS solutions, it is the only relevant security challenge for companies implementing it. Therefore, it is highly important to look for a comprehensive security concept on this layer.

What if you ignore it?

Having issues with data security is a frequent failure of companies. There are a lot of examples of data leaks like with LinkedIn, Deutsche Telekom or Twitter. Almost nobody is secure and thus this block needs to be taken into consideration at the highest level when building a data strategy. Experts argue that it might not be a question when an intrusion happens. The only question might be how long the organisation needs to realise it and thus take counter-measures and minimise the damage.

A key recommendation (but not the only one) is to encrypt all data, so that it is more challenging to get full access.

What is data privacy?

Another important block is data privacy. This now deals more with the question on who can read or access the data within a company. Basically, algorithms and people should work with (pseudo) anonymised data whenever possible. Analysts or Data Scientists shouldn’t see any personal information within the data that they are dealing with. If we take a marketing campaign, the analysts working with the data should only see the minimum available data for them necessary to build the campaign. The marketing tool should then combine the results of their selection with the addresses of their target. There are several tools available that obfuscate personal identifiable data (PID) and thus make the work with it easier.

The above described is also called as the “need to know principle”. People should only see the data that they really need to know. When looking at how companies build their access rights to data, it is often built on a very individual basis. People ask for access, state why they need it and the data owner gives them access. However, this is rather manual and not necessarily fit for the new era of privacy.

A business driven role-based access model

A much better approach is to build on a role-based access model. By roles, it doesn’t necessarily mean Active Directory roles. It is more built on the business roles that users are in. For example, a role would be “Marketing Analyst”. This user would get access to specific data that he or she needs for the daily work. Access to all data that are relevant should be given, but nothing more than that. The roles in this approach should be clearly business focused and not technology-focused.

Another key aspect in data privacy is to understand who was accessing what data. It is necessary to store a comprehensive audit log about all data access and thus make data breaches trackable.

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

Data Governance

Everybody is talking about Data Science and Big Data, but one heavily ignored topic is Data Governance and Data Quality. Executives all over the world want to invest into doing data science, but they often ignore Data Governance. Some month ago I wrote about this and shared my frustration about it. Now I’ve decided to go for a more pragmatic approach and describe what Data Governance is all about. This should bring some clarity into the topic and reduce emotions.

Why is Data Governance important?

It is important to keep a certain level of quality in the data. Making decisions on Bad Data Quality leads to bad overall decisions. Data Governance efforts are increasing exponentially when not done in the very beginning of your Data Strategy.

Also, there are a lot of challenges around Data Governance:

  • Keeping a high level of security is often slowing down business implementations
  • Initial investments are necessary – that don’t show value for month to years
  • Benefits are only visible “on top” of governance – e.g. with faster business results or better insights and thus it is not easy to “quantify” the impact
  • Data Governance is often considered as “unsexy” to do. Everybody talks about data science, but nobody about data governance. In fact, Data Scientists can do almost nothing without data governance
  • Data Governance tools are rare – and those that are available are very expensive. Open Source doesn’t focus too much on it, as there is less “buzz” around it than AI. However, this also creates opportunities for us

Companies can basically follow three different strategies. Each strategy differs in the level of maturity:

  • Reactive Governance: Efforts are rather designed to respond to current pains. This happens when the organization has suffered a regulatory breach or a data disaster
  • Pre-emptive Governance: The organization is facing a major change or threat. This strategy is designed to ward off significant issues that could affect success of the company. Often it is driven by impending regulatory & compliance needs
  • Proactive Governance: All efforts are designed to improve capabilities to resolve risk and data issues. This strategy builds on reactive governance to create an ever-increasing body of validated rules, standards, and tested processes. It is also part of a wider Information Management strategy

The 4 pillars

4 data governance pillars
The 4 pillars of Data Governance

As you can see in the image, there are basically 4 main pillars. During the next weeks, I will describe each of them in detail. But let’s have a first look at them now:

  • Data Security & Data Privacy: The overall goal in here is to keep the data secure against external access. It is built on encryption, access management and accessibility. Often, a Roles-based access is defined in this process. A typical definition in here is privacy and security by design
  • Data Quality Management: In this pillar, different measures for Data Quality are defined and tracked. Typically, for each dataset, specific quality measures are looked after. This gives data consumers an overview of the data quality.
  • Data Access & Search: This pillar is all about making data accessible and searchable within the company assets. A typical sample here is a Data Catalog, that shows all available company data to end users.
  • Master Data Management: master data is the common data of the company – e.g. the customer data, the data of suppliers and alike. Data in here should be of high quality and consistent. One physical customer should occur exactly as one person and not as multiple persons

For each of the above mentioned pillars, I will write individual articles over the next weeks.

This tutorial is part of the Data Governance Tutorial. You can learn more about Data Governance by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

A college of mine, Vivien Roussez, wrote a nice library in R to predict time series. The package is called “autoTS” and provides a high level interface for univariate time series predictions. It implements many algorithms, most of them provided by the forecast package. You can find the package as an open source project on GitHub. Over the last few weeks we saw a lot of Data Science happening due to Corona. One of the challenges on this is to use the right forecast.

Introduction to autoTS

by Vivien Roussez

The autoTS package provides a high level interface for univariate time series predictions. It implements many algorithms, most of them provided by the forecast package. The main goals of the package are :

  • Simplify the preparation of the time series ;
  • Train the algorithms and compare their results, to chose the best one ;
  • Gather the results in a final tidy dataframe

What are the inputs ?

The package is designed to work on one time series at a time. Parallel calculations can be put on top of it (see example below). The user has to provide 2 simple vectors :

  • One with the dates (s.t. the lubridate package can parse them)
  • The second with the corresponding values

Warnings

This package implements each algorithm with a unique parametrization, meaning that the user cannot tweak the algorithms (eg modify SARIMA specfic parameters).

Example on real-world data

Before getting started, you need to install the required package “autoTS”. This works with the following code:

knitr::opts_chunk$set(warning = F,message = F,fig.width = 8,fig.height = 5)
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(lubridate))
library(autoTS)

For this example, we will use the GDP quarterly data of the european countries provided by eurostat. The database can be downloaded from this page and then chose “GDP and main components (output, expenditure and income) (namq_10_gdp)” and then adjust the time dimension to select all available data and download as a csv file with the correct formatting (1 234.56). The csv is in the “Data” folder of this notebook.

dat <- read.csv("Data/namq_10_gdp_1_Data.csv")
str(dat)
## 'data.frame':    93456 obs. of  7 variables:
##  $ TIME              : Factor w/ 177 levels "1975Q1","1975Q2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ GEO               : Factor w/ 44 levels "Albania","Austria",..: 15 15 15 15 15 15 15 15 15 15 ...
##  $ UNIT              : Factor w/ 3 levels "Chain linked volumes (2010), million euro",..: 2 2 2 2 3 3 3 3 1 1 ...
##  $ S_ADJ             : Factor w/ 4 levels "Calendar adjusted data, not seasonally adjusted data",..: 4 2 1 3 4 2 1 3 4 2 ...
##  $ NA_ITEM           : Factor w/ 1 level "Gross domestic product at market prices": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Value             : Factor w/ 19709 levels "1 008.3","1 012.9",..: 19709 19709 19709 19709 19709 19709 19709 19709 19709 19709 ...
##  $ Flag.and.Footnotes: Factor w/ 5 levels "","b","c","e",..: 1 1 1 1 1 1 1 1 1 1 ...
head(dat)
##     TIME                                       GEO
## 1 1975Q1 European Union - 27 countries (from 2019)
## 2 1975Q1 European Union - 27 countries (from 2019)
## 3 1975Q1 European Union - 27 countries (from 2019)
## 4 1975Q1 European Union - 27 countries (from 2019)
## 5 1975Q1 European Union - 27 countries (from 2019)
## 6 1975Q1 European Union - 27 countries (from 2019)
##                                   UNIT
## 1 Chain linked volumes, index 2010=100
## 2 Chain linked volumes, index 2010=100
## 3 Chain linked volumes, index 2010=100
## 4 Chain linked volumes, index 2010=100
## 5         Current prices, million euro
## 6         Current prices, million euro
##                                                                           S_ADJ
## 1 Unadjusted data (i.e. neither seasonally adjusted nor calendar adjusted data)
## 2                          Seasonally adjusted data, not calendar adjusted data
## 3                          Calendar adjusted data, not seasonally adjusted data
## 4                                         Seasonally and calendar adjusted data
## 5 Unadjusted data (i.e. neither seasonally adjusted nor calendar adjusted data)
## 6                          Seasonally adjusted data, not calendar adjusted data
##                                   NA_ITEM Value Flag.and.Footnotes
## 1 Gross domestic product at market prices     :
## 2 Gross domestic product at market prices     :
## 3 Gross domestic product at market prices     :
## 4 Gross domestic product at market prices     :
## 5 Gross domestic product at market prices     :
## 6 Gross domestic product at market prices     :

Data preparation

First, we have to clean the data (not too ugly though). First thing is to convert the TIME column into a well known date format that lubridate can handle. In this example, the yq function can parse the date without modification of the column. Then, we have to remove the blank in the values that separates thousands… Finally, we only keep data since 2000 and the unadjusted series in current prices.

After that, we should get one time series per country

dat <- mutate(dat,dates=yq(as.character(TIME)),
              values = as.numeric(stringr::str_remove(Value," "))) %>%
  filter(year(dates)>=2000 &
           S_ADJ=="Unadjusted data (i.e. neither seasonally adjusted nor calendar adjusted data)" &
           UNIT == "Current prices, million euro")
filter(dat,GEO %in% c("France","Austria")) %>%
  ggplot(aes(dates,values,color=GEO)) + geom_line() + theme_minimal() +
  labs(title="GDP of (completely) random countries")

Now we’re good to go !

Prediction on a random country

Let’s see how to use the package on one time series :

  • Extract dates and values of the time series you want to work on
  • Create the object containing all you need afterwards
  • Train algo and determine which one is the best (over the last known year)
  • Implement the best algorithm on full data
ex1 <- filter(dat,GEO=="France")
preparedTS <- prepare.ts(ex1$dates,ex1$values,"quarter")
## What is in this new object ?
str(preparedTS)
## List of 4
##  $ obj.ts    : Time-Series [1:77] from 2000 to 2019: 363007 369185 362905 383489 380714 ...
##  $ obj.df    :'data.frame':  77 obs. of  2 variables:
##   ..$ dates: Date[1:77], format: "2000-01-01" "2000-04-01" ...
##   ..$ val  : num [1:77] 363007 369185 362905 383489 380714 ...
##  $ freq.num  : num 4
##  $ freq.alpha: chr "quarter"
plot.ts(preparedTS$obj.ts)
ggplot(preparedTS$obj.df,aes(dates,val)) + geom_line() + theme_minimal()

Get the best algorithm for this time series :

## What is the best model for prediction ?
best.algo <- getBestModel(ex1$dates,ex1$values,"quarter",graph = F)
names(best.algo)
## [1] "prepedTS"     "best"         "train.errors" "res.train"    "algos"
## [6] "graph.train"
print(paste("The best algorithm is",best.algo$best))
## [1] "The best algorithm is my.ets"
best.algo$graph.train

You find in the result of this function :

  • The name of the best model
  • The errors of each algorithm on the test set
  • The graphic of the train step
  • The prepared time series
  • The list of used algorithm (that you can customize)

The result of this function can be used as direct input of the my.prediction function

## Build the predictions
final.pred <- my.predictions(bestmod = best.algo)
tail(final.pred,24)
## # A tibble: 24 x 4
##    dates      type  actual.value   ets
##    <date>     <chr>        <dbl> <dbl>
##  1 2015-04-01 <NA>        548987    NA
##  2 2015-07-01 <NA>        541185    NA
##  3 2015-10-01 <NA>        566281    NA
##  4 2016-01-01 <NA>        554121    NA
##  5 2016-04-01 <NA>        560873    NA
##  6 2016-07-01 <NA>        546383    NA
##  7 2016-10-01 <NA>        572752    NA
##  8 2017-01-01 <NA>        565221    NA
##  9 2017-04-01 <NA>        573720    NA
## 10 2017-07-01 <NA>        563671    NA
## # … with 14 more rows
ggplot(final.pred) + geom_line(aes(dates,actual.value),color="black") +
  geom_line(aes_string("dates",stringr::str_remove(best.algo$best,"my."),linetype="type"),color="red") +
  theme_minimal() 

Not too bad, right ?

Scaling predictions

Let’s say we want to make a prediction for each country in the same time and be the fastest possible →→ let’s combine the package’s functions with parallel computing. We have to reshape the data to get one column per country and then iterate over the columns of the data frame.

Prepare data

suppressPackageStartupMessages(library(tidyr))
dat.wide <- select(dat,GEO,dates,values) %>%
  group_by(dates) %>%
  spread(key = "GEO",value = "values")
head(dat.wide)
## # A tibble: 6 x 45
## # Groups:   dates [6]
##   dates      Albania Austria Belgium `Bosnia and Her… Bulgaria Croatia Cyprus
##   <date>       <dbl>   <dbl>   <dbl>            <dbl>    <dbl>   <dbl>  <dbl>
## 1 2000-01-01      NA  50422.   62261               NA    2941.   5266.  2547.
## 2 2000-04-01      NA  53180.   65046               NA    3252.   5811   2784.
## 3 2000-07-01      NA  53881.   62754               NA    4015.   6409.  2737.
## 4 2000-10-01      NA  56123.   68161               NA    4103.   6113   2738.
## 5 2001-01-01      NA  52911.   64318               NA    3284.   5777.  2688.
## 6 2001-04-01      NA  54994.   67537               NA    3669.   6616.  2946.
## # … with 37 more variables: Czechia <dbl>, Denmark <dbl>, Estonia <dbl>, `Euro
## #   area (12 countries)` <dbl>, `Euro area (19 countries)` <dbl>, `Euro area
## #   (EA11-2000, EA12-2006, EA13-2007, EA15-2008, EA16-2010, EA17-2013,
## #   EA18-2014, EA19)` <dbl>, `European Union - 15 countries (1995-2004)` <dbl>,
## #   `European Union - 27 countries (from 2019)` <dbl>, `European Union - 28
## #   countries` <dbl>, Finland <dbl>, France <dbl>, `Germany (until 1990 former
## #   territory of the FRG)` <dbl>, Greece <dbl>, Hungary <dbl>, Iceland <dbl>,
## #   Ireland <dbl>, Italy <dbl>, `Kosovo (under United Nations Security Council
## #   Resolution 1244/99)` <dbl>, Latvia <dbl>, Lithuania <dbl>,
## #   Luxembourg <dbl>, Malta <dbl>, Montenegro <dbl>, Netherlands <dbl>, `North
## #   Macedonia` <dbl>, Norway <dbl>, Poland <dbl>, Portugal <dbl>,
## #   Romania <dbl>, Serbia <dbl>, Slovakia <dbl>, Slovenia <dbl>, Spain <dbl>,
## #   Sweden <dbl>, Switzerland <dbl>, Turkey <dbl>, `United Kingdom` <dbl>

pull ## Compute bulk predictions

library(doParallel)
pipeline <- function(dates,values)
{
  pred <- getBestModel(dates,values,"quarter",graph = F)  %>%
    my.predictions()
  return(pred)
}
doMC::registerDoMC(parallel::detectCores()-1) # parallel backend (for UNIX)
system.time({
  res <- foreach(ii=2:ncol(dat.wide),.packages = c("dplyr","autoTS")) %dopar%
  pipeline(dat.wide$dates,pull(dat.wide,ii))
})
##    user  system elapsed
## 342.339   3.405  66.336
names(res) <- colnames(dat.wide)[-1]
str(res)
## List of 44
##  $ Albania                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ stlm        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Austria                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 50422 53180 53881 56123 52911 ...
##   ..$ sarima      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Belgium                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 62261 65046 62754 68161 64318 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Bosnia and Herzegovina                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ stlm        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Bulgaria                                                                                     :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 2941 3252 4015 4103 3284 ...
##   ..$ tbats       : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Croatia                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 5266 5811 6409 6113 5777 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Cyprus                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 2547 2784 2737 2738 2688 ...
##   ..$ sarima      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Czechia                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 15027 16430 17229 18191 16677 ...
##   ..$ tbats       : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Denmark                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 42567 44307 43892 47249 44143 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Estonia                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 1391 1575 1543 1662 1570 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Euro area (12 countries)                                                                     :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Euro area (19 countries)                                                                     :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Euro area (EA11-2000, EA12-2006, EA13-2007, EA15-2008, EA16-2010, EA17-2013, EA18-2014, EA19):Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ European Union - 15 countries (1995-2004)                                                    :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ European Union - 27 countries (from 2019)                                                    :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ European Union - 28 countries                                                                :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ prophet     : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Finland                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 31759 33836 34025 36641 34474 ...
##   ..$ bats        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ France                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 363007 369185 362905 383489 380714 ...
##   ..$ ets         : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Germany (until 1990 former territory of the FRG)                                             :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 515500 523900 536120 540960 530610 ...
##   ..$ sarima      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Greece                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 33199 34676 37285 37751 35237 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Hungary                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 11516 12630 13194 13955 12832 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Iceland                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 2304 2442 2557 2447 2232 ...
##   ..$ stlm        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Ireland                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 25583 26751 27381 28666 29766 ...
##   ..$ tbats       : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Italy                                                                                        :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 292517 309098 298655 338996 309967 ...
##   ..$ ets         : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Kosovo (under United Nations Security Council Resolution 1244/99)                            :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ ets         : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Latvia                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 1848 2165 2238 2382 2005 ...
##   ..$ ets         : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Lithuania                                                                                    :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 2657 3124 3267 3505 2996 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Luxembourg                                                                                   :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 5646 5730 5689 6015 5811 ...
##   ..$ sarima      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Malta                                                                                        :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 979 1110 1158 1152 1031 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Montenegro                                                                                   :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ ets         : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Netherlands                                                                                  :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 109154 113124 110955 118774 118182 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ North Macedonia                                                                              :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 901 1052 1033 1108 986 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Norway                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 44900 43730 46652 50638 48355 ...
##   ..$ sarima      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Poland                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 41340 44210 46944 54163 47445 ...
##   ..$ bagged      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Portugal                                                                                     :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 30644 31923 32111 33788 31927 ...
##   ..$ stlm        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Romania                                                                                      :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 7901 9511 11197 11630 8530 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Serbia                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 0 0 0 0 0 ...
##   ..$ sarima      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Slovakia                                                                                     :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 5100 5722 5764 5752 5343 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Slovenia                                                                                     :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 5147 5591 5504 5667 5407 ...
##   ..$ bats        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Spain                                                                                        :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 153378 162400 158526 171946 166204 ...
##   ..$ bagged      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Sweden                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 67022 73563 68305 73399 66401 ...
##   ..$ bats        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Switzerland                                                                                  :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 70048 72725 74957 77476 76092 ...
##   ..$ bats        : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ Turkey                                                                                       :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 59944 70803 82262 82981 60075 ...
##   ..$ shortterm   : num [1:85] NA NA NA NA NA NA NA NA NA NA ...
##  $ United Kingdom                                                                               :Classes 'tbl_df', 'tbl' and 'data.frame':   85 obs. of  4 variables:
##   ..$ dates       : Date[1:85], format: "2000-01-01" "2000-04-01" ...
##   ..$ type        : chr [1:85] NA NA NA NA ...
##   ..$ actual.value: num [1:85] 438090 440675 446918 462127 441157 ...
##   ..$ bagged      : num [1:85] NA NA NA NA NA NA NA NA NA NA ...

There is no free lunch…

There is no best algorithm in general ⇒⇒ depends on the data !

sapply(res,function(xx) colnames(select(xx,-dates,-type,-actual.value)) ) %>% table()
## .
##    bagged      bats       ets   prophet    sarima shortterm      stlm     tbats
##         3         4         5         7         6        12         4         3
sapply(res,function(xx) colnames(select(xx,-dates,-type,-actual.value)) )
##                                                                                       Albania
##                                                                                        "stlm"
##                                                                                       Austria
##                                                                                      "sarima"
##                                                                                       Belgium
##                                                                                   "shortterm"
##                                                                        Bosnia and Herzegovina
##                                                                                        "stlm"
##                                                                                      Bulgaria
##                                                                                       "tbats"
##                                                                                       Croatia
##                                                                                   "shortterm"
##                                                                                        Cyprus
##                                                                                      "sarima"
##                                                                                       Czechia
##                                                                                       "tbats"
##                                                                                       Denmark
##                                                                                     "prophet"
##                                                                                       Estonia
##                                                                                   "shortterm"
##                                                                      Euro area (12 countries)
##                                                                                     "prophet"
##                                                                      Euro area (19 countries)
##                                                                                     "prophet"
## Euro area (EA11-2000, EA12-2006, EA13-2007, EA15-2008, EA16-2010, EA17-2013, EA18-2014, EA19)
##                                                                                     "prophet"
##                                                     European Union - 15 countries (1995-2004)
##                                                                                     "prophet"
##                                                     European Union - 27 countries (from 2019)
##                                                                                     "prophet"
##                                                                 European Union - 28 countries
##                                                                                     "prophet"
##                                                                                       Finland
##                                                                                        "bats"
##                                                                                        France
##                                                                                         "ets"
##                                              Germany (until 1990 former territory of the FRG)
##                                                                                      "sarima"
##                                                                                        Greece
##                                                                                   "shortterm"
##                                                                                       Hungary
##                                                                                   "shortterm"
##                                                                                       Iceland
##                                                                                        "stlm"
##                                                                                       Ireland
##                                                                                       "tbats"
##                                                                                         Italy
##                                                                                         "ets"
##                             Kosovo (under United Nations Security Council Resolution 1244/99)
##                                                                                         "ets"
##                                                                                        Latvia
##                                                                                         "ets"
##                                                                                     Lithuania
##                                                                                   "shortterm"
##                                                                                    Luxembourg
##                                                                                      "sarima"
##                                                                                         Malta
##                                                                                   "shortterm"
##                                                                                    Montenegro
##                                                                                         "ets"
##                                                                                   Netherlands
##                                                                                   "shortterm"
##                                                                               North Macedonia
##                                                                                   "shortterm"
##                                                                                        Norway
##                                                                                      "sarima"
##                                                                                        Poland
##                                                                                      "bagged"
##                                                                                      Portugal
##                                                                                        "stlm"
##                                                                                       Romania
##                                                                                   "shortterm"
##                                                                                        Serbia
##                                                                                      "sarima"
##                                                                                      Slovakia
##                                                                                   "shortterm"
##                                                                                      Slovenia
##                                                                                        "bats"
##                                                                                         Spain
##                                                                                      "bagged"
##                                                                                        Sweden
##                                                                                        "bats"
##                                                                                   Switzerland
##                                                                                        "bats"
##                                                                                        Turkey
##                                                                                   "shortterm"
##                                                                                United Kingdom
##                                                                                      "bagged"

We hope you enjoy working with this package to build your time series predictions in the future. Now you should be capable of extending your data science algorithms on corona with Time Series predicitons. If you want to learn more about data science, I recommend you doing this tutorial.