A linear regression model

In my previous posts, I introduced the basics of machine learning. Today, I want to focus on the two elementary algorithms: linear and logistic regression. Basically, you would learn them at the very beginning of your journey for machine learning, but eventually not use them much later on any more. But to understand the concepts of it, it is helpful to understand them.

Linear Regression

A Linear Regression is the simplest model for Data Science. Linear Regression is of supervised learning and used in Trend Analysis, Time-Series Analysis, Risk in Banking and many more.

In a linear regression, a relationship between a dependent variable y and a dataset of xn is linear. This basically means, that if there is data of a specific trend, a future trend can be predicted. Let’s assume that there is a significant relation between ad spendings and sales. We would have the following table:

YearAd SpendRevenue
2013 €      345.126,00  €      41.235.645,00
2014 €      534.678,00  €      62.354.984,00
2015 €      754.738,00  €      82.731.657,00
2016 €      986.453,00  €    112.674.539,00
2017 €   1.348.754,00  €    156.544.387,00
2018 €   1.678.943,00  €    176.543.726,00
2019 €   2.165.478,00  €    199.645.326,00

If you look at the data, it is very easy to figure out that that there is some kind of relation between how much money you spend on the ads and the revenue you get. Basically, the ratio is 1:92 to 1:119. Please not that I totally made up the numbers. however, based on this numbers, you could basically predict what revenues to obtain when spending X amount of data. The relation between them is therefore linear and we can easily plot it on a line chart:

Linear Regression

As you can see, some of the values are above the line and others below. Let’s now manually calculate the linear function. There are some steps necessary that should eventually lead to the prediction values. Let’s assume we want to know if we spend a specific money on ads, what revenue we can expect. Let’s assume we want to know how much value we create for 1 Million spend on ads. The linear regression function for this is:

predicted score (Y') = bX + intercept (A)

This means that we now need to calculate several values: (A) the slope (it is our “b” and the intercept (it is our A). X is the only value we know – our 1 Million spend. Let’s first calculate the slope

Calculating the Slope

The first thing we need to do is calculating the slope. For this, we need to have the standard deviation of both X and XY. Let’s first start with X – our revenues. The standard deviation is calculated for each revenue individually. There are some steps involved:

  • Creating the average of the revenues
  • Subtracting the individual revenue
  • Building the square

The first step is to create the average of both values. The average for the revenues should be:  € 118.818.609,14 and the average for the spend should be:  € 1.116.310,00.

Next, we need to create the standard deviation of each item. For the ad spend, we do this by substracting each individual ad spend and building the square. The table for this should look like the following:

The formular is: (Average of Ad spend – ad spend) ^ 2

YearAd spendStddev (X)
2013 €    345.126,00  €              594.724.761.856,00
2014 €    534.678,00  €              338.295.783.424,00
2015 €    754.738,00  €              130.734.311.184,00
2016 €    986.453,00  €                16.862.840.449,00
2017 € 1.348.754,00  €                54.030.213.136,00
2018 € 1.678.943,00  €              316.555.892.689,00
2019 € 2.165.478,00  €           1.100.753.492.224,00

Quite huge numbers already, right? Now, let’s create the standard deviation for the revenues. This is done by taking the average of the ad spend – ad spend and multiplying it with the same procedure for the revenues. This should result in:

YearRevenueY_Ad_Stddev
2013 €                  41.235.645,00  €    59.830.740.619.545,10
2014 €                  62.354.984,00  €    32.841.051.219.090,30
2015 €                  82.731.657,00  €    13.048.031.460.197,10
2016 €                112.674.539,00  €         797.850.516.541,00
2017 €                156.544.387,00  €      8.769.130.708.225,71
2018 €                176.543.726,00  €    32.478.055.672.684,90
2019 €                199.645.326,00  €    84.800.804.871.574,80

Now, we only need to sum up the columns for Y and X. The sums should be:

€ 2.551.957.294.962,00 for the X-Row
€ 232.565.665.067.859,00 for the Y-Row

Now, we need to divide the Y-Row by the X-Row and would get the following slope: 91,1322715

Calculating the Intercept

The intercept is somewhat easier. The formular for it is: average(y) – Slope * average(x). We already have all relevant variables calculated in our previous step. Our intercept should equal:  € 17.086.743,14.

Predicting the value with the Linear Regression

Now, we can build our function. This is: Y = 91,1322715X + 17.086.743,14

As stated in the beginning, our X should be 1 Million and we want to know our revenue:  € 108.219.014,64

The prediction is actually lower than the values which are closer (2016 and 2017 values). If you change the values, e.g. to 2 Million or 400k, it will again get closer. Predictions always produce some errors and they are normally shown. In our case, the error table would look like the following:

ad spentreal revenue (Y)prediction (Y’)error
2013 €                       345.126,00  €                  41.235.645,00  €                  48.538.859,48 -€     7.303.214,48
2014 €                       534.678,00  €                  62.354.984,00  €                  65.813.163,80 -€     3.458.179,80
2015 €                       754.738,00  €                  82.731.657,00  €                  85.867.731,47 -€     3.136.074,47
2016 €                       986.453,00  €                112.674.539,00  €                106.984.445,76  €     5.690.093,24
2017 €                    1.348.754,00  €                156.544.387,00  €                140.001.758,86  €   16.542.628,14
2018 €                    1.678.943,00  €                176.543.726,00  €                170.092.632,46  €     6.451.093,54
2019 €                    2.165.478,00  €                199.645.326,00  €                214.431.672,17 -€   14.786.346,17

The error calculation is done by using the real value and deducting the predicted value from it. And voila – you have your error. One common thing in machine learning is to reduce the error and make predictions more accurate.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!