Posts

two native american playing wind instrument

The ultimate career goal of a Data Scientist or any other person in the Data industry would be to become a CDO – Chief Data Officer. But is it really that interesting to become one? Is the role of the CDO still relevant nowadays? Let me dig a bit into the role and how it has evolved and outline the facts on why I believe that this role will – sooner or later – become irrelevant

The Chief Data Officer at the board level

Whenever we talk about a C-role, it would imply that this person is a board member. In my past and my own career, I have however not seen any CDO being a real board member. In fact, most of the time the “CDO” was either reporting to the board or several hierarchical levels underneath it. So in all cases, the “C” should be removed from the job title per se. In my recent jobs, I had the full job description of a CDO, reported to the board and dealt with the data topic on a group-wide basis of the company. However, my job title was always “Head of Data”, rather than adding the C to my title (since, again, I wasn’t part of the board). My personal opinion is that anyway the data topic shouldn’t be on a board level – it should be in ALL business departments.

A central function for a decentralised job

One of the key aspects, why the Chief Data Officer might become irrelevant is the basic nature of Data: Data is always decentral, produced in business departments and used in these departments. A central function for data will never be fast enough to catch up with the demands around data. This leads to the question on why a central department might be relevant? Or will it still be relevant at all?

One key consideration is, if the job brings any benefits to the company, if installed. Most CDOs I knew used to focus on analytical use-cases. But this is definitely something that needs to be done in the business departments. With the data mesh, not only using data but also preparing data (e.g. data engineering as a task) would rather be embedded in business functions than in centralised functions. So several functions, that a Chief Data Officer would carry out get decentralised. But what will stay for the Chief Data Officer?

The Chief Data Officer as central Data Governance and Architecture steering

However, there is still plenty of work left for a “Chief Data Officer”. The main tasks of this function or person will center around the following items:

  • Data Governance: steer all decentral projects into common standards and raise awareness for data quality
  • Data Architecture: ensure a common data architecture and set standards for decentral functions, alongside the data governance standards
  • Drive a data driven culture: Gather the decentral community within an organisation and ensure that the organisation learns and constantly improves in this topic. Become a catalyst for innovation in the data topic

There are some aspects that such a function should not do:

  • Data Engineering: Data Engineering should either be done in IT (if it is source centric) or in business departments (but rather on a limited scale, focused on DataOps!) if it is pipeline centric (and supporting data scientists)
  • Data Science: This should entirely be done in the business

As we can see, there are a lot of things that a “CDO” should still do. However, the function would rather focus on securing the base, not creating value in the business cases per se. But this might not be entirely true: if there is nobody that takes care of governance and a proper architecture, there is no chance to create value in the business. So it is a very important role to have in organisations. Will this role be called a “Chief Data Officer”? Probably not, but people like titles, so it will stay 😉

Shows the code editor in Python

In my previous post, I gave an introduction to Python Libraries for Data Engineering and Data Science. In this post, we will have a first look at NumPy, one of the most important libraries to work with in Python.

NumPy is the simplest library for working with data. It is often re-used by other libraries such as Pandas, so it is necessary to first understand NumPy. The focus of this library is on easy transformations of Vectors, Matrizes and Arrays. It provides a lot of functionality on that. But let’s get our hands dirty with the library and have a look at it!

Before you get started, please make sure to have the Sandbox setup and ready

Getting started with NumPy

First of all, we need to import the library. This works with the following import statement in Python:

import numpy as np

This should now give us access to NumPy libraries. Let us first create an 3-dimensional array with 5 values in it. In NumPy, this works with the “arange” method. We provide “15” as the number of items and then let it re-shape to 3×5:

vals = np.arange(15).reshape(3,5)
vals

This should now give us an output array with 2 dimensions, where each dimension contains 5 values. The values range from 0 to 14:

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

NumPy contains a lot of different variables and functions. To have PI, you simply import “pi” from numpy:

from numpy import pi
pi

We can now use PI for further work and calculations in Python.

Simple Calculations with NumPy

Let’s create a new array with 5 values:

vl = np.arange(5)
vl

An easy way to calculate is to calculate something to the power. This works with “**”

nv = vl**2
nv

Now, this should give us the following output:

array([ 0,  1,  4,  9, 16])

The same applies to “3”: if we want to calculate everything in an array to the power of 3:

nn = vl**3
nn

And the output should be similar:

array([ 0,  1,  8, 27, 64])

Working with Random Numbers in NumPy

NumPy contains the function “random” to create random numbers. This method takes the dimensions of the array to fit the numbers into. We use a 3×3 array:

nr = np.random.random((3,3))
nr *= 100
nr

Please note that random returns numbers between 0 and 1, so in order to create higher numbers we need to “stretch” them. We thus multiply by 100. The output should be something like this:

array([[90.30147522,  6.88948191,  6.41853222],
       [82.76187536, 73.37687372,  9.48770728],
       [59.02523947, 84.56571797,  5.05225463]])

Your numbers should be different, since we are working with random numbers in here. We can do this as well with a 3-dimensional array:

n3d = np.random.random((3,3,3))
n3d *= 100
n3d

Also here, your numbers would be different, but the overall “structure” should look like the following:

array([[[89.02863455, 83.83509441, 93.94264059],
        [55.79196044, 79.32574406, 33.06871588],
        [26.11848117, 64.05158411, 94.80789032]],

       [[19.19231999, 63.52128357,  8.10253043],
        [21.35001753, 25.11397256, 74.92458022],
        [35.62544853, 98.17595966, 23.10038137]],

       [[81.56526913,  9.99720992, 79.52580966],
        [38.69294158, 25.9849473 , 85.97255179],
        [38.42338734, 67.53616027, 98.64039687]]])

Other means to work with Numbers in Python

NumPy provides several other options to work with data. There are several aggregation functions available that we can use. Let’s now look for the maximum value in the previously created array:

n3d.max()

In my example this would return 98.6. You would get a different number, since we made it random. Also, it is possible to return the maximum number of a specific axis within an array. We therefore add the keyword “axis” to the “max” function:

n3d.max(axis=1)

This would now return the maximum number for each of the axis within the array. In my example, the results look like this:

array([[93.94264059, 79.32574406, 94.80789032],
       [63.52128357, 74.92458022, 98.17595966],
       [81.56526913, 85.97255179, 98.64039687]])

Another option is to create the sum. We can do this by the entire array, or by providing the axis keyword:

n3d.sum(axis=1)

In the next sample, we make the data look more pretty. This can be done by rounding the numbers to 2 digits:

n3d.round(2)

Iterating arrays in Python

Often, it is necessary to iterate over items. In NumPy, this can be achieved by using the built-in iterator. We get it by the function “nditer”. This function needs the array to iterate over and then we can include it in a for-each loop:

or val in np.nditer(n3d):
    print(val)

The above sample would iterate over all values in the array and then prints the values. If we want to modify the items within the array, we need to set the flag “op_flags” to “readwrite”. This enables us to do modifications to the array while iterating it. In the next sample, we iterate over each item and then create the modulo of 3 from it:

n3d = n3d.round(0)

with np.nditer(n3d, op_flags=['readwrite']) as iter:
    for i in iter:
        i[...] = i%3
        
n3d

These are the basics of NumPy. In our next tutorial, we will have a look at Pandas: a very powerful dataframe library.

If you liked this post, you might consider the tutorial about Python itself. This gives you a great insight into the Python language for Spark itself. If you want to know more about Python, you should consider visiting the official page.

I am talking a lot to different people in my domain – either on conferences or as I know them personally. One thing most of them have in common is one thing: frustration. But why are people working with data frustrated? Why do we see so many frustrated data scientists? Is it the complexity of the job on dealing with data or is it something else? My experience is clearly one thing: something else.

Why are people working with Data frustrated?

One pattern is very clear: most people I talk to that are frustrated with their job working in classical industries. Whenever I talk to people in the IT industry or in Startups, they seem to be very happy. This is largely in contrast to people working in “classical” industries or in consulting companies. There are several reasons to that:

  • First, it is often about a lack of support within traditional companies. Processes are complex and employees work in that company for quite some time. Bringing in new people (or the cool data scientists) often creates frictions with the established employees of the company. Doing things different to how they used to be done isn’t well perceived by the established type of employees and they have the power and will to block any kind of innovation. The internal network they have can’t compete with any kind of data science magic.
  • Second, data is difficult to grasp and organised in silos. Established companies often have an IT function as a cost center, so things were done or fixed on the fly. It was never really intended to dismantle those silos, as budgets were never reserved or made available in doing so. Even now, most companies don’t look into any kind of data governance to reduce their silos. Data quality isn’t a key aspect they strive for. The new kind of people – data scientists – are often “hunting” for data rather than working with the data.
  • Third, the technology stack is heterogenous and legacy brings in a lot of frustration as well. This is very similar to the second point. Here, the issue is rather about not knowing how to get the data out of a system without a clear API rather than finding data at all.
  • Fourth, everybody forgets about data engineers. Data Scientists sit alone and though they do have some skills in Python, they aren’t the ones operating a technology stack. Often, there is a mismatch between data scientists and data engineers in corporations.
  • Fifth, legacy always kicks in. Mandatory regulatory reporting and finance reporting is often taking away resources from the organisation. You can’t just say: “Hey, I am not doing this report for the regulatory since I want to find some patterns in the behaviour of my customers”. Traditional industries are more heavy regulated than Startups or IT companies. This leads to data scientists being reused for standard reporting (not even self-service!). Then the answer often is: “This is not what I signed up for!”
  • Sixth, Digitalisation and Data units are often created in order to show it to the shareholder report. There is no real need from the board for impact. Impact is driven from the business and the business knows how to do so. There won’t be significant growth at all but some growth with “doing it as usual”. (However, startups and companies changing the status quo will get this significant growth!)
  • Seventh, Data scientists need to be in the business, whereas data engineers need to be in the IT department close to the IT systems. Period. However, Tribes need to be centrally steered.

How to overcome this frustration?

Basically, there is no fast cure available to this problem to reduce the frustrated data scientists. The field is still young, so confusion and wrong decisions outside of the IT industry is normal. Projects will fail, skilled people will leave and find new jobs. Over time, companies will get more and more mature in their journey and thus everything around data will become part of the established parts of a company. Just like controlling, marketing or any other function. It is yet to find its place and organisation type.

In the last couple of posts, we’ve learned about various aspects of Machine Learning. Now, we will focus on other aspects of Machine Learning: Deep Learning. After introducing the key concepts of Deep Learning in the previous post, we will have a look at two concepts: the Convolutional Neural Network (CNN) and the Feedforward Neural Network

The Feedforward Neural Network

Feedforward neural networks are the most general-purpose neural network. The entry point is the input layer and it consists of several hidden layers and an output layer. Each layer has a connection to the previous layer. This is one-way only, so that nodes can’t for a cycle. The information in a feedforward network only moves into one direction – from the input layer, through the hidden layers to the output layer. It is the easiest version of a Neural Network. The below image illustrates the Feedforward Neural Network.

Feedforward Neural Network

Convolutional Neural Networks (CNN)

The Convolutional Neural Network is very effective in Image recognition and similar tasks. For that reason it is also good for Video processing. The difference to the Feedforward neural network is that the CNN contains 3 dimensions: width, height and depth. Not all neurons in one layer are fully connected to neurons in the next layer. There are three different type of layers in a Convolutional Neural Network, which are also different to feedforward neural networks:

Convolution Layer

Convolution puts the input image through several convolutional filters. Each filter activates certain features, such as: edges, colors or objects. Next, the feature map is created out of them. The deeper the network goes the more sophisticated those filters become. The convolutional layer automatically learns which features are most important to extract for a specific task.

Rectified linear units (ReLU)

The goal of this layer is to improve the training speed and impact. Negative values in the layers are removed.

Pooling/Subsampling

Pooling simplifies the output by performing nonlinear downsampling. The number of parameters that the network needs to learn about gets reduced. In convolutional neural networks, the operation is useful since the outgoing connections usually receive similar information.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

A linear regression model

In the last couple of posts, we’ve learned about various aspects of Machine Learning. Now, we will focus on other aspects of Machine Learning: Deep Learning. In this post, I will give an introduction to deep learning. Over the last couple of years, this was the hype around AI. But what is so exciting about Deep Learning? First, let’s have a look at the concepts of Deep Learning.

A brief introduction to Deep Learning

Basically, Deep Learning should function similar to the human brain. Everything is built around Neurons, which work in networks (neural networks). The smallest element in a neural network is the neuron, which takes an input parameter and creates an output parameter, based on the bias and weight it has. The following image shows the Neuron in Deep Learning:

The Neuron in a Neuronal Network in Deep Learning
The Neuron in a Neuronal Network in Deep Learning

Next, there are Layers in the Network, which consists of several Neurons. Each Layer has some transformations, that will eventually lead to an end result. Each Layer will get much closer to the target result. If your Deep Learning model built to recognise hand writing, the first layer would probably recognise gray-scales, the second layer a connection between different pixels, the third layer would recognise simple figures and the fourth layer would recognise the letter. The following image shows a typical neural net:

A neural net for Deep Learning
A neural net for Deep Learning

A typical workflow in a neural net calculation for image recognition could look like this:

  • All images are split into batches
  • Each batch is sent to the GPU for calculation
  • The model starts the analysis with random weights
  • A cost function gets specified, that compares the results with the truth
  • Back propagation of the result happens
  • Once a model calculation is finished, the result is merged and returned

How is it different to Machine Learning?

Although Deep Learning is often considered to be a “subset” of Machine Learning, it is quite different. For different aspects, Deep Learning often achieves better results than “traditional” machine learning models. The following table should provide an overview of these differences:

Machine Leaning Deep Learning
Feature extraction happens manuallyFeature extraction is done automatically
Features are used to create a model that categorises elementsPerforms “end-to-end learning” 
Shallow learning  Deep learning algorithms scale with data

This is only the basic overview of Deep Learning. Deep Learning knows several different methods. In the next tutorial, we will have a look at different interpretations of Deep Learning.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In the first posts, I introduced different type of Machine Learning concepts. On of them is classification. Basically, classification is about identifying to which set of categories a certain observation belongs. Classifications are normally of supervised learning techniques. A typical classification is Spam detection in e-mails – the two possible classifications in this case are either “spam” or “no spam”. The two most common classification algorithms are the naive bayes classification and the random forest classification.

What classification algorithms are there?

Basically, there are a lot of classification algorithms available and when working in the field of Machine Learning, you will discover a large number of algorithms every time. In this tutorial, we will only focus on the two most important ones (Random Forest, Naive Bayes) and the basic one (Decision Tree)

The Decision Tree classifier

The basic classifier is the Decision tree classifier. It basically builds classification models in the form of a tree structure. The dataset is broken down into smaller subsets and gets detailed by each leave. It could be compared to a survey, where each question has an effect on the next question. Let’s assume the following case: Tom was captured by the police and is a suspect in robing a bank. The questions could represent the following tree structure:

Basic sample of a Decision Tree
Basic sample of a Decision Tree

Basically, by going from one leave to another, you get closer to the result of either “guilty” or “not guilty”. Also, each leaf has a weight.

The Random Forest classification

Random forest is a really great classifier, often used and also often very efficient. It is an ensemble classifier made using many decision tree models. There are ensemble models that combine the different results. The random forest model can both run regression and classification models.

Basically, it divides the data set into subsets and then runs on the data. Random forest models run efficient on large datasets, since all compute can be split and thus it is easier to run the model in parallel. It can handle thousands of input variables without variable deletion. It computes proximities between pairs of cases that can be used in clustering, locating outliers or (by scaling) give interesting views of the data.

There are also some disadvantages with the random forest classifier: the main problem is its complexity. Working with random forest is more challenging than classic decision trees and thus needs skilled people. Also, the complexity creates large demands for compute power.

Random Forest is often used by financial institutions. A typical use-case is credit risk prediction. If you have ever applied for a credit, you might know the questions being asked by banks. They are often fed into random forest models.

The Naive Bayes classifier

The Naive Bayes classifier is based on prior knowledge of conditions that might relate to an event. It is based on the Bayes Theorem. There is a strong independence between features assumed. It uses categorial data to calculate ratios between events.

The benefit of Naive Bayes are different. It can easily and fast predict classes of data sets. Also, it can predict multiple classes. Naive Bayes performs better compared to models such as logistic regression and there is a lot less training data needed.

A key challenge is that if a categorical variable has a category which was not checked in the training data set, then model will assign a 0 (zero) probability, which makes it unable for prediction. Also, it is known to be a rather bad estimator. Also, it is rather complex to use.

As stated, there are many more algorithms available. In the next tutorial, we will have a look at Deep Learning.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

A linear regression model

In the previous tutorial posts, we looked at the Linear Regression and discussed some basics of statistics such as the Standard Deviation and the Standard Error. Today, we will look at the Logistic Regression. It is similar in name to the linear regression, but different in usage. Let’s have a look

The Logistic Regression explained

One of the main difference to the Linear Regression for the Logistic Regression is that you the logistic regression is binary – it calculates values between 0 and 1 and thus states if something is rather true or false. This means that the result of a prediction could be “fail” or “succeed” for a test. In a churn model, this would mean that a customer either stays with the company or leaves the company.

Another key difference to the Linear Regression is that the regression curve can’t be calculated. Therefore, in the Logistic Regression, the regression curve is “estimated” and optimised. There is a mathematical function to do this estimation – called the “Maximum Likelihood Method”. Normally, these Parameters are calculated by different Machine Learning Tools so that you don’t have to do it.

Another aspect is the concept of “Odds”. Basically, the odd of a certain event happening or not happening is calculated. This could be a certain team winning a soccer game: let’s assume that Team X wins 7 out of 10 games (thus loosing 3, we don’t take a draw). The odds in this case would be 7:10 on winning or 3:10 on loosing.

This time we won’t calculate the Logistic Regression, since it is way too long. In the next tutorial, I will focus on classifiers such as Random Forest and Naive Bayes.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In my previous posts we had a look at some fundamentals of machine learning and had a look at the linear regression. Today, we will look at another statistical topic: false positives and false negatives. You will come across these terms quite often when working with data, so let’s have a look at them.

The false positive

In statistics, there is one error, called the false positive error. This happens when the prediction states something to be true, but in reality it is false. To easily remember the false positive, you could describe this as a false alarm. A simple example for that is the airport security check: when you pass the security check, you have to walk through a metal detector. If you don’t wear any metal items with you (since you left them for the x-ray!), no alarm will go on. But in some rather rare cases, the alarm might still go on. Either you forgot something or the metal detector had an error – in this case, a false positive. The metal detector predicted that you have metal items somewhere with you, but in fact you don’t.

Another sample of a false positive in machine learning would be in image recognition: imagine your algorithm is trained to recognise cats. There are so many cat pictures on the web, so it is easy to train this algorithm. However, you would then feed the algorithm the image of a dog and the algorithm would call it a cat, even though it is a dog. This again is a false positive.

In a business context, your algorithm might predict that a specific customer is going to buy a certain product for sure. but in fact, this customer didn’t buy it. Again, here we have our false positive. Now, let’s have a look at the other error: the false negative.

The false negative

The other error in statistics is the false negative. Similar to the false positive, it is something that should be avoided. It is very similar to the false positive, just the other way around. Let’s look at the airport example one more time: you wear a metal item (such as a watch) and go through the metal detector. You simply forgot to take off the watch. And – the metal detector doesn’t go on this time. Now, you are a false negative: the metal detector stated that you don’t wear any metal items, but in fact you did. A condition was predicted to be true but in fact it was false.

A false positive is often useful to score your data quality. Now that you understand some of the most important basics of statistics, we will have a look at another machine learning algorithm in my next post: the logistic regression.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

A linear regression model

Now we have learned how to write a Linear Regression model from hand in our last tutorial. Also, we had a look at the prediction error and standard error. Today, we want to focus on a way how to measure the performance of a model. In marketing, a common methodology for this is lift and gain charts. They can also be used for other things, but in our today’s sample we will use a marketing scenario.

The marketing scenario for Lift and Gain charts

Let’s assume that you are in charge of an outbound call campaign. Basically, your goal is to increase conversions of people contacted via this campaign. Like with most campaigns, you have a certain – limited – budget and thus need to plan the campaign smart. This is where machine learning comes into play: you only want to contact those people that are most relevant to buy the product. Therefore, you contact the top X percent of customers where you rather expect a conversion and avoid contacting those customers that are very unlikely to get converted. We assume that you already built a model for that and that we now do the campaign. We will measure our results with a gain chart, but first let’s create some data.

Our sample data represents all our customers, grouped into decentiles. Basically, we group the customers into top 10%, top 20%, … until we reach all customers. We add the number of conversions to it as well:

Decantile# of CustomersConversions
120033
220030
320027
420025
520023
620019
720015
820011
92007
102002

As you can see in the above table, the first decentile contains most conversions and is thus our top group. The conversion rates for each group in percent are:

% Conversions
17,2%
15,6%
14,1%
13,0%
12,0%
9,9%
7,8%
5,7%
3,6%
1,0%

As you can see, 17.2% of all top 10% customers could be converted. From each group, it declines. So, the best approach is to first contact the top customers. As a next step, we add the cumulative conversions. This number is then used for our cumulative gain chart.

Cumulative % Conversions
17,2%
32,8%
46,9%
59,9%
71,9%
81,8%
89,6%
95,3%
99,0%
100,0%

Cumulative Gain Chart

With this data, we can now create the cumulative gain chart. In our case, this would look like the following:

A cumulative gain chart
A cumulative gain chart

The Lift factor

Now, let’s have a look at the lift factor. The base for the lift factor is always the lift 1. This means that there was a random sample selected and no structured approach was done. Basically, the lift factor is the ratio you get between the number of customers contacted in % and the number of conversions for the decentile in %. With our sample data, this lift data would look like the following:

Lift
1,72
1,64
1,56
1,50
1,44
1,36
1,28
1,19
1,10
1,00

Thus we would have a lift factor of 1.72 with the first percentile, decreasing towards the full customer set.

In this tutorial, we’ve learned about how to verify a machine learning model. In the next tutorial, we will have a look at false positives and some other important topics before moving on with Logistic Regression.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In my previous posts, I explained the Linear Regression and stated that there are some errors in it. This is called the error of prediction (for individual predictions) and there is also a standard error. A prediction is good if the individual errors of prediction and the standard error are small. Let’s now start by examining the error of prediction, which is called the standard error in a linear regression model.

Error of prediction in Linear regression

Let’s recall the table from the previous tutorial:

YearAd Spend (X)Revenue (Y)Prediction (Y’)
2013 €    345.126,00  €   41.235.645,00  €   48.538.859,48
2014 €    534.678,00  €   62.354.984,00  €   65.813.163,80
2015 €    754.738,00  €   82.731.657,00  €   85.867.731,47
2016 €    986.453,00  € 112.674.539,00  € 106.984.445,76
2017 € 1.348.754,00  € 156.544.387,00  € 140.001.758,86
2018 € 1.678.943,00  € 176.543.726,00  € 170.092.632,46
2019 € 2.165.478,00  € 199.645.326,00  € 214.431.672,17

We can see that there is a clear difference in between the prediction and the actual numbers. We calculate the error in each prediction by taking the real value minus the prediction:

Y-Y’
-€   7.303.214,48
-€   3.458.179,80
-€   3.136.074,47
 €   5.690.093,24
 € 16.542.628,14
 €   6.451.093,54
-€ 14.786.346,17

In the above table, we can see how each prediction differs from the real value. Thus it is our prediction error on the actual values.

Calculating the Standard Error

Now, we want to calculate the standard error. First, let’s have a look at the formular:

Basically, we take the sum of all error to the square, divide it by the number of occurrences and take the square root of it. We already have Y-Y’ calculated, so we only need to make the square of it:

Y-Y’(Y-Y’)^2
-€   7.303.214,48  €    53.336.941.686.734,40
-€   3.458.179,80  €    11.959.007.558.032,20
-€   3.136.074,47  €      9.834.963.088.101,32
 €   5.690.093,24  €    32.377.161.053.416,10
 € 16.542.628,14  €  273.658.545.777.043,00
 €   6.451.093,54  €    41.616.607.923.053,70
-€ 14.786.346,17  €  218.636.033.083.835,00

The sum of it is 641.419.260.170.216,00 €

And N is 7, since it contains 7 Elements. Divided by 7, it is: 91.631.322.881.459,50 €

The last step is to take the square root, which results in the standard error of 9.572.425,13 € for our linear regression.

Now, we have most items cleared for our linear regression and can move on to the logistic regression in our next tutorial.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.