Posts

In the last couple of posts, we’ve learned about various aspects of Machine Learning. Now, we will focus on other aspects of Machine Learning: Deep Learning. After introducing the key concepts of Deep Learning in the previous post, we will have a look at two concepts: the Convolutional Neural Network (CNN) and the Feedforward Neural Network

The Feedforward Neural Network

Feedforward neural networks are the most general-purpose neural network. The entry point is the input layer and it consists of several hidden layers and an output layer. Each layer has a connection to the previous layer. This is one-way only, so that nodes can’t for a cycle. The information in a feedforward network only moves into one direction – from the input layer, through the hidden layers to the output layer. It is the easiest version of a Neural Network. The below image illustrates the Feedforward Neural Network.

Feedforward Neural Network

Convolutional Neural Networks (CNN)

The Convolutional Neural Network is very effective in Image recognition and similar tasks. For that reason it is also good for Video processing. The difference to the Feedforward neural network is that the CNN contains 3 dimensions: width, height and depth. Not all neurons in one layer are fully connected to neurons in the next layer. There are three different type of layers in a Convolutional Neural Network, which are also different to feedforward neural networks:

Convolution Layer

Convolution puts the input image through several convolutional filters. Each filter activates certain features, such as: edges, colors or objects. Next, the feature map is created out of them. The deeper the network goes the more sophisticated those filters become. The convolutional layer automatically learns which features are most important to extract for a specific task.

Rectified linear units (ReLU)

The goal of this layer is to improve the training speed and impact. Negative values in the layers are removed.

Pooling/Subsampling

Pooling simplifies the output by performing nonlinear downsampling. The number of parameters that the network needs to learn about gets reduced. In convolutional neural networks, the operation is useful since the outgoing connections usually receive similar information.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In the last couple of posts, we’ve learned about various aspects of Machine Learning. Now, we will focus on other aspects of Machine Learning: Deep Learning. In this post, I will give an introduction to deep learning. Over the last couple of years, this was the hype around AI. But what is so exciting about Deep Learning? First, let’s have a look at the concepts of Deep Learning.

A brief introduction to Deep Learning

Basically, Deep Learning should function similar to the human brain. Everything is built around Neurons, which work in networks (neural networks). The smallest element in a neural network is the neuron, which takes an input parameter and creates an output parameter, based on the bias and weight it has. The following image shows the Neuron in Deep Learning:

The Neuron in a Neuronal Network in Deep Learning
The Neuron in a Neuronal Network in Deep Learning

Next, there are Layers in the Network, which consists of several Neurons. Each Layer has some transformations, that will eventually lead to an end result. Each Layer will get much closer to the target result. If your Deep Learning model built to recognise hand writing, the first layer would probably recognise gray-scales, the second layer a connection between different pixels, the third layer would recognise simple figures and the fourth layer would recognise the letter. The following image shows a typical neural net:

A neural net for Deep Learning
A neural net for Deep Learning

A typical workflow in a neural net calculation for image recognition could look like this:

  • All images are split into batches
  • Each batch is sent to the GPU for calculation
  • The model starts the analysis with random weights
  • A cost function gets specified, that compares the results with the truth
  • Back propagation of the result happens
  • Once a model calculation is finished, the result is merged and returned

How is it different to Machine Learning?

Although Deep Learning is often considered to be a “subset” of Machine Learning, it is quite different. For different aspects, Deep Learning often achieves better results than “traditional” machine learning models. The following table should provide an overview of these differences:

Machine Leaning Deep Learning
Feature extraction happens manuallyFeature extraction is done automatically
Features are used to create a model that categorises elementsPerforms “end-to-end learning” 
Shallow learning  Deep learning algorithms scale with data

This is only the basic overview of Deep Learning. Deep Learning knows several different methods. In the next tutorial, we will have a look at different interpretations of Deep Learning.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In the first posts, I introduced different type of Machine Learning concepts. On of them is classification. Basically, classification is about identifying to which set of categories a certain observation belongs. Classifications are normally of supervised learning techniques. A typical classification is Spam detection in e-mails – the two possible classifications in this case are either “spam” or “no spam”. The two most common classification algorithms are the naive bayes classification and the random forest classification.

What classification algorithms are there?

Basically, there are a lot of classification algorithms available and when working in the field of Machine Learning, you will discover a large number of algorithms every time. In this tutorial, we will only focus on the two most important ones (Random Forest, Naive Bayes) and the basic one (Decision Tree)

The Decision Tree classifier

The basic classifier is the Decision tree classifier. It basically builds classification models in the form of a tree structure. The dataset is broken down into smaller subsets and gets detailed by each leave. It could be compared to a survey, where each question has an effect on the next question. Let’s assume the following case: Tom was captured by the police and is a suspect in robing a bank. The questions could represent the following tree structure:

Basic sample of a Decision Tree
Basic sample of a Decision Tree

Basically, by going from one leave to another, you get closer to the result of either “guilty” or “not guilty”. Also, each leaf has a weight.

The Random Forest classification

Random forest is a really great classifier, often used and also often very efficient. It is an ensemble classifier made using many decision tree models. There are ensemble models that combine the different results. The random forest model can both run regression and classification models.

Basically, it divides the data set into subsets and then runs on the data. Random forest models run efficient on large datasets, since all compute can be split and thus it is easier to run the model in parallel. It can handle thousands of input variables without variable deletion. It computes proximities between pairs of cases that can be used in clustering, locating outliers or (by scaling) give interesting views of the data.

There are also some disadvantages with the random forest classifier: the main problem is its complexity. Working with random forest is more challenging than classic decision trees and thus needs skilled people. Also, the complexity creates large demands for compute power.

Random Forest is often used by financial institutions. A typical use-case is credit risk prediction. If you have ever applied for a credit, you might know the questions being asked by banks. They are often fed into random forest models.

The Naive Bayes classifier

The Naive Bayes classifier is based on prior knowledge of conditions that might relate to an event. It is based on the Bayes Theorem. There is a strong independence between features assumed. It uses categorial data to calculate ratios between events.

The benefit of Naive Bayes are different. It can easily and fast predict classes of data sets. Also, it can predict multiple classes. Naive Bayes performs better compared to models such as logistic regression and there is a lot less training data needed.

A key challenge is that if a categorical variable has a category which was not checked in the training data set, then model will assign a 0 (zero) probability, which makes it unable for prediction. Also, it is known to be a rather bad estimator. Also, it is rather complex to use.

As stated, there are many more algorithms available. In the next tutorial, we will have a look at Deep Learning.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In the previous tutorial posts, we looked at the Linear Regression and discussed some basics of statistics such as the Standard Deviation and the Standard Error. Today, we will look at the Logistic Regression. It is similar in name to the linear regression, but different in usage. Let’s have a look

The Logistic Regression explained

One of the main difference to the Linear Regression for the Logistic Regression is that you the logistic regression is binary – it calculates values between 0 and 1 and thus states if something is rather true or false. This means that the result of a prediction could be “fail” or “succeed” for a test. In a churn model, this would mean that a customer either stays with the company or leaves the company.

Another key difference to the Linear Regression is that the regression curve can’t be calculated. Therefore, in the Logistic Regression, the regression curve is “estimated” and optimised. There is a mathematical function to do this estimation – called the “Maximum Likelihood Method”. Normally, these Parameters are calculated by different Machine Learning Tools so that you don’t have to do it.

Another aspect is the concept of “Odds”. Basically, the odd of a certain event happening or not happening is calculated. This could be a certain team winning a soccer game: let’s assume that Team X wins 7 out of 10 games (thus loosing 3, we don’t take a draw). The odds in this case would be 7:10 on winning or 3:10 on loosing.

This time we won’t calculate the Logistic Regression, since it is way too long. In the next tutorial, I will focus on classifiers such as Random Forest and Naive Bayes.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In my previous posts we had a look at some fundamentals of machine learning and had a look at the linear regression. Today, we will look at another statistical topic: false positives and false negatives. You will come across these terms quite often when working with data, so let’s have a look at them.

The false positive

In statistics, there is one error, called the false positive error. This happens when the prediction states something to be true, but in reality it is false. To easily remember the false positive, you could describe this as a false alarm. A simple example for that is the airport security check: when you pass the security check, you have to walk through a metal detector. If you don’t wear any metal items with you (since you left them for the x-ray!), no alarm will go on. But in some rather rare cases, the alarm might still go on. Either you forgot something or the metal detector had an error – in this case, a false positive. The metal detector predicted that you have metal items somewhere with you, but in fact you don’t.

Another sample of a false positive in machine learning would be in image recognition: imagine your algorithm is trained to recognise cats. There are so many cat pictures on the web, so it is easy to train this algorithm. However, you would then feed the algorithm the image of a dog and the algorithm would call it a cat, even though it is a dog. This again is a false positive.

In a business context, your algorithm might predict that a specific customer is going to buy a certain product for sure. but in fact, this customer didn’t buy it. Again, here we have our false positive. Now, let’s have a look at the other error: the false negative.

The false negative

The other error in statistics is the false negative. Similar to the false positive, it is something that should be avoided. It is very similar to the false positive, just the other way around. Let’s look at the airport example one more time: you wear a metal item (such as a watch) and go through the metal detector. You simply forgot to take off the watch. And – the metal detector doesn’t go on this time. Now, you are a false negative: the metal detector stated that you don’t wear any metal items, but in fact you did. A condition was predicted to be true but in fact it was false.

A false positive is often useful to score your data quality. Now that you understand some of the most important basics of statistics, we will have a look at another machine learning algorithm in my next post: the logistic regression.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

Now we have learned how to write a Linear Regression model from hand in our last tutorial. Also, we had a look at the prediction error and standard error. Today, we want to focus on a way how to measure the performance of a model. In marketing, a common methodology for this is lift and gain charts. They can also be used for other things, but in our today’s sample we will use a marketing scenario.

The marketing scenario for Lift and Gain charts

Let’s assume that you are in charge of an outbound call campaign. Basically, your goal is to increase conversions of people contacted via this campaign. Like with most campaigns, you have a certain – limited – budget and thus need to plan the campaign smart. This is where machine learning comes into play: you only want to contact those people that are most relevant to buy the product. Therefore, you contact the top X percent of customers where you rather expect a conversion and avoid contacting those customers that are very unlikely to get converted. We assume that you already built a model for that and that we now do the campaign. We will measure our results with a gain chart, but first let’s create some data.

Our sample data represents all our customers, grouped into decentiles. Basically, we group the customers into top 10%, top 20%, … until we reach all customers. We add the number of conversions to it as well:

Decantile# of CustomersConversions
120033
220030
320027
420025
520023
620019
720015
820011
92007
102002

As you can see in the above table, the first decentile contains most conversions and is thus our top group. The conversion rates for each group in percent are:

% Conversions
17,2%
15,6%
14,1%
13,0%
12,0%
9,9%
7,8%
5,7%
3,6%
1,0%

As you can see, 17.2% of all top 10% customers could be converted. From each group, it declines. So, the best approach is to first contact the top customers. As a next step, we add the cumulative conversions. This number is then used for our cumulative gain chart.

Cumulative % Conversions
17,2%
32,8%
46,9%
59,9%
71,9%
81,8%
89,6%
95,3%
99,0%
100,0%

Cumulative Gain Chart

With this data, we can now create the cumulative gain chart. In our case, this would look like the following:

A cumulative gain chart
A cumulative gain chart

The Lift factor

Now, let’s have a look at the lift factor. The base for the lift factor is always the lift 1. This means that there was a random sample selected and no structured approach was done. Basically, the lift factor is the ratio you get between the number of customers contacted in % and the number of conversions for the decentile in %. With our sample data, this lift data would look like the following:

Lift
1,72
1,64
1,56
1,50
1,44
1,36
1,28
1,19
1,10
1,00

Thus we would have a lift factor of 1.72 with the first percentile, decreasing towards the full customer set.

In this tutorial, we’ve learned about how to verify a machine learning model. In the next tutorial, we will have a look at false positives and some other important topics before moving on with Logistic Regression.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In my previous posts, I explained the Linear Regression and stated that there are some errors in it. This is called the error of prediction (for individual predictions) and there is also a standard error. A prediction is good if the individual errors of prediction and the standard error are small. Let’s now start by examining the error of prediction, which is called the standard error in a linear regression model.

Error of prediction in Linear regression

Let’s recall the table from the previous tutorial:

YearAd Spend (X)Revenue (Y)Prediction (Y’)
2013 €    345.126,00  €   41.235.645,00  €   48.538.859,48
2014 €    534.678,00  €   62.354.984,00  €   65.813.163,80
2015 €    754.738,00  €   82.731.657,00  €   85.867.731,47
2016 €    986.453,00  € 112.674.539,00  € 106.984.445,76
2017 € 1.348.754,00  € 156.544.387,00  € 140.001.758,86
2018 € 1.678.943,00  € 176.543.726,00  € 170.092.632,46
2019 € 2.165.478,00  € 199.645.326,00  € 214.431.672,17

We can see that there is a clear difference in between the prediction and the actual numbers. We calculate the error in each prediction by taking the real value minus the prediction:

Y-Y’
-€   7.303.214,48
-€   3.458.179,80
-€   3.136.074,47
 €   5.690.093,24
 € 16.542.628,14
 €   6.451.093,54
-€ 14.786.346,17

In the above table, we can see how each prediction differs from the real value. Thus it is our prediction error on the actual values.

Calculating the Standard Error

Now, we want to calculate the standard error. First, let’s have a look at the formular:

Basically, we take the sum of all error to the square, divide it by the number of occurrences and take the square root of it. We already have Y-Y’ calculated, so we only need to make the square of it:

Y-Y’(Y-Y’)^2
-€   7.303.214,48  €    53.336.941.686.734,40
-€   3.458.179,80  €    11.959.007.558.032,20
-€   3.136.074,47  €      9.834.963.088.101,32
 €   5.690.093,24  €    32.377.161.053.416,10
 € 16.542.628,14  €  273.658.545.777.043,00
 €   6.451.093,54  €    41.616.607.923.053,70
-€ 14.786.346,17  €  218.636.033.083.835,00

The sum of it is 641.419.260.170.216,00 €

And N is 7, since it contains 7 Elements. Divided by 7, it is: 91.631.322.881.459,50 €

The last step is to take the square root, which results in the standard error of 9.572.425,13 € for our linear regression.

Now, we have most items cleared for our linear regression and can move on to the logistic regression in our next tutorial.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In my previous posts, I introduced the basics of machine learning. Today, I want to focus on the two elementary algorithms: linear and logistic regression. Basically, you would learn them at the very beginning of your journey for machine learning, but eventually not use them much later on any more. But to understand the concepts of it, it is helpful to understand them.

Linear Regression

A Linear Regression is the simplest model for Data Science. Linear Regression is of supervised learning and used in Trend Analysis, Time-Series Analysis, Risk in Banking and many more.

In a linear regression, a relationship between a dependent variable y and a dataset of xn is linear. This basically means, that if there is data of a specific trend, a future trend can be predicted. Let’s assume that there is a significant relation between ad spendings and sales. We would have the following table:

YearAd SpendRevenue
2013 €      345.126,00  €      41.235.645,00
2014 €      534.678,00  €      62.354.984,00
2015 €      754.738,00  €      82.731.657,00
2016 €      986.453,00  €    112.674.539,00
2017 €   1.348.754,00  €    156.544.387,00
2018 €   1.678.943,00  €    176.543.726,00
2019 €   2.165.478,00  €    199.645.326,00

If you look at the data, it is very easy to figure out that that there is some kind of relation between how much money you spend on the ads and the revenue you get. Basically, the ratio is 1:92 to 1:119. Please not that I totally made up the numbers. however, based on this numbers, you could basically predict what revenues to obtain when spending X amount of data. The relation between them is therefore linear and we can easily plot it on a line chart:

Linear Regression

As you can see, some of the values are above the line and others below. Let’s now manually calculate the linear function. There are some steps necessary that should eventually lead to the prediction values. Let’s assume we want to know if we spend a specific money on ads, what revenue we can expect. Let’s assume we want to know how much value we create for 1 Million spend on ads. The linear regression function for this is:

predicted score (Y') = bX + intercept (A)

This means that we now need to calculate several values: (A) the slope (it is our “b” and the intercept (it is our A). X is the only value we know – our 1 Million spend. Let’s first calculate the slope

Calculating the Slope

The first thing we need to do is calculating the slope. For this, we need to have the standard deviation of both X and XY. Let’s first start with X – our revenues. The standard deviation is calculated for each revenue individually. There are some steps involved:

  • Creating the average of the revenues
  • Subtracting the individual revenue
  • Building the square

The first step is to create the average of both values. The average for the revenues should be:  € 118.818.609,14 and the average for the spend should be:  € 1.116.310,00.

Next, we need to create the standard deviation of each item. For the ad spend, we do this by substracting each individual ad spend and building the square. The table for this should look like the following:

The formular is: (Average of Ad spend – ad spend) ^ 2

YearAd spendStddev (X)
2013 €    345.126,00  €              594.724.761.856,00
2014 €    534.678,00  €              338.295.783.424,00
2015 €    754.738,00  €              130.734.311.184,00
2016 €    986.453,00  €                16.862.840.449,00
2017 € 1.348.754,00  €                54.030.213.136,00
2018 € 1.678.943,00  €              316.555.892.689,00
2019 € 2.165.478,00  €           1.100.753.492.224,00

Quite huge numbers already, right? Now, let’s create the standard deviation for the revenues. This is done by taking the average of the ad spend – ad spend and multiplying it with the same procedure for the revenues. This should result in:

YearRevenueY_Ad_Stddev
2013 €                  41.235.645,00  €    59.830.740.619.545,10
2014 €                  62.354.984,00  €    32.841.051.219.090,30
2015 €                  82.731.657,00  €    13.048.031.460.197,10
2016 €                112.674.539,00  €         797.850.516.541,00
2017 €                156.544.387,00  €      8.769.130.708.225,71
2018 €                176.543.726,00  €    32.478.055.672.684,90
2019 €                199.645.326,00  €    84.800.804.871.574,80

Now, we only need to sum up the columns for Y and X. The sums should be:

€ 2.551.957.294.962,00 for the X-Row
€ 232.565.665.067.859,00 for the Y-Row

Now, we need to divide the Y-Row by the X-Row and would get the following slope: 91,1322715

Calculating the Intercept

The intercept is somewhat easier. The formular for it is: average(y) – Slope * average(x). We already have all relevant variables calculated in our previous step. Our intercept should equal:  € 17.086.743,14.

Predicting the value with the Linear Regression

Now, we can build our function. This is: Y = 91,1322715X + 17.086.743,14

As stated in the beginning, our X should be 1 Million and we want to know our revenue:  € 108.219.014,64

The prediction is actually lower than the values which are closer (2016 and 2017 values). If you change the values to 2 Million or 400k, it will again get closer. Predictions always produce some errors and they are normally shown. Therefore, the error table would look like the following:

ad spentreal revenue (Y)prediction (Y’)error
2013 €                       345.126,00  €                  41.235.645,00  €                  48.538.859,48 -€     7.303.214,48
2014 €                       534.678,00  €                  62.354.984,00  €                  65.813.163,80 -€     3.458.179,80
2015 €                       754.738,00  €                  82.731.657,00  €                  85.867.731,47 -€     3.136.074,47
2016 €                       986.453,00  €                112.674.539,00  €                106.984.445,76  €     5.690.093,24
2017 €                    1.348.754,00  €                156.544.387,00  €                140.001.758,86  €   16.542.628,14
2018 €                    1.678.943,00  €                176.543.726,00  €                170.092.632,46  €     6.451.093,54
2019 €                    2.165.478,00  €                199.645.326,00  €                214.431.672,17 -€   14.786.346,17

The error calculation is done by using the real value and deducting the predicted value from it. And voila – you have your error. One common thing in machine learning is to reduce the error and make predictions more accurate.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

I teach Big Data & Data Science at several universities and I work in that field also. Since I wrote a lot here on Big Data itself and there are now many young professionals deciding if they want to go for data science, I decided to write a short intro series to machine learning. After this intro, you should be capable of getting deeper into this topic and know where to start. To kick off the series, we’ll go over some basics of machine learning. The first part for this is about supervised and unsupervised learning.

One of the main ideas behind that is to find patterns in data and make predictions on that data without the need to develop each and every use-case from scratch. Therefore, a certain number of algorithms are available. These algorithms can be “classified” by how they work. the main two principles (which then can also be spilt) are:

  • Supervised Learning
  • Unsupervised Learning
  • Semi-supervised Learning

Supervised Learning

With supervised learning, the algorithm learns basically by existing data and learning “from the past”. This means that there is basically a lot of learning data that allows the algorithm to find the patterns by learning from this data. We can also call this “a teacher”. It works closely to how we as humans learn: we get information from our parents, teachers and friends and combine this to make future predictions. Examples are:

  • Manufacturing: if several properties of a material were of specific properties, the quality was either good or bad (or maybe scaled from several numbers). Now, if we produce a new material and we look at the properties of the material, based on the existing data we have from former productions, we can say how the quality will be. Properties of a material might be: hardness, color, …
  • Banking: based on several properties of a potential borrower, we can predict if the person is capable of paying back the loan. This can be based on existing data of former customers and what the bank “learned” from them. The algorithm takes a lot of different variables into consideration. This variables can be income, montly liability to pay, education, job, etc.

Unsupervised Learning

With unsupervised learning we have no “teacher” available. The algorithms get data, and the algorithms try to find patterns in that. This can either be by clustering data (e.g. customer with high income, customer with low income, …) and make predictions based on that. An unsupervised learning algorithm can be useful for several use-cases. Below are some samples:

  • Manufacturing: find anomalies in the production lines (e.g. the average output of units per hour was between 200 and 250, but on day D at time T, the output was only 20 units. The algorithm can cluster data into normal output and an anomaly that was detected.
  • Banking: normally, the customer would only spend money in his home country. The algorithm detects an abnormal behaviour, like money transferd in a country that he normally isn’t in. This can be an indicator for fraud.

Last, but not least, there is Semi supervised learning, which is a combination of both. In many machine learning projects, not all training data that is used for supervised learning is available, so values might need to get predicted. This can be done by combining supervised and unsupervised learning algorithms and then work with the “curated” data on it.

Now that we basically understand the 3 main concepts of supervised, unsupervised and semi supervised learning, we can continue with variations within these concepts and some statistical background in the next post.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.