In the last couple of posts, we’ve learned about various aspects of Machine Learning. Now, we will focus on other aspects of Machine Learning: Deep Learning. After introducing the key concepts of Deep Learning in the previous post, we will have a look at two concepts: the Convolutional Neural Network (CNN) and the Feedforward Neural Network

The Feedforward Neural Network

Feedforward neural networks are the most general-purpose neural network. The entry point is the input layer and it consists of several hidden layers and an output layer. Each layer has a connection to the previous layer. This is one-way only, so that nodes can’t for a cycle. The information in a feedforward network only moves into one direction – from the input layer, through the hidden layers to the output layer. It is the easiest version of a Neural Network. The below image illustrates the Feedforward Neural Network.

Feedforward Neural Network

Convolutional Neural Networks (CNN)

The Convolutional Neural Network is very effective in Image recognition and similar tasks. For that reason it is also good for Video processing. The difference to the Feedforward neural network is that the CNN contains 3 dimensions: width, height and depth. Not all neurons in one layer are fully connected to neurons in the next layer. There are three different type of layers in a Convolutional Neural Network, which are also different to feedforward neural networks:

Convolution Layer

Convolution puts the input image through several convolutional filters. Each filter activates certain features, such as: edges, colors or objects. Next, the feature map is created out of them. The deeper the network goes the more sophisticated those filters become. The convolutional layer automatically learns which features are most important to extract for a specific task.

Rectified linear units (ReLU)

The goal of this layer is to improve the training speed and impact. Negative values in the layers are removed.

Pooling/Subsampling

Pooling simplifies the output by performing nonlinear downsampling. The number of parameters that the network needs to learn about gets reduced. In convolutional neural networks, the operation is useful since the outgoing connections usually receive similar information.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

For Data itself, there are a lot of different sources that are needed. Based on the company and industry, they differ a lot. However, to create a complex view on your company, it isn’t necessary only to have your own data. There are several other data sources you should consider.

The three data sources

The three data sources

Data you already have

The first data source – data you have – seems to be the easiest. However, it isn’t as easy as you might believe. Bringing your data in order is actually a very difficult task and can’t be achieved that easy. I’ve written several blog posts here about the challenges around data and you can review them. Basically, all of them focus on your internal data sources. I won’t re-state them in detail here, but it is mainly about data governance and access.

Data that you can acquire

The second data source – data you can acquire – is another important aspect. By acquire I basically mean everything that you don’t have to pay to an external party as data provider. You might use surveys (and pay for it as well) or acquire the data from open data platforms. Also, you might collect data from social media or with other kind of crawlers. This data source is very important for you, as you can get great overview and insights into your specific questions.

In the past, I’ve seen a lot of companies utilising the second one and we did a lot on that aspect. For this kind of data, you don’t necessarily have to pay for it – some data sources are free. And if you pay for something, you don’t pay for the data itself but rather for the (semi)-manual way of collecting it. Also here, it differs heavily from industry to industry and what the company is all about. I’ve seen companies collecting data from news sites to get insights into their competition and mentions or simply by scanning social media. A lot is possible with this aspect of data source.

Data you can buy

The last one – data you can buy – is easy to get but very expensive in cash-out terms. There are a lot of data providers selling different kind of data. Often, it is demographic data or data about customers. Different platforms collect data from a large number of online sites and thus track individuals over different sites and their behavior. Such platforms then sell this kind of data to marketing departments with more insights. Also here, you can buy this kind of data from that platforms and thus enrich your own first-party and second-party data. Imagine, you are operating a retail business selling all kind of furniture.

You would probably not know much about your web shop visitors, since they are anonymous until they buy something. With data bought from such kind of data providers, it would now be possible for you to figure out if an anonymous visitor is an outdoor enthusiast. You might adjust your offers to match his or her interest best. Or, you might learn that the person visiting your shop recently bought a countryside house with a garden. You might now adjust your offers to present garden furniture or Barbecue accessories. With this kind of third party data, you can achieve a lot and better understand your customers and your company.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. If you look for open data, I would recommend you browsing some open data catalogs like the open data catalog from the U.S. government.

In the last couple of posts, we’ve learned about various aspects of Machine Learning. Now, we will focus on other aspects of Machine Learning: Deep Learning. In this post, I will give an introduction to deep learning. Over the last couple of years, this was the hype around AI. But what is so exciting about Deep Learning? First, let’s have a look at the concepts of Deep Learning.

A brief introduction to Deep Learning

Basically, Deep Learning should function similar to the human brain. Everything is built around Neurons, which work in networks (neural networks). The smallest element in a neural network is the neuron, which takes an input parameter and creates an output parameter, based on the bias and weight it has. The following image shows the Neuron in Deep Learning:

The Neuron in a Neuronal Network in Deep Learning
The Neuron in a Neuronal Network in Deep Learning

Next, there are Layers in the Network, which consists of several Neurons. Each Layer has some transformations, that will eventually lead to an end result. Each Layer will get much closer to the target result. If your Deep Learning model built to recognise hand writing, the first layer would probably recognise gray-scales, the second layer a connection between different pixels, the third layer would recognise simple figures and the fourth layer would recognise the letter. The following image shows a typical neural net:

A neural net for Deep Learning
A neural net for Deep Learning

A typical workflow in a neural net calculation for image recognition could look like this:

  • All images are split into batches
  • Each batch is sent to the GPU for calculation
  • The model starts the analysis with random weights
  • A cost function gets specified, that compares the results with the truth
  • Back propagation of the result happens
  • Once a model calculation is finished, the result is merged and returned

How is it different to Machine Learning?

Although Deep Learning is often considered to be a “subset” of Machine Learning, it is quite different. For different aspects, Deep Learning often achieves better results than “traditional” machine learning models. The following table should provide an overview of these differences:

Machine Leaning Deep Learning
Feature extraction happens manuallyFeature extraction is done automatically
Features are used to create a model that categorises elementsPerforms “end-to-end learning” 
Shallow learning  Deep learning algorithms scale with data

This is only the basic overview of Deep Learning. Deep Learning knows several different methods. In the next tutorial, we will have a look at different interpretations of Deep Learning.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

In the first posts, I introduced different type of Machine Learning concepts. On of them is classification. Basically, classification is about identifying to which set of categories a certain observation belongs. Classifications are normally of supervised learning techniques. A typical classification is Spam detection in e-mails – the two possible classifications in this case are either “spam” or “no spam”. The two most common classification algorithms are the naive bayes classification and the random forest classification.

What classification algorithms are there?

Basically, there are a lot of classification algorithms available and when working in the field of Machine Learning, you will discover a large number of algorithms every time. In this tutorial, we will only focus on the two most important ones (Random Forest, Naive Bayes) and the basic one (Decision Tree)

The Decision Tree classifier

The basic classifier is the Decision tree classifier. It basically builds classification models in the form of a tree structure. The dataset is broken down into smaller subsets and gets detailed by each leave. It could be compared to a survey, where each question has an effect on the next question. Let’s assume the following case: Tom was captured by the police and is a suspect in robing a bank. The questions could represent the following tree structure:

Basic sample of a Decision Tree
Basic sample of a Decision Tree

Basically, by going from one leave to another, you get closer to the result of either “guilty” or “not guilty”. Also, each leaf has a weight.

The Random Forest classification

Random forest is a really great classifier, often used and also often very efficient. It is an ensemble classifier made using many decision tree models. There are ensemble models that combine the different results. The random forest model can both run regression and classification models.

Basically, it divides the data set into subsets and then runs on the data. Random forest models run efficient on large datasets, since all compute can be split and thus it is easier to run the model in parallel. It can handle thousands of input variables without variable deletion. It computes proximities between pairs of cases that can be used in clustering, locating outliers or (by scaling) give interesting views of the data.

There are also some disadvantages with the random forest classifier: the main problem is its complexity. Working with random forest is more challenging than classic decision trees and thus needs skilled people. Also, the complexity creates large demands for compute power.

Random Forest is often used by financial institutions. A typical use-case is credit risk prediction. If you have ever applied for a credit, you might know the questions being asked by banks. They are often fed into random forest models.

The Naive Bayes classifier

The Naive Bayes classifier is based on prior knowledge of conditions that might relate to an event. It is based on the Bayes Theorem. There is a strong independence between features assumed. It uses categorial data to calculate ratios between events.

The benefit of Naive Bayes are different. It can easily and fast predict classes of data sets. Also, it can predict multiple classes. Naive Bayes performs better compared to models such as logistic regression and there is a lot less training data needed.

A key challenge is that if a categorical variable has a category which was not checked in the training data set, then model will assign a 0 (zero) probability, which makes it unable for prediction. Also, it is known to be a rather bad estimator. Also, it is rather complex to use.

As stated, there are many more algorithms available. In the next tutorial, we will have a look at Deep Learning.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

One of the reasons why Python is so popular for Data Science is that Python has a very rich set of functionality for Mathematics and Statistics. In this tutorial, I will show the very basic functions; however, you might be very disappointed, since they are really basic. When we talk about real data science, you might rather consider learning scikit learn, pytorch or Spark ML. However, today’s tutorial will focus on the elements of it, before moving on to the more complex tutorials.

Basic Mathematics in Python from the math Library

The math-library in Python provides a great number of most of the relevant functionality you might want to use in Python when working with numbers. The following samples provide some overview on them:

import math
vone = 1.2367
print(math.ceil(vone))

First, we import “math” from the standard library and then we create some values. The first function we use is ceiling. In the following sample, we calculate the greatest common denominator between two numbers.

math.gcd(44,77)

Other functions are logarithmic, power, cosinus and many more. Some of them are displayed in the following sample:

math.log(5)
math.pow(2,3)
math.cos(4)
math.pi

Basic statistics in Python from the statistics library

The standard library offers some elementary statistical functions. We will first import the library and then calculate the mean of 5 values:

from statistics import *
values = [1,2,3,4,5]
mean(values)

Some other possible functions are:

median(values)
stdev(values)
variance(values)

Have a look at those two libraries – there is quite a lot to explore.

What’s next?

Now, the tutorial series for Python is over. You should now be fit to using pyspark. If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

In recent years, a lot of traditional companies founded digital labs that should server their digitalisation efforts. But if you look at the results of these labs, they are rather limited. Most of the “products” or PoCs never came back to their real products. Overall, they could be considered as failure. But why? Why is it the wrong data strategy?

What is a silicon valley lab?

Let’s first look at what those labs are and why they were founded. Everywhere in the world, there is increased pressure on companies to digitalise themselves. Basically, how C-Level executives handed that in traditional companies is by looking at (successful) Silicon Valley startups. They did trips to the Valley and found a very cool culture there.

Basically, a lot of companies were built in the garage or old fabrics, thus giving that a very industrial style. Back in good old Europe, the executives decided: “We need to have something very similar”. What they did is: they rented a fabric hall somewhere, equipped it with IT and hired the smartes people available on the markets to create their now digital products. Their idea was also to keep them away (physically) from their traditional company premises in order to build something new and don’t look too much at the company itself. A lot of money was burned with this approach. What C-Level executives weren’t told are a few things:

(A) Silicon Valley companies don’t work in garages or fabric halls because it is fancy and the way they like it. Often, they don’t have money to rent expensive office space, especially when prices in the valley are very high. The culture that is typical for the valley is rather something that was done because it was necessary, but not because of coolness.

(B) being remote to the traditional business works best when you develop a product from the ground up and completely new. However, most traditional companies still earn most of their money with their business and in most cases it will also stay like that. A car manufacturer will earn most money with cars, digital products then come on top. With this remote type of development, it often proved impossible to integrate the results of the labs into the real products.

So what can executives do to overcome this dilema with failed PoC’s in Data Science projects?

There is no silver bullet available for this challenge. The popular website Venturebeat even claims that 87% of Data Science projects never make it into production. It depends mainly on what should be achieved. When we look at startups, their founders often come from large enterprises that were unhappy with how their business used to work.

I would argue that most large enterprises basically have the innovation power they are seeking for, but it is often under-utilized or even not utilized. One thing is crucial: keep the right balance between distance or closeness to the legacy products. It is necessary to understand and built on top of the legacy products, but also it is necessary to not get corrupted by them – often, people keep on doing their things for years and simply don’t question it.

To successfully change products and services, the best thing is to bring in someone external that doesn’t understand the company that well but has the competence to accept their history. This person(s) should not be engineers (they are also needed) but rather senior executives with a strong background in digital technologies. Seeing things different brings new ideas and can bring each company forward in the digitalisation aspects 🙂

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company.

In the previous tutorial posts, we looked at the Linear Regression and discussed some basics of statistics such as the Standard Deviation and the Standard Error. Today, we will look at the Logistic Regression. It is similar in name to the linear regression, but different in usage. Let’s have a look

The Logistic Regression explained

One of the main difference to the Linear Regression for the Logistic Regression is that you the logistic regression is binary – it calculates values between 0 and 1 and thus states if something is rather true or false. This means that the result of a prediction could be “fail” or “succeed” for a test. In a churn model, this would mean that a customer either stays with the company or leaves the company.

Another key difference to the Linear Regression is that the regression curve can’t be calculated. Therefore, in the Logistic Regression, the regression curve is “estimated” and optimised. There is a mathematical function to do this estimation – called the “Maximum Likelihood Method”. Normally, these Parameters are calculated by different Machine Learning Tools so that you don’t have to do it.

Another aspect is the concept of “Odds”. Basically, the odd of a certain event happening or not happening is calculated. This could be a certain team winning a soccer game: let’s assume that Team X wins 7 out of 10 games (thus loosing 3, we don’t take a draw). The odds in this case would be 7:10 on winning or 3:10 on loosing.

This time we won’t calculate the Logistic Regression, since it is way too long. In the next tutorial, I will focus on classifiers such as Random Forest and Naive Bayes.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

Python has a really great standard library. In the next two tutorial sessions, we will have a first look at this standard library. We will mainly focus on what is relevant for Spark developers in the long run. Today, we will focus on FuncTools and IterTools in Python, the next tutorial will deal with some mathematical functions. But first, let’s start with “reduce

The reduce() function from the IterTools in Python

Basically, the reduce function takes an iterable list and executes a function on it. In most of the cases, this will be a lambda function but it could also be a normal function. In our sample, we take some values and create the sum of it by moving from left to right:

from functools import reduce
values = [1,4,5,3,2]
reduce(lambda x,y: x+y, values)

And we get the expected output

15

The sorted() function

Another very useful function is the “sorted” function. Basically, this sorts values or pairs of tuples in an array. The easiest way to apply it is to do it with our previous values (which were unsorted!):

print(sorted(values))

The output is now in the expected sorting:

[1, 2, 3, 4, 5]

However, we can still improve this by even sorting complex objects. Sorted takes a key to sort on, and this is passed as a lamdba expression. We state that we want to sort it by age. Make sure that you still have the “Person” class from our previous tutorial:

perli = [Person("Mario", "Meir-Huber", 35, 1.0), Person("Helena", "Meir-Huber", 5, 1.0)]
print(perli)
print(sorted(perli, key=lambda p: p.age))

As you can see, our values are now sorted based on the age member.

[Person(firstname='Mario', lastname='Meir-Huber', age=35, score=1.0), Person(firstname='Helena', lastname='Meir-Huber', age=5, score=1.0)]
[Person(firstname='Helena', lastname='Meir-Huber', age=5, score=1.0), Person(firstname='Mario', lastname='Meir-Huber', age=35, score=1.0)]

The chain() function

The chain() method is very helpful if you want to hook up two lists with the same objects in it. Basically, we take the Person-Class again and create a new instance. We then chain the two lists together:

import itertools
perstwo = [Person("Some", "Other", 46, 1.0)]
persons = itertools.chain(perli, perstwo)
for pers in persons:
    print(pers.firstname)

Also here, we get the expected output:

Mario
Helena
Some

The groupby() function

Another great feature when working with data is grouping of data. Python also allows us to do so. The groupby() method takes two parameters: the list to group and the key as lambda expression. We create a new array of tuple pairs and group by the family name:

from itertools import groupby
pl = [("Meir-Huber", "Mario"), ("Meir-Huber", "Helena"), ("Some", "Other")]
    
for k,v in groupby(pl, lambda p: p[0]):
    
    print("Family {}".format(k))
    
    for p in v:
        print("\tFamily member: {}".format(p[1]))

Basically, the groupby() method returns the key (as the value type) and the objects as list in the key group. This means that another iteration is necessary in order to access the elements in the group. The output of the above sample looks like this:

Family Meir-Huber
	Family member: Mario
	Family member: Helena
Family Some
	Family member: Other

The repeat() function

A nice function is the repeat() function. Basically, it copies an element several times. For instance, if we want to copy a person 4 times, this can be done like this:

lst = itertools.repeat(perstwo, 4)
for p in lst:
    print(p)

And also the output is just as expected:

[Person(firstname='Some', lastname='Other', age=46, score=1.0)]
[Person(firstname='Some', lastname='Other', age=46, score=1.0)]
[Person(firstname='Some', lastname='Other', age=46, score=1.0)]
[Person(firstname='Some', lastname='Other', age=46, score=1.0)]

The takewhile() and the dropwhile() function in IterTools in Python

Two functions – takewhile and dropwhile – are also very helpful in Python. Basically, they are very similar, but their result is the opposite form each other. takewhile runs until a condition is true, dropwhile runs once a condition is false. Takewhile will take elements from an array/list as long as the predicate is true (e.g. lower than 20, this would mean that elements are only considered as long as they are below 20) – Dropwhile with the same condition would remove elements as long as their values are below 20. The following sample shows this:

vals = range(1,40)
for v in itertools.takewhile(lambda vl: vl<20, vals):
    print(v)
    
print("######")
for v in itertools.dropwhile(lambda vl: vl<20, vals):
    print(v)

And also here, the output is as expected:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
######
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39

As you can see, these are quite helpful functions. In our last Python tutorial, we will have a look at some basic mathematical and statistical functions.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.

In my previous posts we had a look at some fundamentals of machine learning and had a look at the linear regression. Today, we will look at another statistical topic: false positives and false negatives. You will come across these terms quite often when working with data, so let’s have a look at them.

The false positive

In statistics, there is one error, called the false positive error. This happens when the prediction states something to be true, but in reality it is false. To easily remember the false positive, you could describe this as a false alarm. A simple example for that is the airport security check: when you pass the security check, you have to walk through a metal detector. If you don’t wear any metal items with you (since you left them for the x-ray!), no alarm will go on. But in some rather rare cases, the alarm might still go on. Either you forgot something or the metal detector had an error – in this case, a false positive. The metal detector predicted that you have metal items somewhere with you, but in fact you don’t.

Another sample of a false positive in machine learning would be in image recognition: imagine your algorithm is trained to recognise cats. There are so many cat pictures on the web, so it is easy to train this algorithm. However, you would then feed the algorithm the image of a dog and the algorithm would call it a cat, even though it is a dog. This again is a false positive.

In a business context, your algorithm might predict that a specific customer is going to buy a certain product for sure. but in fact, this customer didn’t buy it. Again, here we have our false positive. Now, let’s have a look at the other error: the false negative.

The false negative

The other error in statistics is the false negative. Similar to the false positive, it is something that should be avoided. It is very similar to the false positive, just the other way around. Let’s look at the airport example one more time: you wear a metal item (such as a watch) and go through the metal detector. You simply forgot to take off the watch. And – the metal detector doesn’t go on this time. Now, you are a false negative: the metal detector stated that you don’t wear any metal items, but in fact you did. A condition was predicted to be true but in fact it was false.

A false positive is often useful to score your data quality. Now that you understand some of the most important basics of statistics, we will have a look at another machine learning algorithm in my next post: the logistic regression.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

One thing that everyone that deals with data is with classes that make data accessible to the code as objects. In all cases – and Python isn’t different here – wrapper classes and O/R mappers have to be written. However, Python has a powerful decorator for us at hand, that allows us to ease up or work. This decorator is called “dataclass”

The dataclass in Python

The nice thing about the dataclass decorator is that it enables us to add a great set of functionality to an object containing data without the need to re-write it always. Basically, this decorator adds the following functionality:

  • __init__: the constructor with all defined member variables. In order to use this, the member variables must be initialised with its type – which is rather uncommon in Python
  • __repr__: this pretty prints the class with all its member variables as a string
  • __eq__: a function to compare two classes for ordering
  • order functions: this creates several order functions such as __lt__ (lower than), __gt__ (greater than), __le__ (lower equals) and __ge__ (greater equals)
  • __hash__: adds a hash-function to the class
  • frozen: prevents the class from adding/deleting attributes on runtime

The definition for a dataclass in Python is easy:

@dataclass
class Classname():
CLASS-BLOCK

You can also add each of the above described properties separately, e.g. with frozen=True or alike.

In the following sample, we will create a Person-Dataclass.

from dataclasses import dataclass
@dataclass
class Person:
    firstname: str
    lastname: str
    age: int
    score: float
        
p = Person("Mario", "Meir-Huber", 35, 1.0)
print(p)

Please note the differences in how to annotate the member variables. You can see that there is now no need for a constructor anymore, since this is already done for you. When you print the class, the __repr__() function is called. The output should look like the following:

Person(firstname='Mario', lastname='Meir-Huber', age=35, score=1.0)

As you can see, the dataclass abstracts a lot of our problems. In the next tutorial we will have a look at IterTools and FuncTools.

If you are not yet familiar with Spark, have a look at the Spark Tutorial i created here. Also, I will create more tutorials on Python and Machine Learning in the future, so make sure to check back often to the Big Data & Data Science tutorial overview. I hope you liked this tutorial. If you have any suggestions and what to improve, please feel free to get in touch with me! If you want to learn more about Python, I also recommend you the official page.