Posts

In my previous posts, I introduced the basics of machine learning. Today, I want to focus on the two elementary algorithms: linear and logistic regression. Basically, you would learn them at the very beginning of your journey for machine learning, but eventually not use them much later on any more. But to understand the concepts of it, it is helpful to understand them.

Linear Regression

A Linear Regression is the simplest model for Data Science. Linear Regression is of supervised learning and used in Trend Analysis, Time-Series Analysis, Risk in Banking and many more.

In a linear regression, a relationship between a dependent variable y and a dataset of xn is linear. This basically means, that if there is data of a specific trend, a future trend can be predicted. Let’s assume that there is a significant relation between ad spendings and sales. We would have the following table:

YearAd SpendRevenue
2013 €      345.126,00  €      41.235.645,00
2014 €      534.678,00  €      62.354.984,00
2015 €      754.738,00  €      82.731.657,00
2016 €      986.453,00  €    112.674.539,00
2017 €   1.348.754,00  €    156.544.387,00
2018 €   1.678.943,00  €    176.543.726,00
2019 €   2.165.478,00  €    199.645.326,00

If you look at the data, it is very easy to figure out that that there is some kind of relation between how much money you spend on the ads and the revenue you get. Basically, the ratio is 1:92 to 1:119. Please not that I totally made up the numbers. however, based on this numbers, you could basically predict what revenues to obtain when spending X amount of data. The relation between them is therefore linear and we can easily plot it on a line chart:

Linear Regression

As you can see, some of the values are above the line and others below. Let’s now manually calculate the linear function. There are some steps necessary that should eventually lead to the prediction values. Let’s assume we want to know if we spend a specific money on ads, what revenue we can expect. Let’s assume we want to know how much value we create for 1 Million spend on ads. The linear regression function for this is:

predicted score (Y') = bX + intercept (A)

This means that we now need to calculate several values: (A) the slope (it is our “b” and the intercept (it is our A). X is the only value we know – our 1 Million spend. Let’s first calculate the slope

Calculating the Slope

The first thing we need to do is calculating the slope. For this, we need to have the standard deviation of both X and XY. Let’s first start with X – our revenues. The standard deviation is calculated for each revenue individually. There are some steps involved:

  • Creating the average of the revenues
  • Subtracting the individual revenue
  • Building the square

The first step is to create the average of both values. The average for the revenues should be:  € 118.818.609,14 and the average for the spend should be:  € 1.116.310,00.

Next, we need to create the standard deviation of each item. For the ad spend, we do this by substracting each individual ad spend and building the square. The table for this should look like the following:

The formular is: (Average of Ad spend – ad spend) ^ 2

YearAd spendStddev (X)
2013 €    345.126,00  €              594.724.761.856,00
2014 €    534.678,00  €              338.295.783.424,00
2015 €    754.738,00  €              130.734.311.184,00
2016 €    986.453,00  €                16.862.840.449,00
2017 € 1.348.754,00  €                54.030.213.136,00
2018 € 1.678.943,00  €              316.555.892.689,00
2019 € 2.165.478,00  €           1.100.753.492.224,00

Quite huge numbers already, right? Now, let’s create the standard deviation for the revenues. This is done by taking the average of the ad spend – ad spend and multiplying it with the same procedure for the revenues. This should result in:

YearRevenueY_Ad_Stddev
2013 €                  41.235.645,00  €    59.830.740.619.545,10
2014 €                  62.354.984,00  €    32.841.051.219.090,30
2015 €                  82.731.657,00  €    13.048.031.460.197,10
2016 €                112.674.539,00  €         797.850.516.541,00
2017 €                156.544.387,00  €      8.769.130.708.225,71
2018 €                176.543.726,00  €    32.478.055.672.684,90
2019 €                199.645.326,00  €    84.800.804.871.574,80

Now, we only need to sum up the columns for Y and X. The sums should be:

€ 2.551.957.294.962,00 for the X-Row
€ 232.565.665.067.859,00 for the Y-Row

Now, we need to divide the Y-Row by the X-Row and would get the following slope: 91,1322715

Calculating the Intercept

The intercept is somewhat easier. The formular for it is: average(y) – Slope * average(x). We already have all relevant variables calculated in our previous step. Our intercept should equal:  € 17.086.743,14.

Predicting the value with the Linear Regression

Now, we can build our function. This is: Y = 91,1322715X + 17.086.743,14

As stated in the beginning, our X should be 1 Million and we want to know our revenue:  € 108.219.014,64

The prediction is actually lower than the values which are closer (2016 and 2017 values). If you change the values, e.g. to 2 Million or 400k, it will again get closer. Predictions always produce some errors and they are normally shown. In our case, the error table would look like the following:

ad spentreal revenue (Y)prediction (Y’)error
2013 €                       345.126,00  €                  41.235.645,00  €                  48.538.859,48 -€     7.303.214,48
2014 €                       534.678,00  €                  62.354.984,00  €                  65.813.163,80 -€     3.458.179,80
2015 €                       754.738,00  €                  82.731.657,00  €                  85.867.731,47 -€     3.136.074,47
2016 €                       986.453,00  €                112.674.539,00  €                106.984.445,76  €     5.690.093,24
2017 €                    1.348.754,00  €                156.544.387,00  €                140.001.758,86  €   16.542.628,14
2018 €                    1.678.943,00  €                176.543.726,00  €                170.092.632,46  €     6.451.093,54
2019 €                    2.165.478,00  €                199.645.326,00  €                214.431.672,17 -€   14.786.346,17

The error calculation is done by using the real value and deducting the predicted value from it. And voila – you have your error. One common thing in machine learning is to reduce the error and make predictions more accurate.

One of the frequent statements vendors make is “Agile Analytics”. In pitches towards business units, they often claim that it would only take them some weeks to do agile analytics. However, this isn’t necessarily true, since they can easily abstract the hardest part of “agile” analytics: data access, retrieval and preparation. On the one hand side, this creates “bad blood” within a company: business units might ask why it takes their internal department so long (and there most likely has been some history to get the emotions going). But on the other side, it is necessary to solve this problem, as agile analytics is still possible – if done right.

In my opinion, there are several aspects necessary to go for agile analytics. First, it is about culture. Second, it is about organization and third is it about technology. Let’s start with culture first.

Culture

The company must be silo-free. Sounds easy, in fact it is very difficult. Different business units use data as a “weapon” which could easily be thermo-nuclear. If you own the data, you can easily create your own truth. This means that marketing could create their view of the market in terms of reach, sales could tweak the numbers (until the overall performance is measured by controlling), … So, business units might fight giving away data and will try to keep it in their ownership. However, data should be a company-wide good that is available to all units (of course, on the need to know basis and with adhering to legal and regulatory standards). This can only be achieved if the data unit is close to the CEO or any other powerful board member. Once this is achieved, it is easier to go for self-service analytics.

Organisation

Similar like culture, it is necessary to organize yourself for agile analytics. This is now more focused on the internal structure of an organization (e.g. the data unit). There is now silver bullet for this available, it very much depends on the overall culture of a company. However, certain aspects have to be fulfilled:

  • BizDevOps: I outlined it in one of my previous posts and I insist on this approach being necessary for many things around data. One of them is agile analytics, since handover of tasks is always complicated. End-to-end responsibility is really crucial for agile analytics
  • Data Governance: There is no way around it; either do it or forget about anything close to agile analytics. It is necessary to have security and privacy at control and to allow users to access data easy but secure. Also, it is very important to log what is going on (SOX!)
  • Self-Service Tools: Have tools available that enable you to access data without complex processes. I will write about this in “Technology”.

Technology

Last but not least, agile analytics is done via technology. Technology is just an enabler, so if you don’t get the previous 2 right, you will most likely fail here – even though you invest millions into it. You will need different tools that handle security and privacy, but also a clear and easy to use Metadata repository (let’s face it – a data catalog!). Also, you need tools that allow easy access of data via a data science workbench, a fully functional data lake and a data abstraction layer. That sounds quite a lot – and it is. The good news though is, that most of that comes for free – as all of them are mainly open source tools. At some point, you might need an enterprise license but cost-wise it is still manageable. And remember one thing: technology comes last. If you don’t fix culture and organization, you won’t be capable to deliver.

Agility is almost everywhere and it also starts to get more into other hyped domains – such as Data Science. One thing which I like in this respect is the combination with DevOps – as this eases up the process and creates end-to-end responsibility. However, I strongly believe that it doesn’t make much sense to exclude the business. In case of Analytics, I would argue that it is BizDevOps.

Basically, Data Science needs a lot of business integration and works throughout different domains and functions. I outlined several times and in different posts here, that Data Science isn’t a job that is done by Data Scientists. It is more of a team work, and thus needs different people. With the concept of BizDevOps, this can be easily explained; let’s have a look at the following picture and I will afterwards outline the interdependencies on it:

BizDevOps for Data Science

Basically, there must be exactly one person that takes the end-to-end responsibility – ranging from business alignments to translation into an algorithm and finally in making it productive by operating it. This is basically the typical workflow for BizDevOps. This one person taking the end-to-end responsibility is typically a project or program manager working in the data domain. The three steps were outlined in the above figure, let’s now have a look at each of them.

Biz

The program manager for Data (or – you could also call this person the “Analytics Translator”) works closely with the business – either marketing, fraud, risk, shop floor, … – on getting their business requirements and needs. This person has a great understanding of what is feasible with their internal data as well in order to be capable of “translating a business problem to an algorithm”. In here, it is mainly about the Use-Case and not so much about tools and technologies. This happens in the next step. Until here, Data Scientists aren’t necessarily involved yet.

Dev

In this phase, it is all about implementing the algorithm and working with the Data. The program manager mentioned above already aligned with the business and did a detailed description. Also, Data Scientists and Data Engineers are integrated now. Data Engineers start to prepare and fetch the data. Also, they work with Data Scientists in finding and retrieving the answer for the business question. There are several iterations and feedback loops back to the business, once more and more answers arrive. Anyway, this process should only take a few weeks – ideally 3-6 weeks. Once the results are satisfying, it goes over to the next phase – bringing it into operation.

Ops

This phase is now about operating the algorithms that were developed. Basically, the data engineer is in charge of integrating this into the live systems. Basically, the business unit wants to see it as (continuously) calculated KPI or any other action that could result in some sort of impact. Also, continuous improvement of the models is happening there, since business might come up with new ideas on it. In this phase, the data scientist isn’t involved anymore. It is the data engineer or a dedicated devops engineer alongside the program manager.

Eventually, once the project is done (I dislike “done” because in my opinion a project is never done), this entire process moves into a CI process.

To get the most out of your data strategy in an enterprise, it is necessary to cluster the different user types that might arise in an enterprise. All of them are users of data but with different needs and demands on it. In my oppinion, they range from different experise levels. Basically, I see three different user types:

Three degrees of Data Access

Basically, the different user types differentiate from their level of how they use data and from the number of users. Let’s first start with the lower part of the pyramid – Business Users

Business Users

The first layer are the business users. This are basically users that need data for their daily decisions, but are rather consumers of the data. These people look at different reports to make decisions on their business topics. They could either be Marketing, Sales or Technology – depending on the company itself. Basically, these users would use pre-defined reports, but in the long run would rather go for customized reports. One great thing for that is self-service BI. Basically, theses users are experienced in interpreting data for their business goals and asking questions on their data. This could be about re-viewing the performance of a campaign, weekly or monthly sales reports, … They create huge load on the underlying systems without understanding the implementation and complexity underneath it – and they don’t have to. From time to time, they start digging deeper into their data and thus become power users – our next level

Power Users

Power Users often emerge from Business Users. This is typically a person that is close with the business and understands the needs and processes around it. However, they also have a great technical understanding (or gained this understanding during the process of becoming power users). They have some level of SQL know-how or know the basics of other scripting tools. They often work with the business users (even in the same department) on solving business questions. Also, they work close with Data Engineers on accessing data sources and integrating new data sources. Also, they go for self-service analytics tools to have a basic level of data science done. However, they aren’t data scientists but might get into this direction if they invest significant time into it. This now brings us to the next level – the data scientists

Data Scientists

This is the top level of our pyramid. People working as data scientists aren’t in the majority – business users and power users are much more. However, they work on more challenging topics then the previous two. Also, they work close with power users and business users. They might still be in the same department, but not necessarily. Also, they work with advanced tools such as R and Python and fine-tune the models the power users built with self-service analytics tools or translate the business questions raised from the business users into algorithms.

Often, those 3 develop in different directions – however, it is necessary that all of them work together – as a team – in order to make projects with data a success.

A current trend in AI is not a much technical one – it is rather a societial one. Basically, technologies around AI in Machine Learning and Deep Learning are getting more and more complex – thus making it even more complex for humans to understand what is happening and why a prediction is happening. The current approach in „throwing data in, getting a prediction out“ is not necessarily working for that. It is somewhat dangerous building knowledge and making decisions based on algorithms that we don‘t understand.

Explainable AI is getting even more important with new developments in the AI space such as Auto ML, where the system takes most of the data scientist‘s work. It needs to be ensured that everyone understands what‘s going on with the algorithms and why a prediction is happening exactly the way it is. So far (and without AutoML), Data Scientists were basically in charge of the algorithms and thus at least there was someone that could explain an algorithm (note: it didn‘t prevent us from bias in it, nor will AutoML do). With AutoML, when the tuning and algorithm selection is done more or less automatically, we need to ensure to have some vital and relevant documentation of the predictions available.

And one last note: this isn‘t a primer against AutoML and tools that do so – I believe that democratisation of AI is an absolute must and a good thing. However, we need to ensure that it stays – explainable!

Data itselve and Data Science especially, is one of the drivers of digitalization. Many companies experimented with Data Science over the last years and gained significant insights and learnings from it. Often, people dealing with statistics started to do this magic thing called data science. But also technical units used machine learning and alike to further improve their businesses. However, for many other units within traditional companies, all of this seems like magic and dangerous. So how to include others not dealing with the topic in detail and thus de-mystify the topic?

First of all, Machine Learning and Data Science isn‘t the revolution – units started implementing it in order to gain new insights and improve their business results. However, often it is also aquired via business projects from consulting companies. The newer and complex a topic is, the higher the risk is that people will object it – of fear and mis- or not understanding. When being deep in the topic of data and data science, you might be treated with fame by some – those, that think that you are a magician, and rejected by others. Both is poisoning in my opinion. The first group will try to get very close to you and expects a lot. However, you are often not capable of meeting their expectations. After a while, they get frustrated by far too high expectations. In corporate environments, it is very important to filter this group at the very beginning and clearly state what they can expect and what not. It is also important to state towards them what they won‘t get – and saying „No“ is very important to them as well. Being transparent with this group is essential – in order to keep them close supporters to you in an growing environment. You will depend a lot on those people if you want to succeed. So be clear with them.

The other group – which I would say in digitalization is the bigger group – is the group that will meet you with fears and doubts. This group is the far larger group and it is highly important that you cover them well. You can easily recognize people in this group by not being open towards your topics. Some are probably actively refusing it, others might not be so active and just poison the climate. But be aware: they usually don‘t do it because they hate you for some reasons. They are just acting human and are either afraid, feel that they are not included or have other doubts about you and your unit. It is highly essential to work on a communication strategy with this group and pro-actively include them. Bringing clarity and de-mystifying your topic in easy terms is vital. It is important that you create a lot of comparisons to your traditional business and keep it simply. Once you gained their trust and interest, you can get much deeper into your topic and provide learning paths and skill development for those people. If you succeeded in that, you created strong supporters that will come up with great ideas to improve your business even further. Keep in mind: just because you are in a „hot topic“ like big data and data science and you might be treated like a rock star by some, others are also great in doing things and it all boils down to: we are just humans.

Digitalization needs trust to succeed. If you fail to deliver trust and don’t include the human aspect, your digitalization and data strategy is poised to fail – independent of the budget and C-Level support you might have for your initiative. So, make sure to work on that – with high focus!

In my last post of this series, I explained the concept of supervised, unsupervised and semi-supervised machine learning. In this post, we will go a bit deeper into machine learning (but don’t worry, it won’t be that deep yet!) and look at more concrete topics. But first of all, we have to define some terms, which basically derive from statistics or mathematics. These are:

  • Features
  • Labels

Features are known values, which are often used to calculate results. This are the variables that have an impact on a prediction. If we talk about manufacturing, we might want to reduce junk in our production line. Known features from a machine could then be: Temperature, Humidity, Operator, Time since last service. Based on these Features, we can later calculate the quality of the machine output

Labels are the values we want to build the prediction on. In training data, labels are mostly known, but for the prediction they are not known. When we focus on the machine data example from above, a label would be the quality. So all of the features together make up for a good or bad quality and algorithms can now calculate the quality based on that.

Let’s now go on another “classification” of machine learning techniques. We “cluster” them by supervised/unsupervised.

The first one is clustering. Clustering is an unsupervised technique. With clustering, the algorithm tries to find a pattern in data sets without labels associated with it. This could be a clustering of buying behaviour of customers. Features for this would be the household income, age, … and clusters of different consumers could then be built.

The next one is classification. In contrast to clustering, classification is a supervised technique. Classification algorithms look at existing data and predicts what a new data belongs to. Classification is used for spam for years now and these algorithms are more or less mature in classifying something as spam or not. With machine data, it could be used to predict a material quality by several known parameters (e.g. humidity, strength, color, … ). The output of the material prediction would then be the quality type (either “good” or “bad” or a number in a defined space like 1-10). Another well known sample is if someone would survive the titanic – classification is done by “true” or “false” and input parameters are “age”, “sex”, “class”. If you would be 55, male and in 3rd class, chances are low, but if you are 12, female and in first class, chances are rather high.

The last technique for this post is regression. Regression is often confused with clustering, but it is still different from it. With a regression, no classified labels (such as good or bad, spam or not spam, …) are predicted. Instead, regression outputs continuous, often unbound, numbers. This makes it useful for financial predictions and alike. A common known sample is the prediciton of housing prices, where several values (FEATURES!) are known, such as distance to specific landmarks, plot size,… The algorithms could then predict a price for your house and the amount you can sell it for.

In my next post, I will talk about different algorithms that can be used for such problems.

Hi,

I am happy to share this exciting conference I am keynoting at. Also, Mike Ohlsen from Cloudera will deliver a keynote at the conference.

About the conference:

June 12th – 13th 2017 | Salzburg, Austria | www.idsc.at

The 1st International Data Science Conference (iDSC 2017) organized by Salzburg University of Applied Sciences (Information Technology and Systems Management) in cooperation with Information Professionals GmbH seeks to establish a key Data Science event, providing a forum for an international exchange on Data Science technologies and applications.

The International Data Science Conference gives the participants the opportunity, over the course of two days, to delve into the most current research and up-to-date practice in Data Science and data-driven business. Besides the two parallel tracks, the Research Track and the Industry Track, on the second day a Symposium is taking place presenting the outcomes of a European Project on Text and Data Mining (TDM). These events are open to all participants.

Also we are proud to announce keynote presentations from Mike Olson (Chief Strategy Officer Cloudera), Ralf Klinkenberg (General Manager RapidMiner), Euro Beinat (Data-Science Professor and Managing Director CS Research), Mario Meir-Huber (Big Data Architect Microsoft). These keynotes will be distributed over both conference days, providing times for all participants to come together and share views on challenges and trends in Data Science.

The Research Track offers a series of short presentations from Data Science researchers on their own, current papers. On both conference days, we are planning a morning and an afternoon session presenting the results of innovative research into data mining, machine learning, data management and the entire spectrum of Data Science.

The Industry Track showcases real practitioners of data-driven business and how they use Data Science to help achieve organizational goals. Though not restricted to these topics only, the industry talks will concentrate on our broad focus areas of manufacturing, retail and social good. Users of data technologies can meet with peers and exchange ideas and solutions to the practical challenges of data-driven business.

Futhermore the Symposium is organized in collaboration with the FutureTDM Consortium. FutureTDM is a European project which over the last two years has been identifying the legal and technical barriers, as well as the skills stakeholders/practitioners lack, that inhibit the uptake of text and data mining for researchers and innovative businesses. The recommendations and guidelines recognized and proposed to counterbalance these barriers, so as to ensure broader TDM uptake and thus boost Europe’s research and innovation capacities, will be the focus of the Symposium.

Our sponsors ClouderaF&F and um etc. will have their own, special platform: half-day workshops to provide hands-on interaction with tools or to learn approaches to developing concrete solutions. In addition, there will be an exhibition of the sponsors’ products and services throughout the conference, with the opportunity for the participants to seek contact and advice.

The iDSC 2017 is therefore a unique meeting place for researchers, business managers, and data scientists to discover novel approaches and to share solutions to the challenges of a data-driven world.

Apache Mahout is the service on Hadoop that is in charge of what is often called “data science”. Mahout is all about learning algorithms, pattern recognition and alike. An interesting fact about Mahout is that under the hood MapReduce was replaced by Spark.

Mahout is in charge of the following tasks:

  • Machine Learning. Learning from existing data and.
  • Recommendation Mining. This is what we often see at websites. Remember the “You bought X, you might be interested in Y”? This is exactly what Mahout can do for you.
  • Cluster data. Mahout can cluster documents and data that has some similarities.
  • Classification. Learn from existing classifications.

A Mahout program is written in Java. The next listing shows how the recommendation builder works.

DataModel model = new FileDataModel(new File(“/home/var/mydata.xml”));

 

RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator();

 

RecommenderBuilder builder = new MyRecommenderBuilder();

 

Double res = eval.evaluate(builder, null, model, 0.9, 1.0);

 

System.out.println(result);

Big Data is a disruptive technology. It is changing major industries from the inside. In the next posts, we will learn how Big Data changes different industries.

Today’s focus: Big Data in Manufacturing.

Manufacturing is a traditional industry relevant to almost any country in the world. It started to emerge in the industrial revolution, when machines took over and production became more and more automated. Big Data has the possibility to substantially change the manufacturing industry again – with various opportunities.

Manufactures can utilize Big Data for various reasons. First, it is all about quality. When we look at production chains, may it be producing a car or just some metal works, quality is key. Who wants to buy a car that is broken? Exactly, nobody. Improving quality is a key aspect in Big Data for manufacturers. As of Big Data, this can come with several aspects. First of all, it is necessary to collect data about the production line(s) and all devices that are connected or connect-able. When errors occur or a product isn’t as desired, the production data can be analyzed and reviewed. Data scientists basically do a great job on that. Real-Time analytics allow the company to improve the material quality and product quality again. This can be done by analyzing images of products or materials and removing them from the production line in case they don’t fulfill certain standards.

A key challenge today in manufacturing is the high degree of product customization. When buying a new car, the words by Henry Ford (you can have any type of the T-model as long as it is black) are not true any more. When customers order whatever type of product, customers expect that their own personality is reflected by the product. If a company fails to deliver that, they might risk loosing customers. But what is the affiliation with Big Data now? Well, this customization is a strong shift towards Industry 4.0, which is heavily promoted by the German industry. In order to make products customize able, it is necessary to have an automated product line and to know what customers might want – by analyzing recent sales and trends from social networks and alike.

Changing the output of a production line is often difficult and ineffective. Big Data analytics allow manufacturers to better understand future demands and they can reduce production pikes. This enables the manufacturer to better plan and act in the market – and get more efficient.