Machine Learning 101 – Clustering, Regression and Classification


In my last post of this series, I explained the concept of supervised, unsupervised and semi-supervised machine learning. In this post, we will go a bit deeper into machine learning (but don’t worry, it won’t be that deep yet!) and look at more concrete topics. But first of all, we have to define some terms, which basically derive from statistics or mathematics. These are: Features Labels Features are known values, which are often used to calculate results. This are the variables that have an impact on a prediction. If we talk about manufacturing, we might want to reduce junk in our production line. Known features from a machine could then be: Temperature, Humidity, Operator, Time since last service. Based on these Features, we can later calculate the quality of the machine output Labels are the values we want to build the prediction on. In training data, labels are mostly known, but for the prediction they are not known. When we

read more Machine Learning 101 – Clustering, Regression and Classification

International Data Science Conference, Salzburg


Hi, I am happy to share this exciting conference I am keynoting at. Also, Mike Ohlsen from Cloudera will deliver a keynote at the conference. About the conference: June 12th – 13th 2017 | Salzburg, Austria | www.idsc.at The 1st International Data Science Conference (iDSC 2017) organized by Salzburg University of Applied Sciences (Information Technology and Systems Management) in cooperation with Information Professionals GmbH seeks to establish a key Data Science event, providing a forum for an international exchange on Data Science technologies and applications. The International Data Science Conference gives the participants the opportunity, over the course of two days, to delve into the most current research and up-to-date practice in Data Science and data-driven business. Besides the two parallel tracks, the Research Track and the Industry Track, on the second day a Symposium is taking place presenting the outcomes of a European Project on Text and Data Mining (TDM). These events are open to all participants. Also we are proud to announce keynote presentations from Mike Olson (Chief Strategy Officer

read more International Data Science Conference, Salzburg

Hadoop Tutorial – Data Science with Apache Mahout


Apache Mahout is the service on Hadoop that is in charge of what is often called “data science”. Mahout is all about learning algorithms, pattern recognition and alike. An interesting fact about Mahout is that under the hood MapReduce was replaced by Spark. Mahout is in charge of the following tasks: Machine Learning. Learning from existing data and. Recommendation Mining. This is what we often see at websites. Remember the “You bought X, you might be interested in Y”? This is exactly what Mahout can do for you. Cluster data. Mahout can cluster documents and data that has some similarities. Classification. Learn from existing classifications. A Mahout program is written in Java. The next listing shows how the recommendation builder works. DataModel model = new FileDataModel(new File(“/home/var/mydata.xml”));   RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator();   RecommenderBuilder builder = new MyRecommenderBuilder();   Double res = eval.evaluate(builder, null, model, 0.9, 1.0);   System.out.println(result); A Mahout program

Big Data in Manufacturing


Big Data is a disruptive technology. It is changing major industries from the inside. In the next posts, we will learn how Big Data changes different industries. Today’s focus: Big Data in Manufacturing. Manufacturing is a traditional industry relevant to almost any country in the world. It started to emerge in the industrial revolution, when machines took over and production became more and more automated. Big Data has the possibility to substantially change the manufacturing industry again – with various opportunities. Manufactures can utilize Big Data for various reasons. First, it is all about quality. When we look at production chains, may it be producing a car or just some metal works, quality is key. Who wants to buy a car that is broken? Exactly, nobody. Improving quality is a key aspect in Big Data for manufacturers. As of Big Data, this can come with several aspects. First of all, it is necessary to collect data about the production line(s)

read more Big Data in Manufacturing

Big Data 101: Partitioning


Partitioning is another factor for Big Data Applications. It is one of the factors of the CAP-Theorem (see 1.6.1) and is also important for scaling applications. Partitioning basically describes the ability to distribute a database over different servers. In Big Data Applications, it is often not possible to store everything on one (Josuttis, 2011) The factors for partitioning illustrated in the Figure: Partitioning are described by (Rys, 2011). Functional partitioning is basically describing the service oriented architecture (SOA) approach (Josuttis, 2011). With SOA, different functions are provided by their own services. If we talk about a Web shop such as Amazon, there are a lot of different services involved. Some Services handle the Order Workflow; other Services handle the search and so on. If there is high load on a specific service such as the shopping cart, new instances can be added on demand. This reduces the risk of an outage that would lead to loosing money. Building a service-oriented

read more Big Data 101: Partitioning

Big Data 101: Scalability


Scalability is another factor of Big Data Applications described by (Rys, 2011). Whenever we talk about Big Data, it mainly involves high-scaling systems. Each Big Data Application should be built in a way that eases scaling. (Rys, 2011) describes several needs for scaling: user load scalability, data load scalability, computational scalability and scale agility. The figure illustrates the different needs for scalability in Big Data environments as described by (Rys, 2011). Many applications such as Facebook (Fowler, 2012) have a lot of users. Applications should support the large user base and should stay prone to errors in case the application sees unexpected high user numbers. Various techniques can be applied to support different needs such as fast data access. A factor that often – but not only – comes with a high number of users is the data load. (Rys, 2011) describes that some or many users can produce this data. However, things such as sensors and other devices that

read more Big Data 101: Scalability

Big Data 101: Data agility


Agility is an important factor to Big Data Applications. (Rys, 2011) describes 3 different agility factors which are: model agility, operational agility and programming ability. Model agility means how easy it is to change the Data Model. Traditionally, in SQL Systems it is rather hard to change a schema. Other Systems such as non-relational Databases allow easy change to the Database. If we look at Key/Value Storages such as DynamoDB (Amazon Web Services, 2013), the change to a Model is very easy. Databases in fast changing systems such as Social Media Applications, Online Shops and other require model agility. Updates to such systems occur frequently, often weekly to daily (Paul, 2012). In distributed environments, it is often necessary to change operational aspects of a System. New Servers get added often, also with different aspects such as Operating System and Hardware. Database systems should stay tolerant to operational changes, as this is a crucial factor to growth. Database Systems should support

read more Big Data 101: Data agility

Are you a Data Scientist or what is necessary to become one?


Big Data is considered to be the job you simply have to go for. Some call it sexy, some call it the best job in the future. But what exactly is a Data Scientist? Is it someone you can simply hire from university or is it more complicated? Definitely the last one applies for that. When we think about a Data Scientist, we often say that the perfect Data Scientist is kind of a hybrid between a Statistician and Computer Scientist. I think this needs to be redefined, since much more knowledge is necessary. A Data Scientist should also be good in analysing business cases and talk to line executives to understand the problem and model an ideal solution. Furthermore, extensive knowledge on current (international) law is necessary. In a recent study we did, we defined 5 major challenges: Each of the 5 topics are about: Big Data Business Developer: The person needs to know what questions to ask, how

read more Are you a Data Scientist or what is necessary to become one?