Another year has passed and 2018 has been a really great year for the cloud. We can now really say that cloud is becoming a commodity and common sense. After years of arguing why the cloud is useful, this discussion is now gone. Nobody doubts the benefits of the cloud anymore. The next year, most developments that already started in 2018 will continue and intensify. My predictions for 2019 won’t be revolutionary but the trends we will see in the short period of this year. Therefore, my 5 predictions for 2019 are:

1. Strong growth in the cloud will continue, but it won’t be hyper growth anymore

In the past years, companies such as Amazon or Microsoft saw significant growth rates in their cloud business. These numbers will still go up by a large, double digit, growth rate for all major cloud providers (not just the two of them). However, overall growth will be slower than previous years as the market gets more mature. To win market shares, it will now not only be relevant to offer cloud products (literally), now it is relevant to have a significant market presence and sales force available in all markets to win from the competition. Also, more companies are now looking for a dual-vendor cloud strategy in order to overcome potential vendor lock-ins. To make this easier for customers, cloud companies will offer more open source products in the cloud. This will give them additional options to argue against a vendor lock-in.

2. PaaS Solutions will now see significant uptake, driven by containerisation

Containerised services are already here for some years and services such as AWS Lambda or Azure Functions are really great solutions for building web based services. However, many (traditional) companies still struggle with this approach and on how to use them. With Software Engineers and Architects experimenting with these kind of services, they will bring them back to enterprise environments. Therefore, significant growth in these services will happen over the next month. This will partially eat up share on pure IaaS services, but it is the path that the cloud will become more mature in it’s service stack as well.

3. The number of existing domain specific SaaS solutions will grow significantly

Just like software products in the past emerged on the windows platform or Apps emerged on the iPhone or on Android, new SaaS solutions emerged over the last years in the cloud. This trend will now speed up, with more domain specific platforms (e.g. for finance, marketing or alike) getting more popular and widely accepted. Business functions in traditional companies won’t question the cloud the same way as IT departments do and thus seeing fast growth. IT departments have to accepted this development and shouldn’t block it. 

4. Cloud will become local

All major cloud providers are continuing to invest significantly in data centers around the world. This trend won’t stop in 2019. The first data centers were built in the US, later on we saw data centers in APAC and Europe. Until some years ago, all major cloud providers had 2 data centers in Europe, which were simply called “North Europe”, “West Europe” or alike. In the last 1 or 2 years, this was changed by having data centers dedicated to specific markets (but still being useable by others). These markets now include Germany, UK, France, Sweden, Italy and Switzerland. However, much more data centers will emerge, also covering smaller markets as the maturity grows. First data centers opened in Africa, which is a very interesting market with huge but still underestimated potential. As of Europe, I see the CEE and SEE markets not covered well and would expect a dedicated CEE or SEE data center to open in the next 1-2 years.

5. Google will catch up fast in the cloud, mainly driven by it’s strength in the AI space

When it comes to the cloud, the #3 in the market is definitely Google. They entered the market somewhat later than AWS or Microsoft did. However, they offer a very interesting portfolio and competitive pricing. A key strength Google has is their AI and Analytics services, as the company itself is very data driven. Google really knows how to handle and analyse data much more than the two others do, so it is very likely that Google will use this advantage to gain shares from their competitors. I am exited about next Google I/O and what will be shown there in terms of Analytics and AI. 
These are just my ideas about the Cloud and what we will see in the next year. What is your opinion? Where do you agree or disagree? Looking forward to your comments!

Now you probably think: is Mario crazy? In fact, during this post, I will explain why cloud is not the future.
First, let’s have a look at the economic facts of the cloud. If we look at share prices of companies providing cloud services, it is rather easy to say: those shares are skyrocketing! (Not mentioning recent drops in some shares, but these are rather market dynamics than real valuations). One thing is also about overall company performances: the income of companies providing cloud services increased a lot. Have a look at the major cloud providers such as AWS, Google, Oracle or Microsoft: they make quite a lot of their revenue now with cloud services. So, obviously here, my initial statement seems to be wrong. So why did I just choose this one? Still crazy?
Let’s look at another explanation on this: it might be all about technology, right? I was recently playing with AWS API Gateway and AWS Lambda. Wow, how easy is it to write a great API? I could program an API for an Android APP in some hours, deployment was easy. Remember back when you first had to deploy your full stack for this? Make sure to have all libraries set up and alike? Another sample: Data Analytics. Currently, much of this is moving from “classical” Hadoop-backed HDFS to decoupled Architectures (Object Stores as “Data Lake” and Spark for Compute/Analytics). This is also a clear favour for the Cloud, because both can be scaled individually and utilisation is easier to handle. When you need more compute power, you would spin up new instances and disconnect them again when you are done. This simply can’t be done with on-prem or private cloud, since the available capacity is calculated to match some corporate requirements. Also this is clearly in favour of the Cloud.
But what else? Let’s look at how new Applications or Services are developed. Nowadays, almost every Service is developed “Cloud first”, which means that they aren’t available without the cloud or at least they get available at a very late stage / substantial delay. So if you want to stay ahead in the innovation, it is necessary to embrace cloud here. And please don’t tell me that you would rather wait as it isn’t necessary to be with the first one’s to move. Answer: of course it is fine to wait until your business is dead ;).
So, there are no real points against the cloud, so why did I then formulate the title like this? Provocation? Clickbaiting? NO: Cloud is not the future, it is the present!

Data Science Conference 4.0 is going to be in Belgrade again soon this year, and I am happy to be one of the keynote speakers there! The program will be fantastic, so looking forward to see as many as possible there :). From the organisers:
Data Science Conference / 4.0 (DSC 4.0) will be held on 18 and 19 September in hotel Hyatt Regency Belgrade and is organized by Institute for Contemporary Sciences (ISN). During the two days Belgrade will be the epicenter of Data Science and guests will have an opportunity to hear 62 talks and 4 discussion groups, as well as take part in exclusive content such as workshops and more than 12 technical tutorials. More than a 1000 attendees are expected from over 25 countries around the work, which makes DSC 4.0 one of the three biggest data science conferences in Europe.
 
Hadley Wickham, Chief Scientist at RStudio and an adjunct Professor at Stanford University, will open the Conference. He is a prominent and active member of the R user community and has developed several notable and widely used packages (known as the ‘tidyverse’). Additionally, many notable speakers from around Europe and the US will take the stage, such as Mario Meir-Huber (A1 Telekom Austria Group), Vanja Paunic (Microsoft London) and Miha Pelko (BMW Group). Moreover, a great number of women will speak this year, which is noteworthy as one of the goals of the Conference is to empower women in this field. Dorothea Wiesmann working at IBM Research Zurich is one of the 6 keynote speakers.
 
The program of the conference will be divided into 4 parallel tracks and will cover a wide range of topics from Artificial Intelligence, Machine Learning, Data Monetisation and Data Science Education to Big Data and Engineering and more! Attendees will be able to choose a level of a talk that is most suitable for their background – be it beginner, intermediate or advanced, as well as choose talk types between technical, business or academic types – all mentioned in our schedule. Additionally, there will be an UnConference in parallel to the talks, where any attendee will have a chance to hold a small presentation of a topic or an idea in data science and have a discussion with other guests of the Conference.
 
The program includes Workshops as well, and their goal is to provide experience to future data scientists on the problems that companies face day to day. They are free of charge and the applications are still open. Furthermore, attendees will have an opportunity to hear more than 130 hours of Technical Tutorials on the days leading up to the Conference (15-17.9.) on topics such as the basics and advanced visualisation using Tableau, using Amazon Web Services, the basics of Artificial Intelligence, Machine Learning in Python and R and many more.
Additional information can be found on Data Science Conference / 4.0 official website, where you can book a ticket as well or contact dsc.info@isn.rs for any questions you might have. One more thing, you can take a look at the last year’s aftermovie here.

Everyone (or at least most) companies today talk about digital transformation and treat data as a main asset for this. The question is where to store this data. In a traditional database? In a DWH? Ever heard about the datalake?

What is the datalake?

I think we should take a step back to answer this question. First of all, a Datalake is not a single piece of software. It consists of a large variety of Platforms, where Hadoop is a central one, but not the only one – it includes other tools such as Spark, Kafka, … and many more. Also, it includes relational Databases – such as PostgreSQL for instance. If we look at how truly digital companies such as Facebook, Google or Amazon solve these problems, then the technology stack is also clear; in fact, they heavily contribute to and use Hadoop & similar technologies. So the answer is clear: you don’t need overly expensive DWHs any more.

However, many C-Level executives might now say: “but we’ve invested millions in our DWH over the last years (or even decades)”. Here the question is getting more complex. How should we treat our DWH? Should it be replaced or should the DWH become the single source of truth and should the Datalake be ignored? In my opinion, both options aren’t valid:

Can the datalake replace a data warehouse?

First, replacing a DWH and moving all data to a Datalake will be a massive project that will bind too many resources in a company. Finding people with adequate skills isn’t easy, so this can’t be the solution to it. In addition to that, there are hundreds of business KPIs built, a lot of units within large enterprises built their decisions on these. Moving them to a Datalake will most likely break (important) business processes. Also, previous investments will be vaporised. So a big-bang replacement is clearly a no-go.

Second, keeping everything in the DWH is not feasible. Modern tools such as Python, Tensorflow and many more aren’t well supported by proprietary software (or at least, get the support with delay). From a skills-perspective, most young professionals coming from university get skills in technologies such as Spark, Hadoop and alike and therefore the skills shortage can be solved easier by moving towards a Datalake.

I am speaking at a large number of international conferences; whenever I ask the audience if they want to work with proprietary DWH databases, no hands go up. If I ask them if they want to work with Datalake technologies, everyone raises the hand. The fact is, that employees choose the company they want to work for, not vice versa. We have a skills shortage in this area, everyone ignoring or not accepting that is simply wrong. Also, a DWH is way more expensive then a Datalake. So also this option is not a valid one.

What to do now?

So what is my recommendation or strategy? For large, established enterprises, it is a combination of both steps, but with a clear path towards replacing the DWH in the long run. I am not a supporter of complex, long-running projects that are hard to control and track. Replacing the DWH should be a vision, not a project. This can be achieved by agile project management, combined with a long-term strategy: new projects are solely done by Datalake technologies.

All future investments and platform implementations must use the Datalake as the single source of truth. Once existing KPIs and processes are renewed, it must be ensured that these technologies are implemented on the Datalake and that the data gets shifted to the Datalake from the DWH. To make this succeed, it is necessary to have a strong Metadata management and data governance in place, otherwise the Datalake will be a very messy place – and thus become a data swamp.

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company. Wikipedia describes the concept of the datalake very well.

In my last post of this series, I explained the concept of supervised, unsupervised and semi-supervised machine learning. In this post, we will go a bit deeper into machine learning – clustering, regression and classification (but don’t worry, it won’t be that deep yet!) and look at more concrete topics. But first of all, we have to define some terms, which basically derive from statistics or mathematics.

Features and Labels in Machine Learning

  • Features
  • Labels

Features are known values, which are often used to calculate results. This are the variables that have an impact on a prediction. If we talk about manufacturing, we might want to reduce junk in our production line. Known features from a machine could then be: Temperature, Humidity, Operator, Time since last service. Based on these Features, we can later calculate the quality of the machine output

Labels are the values we want to build the prediction on. In training data, labels are mostly known, but for the prediction they are not known. When we focus on the machine data example from above, a label would be the quality. So all of the features together make up for a good or bad quality and algorithms can now calculate the quality based on that.

Let’s now go on another “classification” of machine learning techniques. We “cluster” them by supervised/unsupervised.

Machine Learning: Clustering, Classification and Regression

The first one is clustering. Clustering is an unsupervised technique. With clustering, the algorithm tries to find a pattern in data sets without labels associated with it. This could be a clustering of buying behaviour of customers. Features for this would be the household income, age, … and clusters of different consumers could then be built.

The next one is classification. In contrast to clustering, classification is a supervised technique. Classification algorithms look at existing data and predicts what a new data belongs to. Classification is used for spam for years now and these algorithms are more or less mature in classifying something as spam or not. With machine data, it could be used to predict a material quality by several known parameters (e.g. humidity, strength, color, … ). The output of the material prediction would then be the quality type (either “good” or “bad” or a number in a defined space like 1-10). Another well known sample is if someone would survive the titanic – classification is done by “true” or “false” and input parameters are “age”, “sex”, “class”. If you would be 55, male and in 3rd class, chances are low, but if you are 12, female and in first class, chances are rather high.

The last technique for this post is regression. Regression is often confused with clustering, but it is still different from it. With a regression, no classified labels (such as good or bad, spam or not spam, …) are predicted. Instead, regression outputs continuous, often unbound, numbers. This makes it useful for financial predictions and alike. A common known sample is the prediciton of housing prices, where several values (FEATURES!) are known, such as distance to specific landmarks, plot size,… The algorithms could then predict a price for your house and the amount you can sell it for.

What’s next?

In my next post, I will talk about different algorithms that can be used for such problems.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.

I teach Big Data & Data Science at several universities and I work in that field also. Since I wrote a lot here on Big Data itself and there are now many young professionals deciding if they want to go for data science, I decided to write a short intro series to machine learning. After this intro, you should be capable of getting deeper into this topic and know where to start. To kick off the series, we’ll go over some basics of machine learning. The first part for this is about supervised and unsupervised learning.

One of the main ideas behind that is to find patterns in data and make predictions on that data without the need to develop each and every use-case from scratch. Therefore, a certain number of algorithms are available. These algorithms can be “classified” by how they work. the main two principles (which then can also be spilt) are:

  • Supervised Learning
  • Unsupervised Learning
  • Semi-supervised Learning

Supervised Learning

With supervised learning, the algorithm learns basically by existing data and learning “from the past”. This means that there is basically a lot of learning data that allows the algorithm to find the patterns by learning from this data. We can also call this “a teacher”. It works closely to how we as humans learn: we get information from our parents, teachers and friends and combine this to make future predictions. Examples are:

  • Manufacturing: if several properties of a material were of specific properties, the quality was either good or bad (or maybe scaled from several numbers). Now, if we produce a new material and we look at the properties of the material, based on the existing data we have from former productions, we can say how the quality will be. Properties of a material might be: hardness, color, …
  • Banking: based on several properties of a potential borrower, we can predict if the person is capable of paying back the loan. This can be based on existing data of former customers and what the bank “learned” from them. The algorithm takes a lot of different variables into consideration. This variables can be income, montly liability to pay, education, job, etc.

Unsupervised Learning

With unsupervised learning we have no “teacher” available. The algorithms get data, and the algorithms try to find patterns in that. This can either be by clustering data (e.g. customer with high income, customer with low income, …) and make predictions based on that. An unsupervised learning algorithm can be useful for several use-cases. Below are some samples:

  • Manufacturing: find anomalies in the production lines (e.g. the average output of units per hour was between 200 and 250, but on day D at time T, the output was only 20 units. The algorithm can cluster data into normal output and an anomaly that was detected.
  • Banking: normally, the customer would only spend money in his home country. The algorithm detects an abnormal behaviour, like money transferd in a country that he normally isn’t in. This can be an indicator for fraud.

Last, but not least, there is Semi supervised learning, which is a combination of both. In many machine learning projects, not all training data that is used for supervised learning is available, so values might need to get predicted. This can be done by combining supervised and unsupervised learning algorithms and then work with the “curated” data on it.

Now that we basically understand the 3 main concepts of supervised, unsupervised and semi supervised learning, we can continue with variations within these concepts and some statistical background in the next post.

This tutorial is part of the Machine Learning Tutorial. You can learn more about Machine Learning by going through this tutorial. On Cloudvane, there are many more tutorials about (Big) Data, Data Science and alike, read about them in the Big Data Tutorials here. If you look for great datasets to play with, I would recommend you Kaggle.