Big Data was a buzz-word over the last years, but it started to prove value in the last years as well. Major companies started to develop Big Data strategies and made their organisation fit to become “data driven”. However there is still some way to go. Therefore, my 5 predictions for 2019 are:

1. Big Data Services will become more popular on the Cloud, impacting the business model for On-Premise Hadoop providers

One obvious development in the last year was that the object stores (such as Amazon S3) became more popular for processing large amounts of data. This heavily threats the “traditional” HDFS based solutions, which are fully built on Hadoop. With Object Stores, HDFS and thus Hadoop becomes obsolete. Processing is done now with Spark. Also, most established Cloud providers started to create automated Spark services. This gives the customers more flexibility over traditional solutions. However, traditional HDFS still brings some advantages over Object stores such as fine-granular security and data governance. We will see improvements in cloud based object stores over the next year(s) to overcome those obstacles. But anyhow: in my opinion, the Hortonworks/Cloudera merger this year didn’t come out of a position of strength but rather from the future threats that arise – from the cloud. And running Hadoop in a full distribution in the cloud isn’t smart from an economy point of view.

2. Traditional database providers will see shrinking revenues for their proprietary solutions

Data warehouse providers such as Teradata struggle with decreasing revenues from their core business model. We could see this over the last years with shrinking revenue statements and declining market capitalisation. This trend will continue in 2019 and increases in it’s pace – but isn’t at the fasted depreciation yet. With companies becoming more mature in data, they will increasingly see that overly expensive data warehouses aren’t necessary anymore. However, data warehousing will always exist – also in a relational way. But this might also move to economically more relevant platforms. Anyway, data warehouse providers need to change their business models. They have huge potential to become a major player in the Big Data and Analytical ecosystem.

3. Big Data will become faster

This is another trend that emerged over the last year. With the discussion of Kappa vs. Lambda for the Big Data technology stack, this trend was becoming more attention recently. Real-Time platforms become more and more economical and thus making it easier to move towards faster execution. Also, customers expect fast results and internal (business) departments don’t want to wait forever for results. 

4. Big Data (and Analytics) isn’t a buzz-word anymore and sees significant investment within companies

As already mentioned in my opening statement, Big Data isn’t a buzzword anymore and companies putting some significant investments into it. This is now also perceived from the c-level. Executives see the need for their companies to become data driven in all aspects. The backbone of digitalisation is data, and if they want to succeed in digitalisation, the data aspects has to be mastered first. Banking and Telecommunication already started this journey in the last years and thus has significant knowledge gathered in this, other industries – also very traditional ones – will follow in 2019. Initiatives will now turn into programs with organisational alignments. 

5. Governance is now perceived as a key challenge for Big Data

Data Governance was always something nobody wanted to care about. It didn’t give much benefits to business functions (you don’t see governance) and it wasn’t put into the big context. Now, with big data being put into production in large enterprises, data governance comes up as an important topic again. Basically, companies should start with data governance at the very beginning, since it is too hard to do it afterwards. Also, a good data governance strategy enables the company to be faster with analytics. The aim on this should be self-service analytics, which can only be achieved with a great data governance strategy. 

Data Science Conference 4.0 is going to be in Belgrade again soon this year, and I am happy to be one of the keynote speakers there! The program will be fantastic, so looking forward to see as many as possible there :). From the organisers:
Data Science Conference / 4.0 (DSC 4.0) will be held on 18 and 19 September in hotel Hyatt Regency Belgrade and is organized by Institute for Contemporary Sciences (ISN). During the two days Belgrade will be the epicenter of Data Science and guests will have an opportunity to hear 62 talks and 4 discussion groups, as well as take part in exclusive content such as workshops and more than 12 technical tutorials. More than a 1000 attendees are expected from over 25 countries around the work, which makes DSC 4.0 one of the three biggest data science conferences in Europe.
Hadley Wickham, Chief Scientist at RStudio and an adjunct Professor at Stanford University, will open the Conference. He is a prominent and active member of the R user community and has developed several notable and widely used packages (known as the ‘tidyverse’). Additionally, many notable speakers from around Europe and the US will take the stage, such as Mario Meir-Huber (A1 Telekom Austria Group), Vanja Paunic (Microsoft London) and Miha Pelko (BMW Group). Moreover, a great number of women will speak this year, which is noteworthy as one of the goals of the Conference is to empower women in this field. Dorothea Wiesmann working at IBM Research Zurich is one of the 6 keynote speakers.
The program of the conference will be divided into 4 parallel tracks and will cover a wide range of topics from Artificial Intelligence, Machine Learning, Data Monetisation and Data Science Education to Big Data and Engineering and more! Attendees will be able to choose a level of a talk that is most suitable for their background – be it beginner, intermediate or advanced, as well as choose talk types between technical, business or academic types – all mentioned in our schedule. Additionally, there will be an UnConference in parallel to the talks, where any attendee will have a chance to hold a small presentation of a topic or an idea in data science and have a discussion with other guests of the Conference.
The program includes Workshops as well, and their goal is to provide experience to future data scientists on the problems that companies face day to day. They are free of charge and the applications are still open. Furthermore, attendees will have an opportunity to hear more than 130 hours of Technical Tutorials on the days leading up to the Conference (15-17.9.) on topics such as the basics and advanced visualisation using Tableau, using Amazon Web Services, the basics of Artificial Intelligence, Machine Learning in Python and R and many more.
Additional information can be found on Data Science Conference / 4.0 official website, where you can book a ticket as well or contact for any questions you might have. One more thing, you can take a look at the last year’s aftermovie here.

I am happy to share this exciting conference I am keynoting at. Also, Mike Ohlsen from Cloudera will deliver a keynote at the conference.
About the conference:
June 12th – 13th 2017 | Salzburg, Austria |
The 1st International Data Science Conference (iDSC 2017) organized by Salzburg University of Applied Sciences (Information Technology and Systems Management) in cooperation with Information Professionals GmbH seeks to establish a key Data Science event, providing a forum for an international exchange on Data Science technologies and applications.
The International Data Science Conference gives the participants the opportunity, over the course of two days, to delve into the most current research and up-to-date practice in Data Science and data-driven business. Besides the two parallel tracks, the Research Track and the Industry Track, on the second day a Symposium is taking place presenting the outcomes of a European Project on Text and Data Mining (TDM). These events are open to all participants.
Also we are proud to announce keynote presentations from Mike Olson (Chief Strategy Officer Cloudera), Ralf Klinkenberg (General Manager RapidMiner), Euro Beinat (Data-Science Professor and Managing Director CS Research), Mario Meir-Huber (Big Data Architect Microsoft). These keynotes will be distributed over both conference days, providing times for all participants to come together and share views on challenges and trends in Data Science.
The Research Track offers a series of short presentations from Data Science researchers on their own, current papers. On both conference days, we are planning a morning and an afternoon session presenting the results of innovative research into data mining, machine learning, data management and the entire spectrum of Data Science.
The Industry Track showcases real practitioners of data-driven business and how they use Data Science to help achieve organizational goals. Though not restricted to these topics only, the industry talks will concentrate on our broad focus areas of manufacturing, retail and social good. Users of data technologies can meet with peers and exchange ideas and solutions to the practical challenges of data-driven business.
Futhermore the Symposium is organized in collaboration with the FutureTDM Consortium. FutureTDM is a European project which over the last two years has been identifying the legal and technical barriers, as well as the skills stakeholders/practitioners lack, that inhibit the uptake of text and data mining for researchers and innovative businesses. The recommendations and guidelines recognized and proposed to counterbalance these barriers, so as to ensure broader TDM uptake and thus boost Europe’s research and innovation capacities, will be the focus of the Symposium.
Our sponsors ClouderaF&F and um etc. will have their own, special platform: half-day workshops to provide hands-on interaction with tools or to learn approaches to developing concrete solutions. In addition, there will be an exhibition of the sponsors’ products and services throughout the conference, with the opportunity for the participants to seek contact and advice.
The iDSC 2017 is therefore a unique meeting place for researchers, business managers, and data scientists to discover novel approaches and to share solutions to the challenges of a data-driven world.

There are several things people discuss when it comes to Hadoop and there are some wrong discussions. First, there is a small number of people believing that Hadoop is a hype that will end at some point in time. They often come from a strong DWH background and won’t accept (or simply ignore) the new normal. But there are also some people that basically coin two major sayings: the first group of people states that Hadoop is cheap because it is open source and the second group of people states that Hadoop is expensive because it is very complicated. (Info: by Hadoop, I also include Spark and alike)

Neither the one nor the other is true.

First, you can download it for free and install it on your system. This makes it basically free in terms of licenses, but not in terms of running it. When you get a vanilla Hadoop, you will have to think about hotfixes, updates, services, integration and many more tasks that will get very complicated. This ends up in spending many dollars on Hadoop experts to solve your problems. Remember: you didn’t solve any business problem/question so far, as you are busy running the system! You spend dollars and dollars on expensive operational topics instead of spending them on creating value for your business.

Now, we have the opposite. Hadoop is expensive. Is it? In the past years I saw a lot of Hadoop projects the went more or less bad. Costs were always higher than expected and the project timeframe was never kept. Hadoop experts have a high income as well, which makes consulting hours even more expensive. Plus: you probably won’t find them on the market, as they can select what projects to make. So you have two major problems: high implementation cost and low ressource availability.

The pain of cluster sizing

Another factor that is relevant to the cost discussion is the cluster utilization. In many projects I could see one trend: when the discussion about cluster sizing is on, there are two main decisions: (a) sizing the cluster to the highest expected utilization or (b) making the cluster smaller than the highest expected utilization. If you select (a), you have another problem: the cluster might be under-utilized. What I could see and what my clients often have, is the following: 20% of the time, they have full utilization on the cluster, but 80% of the time the cluster utilization is below 20%. This basically means that your cluster is very expensive when it comes to business case calculation. If you select (b), you will loose business agility and your projects/analytics might require long compute times.

At the beginning of this article, I promised to explain that Hadoop is still cost-effective. So far, I only stated that it might be expensive, but this would mean that it isn’t cost effective. Hadoop is still cost effective but I will give you a solution in my next blog post on that, so stay tuned 😉

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company

As 2016 is around the corner, the question is what this year will bring for Big Data. Here are my top assumptions for the year to come:

  • The growth for relational databases will slow down, as more companies will evaluate Hadoop as an alternative to classic rdbms
  • The Hadoop stack will get more complicated, as more and more projects are added. It will almost take a team to understand what each of these projects does
  • Spark will lead the market for handling data. It will change the entire ecosystem again.
  • Cloud vendors will add more and more capability to their solutions to deal with the increasing demand for workloads in the cloud
  • We will see a dramatic increase of successful use-cases with Hadoop, as the first projects come to a successful end

What do you think about my predictions? Do you agree or disagree?

2 Big Data and Hadoop E-Books are available at a special promotion. The reduced price is only valid for 1 week, so make sure to order soon! The offer expires on 21th of December and are available on the Kindle store. The two E-Books are:

  • Big Data (Introduction); 0.99$ instead of 5$: Get it here
  • Hadoop (Introduction); 0.99$ instead of 5$: Get it here

Have fun reading it!

I saw so many Big Data “initiatives” in the last month in companies. And guess what? Most of them failed either completely or simply didn’t deliver the results expected. A recent Gartner study even mentioned that only 20% of Hadoop projects are put “live”. But why do these projects fail? What is everyone doing wrong?
Whenever customers are coming to me, they “heard” of what Big Data can help them with. So they looked at 1-3 use cases and now want to have them put into production. However, this is where the problem starts: they are not aware of the fact that also Big Data needs a strategic approach. To get this right, it is necessary to understand the industry (e.g. TelCo, Banking, …) and associated opportunities. To achieve that, a Big Data roadmap has to be built. This is normally done in a couple of workshops with the business. This roadmap will then outline what projects are done in what priority and how to measure results. Therefore, we have a Business Value Framework for different industries, where possible projects are defined.
The other thing I often see is that customers come and say: so now we built a data lake. What should we do with it? We simply can’t find value in our data. This is a totally wrong approach. We often talk about the data lake, but it is not as easy as IT marketing tells us; whenever you build a data lake, you first have to think about what you want to do with it. Why should you know what you might find if you don’t really know what you are looking for? Ever tried searching “something”? If you have no strategy, it is worth nothing and you will find nothing. Therefore, a data lake makes sense, but you need to know what you want to build on top of it. Building a data lake for Big Data is like buying bricks for a house – without knowing where you gonna construct that house and without knowing what the house should finally look like. However, a data lake is necessary to provide great analytics and to run projects on top of that.

Big Data and IT Business alignment

Big Data and IT Business alignment

Summing it up, what is necessary for Big Data is to have a clear strategy and vision in place. If you fail to do so, you will end up like many others – being desperate about the promises that didn’t turn out to be true.

The AWS Java SDK Version 1.8.10 comes with a critical bug, affecting uploads. A fix was provided by AWS and normally the SDK is updated automatically, so you don’t need to worry.
However, if automatic updates are disabled in your Eclipse Version, you might loose data when uploading via the SDK Version 1.8.10. Here is what AWS has to say about the bug:
// //

AWS Message

Users of AWS SDK for Java 1.8.10 are urged to immediately update to the latest version of the SDK, version 1.8.11.
If you’ve already upgraded to 1.8.11, you can safely ignore this message.
Version 1.8.10 has a potential for data loss when uploading data to Amazon S3 under certain conditions. Data loss can occur if an upload request using an InputStream with no user-specified content-length fails and is automatically retried by the SDK.
The latest version of the AWS SDK for Java can be downloaded here:
And is also available through Maven central:

The bug itself is repaired, in case you didn’t update the AWS SDK and are on the SDK Version 1.8.10 you should update that. Normally, the AWS SDK updates itself automatically in Eclipse.

// //
Apache Software Foundation announced that Apache Storm is now a top level Hadoop project. But what is Apache Storm about? Well, basically Apache Storm is a project to analyse data streams that are near real time. Storm works with messages and analyses what is going on. Storm originates from Twitter, which is using it for their streaming API. Storm is about processing time-critical data and Storm guarantees that your data gets processed. It is basically fault tolerant and scalable. Apache Storm is useful for fraud protection in gambling, banking and financial services, but not only there. Storm can be used wherever real-time or time-critical applications are necessary. At the moment, Storm allows to process 1 million tupels per second and node. This is massive, given the fact that Storm is all about scaling out. Imagine adding 100 nodes! Apache Storm works with Tupels that come from spouts. A spout is a messaging system such as Apache Kafka. Storm supports much more Messaging systems and it can easily be extended by it’s abstraction layer. Storm consists of some major concepts illustrated in the following image: Apache Storm Nimbus is the Master Node, similar to Hadoop‘s Job Tracker. ZooKeeper is used for Cluster coordination and the Supervisor runs the worker process. Each worker process consists of some subsets: an executor that is a thread spanned by the worker and a task itself.

Enjoy this article?

Make sure to subscribe to Cloudvane to receive regular updates here

Major concepts in Apache Storm are 4 elements: streams, spouts, bolts and topologies.
Tuples in Apache Storm

Tuples in Apache Storm

Streams are an unbound sequence of Tuples, a Spout is a source of streams, Bolts process input streams and create new output streams and a topology is a network of Bolts and Spouts.   The header image is provided as Creative Commons license by MattysFlicks.