The last years have been exciting for Telco’s: 5G is the “next big thing” in communications. It promises us ultra-high speed with low latency. Our internet speed will never be the same again. I’ve been working in the telco business until recently but I would say that the good times of telco’s will soon be gone. Elon Musk will destroy this industry and will entirely shake it up.

Why will Elon Musk disrupt the Telco industry?

Before we get to this answer, let’s first have a look what one of his companies is currently “building”. You might have heard of SpaceX. Yes – these are the folks being capable of shooting rockets to the orbit and landing them again. This significantly reduces the cost per launch. Even Nasa is relying on SpaceX. And it doesn’t stop here: Elon Musk is telling us how to get to the moon (again) and even bring first people to the Mars. This is really visionary, isn’t it?

However, with all this Moon and Mars things, there is one thing we tend to oversee: SpaceX is bringing a lot of satellites to the orbit. Some of them are for other companies, but a significant number are for SpaceX itself. They shot some 1,700 satellites to the orbit and are already the largest operator of satellites. But what are these satellites for? Well – you might already guess it: for providing satellite-powered internet. In a first statement, the satellite network was considered for areas with large coverage. However, recently the company (named “StarLink“) announced that they now offer a global coverage.

One global network …

Wait, did I just write “a global coverage”? That’s “insane”. One company can provide internet for each and every person on the planet, regardless of where they are. All 7,9 billion people on the world. That is a huge market to address! However, what is more impressive is the cost at what they can built this network. Right now, they have something like 1,700 satellites out there. Each Falcon 9 rocket (which they own!) can transport around 40 of these satellites. All together, the per-satellite cost for SpaceX would be around 300,000$. According to Morgan Stanley, SpaceX might need well below 60 billion dollars to built a satellite internet of around 30,000 satellites. This is a way higher number than the 1,700 already up there. However, think about speed and latency – right now, with 1,700 satellites already out, StarLink is offering around 300 Mbits with 20ms latency. This is already great as compared to 4G, where you merely get up to 150 Mbits. Curious what’s in if all 30k are out? I would expect that we get some 1Gbit and a very low latency. Then it would be a strong competitor to 5G.

Again, the cost …

Morgan Stanley estimated the cost for this network to be around 60 billion USD. This is quite a lot of money StarLink has to gather. This sounds like a lot, but it isn’t. Let’s compare it again to 5G. Accenture estimates that the 5G network for the United States will cost some 275 billion alone! One market. compare the 60 billion of Starlink – a global market addressing 7.9 billion people – with the U.S., where you can address 328 million people. It is 20 times the market, by a fraction of the cost! Good night, 5G.

Internet of things via satellites rather than 5G

Building up 5G might not succeed in the race for the future of IoT applications. Just think about autonomous cars: one key issue there is a steady connectivity. 5G might not be everywhere or connectivity might be bad in a lot of regions that have a smaller population. It simply doesn’t pay out for TelCos to built 5G everywhere. But in contrast – StarLink will be everywhere. So large IoT applications will rather go for Starlink. Imagine Ford or Mercedes having one partner to negotiate rather than 50 different Telco providers around the globe for their setup. It makes things easier from a technical and commercial point of view.

Are Telcos doomed?

I would say: not yet. Starlink is at a very early stage and still in Beta. There might be some issues coming up. However, Telcos should definitely be afraid. I was in the business until recently, and most Telco executives are not much thinking about Starlink. If they do, they laugh at them. But remember what happend with the automotive industry? Yep, we are all now going electric. Automotive executives were laughing at Tesla. A low-volume, niche player they said. What is it now? Tesla being more valuable than any other automotive company in the world, producing cars in the masses.

However: one thing is different; automotive companies could easily attach to the new normal. Building a car is not just about about the engine. It is also a lot about the process, the assembly lines and alike. All major car manufacturers now offer electric cars and can built them in a competitive manner with Tesla. As of Starlink vs. 5G, this will be different: Telco companies can’t built rockets. Elon Musk will disrupt another industry – again!

This post is an off-topic post from my Big Data tutorials

Big Data was a buzz-word over the last years, but it started to prove value in the last years as well. Major companies started to develop Big Data strategies and made their organisation fit to become “data driven”. However there is still some way to go. Therefore, my 5 predictions for 2019 are:

1. Big Data Services will become more popular on the Cloud, impacting the business model for On-Premise Hadoop providers

One obvious development in the last year was that the object stores (such as Amazon S3) became more popular for processing large amounts of data. This heavily threats the “traditional” HDFS based solutions, which are fully built on Hadoop. With Object Stores, HDFS and thus Hadoop becomes obsolete. Processing is done now with Spark. Also, most established Cloud providers started to create automated Spark services. This gives the customers more flexibility over traditional solutions. However, traditional HDFS still brings some advantages over Object stores such as fine-granular security and data governance. We will see improvements in cloud based object stores over the next year(s) to overcome those obstacles. But anyhow: in my opinion, the Hortonworks/Cloudera merger this year didn’t come out of a position of strength but rather from the future threats that arise – from the cloud. And running Hadoop in a full distribution in the cloud isn’t smart from an economy point of view.

2. Traditional database providers will see shrinking revenues for their proprietary solutions

Data warehouse providers such as Teradata struggle with decreasing revenues from their core business model. We could see this over the last years with shrinking revenue statements and declining market capitalisation. This trend will continue in 2019 and increases in it’s pace – but isn’t at the fasted depreciation yet. With companies becoming more mature in data, they will increasingly see that overly expensive data warehouses aren’t necessary anymore. However, data warehousing will always exist – also in a relational way. But this might also move to economically more relevant platforms. Anyway, data warehouse providers need to change their business models. They have huge potential to become a major player in the Big Data and Analytical ecosystem.

3. Big Data will become faster

This is another trend that emerged over the last year. With the discussion of Kappa vs. Lambda for the Big Data technology stack, this trend was becoming more attention recently. Real-Time platforms become more and more economical and thus making it easier to move towards faster execution. Also, customers expect fast results and internal (business) departments don’t want to wait forever for results. 

4. Big Data (and Analytics) isn’t a buzz-word anymore and sees significant investment within companies

As already mentioned in my opening statement, Big Data isn’t a buzzword anymore and companies putting some significant investments into it. This is now also perceived from the c-level. Executives see the need for their companies to become data driven in all aspects. The backbone of digitalisation is data, and if they want to succeed in digitalisation, the data aspects has to be mastered first. Banking and Telecommunication already started this journey in the last years and thus has significant knowledge gathered in this, other industries – also very traditional ones – will follow in 2019. Initiatives will now turn into programs with organisational alignments. 

5. Governance is now perceived as a key challenge for Big Data

Data Governance was always something nobody wanted to care about. It didn’t give much benefits to business functions (you don’t see governance) and it wasn’t put into the big context. Now, with big data being put into production in large enterprises, data governance comes up as an important topic again. Basically, companies should start with data governance at the very beginning, since it is too hard to do it afterwards. Also, a good data governance strategy enables the company to be faster with analytics. The aim on this should be self-service analytics, which can only be achieved with a great data governance strategy. 

Data Science Conference 4.0 is going to be in Belgrade again soon this year, and I am happy to be one of the keynote speakers there! The program will be fantastic, so looking forward to see as many as possible there :). From the organisers:
Data Science Conference / 4.0 (DSC 4.0) will be held on 18 and 19 September in hotel Hyatt Regency Belgrade and is organized by Institute for Contemporary Sciences (ISN). During the two days Belgrade will be the epicenter of Data Science and guests will have an opportunity to hear 62 talks and 4 discussion groups, as well as take part in exclusive content such as workshops and more than 12 technical tutorials. More than a 1000 attendees are expected from over 25 countries around the work, which makes DSC 4.0 one of the three biggest data science conferences in Europe.
Hadley Wickham, Chief Scientist at RStudio and an adjunct Professor at Stanford University, will open the Conference. He is a prominent and active member of the R user community and has developed several notable and widely used packages (known as the ‘tidyverse’). Additionally, many notable speakers from around Europe and the US will take the stage, such as Mario Meir-Huber (A1 Telekom Austria Group), Vanja Paunic (Microsoft London) and Miha Pelko (BMW Group). Moreover, a great number of women will speak this year, which is noteworthy as one of the goals of the Conference is to empower women in this field. Dorothea Wiesmann working at IBM Research Zurich is one of the 6 keynote speakers.
The program of the conference will be divided into 4 parallel tracks and will cover a wide range of topics from Artificial Intelligence, Machine Learning, Data Monetisation and Data Science Education to Big Data and Engineering and more! Attendees will be able to choose a level of a talk that is most suitable for their background – be it beginner, intermediate or advanced, as well as choose talk types between technical, business or academic types – all mentioned in our schedule. Additionally, there will be an UnConference in parallel to the talks, where any attendee will have a chance to hold a small presentation of a topic or an idea in data science and have a discussion with other guests of the Conference.
The program includes Workshops as well, and their goal is to provide experience to future data scientists on the problems that companies face day to day. They are free of charge and the applications are still open. Furthermore, attendees will have an opportunity to hear more than 130 hours of Technical Tutorials on the days leading up to the Conference (15-17.9.) on topics such as the basics and advanced visualisation using Tableau, using Amazon Web Services, the basics of Artificial Intelligence, Machine Learning in Python and R and many more.
Additional information can be found on Data Science Conference / 4.0 official website, where you can book a ticket as well or contact for any questions you might have. One more thing, you can take a look at the last year’s aftermovie here.

I am happy to share this exciting conference I am keynoting at. Also, Mike Ohlsen from Cloudera will deliver a keynote at the conference.
About the conference:
June 12th – 13th 2017 | Salzburg, Austria |
The 1st International Data Science Conference (iDSC 2017) organized by Salzburg University of Applied Sciences (Information Technology and Systems Management) in cooperation with Information Professionals GmbH seeks to establish a key Data Science event, providing a forum for an international exchange on Data Science technologies and applications.
The International Data Science Conference gives the participants the opportunity, over the course of two days, to delve into the most current research and up-to-date practice in Data Science and data-driven business. Besides the two parallel tracks, the Research Track and the Industry Track, on the second day a Symposium is taking place presenting the outcomes of a European Project on Text and Data Mining (TDM). These events are open to all participants.
Also we are proud to announce keynote presentations from Mike Olson (Chief Strategy Officer Cloudera), Ralf Klinkenberg (General Manager RapidMiner), Euro Beinat (Data-Science Professor and Managing Director CS Research), Mario Meir-Huber (Big Data Architect Microsoft). These keynotes will be distributed over both conference days, providing times for all participants to come together and share views on challenges and trends in Data Science.
The Research Track offers a series of short presentations from Data Science researchers on their own, current papers. On both conference days, we are planning a morning and an afternoon session presenting the results of innovative research into data mining, machine learning, data management and the entire spectrum of Data Science.
The Industry Track showcases real practitioners of data-driven business and how they use Data Science to help achieve organizational goals. Though not restricted to these topics only, the industry talks will concentrate on our broad focus areas of manufacturing, retail and social good. Users of data technologies can meet with peers and exchange ideas and solutions to the practical challenges of data-driven business.
Futhermore the Symposium is organized in collaboration with the FutureTDM Consortium. FutureTDM is a European project which over the last two years has been identifying the legal and technical barriers, as well as the skills stakeholders/practitioners lack, that inhibit the uptake of text and data mining for researchers and innovative businesses. The recommendations and guidelines recognized and proposed to counterbalance these barriers, so as to ensure broader TDM uptake and thus boost Europe’s research and innovation capacities, will be the focus of the Symposium.
Our sponsors ClouderaF&F and um etc. will have their own, special platform: half-day workshops to provide hands-on interaction with tools or to learn approaches to developing concrete solutions. In addition, there will be an exhibition of the sponsors’ products and services throughout the conference, with the opportunity for the participants to seek contact and advice.
The iDSC 2017 is therefore a unique meeting place for researchers, business managers, and data scientists to discover novel approaches and to share solutions to the challenges of a data-driven world.

There are several things people discuss when it comes to Hadoop and there are some wrong discussions. First, there is a small number of people believing that Hadoop is a hype that will end at some point in time. They often come from a strong DWH background and won’t accept (or simply ignore) the new normal. But there are also some people that basically coin two major sayings: the first group of people states that Hadoop is cheap because it is open source and the second group of people states that Hadoop is expensive because it is very complicated. (Info: by Hadoop, I also include Spark and alike)

Neither the one nor the other is true.

First, you can download it for free and install it on your system. This makes it basically free in terms of licenses, but not in terms of running it. When you get a vanilla Hadoop, you will have to think about hotfixes, updates, services, integration and many more tasks that will get very complicated. This ends up in spending many dollars on Hadoop experts to solve your problems. Remember: you didn’t solve any business problem/question so far, as you are busy running the system! You spend dollars and dollars on expensive operational topics instead of spending them on creating value for your business.

Now, we have the opposite. Hadoop is expensive. Is it? In the past years I saw a lot of Hadoop projects the went more or less bad. Costs were always higher than expected and the project timeframe was never kept. Hadoop experts have a high income as well, which makes consulting hours even more expensive. Plus: you probably won’t find them on the market, as they can select what projects to make. So you have two major problems: high implementation cost and low ressource availability.

The pain of cluster sizing

Another factor that is relevant to the cost discussion is the cluster utilization. In many projects I could see one trend: when the discussion about cluster sizing is on, there are two main decisions: (a) sizing the cluster to the highest expected utilization or (b) making the cluster smaller than the highest expected utilization. If you select (a), you have another problem: the cluster might be under-utilized. What I could see and what my clients often have, is the following: 20% of the time, they have full utilization on the cluster, but 80% of the time the cluster utilization is below 20%. This basically means that your cluster is very expensive when it comes to business case calculation. If you select (b), you will loose business agility and your projects/analytics might require long compute times.

At the beginning of this article, I promised to explain that Hadoop is still cost-effective. So far, I only stated that it might be expensive, but this would mean that it isn’t cost effective. Hadoop is still cost effective but I will give you a solution in my next blog post on that, so stay tuned 😉

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company

As 2016 is around the corner, the question is what this year will bring for Big Data. Here are my top assumptions for the year to come:

  • The growth for relational databases will slow down, as more companies will evaluate Hadoop as an alternative to classic rdbms
  • The Hadoop stack will get more complicated, as more and more projects are added. It will almost take a team to understand what each of these projects does
  • Spark will lead the market for handling data. It will change the entire ecosystem again.
  • Cloud vendors will add more and more capability to their solutions to deal with the increasing demand for workloads in the cloud
  • We will see a dramatic increase of successful use-cases with Hadoop, as the first projects come to a successful end

What do you think about my predictions? Do you agree or disagree?

2 Big Data and Hadoop E-Books are available at a special promotion. The reduced price is only valid for 1 week, so make sure to order soon! The offer expires on 21th of December and are available on the Kindle store. The two E-Books are:

  • Big Data (Introduction); 0.99$ instead of 5$: Get it here
  • Hadoop (Introduction); 0.99$ instead of 5$: Get it here

Have fun reading it!

I saw so many Big Data “initiatives” in the last month in companies. And guess what? Most of them failed either completely or simply didn’t deliver the results expected. A recent Gartner study even mentioned that only 20% of Hadoop projects are put “live”. But why do these projects fail? What is everyone doing wrong?
Whenever customers are coming to me, they “heard” of what Big Data can help them with. So they looked at 1-3 use cases and now want to have them put into production. However, this is where the problem starts: they are not aware of the fact that also Big Data needs a strategic approach. To get this right, it is necessary to understand the industry (e.g. TelCo, Banking, …) and associated opportunities. To achieve that, a Big Data roadmap has to be built. This is normally done in a couple of workshops with the business. This roadmap will then outline what projects are done in what priority and how to measure results. Therefore, we have a Business Value Framework for different industries, where possible projects are defined.
The other thing I often see is that customers come and say: so now we built a data lake. What should we do with it? We simply can’t find value in our data. This is a totally wrong approach. We often talk about the data lake, but it is not as easy as IT marketing tells us; whenever you build a data lake, you first have to think about what you want to do with it. Why should you know what you might find if you don’t really know what you are looking for? Ever tried searching “something”? If you have no strategy, it is worth nothing and you will find nothing. Therefore, a data lake makes sense, but you need to know what you want to build on top of it. Building a data lake for Big Data is like buying bricks for a house – without knowing where you gonna construct that house and without knowing what the house should finally look like. However, a data lake is necessary to provide great analytics and to run projects on top of that.

Big Data and IT Business alignment

Big Data and IT Business alignment

Summing it up, what is necessary for Big Data is to have a clear strategy and vision in place. If you fail to do so, you will end up like many others – being desperate about the promises that didn’t turn out to be true.

The AWS Java SDK Version 1.8.10 comes with a critical bug, affecting uploads. A fix was provided by AWS and normally the SDK is updated automatically, so you don’t need to worry.
However, if automatic updates are disabled in your Eclipse Version, you might loose data when uploading via the SDK Version 1.8.10. Here is what AWS has to say about the bug:
// //

AWS Message

Users of AWS SDK for Java 1.8.10 are urged to immediately update to the latest version of the SDK, version 1.8.11.
If you’ve already upgraded to 1.8.11, you can safely ignore this message.
Version 1.8.10 has a potential for data loss when uploading data to Amazon S3 under certain conditions. Data loss can occur if an upload request using an InputStream with no user-specified content-length fails and is automatically retried by the SDK.
The latest version of the AWS SDK for Java can be downloaded here:
And is also available through Maven central:

The bug itself is repaired, in case you didn’t update the AWS SDK and are on the SDK Version 1.8.10 you should update that. Normally, the AWS SDK updates itself automatically in Eclipse.