How Big is Big Data really? This and similar things will be discussed in here

The Azure CLI is my favorite tool to manage Hadoop Clusters on Azure. Why? Because I can use the tools I am used to from Linux now from my Windows PC. In Windows 10, I am using the Ubuntu Bash for that, which gives me all the major tools for managing remote Hadoop Clusters.

One thing I am doing frequently, is starting and stopping Hadoop Clusters based on Cloudera. If you are coming from Powershell, this might be rather painfull for you, since you can only start each vm in the cluster sequentially, meaning that a cluster consisting of 10 or more nodes is rather slow to start and might take hours! In the Azure CLI I can easily do this by specifiying “–nowait” and all runs in parallel. The only disadvantage is that I won’t get any notifications on when the cluster is ready. But I am doing this with a simple hack: ssh’ing into the cluster (since I have to do this anyway). SSH will succeed once the Masternodes are ready and so I can perform some tasks on the nodes (such as restarting Cloudera Manager since CM is usually a bit “dizzy” after sending it to sleep and waking it up again :))

Let’s start with the easiest step: stopping the cluster. The Azure CLI always starts with “az” as command (meaning Azure of course). The command for stopping one or more vm’s with the Azure CLI is “vm stop”. The only two things I need to provide now are the id’s I want to stop and “–nowait” since I want to quit the script right after.

So, the script would look like the following:

az vm stop --ids YOUR_IDS --no-wait

However, this has still one major disadvantage: you would need to provide all ID’s Hardcoded. This doesn’t matter at all if your cluster never changes, but in my case I add and delete vm’s to or from the cluster, so this script doesn’t play well for my case. However, the CLI is very flexible (and so is bash) and I can query all my vm’s in a resource group. This will give me the IDs which are currently in the cluster (let’s assume I delete dropped vm’s and add new vm’s to the RG). The Query for retrieving all VMs in a Resource Group is easy:

az vm list --resource-group YOUR_RESOURCE_GROUP --query "[].id" -o tsv

This will give me all IDs in the RG. The real fun starts when doing this in one statement:

az vm stop --ids $(az vm list --resource-group clouderarg --query "[].id" -o tsv) --no-wait

Which is really nice and easy 🙂

It is similar with starting VMs in a Resource Group:

az vm start --ids $(az vm list --resource-group mmhclouderarg --query "[].id" -o tsv) --no-wait

There are several things people discuss when it comes to Hadoop and there are some wrong discussions. First, there is a small number of people believing that Hadoop is a hype that will end at some point in time. They often come from a strong DWH background and won’t accept (or simply ignore) the new normal. But there are also some people that basically coin two major sayings: the first group of people states that Hadoop is cheap because it is open source and the second group of people states that Hadoop is expensive because it is very complicated. (Info: by Hadoop, I also include Spark and alike)

Neither the one nor the other is true.

First, you can download it for free and install it on your system. This makes it basically free in terms of licenses, but not in terms of running it. When you get a vanilla Hadoop, you will have to think about hotfixes, updates, services, integration and many more tasks that will get very complicated. This ends up in spending many dollars on Hadoop experts to solve your problems. Remember: you didn’t solve any business problem/question so far, as you are busy running the system! You spend dollars and dollars on expensive operational topics instead of spending them on creating value for your business.

Now, we have the opposite. Hadoop is expensive. Is it? In the past years I saw a lot of Hadoop projects the went more or less bad. Costs were always higher than expected and the project timeframe was never kept. Hadoop experts have a high income as well, which makes consulting hours even more expensive. Plus: you probably won’t find them on the market, as they can select what projects to make. So you have two major problems: high implementation cost and low ressource availability.

The pain of cluster sizing

Another factor that is relevant to the cost discussion is the cluster utilization. In many projects I could see one trend: when the discussion about cluster sizing is on, there are two main decisions: (a) sizing the cluster to the highest expected utilization or (b) making the cluster smaller than the highest expected utilization. If you select (a), you have another problem: the cluster might be under-utilized. What I could see and what my clients often have, is the following: 20% of the time, they have full utilization on the cluster, but 80% of the time the cluster utilization is below 20%. This basically means that your cluster is very expensive when it comes to business case calculation. If you select (b), you will loose business agility and your projects/analytics might require long compute times.

At the beginning of this article, I promised to explain that Hadoop is still cost-effective. So far, I only stated that it might be expensive, but this would mean that it isn’t cost effective. Hadoop is still cost effective but I will give you a solution in my next blog post on that, so stay tuned 😉

This post is part of the “Big Data for Business” tutorial. In this tutorial, I explain various aspects of handling data right within a company

On the 15th of December, a Big Data Meetup will take place in Vienna, with leading personals from Fraunhofer, Rapidminer, Teradata et al.

About the Meetup:

The growing digitization and networking process within our society has a large influence on all aspects of everyday life. Large amounts of data are being produced permanently, and when these are analyzed and interlinked they have the potential to create new knowledge and intelligent solutions for economy and society. Big Data can make important contributions to the technical progress in our societal key sectors and help shape business. What is needed are innovative technologies, strategies and competencies for the beneficial use of Big Data to address societal needs.

Climate, Energy, Food, Health, Transport, Security, and Social Sciences – are the most important societal challenges tackled by the European Union within the new research and innovation framework program “Horizon 2020”. In every one of these fields, the processing, analysis and integration of large amounts of data plays a growing role – such as the analysis of medical data, the decentralized supply with renewable energies or the optimization of traffic flow in large cities.

Big Data Europe (BDE, http://www.big-data-europe.eu) will undertake the foundational work for enabling European companies to build innovative multilingual products and services based on semantically interoperable, large-scale, multi-lingual data assets and knowledge, available under a variety of licenses and business models

On 14-15 December 2015 the whole BDE team is meeting in Vienna for a project plenary and thereby around 35 experts in the topic will be participating in the Big Data Europe MeetUp on 15 December 2015 at the Impact Hub Vienna to discuss challenges and requirements and proven solutions for big data management together with the audience.

Agenda
16:00 – 16:10, Welcome & the BDE MeetUp, Vienna – Martin Kaltenböck (SWC)
16:10 – 16:30, The Big Data Europe Project
Sören Auer (Fraunhofer IAIS, BDE Project Lead)
16:30 – 16:45, Big Data Management Models (e.g. RACE)
Mario Meir-Huber (Big Data Lead CEE, Teradata, Vienna – Austria)
16:45 – 17:00, Selected Big Data Projects in Budapest & above,

Zoltan C Toth (Senior Big Data Engineer RapidMiner Inc., Budapest – Hungary)
17:00 – 17:30 Open Discussion with the Panel on Big Data Requirements, Challenges and Solutions.
17:30 – 19:00 Networking & Drinks
Remark: 19:00/30 end of event…

Register here or here.

I am happy to announce that there is a partnership between the Data Natives conference and Cloudvane. Once again, one lucky person can get a free ticket to this conference. The conference takes place from 19th to 20th November in Berlin.

What’s necessary for you to get the ticket:

  • Share the blog post (Twitter, LinkedIn, Facebook) and send the proof of that to me via mail
  • Write a review (ideally with some pictures)

Data Natives focuses on three key areas of innovation: Big Data, IoT and FinTech. The intersection of these product categories is home to the most exciting technology innovation happening today. Whether it’s for individual consumers or multi-billion dollar industries, the opportunity is immense. Come and learn more from leading scientists, founders, analysts, investors and economists coming from Google, SAP, Rocket Internet,Gartner, Forrester among others. Two days full of interesting talks, sharing knowledge from 50+ speakers and engaging with the community of a data driven generation of more than 500 people.

More information on www.datanatives.io 

Thursday, November 19, 8:30AM to Friday, November 20 7:00PM  

NHow Hotel Berlin

Stralauer Allee 3

10245 Berlin

Germany

I am happy to announce the conference Big Data Week. I managed to get one free ticket, which I will give to a reader of my blog. What’s necessary for you to get the ticket:

  • Share the blog post (Twitter, LinkedIn, Facebook) and send the proof of that to me via mail
  • Write a review (ideally with some pictures)

About the conference:

You are invited to attend Big Data Conference which is going to take place in London, on November 25.

This year conference’s theme is Big Data in Use: presenting innovative use cases coming from retail, advertising, publishing, IoT and gaming domains.  Companies that implemented such projects will showcase their impact in the business, the benefits and the challenges, both technical and business wise.

Get your ticket now and learn from industry experts, put your existing knowledge to work and forge lasting relationships within one of the most exciting big data communities!

Why should you attend?

Confirmed speakers and themes for the 2015 lineup include:

  • New business models:  Exterion, Honest Caffe, Copenhagen City Exchange
  • Big Data in Retail: Shop Direct, Dunnhumby, EBI Solutions
  • Grow your business with machine learning: Yandex Data Factory
  • How to value data: Dunnhumby, The Economist, Skimlinks, Exterion
  • Data Models and Architectures: Excelian, ShopDirect, Skimlinks
  • 3 Panels: Big Data in Retail, How to become a data driven company, Data Scientists & the Business
  • 1 Workshop: How to become a data scientist? (Technical Track)

 

Your VIP ticket extra-benefits include:

  • 4 Trainings – Big Data in Retail and Real time processing of data – sessions on  23, 24, 26, 27 November 23, 24, 26, 27
  • 70% discount on a second conference ticket – One Day Pass
  • VIP Lounge and after conference networking party access

*** A little special something for our community: the organizers are offering you an exclusive 20% off! Just use this code: CloudVane_20_Off***

Super Early Bird Tickets on sale until October 16th!

Want to find out more? Check out the Conference Website.

I am happy to announce the development we did over the last month within Teradata. We developed a light-weight process model for Big Data Analytic projects, which is called “RACE”. The model is agile and resembles the know-how of more than 25 consultants that worked in over 50 Big Data Analytic projects in the recent month. Teradata also developed CRISP-DM, the industry leading process for data mining. Now we invented a new process for agile projects that addresses the new challenges of Big Data Analytics.

Where does the ROI comes from?

This was one of the key questions we addressed when developing RACE. The economics of Big Data Discovery Analytics are different to traditional Integrated Data Warehousing economics. ROI comes from discovering insights in highly iterative projects run over very short time periods (4 to 8 weeks usually) Each meaningful insight or successful use case that can be actioned generates ROI. The total ROI is a sum of all the successful use cases. Competitive Advantage is therefore driven by the capability to produce both a high volume of insights as well as creative insights that generate a high ROI.

What is the purpose of RACE?

RACE is built to deliver a high volume of use cases, focusing on speed and efficiency of production. It fuses data science, business knowledge & creativity to produce high ROI insights

How does the process look like?

RACE - an agile process for Big Data Analytic Projects

RACE – an agile process for Big Data Analytic Projects

The process itself is divided into several short phases:

  • Roadmap.That’s an optional first step (but heavily recommended) to built a roadmap on where the customer wants to go in terms of Big Data.
  • Align. Use-cases are detailed and data is confirmed.
  • Create. Data is loaded, prepared and analyzed. Models are developed
  • Evaluate. Recommendations for the business are given

In the next couple of weeks we will publish much more on RACE, so stay tuned!

I saw so many Big Data “initiatives” in the last month in companies. And guess what? Most of them failed either completely or simply didn’t deliver the results expected. A recent Gartner study even mentioned that only 20% of Hadoop projects are put “live”. But why do these projects fail? What is everyone doing wrong?

Whenever customers are coming to me, they “heard” of what Big Data can help them with. So they looked at 1-3 use cases and now want to have them put into production. However, this is where the problem starts: they are not aware of the fact that also Big Data needs a strategic approach. To get this right, it is necessary to understand the industry (e.g. TelCo, Banking, …) and associated opportunities. To achieve that, a Big Data roadmap has to be built. This is normally done in a couple of workshops with the business. This roadmap will then outline what projects are done in what priority and how to measure results. Therefore, we have a Business Value Framework for different industries, where possible projects are defined.

The other thing I often see is that customers come and say: so now we built a data lake. What should we do with it? We simply can’t find value in our data. This is a totally wrong approach. We often talk about the data lake, but it is not as easy as IT marketing tells us; whenever you build a data lake, you first have to think about what you want to do with it. Why should you know what you might find if you don’t really know what you are looking for? Ever tried searching “something”? If you have no strategy, it is worth nothing and you will find nothing. Therefore, a data lake makes sense, but you need to know what you want to build on top of it. Building a data lake for Big Data is like buying bricks for a house – without knowing where you gonna construct that house and without knowing what the house should finally look like. However, a data lake is necessary to provide great analytics and to run projects on top of that.

Big Data and IT Business alignment

Big Data and IT Business alignment

 

Summing it up, what is necessary for Big Data is to have a clear strategy and vision in place. If you fail to do so, you will end up like many others – being desperate about the promises that didn’t turn out to be true.

 

Everyone is doing Big Data these days. If you don’t work on Big Data projects within your company, you are simply not up to date and don’t know how things work. Big Data solves all of your problems, really!

Well, in reality this is different. It doesn’t solve all your problems. It actually creates more problems then you think! Most companies I saw recently working on Big Data projects failed. They started a Big Data project and successfully wasted thousands of dollars on Big Data projects. But what exactly went wrong?

First of all, Big Data is often only seen as Hadoop. We live with the mis-perception that only Hadoop can solve all Big Data topics. This simply isn’t true. Hadoop can do many things – but real data science is often not done with the core of Hadoop. Ever talked to someone doing the analytics (e.g someone good in math or statistics)?. They are not ok with writing Java Map/Reduce queries or Pig/Hive scripts. They want to work with other tools that are ways more interactive.

The other thing is that most Big Data initiatives are often handled wrong. Most initiatives often simply don’t include someone being good in analytics. One simply doesn’t find this type of person in an IT team – the person has to be found somewhere else. Failing to include someone with this skills often leads to finding “nothing” in the data – because IT staff is good in writing queries – but not in doing complex analytics. These skills are actually not thought in IT classes – it requires a totally different study field to reach this skill set.

Hadoop as the solution to everything for many IT departments. However, projects often stop with implementing Hadoop. Most Hadoop implementations never leave the pilot phase. This is often due to the fact that IT departments see Hadoop as a fun thing to play with – but getting this into production requires a different approach. There are actually more solutions out there that can be done when delivering a Big Data project.

A key to ruining your Big Data project is not involving the LoB. The IT department often doesn’t know what questions to ask. So how can they know the answer and try to find the question? The LoB sees that different. They see an answer – and know what question it would be.

The key to kill your Big Data initiative is exactly one thing: go with the hype. Implement Hadoop and don’t think about what you actually want to achieve with it. Forget the use-case, just go and play with the fancy technology. NOT

As long as companies will stich to that, I am sure I will have enough work to do. I “inherited” several failed projects and turned them into success. So, please continue.

In the last weeks, I outlined several Big Data benefits by industries. The next posts, I want to outline use-cases where Big Data are relevant in any company, as I will focus on the business functions.

This post’s focus: Logistics.

Big Data is a key driver for logistics. By logistics, companies that provide logistics solutions and companies that take advantage of logistics are meant. On the one hand, Big Data can significantly improve the supply chain of a company. For years – or even decades – companies rely on the “just in time” delivery. However, “just in time” wasn’t always “just in time”. In many cases, the time an item spent on stock was simply reduced but it still needed to be stored somewhere – either in a temporary warehouse on-site or in the delivery trucks themselves. The first approach is capital intensive, since these warehouses need to be built (and extended in case of growth). The second approach is to keep the delivery vehicles waiting – which creates expenses on the operational side – each minute a driver has to wait, costs money. With analytics, the just in time delivery can be further improved and optimized to lower costs and increase productivity.

Another key driver for Big Data and logistics is the route optimization. Routes can be improved by algorithms and make them faster. This lowers costs and on the other hand significantly saves the environment. But this is not the end of possibilities: routes can also be optimized in real-time. This includes traffic prediction and jam avoidance. Real-time algorithms will not only calculate the fastest route but also the environmental friendliest route and cheapest route. This again lowers costs and time for the company.

Header Image by  Nick Saltmarsh / CC BY

In the last weeks, I outlined several Big Data benefits by industries. The next posts, I want to outline use-cases where Big Data are relevant in any company, as I will focus on the business functions.

This post’s focus: IT.

Big Data is a hot IT topic. Not just because it comes from the IT, but also because it gives great benefits to the overall IT operations. In a recent project, I’ve been working with a large european corporation in the manufacturing/production sector. Their IT had some 400 IT employees, serving more than 50,000 corporate employees and operating a large number of servers that run specific services. A key challenge for them was reliability of their services. To find out how a service is utilised, large amounts of log data were analysed in order to find out how they can prioritise different services. This gave them detailed insights on where they want to move their services too since different services had different utilisation patterns. The company could improve their utilisation of servers. New services get integrated in that approach as well, which means that they are capable of delivering these new services without the need to invest in new hardware.

Another great approach – and another hot topic – is Big Data for IT security. With Big Data analytics, companies can find security issues before they become serious threads. Patterns on web site access can provide insights on DoS attacks and similar issues. These analytics are often provided in real-time and provide fast ways to react in case problems occur.

As described in today’s article, Big Data is not just a topic coming from the IT, it is a topic MADE for the IT.