Learn what is necessary for Big Data Management and how you can implement Big Data Projects in your company

I saw so many Big Data “initiatives” in the last month in companies. And guess what? Most of them failed either completely or simply didn’t deliver the results expected. A recent Gartner study even mentioned that only 20% of Hadoop projects are put “live”. But why do these projects fail? What is everyone doing wrong?

Whenever customers are coming to me, they “heard” of what Big Data can help them with. So they looked at 1-3 use cases and now want to have them put into production. However, this is where the problem starts: they are not aware of the fact that also Big Data needs a strategic approach. To get this right, it is necessary to understand the industry (e.g. TelCo, Banking, …) and associated opportunities. To achieve that, a Big Data roadmap has to be built. This is normally done in a couple of workshops with the business. This roadmap will then outline what projects are done in what priority and how to measure results. Therefore, we have a Business Value Framework for different industries, where possible projects are defined.

The other thing I often see is that customers come and say: so now we built a data lake. What should we do with it? We simply can’t find value in our data. This is a totally wrong approach. We often talk about the data lake, but it is not as easy as IT marketing tells us; whenever you build a data lake, you first have to think about what you want to do with it. Why should you know what you might find if you don’t really know what you are looking for? Ever tried searching “something”? If you have no strategy, it is worth nothing and you will find nothing. Therefore, a data lake makes sense, but you need to know what you want to build on top of it. Building a data lake for Big Data is like buying bricks for a house – without knowing where you gonna construct that house and without knowing what the house should finally look like. However, a data lake is necessary to provide great analytics and to run projects on top of that.

Big Data and IT Business alignment

Big Data and IT Business alignment

 

Summing it up, what is necessary for Big Data is to have a clear strategy and vision in place. If you fail to do so, you will end up like many others – being desperate about the promises that didn’t turn out to be true.

 

Everyone is doing Big Data these days. If you don’t work on Big Data projects within your company, you are simply not up to date and don’t know how things work. Big Data solves all of your problems, really!

Well, in reality this is different. It doesn’t solve all your problems. It actually creates more problems then you think! Most companies I saw recently working on Big Data projects failed. They started a Big Data project and successfully wasted thousands of dollars on Big Data projects. But what exactly went wrong?

First of all, Big Data is often only seen as Hadoop. We live with the mis-perception that only Hadoop can solve all Big Data topics. This simply isn’t true. Hadoop can do many things – but real data science is often not done with the core of Hadoop. Ever talked to someone doing the analytics (e.g someone good in math or statistics)?. They are not ok with writing Java Map/Reduce queries or Pig/Hive scripts. They want to work with other tools that are ways more interactive.

The other thing is that most Big Data initiatives are often handled wrong. Most initiatives often simply don’t include someone being good in analytics. One simply doesn’t find this type of person in an IT team – the person has to be found somewhere else. Failing to include someone with this skills often leads to finding “nothing” in the data – because IT staff is good in writing queries – but not in doing complex analytics. These skills are actually not thought in IT classes – it requires a totally different study field to reach this skill set.

Hadoop as the solution to everything for many IT departments. However, projects often stop with implementing Hadoop. Most Hadoop implementations never leave the pilot phase. This is often due to the fact that IT departments see Hadoop as a fun thing to play with – but getting this into production requires a different approach. There are actually more solutions out there that can be done when delivering a Big Data project.

A key to ruining your Big Data project is not involving the LoB. The IT department often doesn’t know what questions to ask. So how can they know the answer and try to find the question? The LoB sees that different. They see an answer – and know what question it would be.

The key to kill your Big Data initiative is exactly one thing: go with the hype. Implement Hadoop and don’t think about what you actually want to achieve with it. Forget the use-case, just go and play with the fancy technology. NOT

As long as companies will stich to that, I am sure I will have enough work to do. I “inherited” several failed projects and turned them into success. So, please continue.

During my professional carrier, I was managing several IT projects, mainly in the distributed systems environment. Initially, these projects were cloud projects, that were rather easy. I worked with IT departments in different domains/industries and we all had the same level of “vocabulary”. When talking with IT staff, it is clear that all use the same terms to describe “things”. No special explanation is needed.

I soon realized that Big Data projects are VERY different to that. I wrote several posts on Big Data challenges in the last month and the requirements for data scientists and alike. What I am always coming across when managing Big Data projects is the different approach one have to select when (successfully) managing these kind of projects.

Let me first start by explaining what I am doing. First of all, I don’t code, implement or create any kind of infrastructure. I work with senior (IT) staff to talk about ideas which will eventually be transformed to Big Data projects (either direct or indirect). My task is to work with them on what Big Data can achieve for their organization and/or problem. I am not discussion how their Hadoop solution will look like, I am working on use-cases and challenges/opportunities for their problems, independent from a concrete technology. Plus, I am not focused on any specific industry or domain.

However, all strategic Big Data projects have a concrete schema. The most challenging part is to understand the problem. In the last month, I had some challenges in different industries; whenever I run these kind of projects, it is mainly about cooperating with the domain experts. They often have no idea about the possibilities of Big Data – and they don’t have to. I, in contrast, have no idea about the domain itself. This is challenging on the one side – but very helpful on the other side. The more experience one person gains within a specific domain, the more the person thinks and acts in the methodology for the specific domain. They often don’t see the solution because they work on a “I’ve made this experience and it has to be very similar”. The same applies to me as a Big Data expert. All workshops I ran were mainly about mixing the concrete domain with the possibilities of Big Data.

I had a number of interesting projects lately. One of the projects was in the geriatric care domain. We worked on how data can make the live of elderly better and what type of data is needed. It was very interesting to work with domain experts and see what challenges they actually face. An almost funny discussion arose around Open Data – we looked at several data sources provided by the state and I mentioned: “sorry, but we can’t use these data sources. They aren’t big and they are about locations of toilets within our capital city”. However, their opinion was different because the location of toilets is very important for them – and data doesn’t always needs to be big, it needs to be valuable. Another project was in the utilities domain, where it was about improving their supply chain by optimizing it with data. Another project for a company providing devices was about improving the reliability of their devices by analyzing large amounts of log data. When their devices have an outage, a service personal has to go to the city of the outage. This takes several days to a week. I worked on reducing this time and included a data scientist. We could reduce the time the device stands still to some hours only for the 3 mayor error codes by finding patterns weeks before the outage occurs. However, there is still much work to be done in that area. Another project was in the utilities sector and in the government sector.

All of these projects had a common iteration phase but were very different – each project had it’s own challenges, but the key success factor for me was how to deal with people – it was very important to work with different people from different domains with a different mindset – improving my knowledge and broadening my horizon as well. That’s challenging on the one hand but very exciting on the other hand.

As described in an earlier post here I outlined the fact that becoming a data scientist requires a lot of knowledge.

Focusing back, a data scientist needs to have knowledge in different IT domains:

  • General understanding of distributed systems and how they work. This includes administration skills for Linux as well as hardware related skills such as networking.
  • Knowledge in Hadoop or similar technologies. This knowledge basically builds on top of the former one but it is sort of different and requires a more software focused knowledge.
  • Great statistical/mathematical knowledge. This is necessary to actually work on the required tasks and to figure out how they can be applied to real algorithms.
  • Presentation skills. All is worth nothing as long as someone can’t represent the data or things found in the data. The management might not see the points if the person can’t present data in an appropriate way.

In addition, there are some other skills necessary:

  • Knowledge of the legal situation. The legal basics are different from country to country. Though the european union gives some legal borders within member states, there are also differences.
  • Knowledge of the society impacts. It is also necessary to understand how society might react to data analysis. Especially in marketing it is absolutely necessary to handle that correct

Since more and more IT companies focus on looking for the ideal data scientist, people should first try to find out who is capable of handling all of these skills. The answer to this might be: there is no person that can handle all. It is likely that one person is great in distributed systems and Hadoop but might fail in transforming questions to algorithms and finally presenting them.

Data Science is more of a team effort than a single person that can handle all of it. Therefore, it is rather necessary to build a team that will be able to address all of these challenges.

Big Data is all about limiting our privacy. With Big Data, we get no privacy at all. Hello, Big Brother is watching us and we have to stop it right now!

Well, this is far too cruel. Big Data is NOT all about limiting our privacy. Just to make it clear: I see the benefits of Big Data. However, there are a lot of people out there that are afraid of Big Data because of privacy. The thing I want to state first: Big Data is not NSA, Privacy, Facebook or whatever surveillance technology you can think of. Of course, it is often enabled by Big Data technologies. I see this discussion often and I recently came across an event, that stated stated that Big Data is bad and it limits our privacy. I say, this is bullsh##.

The event I am talking about stated that Big Data is bad, it is limiting our privacy and it needs to be stopped. It is a statement that only sees one side of the topic. I agree that the continuous monitoring of people by secret services isn’t great and we need to do something about it. But this is not Big Data. I agree that Facebook is limiting my privacy. I significantly reduced the amount of time spending on Facebook and don’t use the mobile Apps. This needs to change.

However, this not Big Data. This are companies/organisations doing something that is not ok. Big Data is much more than that. Big Data is not just evil, it is great for many aspects:

  • Big Data in healthcare can save thousands, if not millions of lives by improving medicine, vaccination and finding correlations for chronically ill people to improve their treatment. Nowadays, we can decode the DNA in short time, which helps a lot of people!
  • Big Data in agriculture can improve how we produce foods. Since the global population is growing, we need to get more productive in order to feed everyone.
  • Big Data can improve the stability and reliability of IT systems by providing real-time analytics. Logs are analysed in real-time to react to incidents before they happen.
  • Big Data can – and actually does – improve the reliability of devices and machines. An example is that of medicine devices. A company in this field could reduce the time the devices had an outage from weeks to only hours! This does not just save money, it also saves lives!
  • There are many other use-cases in that field, where Big Data is great

We need to start to work together instead of just calling something bad because it seems to be so. No technology is good or evil, there are always some bad things but also some good things. It is necessary to see all sides of a technology. The conference I was talking about gave me the inspiration to write this article as it is so small-minded.

Big Data is considered to be the job you simply have to go for. Some call it sexy, some call it the best job in the future. But what exactly is a Data Scientist? Is it someone you can simply hire from university or is it more complicated? Definitely the last one applies for that.

When we think about a Data Scientist, we often say that the perfect Data Scientist is kind of a hybrid between a Statistician and Computer Scientist. I think this needs to be redefined, since much more knowledge is necessary. A Data Scientist should also be good in analysing business cases and talk to line executives to understand the problem and model an ideal solution. Furthermore, extensive knowledge on current (international) law is necessary. In a recent study we did, we defined 5 major challenges:

perfect-data-scientist

Each of the 5 topics are about:

  • Big Data Business Developer: The person needs to know what questions to ask, how to cooperate with line of business (LOB) decision makers and must have good social skills to cooperate with all of them.
  • Big Data Technologist: In case your company isn’t using the cloud for Big Data Analytics, you also need to be into infrastructure. The person must know a lot about system infrastructure, distributed systems, datacenter design and operating systems. Furthermore, it is also important to know how to run your software. Hadoop doesn’t install itself and there is some maintenance necessary.
  • Big Data Analyst: This is the fun part; here it is all about writing your queries, running Hadoop jobs, doing fancy MapReduce queries and so on! However, the person should know what to analyse and how to implement such algorithms. It is also about machine learning and more advanced topics.
  • Big Data Developer: Here it is more about writing extensions, add-ons and other stuff. It is also about distributed programming, which isn’t the easiest part itself.
  • Big Data Artist: Got the hardware/datacenter right? Know what to analyse? Wrote the algorithms? What about presenting them to your management? Exactly! This is also necessary! You simply shouldn’t forget about that. The best data is worth noting if nobody is interested in it because of poor presentation. It is also necessary to know how to present your data.

As you can see, it is very hard to become a data scientist. Things are not as easy as it might seems. The Data Scientist should be a nerd in each of these fields, so the person should be some kind of a “super nerd”. This might be the super hero of the future.

Most likely, you won’t find one person that is good in all of these fields. Therefore, it is necessary to build an effective team.

Header Image Copyright: Chase Elliott Clark

Big Data is definitely a very complex “thing”. Why do I call it “a thing” here? Because it is simply not a technology itself! Hadoop is a technology, Lucene is a technology but Big Data is more of a concept, since it is nothing you can touch. Ever tried installing Big Data on your machine? Or said “I need this Big Data Software”? When you talk about a software or technology, you talk about a very concrete Product or Open Source Tool.

The concept of Big Data is rather complicated when it comes to implementing it. There are several major dimensions you have to be aware of.

Big Data Dimensions

Big Data Dimensions

The dimensions are:

  • Legal dimension: What is necessary in terms of data protection legislation? What do you need to know about legal impacts, what kind of data are you allowed to store or collect/process?
  • Social dimension: What social impacts will you generate with your application? How will your users react to that?
  • Business dimension: What is the business model you want to generate with your Big Data platform? How can your Big Data platform support your business? What kind of pricing do you want to calculate?
  • Technology dimension: How can you achieve your targets? What technology would you use to get there? What scale able software can you use?
  • Application dimension: What industry solutions are available for your needs? How can you enable decision support based on data for your company?

If you want to target all of these questions, you need to have a team that is capable of fulfilling this request. In the next posts I will talk about the Big Data technology stack and what it needs to be a data scientist.

Header Image copyright:  Michael Coghlan. Distributed under the Creative Commons license 2.0 by Creative Commons Australia Pool.