Big Data involves a lot of different technologies. Each of these technologies require different knowledge. I’ve described the knowledge necessary in an earlier post. Today, we will have a look at Big data technology layer. 

In this post, I want to outline all necessary technologies in the Big Data stack. The following image shows them:

Technologies in the Big Data Stack

Technologies in the Big Data Stack

The layers are:

  • Management: In this layer, the problem on how to store data on hardware or in the cloud and what resources need to be scheduled is addressed. It is basically knowledge involved in datacenter design and/or cloud computing for Big Data.
  • Platforms: This layer is all about Big Data technologies such as Hadoop and how to use them.
  • Analytics: This layer is about the mathematical and statistical techniques necessary for Big Data. It is about asking the questions you need to answer.
  • Utilisation: The last and most abstract layer is about the visualization of Big Data. This is mainly used by visual artists and presentation software.

Each of the layers needs different knowledge and also different hardware and software. As described earlier, it is simply not possible to have one software that “fits it all”. And you need to create a team that has the knowledge in all of these areas.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

Big Data is considered to be the job you simply have to go for. Some call it sexy, some call it the best job in the future. But what exactly is a Data Scientist? Is it someone you can simply hire from university or is it more complicated? Definitely the last one applies for that.

When we think about a Data Scientist, we often say that the perfect Data Scientist is kind of a hybrid between a Statistician and Computer Scientist. I think this needs to be redefined, since much more knowledge is necessary. A Data Scientist should also be good in analysing business cases and talk to line executives to understand the problem and model an ideal solution. Furthermore, extensive knowledge on current (international) law is necessary. In a recent study we did, we defined 5 major challenges:

perfect-data-scientist

Each of the 5 topics are about:

  • Big Data Business Developer: The person needs to know what questions to ask, how to cooperate with line of business (LOB) decision makers and must have good social skills to cooperate with all of them.
  • Big Data Technologist: In case your company isn’t using the cloud for Big Data Analytics, you also need to be into infrastructure. The person must know a lot about system infrastructure, distributed systems, datacenter design and operating systems. Furthermore, it is also important to know how to run your software. Hadoop doesn’t install itself and there is some maintenance necessary.
  • Big Data Analyst: This is the fun part; here it is all about writing your queries, running Hadoop jobs, doing fancy MapReduce queries and so on! However, the person should know what to analyse and how to implement such algorithms. It is also about machine learning and more advanced topics.
  • Big Data Developer: Here it is more about writing extensions, add-ons and other stuff. It is also about distributed programming, which isn’t the easiest part itself.
  • Big Data Artist: Got the hardware/datacenter right? Know what to analyse? Wrote the algorithms? What about presenting them to your management? Exactly! This is also necessary! You simply shouldn’t forget about that. The best data is worth noting if nobody is interested in it because of poor presentation. It is also necessary to know how to present your data.

As you can see, it is very hard to become a data scientist. Things are not as easy as it might seems. The Data Scientist should be a nerd in each of these fields, so the person should be some kind of a “super nerd”. This might be the super hero of the future.

Most likely, you won’t find one person that is good in all of these fields. Therefore, it is necessary to build an effective team.

Header Image Copyright: Chase Elliott Clark

Big Data is definitely a very complex “thing”. Why do I call it “a thing” here? Because it is simply not a technology itself! Hadoop is a technology, Lucene is a technology but Big Data is more of a concept, since it is nothing you can touch. Ever tried installing Big Data on your machine? Or said “I need this Big Data Software”? When you talk about a software or technology, you talk about a very concrete Product or Open Source Tool.

The concept of Big Data is rather complicated when it comes to implementing it. There are several major dimensions you have to be aware of.

Big Data Dimensions

Big Data Dimensions

The dimensions are:

  • Legal dimension: What is necessary in terms of data protection legislation? What do you need to know about legal impacts, what kind of data are you allowed to store or collect/process?
  • Social dimension: What social impacts will you generate with your application? How will your users react to that?
  • Business dimension: What is the business model you want to generate with your Big Data platform? How can your Big Data platform support your business? What kind of pricing do you want to calculate?
  • Technology dimension: How can you achieve your targets? What technology would you use to get there? What scale able software can you use?
  • Application dimension: What industry solutions are available for your needs? How can you enable decision support based on data for your company?

If you want to target all of these questions, you need to have a team that is capable of fulfilling this request. In the next posts I will talk about the Big Data technology stack and what it needs to be a data scientist.

Header Image copyright:  Michael Coghlan. Distributed under the Creative Commons license 2.0 by Creative Commons Australia Pool.