Hadoop Tutorial – Getting started with Apache Hadoop

Hadoop is one of the most popular Big Data technologies, or maybe the key Big Data technology. Due to large demand for Hadoop, I’ve decided to write a short Hadoop tutorial series here. In the next weeks, I will write several articles on the Hadoop platform and key technologies.

When we talk about Hadoop, we don’t talk about one specific software or a service. The Hadoop project features several projects, each of them serving different topics in the Big Data ecosystem. When handling Data, Hadoop is very different to traditional RDBMS systems. Key differences are:

  • Hadoop is about large amounts of data. Traditional database systems were only about some gigabyte or terabyte of data, Hadoop can handle much more. Petabytes are not a problem for Hadoop
  • RDBMS work with an interactive access to data, whereas Hadoop is batch-oriented.
  • With traditional database systems, the approach was “read many, write many”. That means, that data gets written often but also modified often. With Hadoop, this is different: the approach now is “write once, read many”. This means that data is written once and then never gets changed. The only purpose is to read the data for analytics.
  • RDBMS systems have schemas. When you design an application, you first need to create the schema of the database. With Hadoop, this is different: the schema is very flexible, it is actually schema-less
  • Last but not least, Hadoop scales linear. If you add 10% more compute capacity, you will get about the same amount of performance. RDBMS are different; at a certain point, scaling them gets really difficult.

Central to Hadoop is the Map/Reduce algorithm. This algorithm was usually introduced by Google to power their search engine. However, the algorithm turned out to be very efficient for distributed systems, so it is nowadays used in many technologies. When you run queries in Hadoop with languages such as Hive or Pig (I will explain them later), these queries are translated to Map/Reduce algorithms by Hadoop. The following figure shows the Map/Reduce algorithm:

Map Reduce function
Map Reduce function

The Map/Reduce function has some steps:

  1. All input data is distributed to the Map functions
  2. The Map functions are running in parallel. The distribution and failover is handled entirely by Hadoop.
  3. The Map functions emit data to a temporary storage
  4. The Reduce function now calculates the temporary stored data

A typical sample is the word-count. With word-count, input data as text is put to a Map function. The Map function adds all words of the same kind to a list in the temporary store. The reduce-function now counts the words and builds a sum.

Next week I will blog about the different Hadoop projects. As already mentioned earlier, Hadoop consists of several other projects.

Why Big Data projects are challenging – and why I love it

During my professional carrier, I was managing several IT projects, mainly in the distributed systems environment. Initially, these projects were cloud projects, that were rather easy. I worked with IT departments in different domains/industries and we all had the same level of “vocabulary”. When talking with IT staff, it is clear that all use the same terms to describe “things”. No special explanation is needed.
I soon realized that Big Data projects are VERY different to that. I wrote several posts on Big Data challenges in the last month and the requirements for data scientists and alike. What I am always coming across when managing Big Data projects is the different approach one have to select when (successfully) managing these kind of projects.
Let me first start by explaining what I am doing. First of all, I don’t code, implement or create any kind of infrastructure. I work with senior (IT) staff to talk about ideas which will eventually be transformed to Big Data projects (either direct or indirect). My task is to work with them on what Big Data can achieve for their organization and/or problem. I am not discussion how their Hadoop solution will look like, I am working on use-cases and challenges/opportunities for their problems, independent from a concrete technology. Plus, I am not focused on any specific industry or domain.
However, all strategic Big Data projects have a concrete schema. The most challenging part is to understand the problem. In the last month, I had some challenges in different industries; whenever I run these kind of projects, it is mainly about cooperating with the domain experts. They often have no idea about the possibilities of Big Data – and they don’t have to. I, in contrast, have no idea about the domain itself. This is challenging on the one side – but very helpful on the other side. The more experience one person gains within a specific domain, the more the person thinks and acts in the methodology for the specific domain. They often don’t see the solution because they work on a “I’ve made this experience and it has to be very similar”. The same applies to me as a Big Data expert. All workshops I ran were mainly about mixing the concrete domain with the possibilities of Big Data.
I had a number of interesting projects lately. One of the projects was in the geriatric care domain. We worked on how data can make the live of elderly better and what type of data is needed. It was very interesting to work with domain experts and see what challenges they actually face. An almost funny discussion arose around Open Data – we looked at several data sources provided by the state and I mentioned: “sorry, but we can’t use these data sources. They aren’t big and they are about locations of toilets within our capital city”. However, their opinion was different because the location of toilets is very important for them – and data doesn’t always needs to be big, it needs to be valuable. Another project was in the utilities domain, where it was about improving their supply chain by optimizing it with data. Another project for a company providing devices was about improving the reliability of their devices by analyzing large amounts of log data. When their devices have an outage, a service personal has to go to the city of the outage. This takes several days to a week. I worked on reducing this time and included a data scientist. We could reduce the time the device stands still to some hours only for the 3 mayor error codes by finding patterns weeks before the outage occurs. However, there is still much work to be done in that area. Another project was in the utilities sector and in the government sector.
All of these projects had a common iteration phase but were very different – each project had it’s own challenges, but the key success factor for me was how to deal with people – it was very important to work with different people from different domains with a different mindset – improving my knowledge and broadening my horizon as well. That’s challenging on the one hand but very exciting on the other hand.

Impact of self-driving cars and Smart Logistics on Cloud and Big Data

Self-driving cars are getting more and more momentum. In 2014, Tesla introduced the “Autopilot” feature for it’s Model S, which allows autonomous driving. The technology for self-driving cars has been around for years though – there are other factors why it is still not here. It is mainly a legal question and not a technical one.
However, autonomous systems will be here in some years from now, and they will have a positive impact on cloud computing and big data. The use-cases were already described partially with smart cities in an earlier post. However, there are several other use-cases. Positive effects of self-driving cars are the advanced security: sensors need milliseconds to react to threads whereas humans need a second. This gives more time for better reactions. Autonomous systems can then also communicate with other cars and warn them in advance. This is called “Vehicle to Vehicle communication”. But communication is also done with infrastructure (which is called Vehicle to Infrastructure communication). A street for instance can warn the car that there are problems ahead – e.g. that the street itself is getting worse.
The car IT itself doesn’t need the cloud and big data – but services around that will heavily use cloud and big data services.
Self-driving cars also brings a side-effect: Smart Logistics. Smart Logistics are fully automated logistic devices that drive without the need of a driver and deliver goods to a destination. This can start in china with a truck that brings a container to a ship. This ship is also fully automated and works independent. The ship drives to New York, where the goods are picked up by a self-driving truck again. The truck brings the container to a distribution center, where robots unload the container and drones deliver the goods to the customers. All of that is handled by cloud and big data systems that often operate in real-time.

Hadoop and Big Data in Central and Eastern Europe

The last month I was working on a report about how Hadoop and Big Data is adopted in CEE (Central and Eastern Europe). I am happy to announce that the survey is now available at idc.com. We carried out a comprehensive end-user research in the region with 600 IT decision makers. Here is the abstract:

This IDC Survey shows the overall acceptance of big data analytics and Hadoop technology in different company types across CEE and outlines the differences between countries in the region. In addition, it shows the differences of opinion on big data technologies between line-of-business managers and IT managers.
“Although big data solutions are emerging worldwide, the adoption rate is currently low in CEE. We expect this technology to take off in 2015. Providers should therefore be ready to offer Hadoop solutions to their customers in the region now.” — Lead Analyst Mario Meir-Huber, Big Data, IDC CEE

The Report contains the following topics:
IDC Opinion, Situation Overview, Survey Findings
Insights on different Hadoop factors by country, industry and employees such as:

  • Attitude Toward Hadoop as a Technology
  • Hadoop Drivers
  • Hadoop Inhibitors

Vendor and Product Placement and Vendor Association with Hadoop
Comparison of IT Managers and LoB Managers in Terms of Big Data Analytics/Hadoop Contributions
Companies Covered:
Apache Corporation, Oracle Corporation, Cloudera, Inc., IBM, SAS Institute Inc., Hewlett-Packard Company, Amazon.com, Inc., Google Inc., Microsoft Corporation
Regions Covered:
Central and Eastern Europe
Topics Covered:
Big Data analytics and discovery, Databases, Hadoop
 
If you are interested in the report, feel free to drop me a line at: mmeir-huber – at – idc.com or simply find the report here.
 

Impact of Industry 4.0 and Smart Production on Cloud and Big Data

According to various sources, we are in the middle of the so-called 4th industrial revolution. This revolution is basically lead by a very high degree of automation and IT systems. Until recently, IT played mainly a support role in the industry, but with new technologies the role will change dramatically: it will lead the industry. Industry 4.0 (or Industrie 4.0) is mainly lead by Germany which places a high bet on that topic. The industrial output of germany is high and in order to maintain it’s global position, the german industry has to – and will – change dramatically.
Let’s first look at the past industrial revolutions:

  • The first industrial revolution took place in the 18th century. This happend when the mechanical loom was introduced.
  • The second industrial revolution took place in the early 20th century, when assembly lines were introduced
  • In the 70th and 80th of the last century, the 3rd industrial revolution took place. Machines could now work on repeatable tasks and robots were first introduced

The 4th industrial revolution is now lead dramatically by the IT industry. It is not only about supporting the assembly lines but it is about replacing them. The customer can define it’s own product and make it really individual. Designers can offer templates in online stores and the product then knows how it will be produced. The product selects in what fabric it will be produced and tells the machines how it should be handled.
Everything in this process is fully automated. It starts by ordering something online. The transportation process is automated as well – autonomous systems deliver individual parts to the fabrics and this goes well beyond traditional just-in-time delivery. This is also a democratization of design: just like individuals can now write their books without a publisher as e-books, designers can provide their designs online on new platforms. This gives new opportunities to designers as well as customers.
As with Smart Homes and Smart Cities, this produces not only a lot of data – it also requires sophisticated back-end systems in the cloud that take care of this complex processes. Business processes need to be adjusted to the new challenges and they are more complex than ever. This can’t be handled by one single system – this needs a complex system running in the cloud.

Guest Blog: Sphares, a tool to unify collaboration in the Cloud

By Dietmar Gombotz, CEO and Founder of Sphares
With the introduction and growth of different Cloud- and Software-As-A-Service offerings, a rapid transition process driven through the mix-up between professional and personal space has taken shape. Not only are users all over the world using modern, flexibel and new products like DropBox or others at home, they want the same usability and “ease of use” in the corporate world. This of course conflicts with internal-policy and external-compliance issues, especially when data is shared through any tool.
I will focus mainly on the aspect of sharing data (usually in the form of files, but it could be other data-objects like calender-information or CRM data)
Many organizations have not yet formulated a consistent and universal strategy on how to handle this aspect in of their daily work. We assume an organizational structure where data sharing with clients/partners/suppliers is a regular process, which will surely be the case in more than 80% of all business nowdays.
There area different strategies to handle this:
No Product Policy
Basically the most well known policy is to not allow usage of modern tools and keeping with some internal infrastructure or in-house built tools.
Pro: data storage is 100% transparent, no need for further clarification
Con: unrealistic expectation especially in fields with a lot of data sharing, email will be used to transfer data to partners anyway so the data will be in multiple places and stages distributed
 
One-Product Policy
The most widley proactive policy is to define one solution (e.g. we use Google Drive) where a business account is taken or which can be installed (owncloud, …) on own hardware
Pro: data storage can be defined, employees have access to a working solution, clarifications are not needed
Con: partner need accounts on this system and have to make an extra effort to integrate it into their processes
Product-As-You-Need
Seen at small shops often. They use whatever their partners are using and get accounts when there partners propose some solution. They usually have a prefered product, but will switch whenever the client wants to use something else.
Pro: no need of adjustment on side of partner
Con: dozens of accounts, often shared to private accounts with no central control, data will be copied into internal system like with emails
 
Usage of Aggregation Services
The organization uses the “Product-As-You-Need” view combined with aggregation tools like JoliCloud or CloudKafe
Pro: no need of adjustment on side of partner, one view on the data on the companies side
Con: data still in dozen of systems and on private accounts (central control), integration in processes not possibleas the data stay on the different systems
 
Usage of Rule-Engines
There are a couple of Rule Engines like IFTTT (If this then that) or Zapier that can help you to connect different tools and trigger actions like you are used to in e-mail inboxes (filter rules). In combination with a preferred tool this can be a valid way to get data pre-processed and put into your system
Pro: Rudimentary integration with different systems, employees stay within their system
Con: Usually One-Way actions so updates do not get back to your partners, usually on a user-basis so no central control is needed.
Service Integration
Service Integration allows the sharing of data via an intermediate layer. There are solutions that will synchronize data (SPHARES) thereby allowing data consistency. Additionally there are services that will connect to multiple cloud storage facilities to retrieve data (Zoho CRM)
Pro: Data is integrated in processes, everybody stays within their system they use
Con: additional cost for the integration service
 

Big Data: what or who is the data scientist?

As described in an earlier post here I outlined the fact that becoming a data scientist requires a lot of knowledge.
Focusing back, a data scientist needs to have knowledge in different IT domains:

  • General understanding of distributed systems and how they work. This includes administration skills for Linux as well as hardware related skills such as networking.
  • Knowledge in Hadoop or similar technologies. This knowledge basically builds on top of the former one but it is sort of different and requires a more software focused knowledge.
  • Great statistical/mathematical knowledge. This is necessary to actually work on the required tasks and to figure out how they can be applied to real algorithms.
  • Presentation skills. All is worth nothing as long as someone can’t represent the data or things found in the data. The management might not see the points if the person can’t present data in an appropriate way.

In addition, there are some other skills necessary:

  • Knowledge of the legal situation. The legal basics are different from country to country. Though the european union gives some legal borders within member states, there are also differences.
  • Knowledge of the society impacts. It is also necessary to understand how society might react to data analysis. Especially in marketing it is absolutely necessary to handle that correct

Since more and more IT companies focus on looking for the ideal data scientist, people should first try to find out who is capable of handling all of these skills. The answer to this might be: there is no person that can handle all. It is likely that one person is great in distributed systems and Hadoop but might fail in transforming questions to algorithms and finally presenting them.
Data Science is more of a team effort than a single person that can handle all of it. Therefore, it is rather necessary to build a team that will be able to address all of these challenges.

Impact of Smart Cities and Smart Homes on Cloud and Big Data

Cities and Homes are getting smarter and smarter. People living in these cities actually demand that there are more services presented by the local government. It needn’t be necessary to go to the city administration for standard tasks but these tasks can be done online. A key driver for smart cities is e-government. But there is much more to that than just e-government (which, in fact, has been around for years)
Cities need to get smarter. This happens on various things. The city can automatically adapt to new developments such as stronger traffic in a certain area. If more people would like to go to a specific area (maybe because there is an event), the public and private transport will automatically adapt to that. As of the private transport, cars are often driven “automatically” in an smart city. There is no driver (this will be described later on). This gives some interesting opportunities: cars communicate with the city where they want to go. The city has an overview over all desired destinations and can adapt in real-time to challenges that might arise. In case that a destination is highly demanded, the city can communicate to individual cars that there might be a traffic jam and prioritize cars or select alternative routs so that no car ends up in the traffic jam. It could also happen that there is different charging: e.g., if you want to get somewhere fast, you might have to pay little more. A very similar system can be found in Singapore, where you have to pay for using streets based on traffic and time. This can significantly lower the private transport and make the city “cleaner” and give inhabitants less stress. Some people might even decide to select the public transport instead. Furthermore, the private traffic could become public: companies might offer their cars to individuals, just like taxis but without drivers.
Of course, this needs a lot of technology in the background. Real-Time systems have to be available and complex calculations have to be done. Smart Cities need Big Data and Cloud Computing in order to provide all of these things.
A similar story can be seen with Smart Homes. More and more home automation is underway. Google’s Nest and Apple’s HomeKit are big bets for the companies and this emerging market. Future homes are highly connected and optimized. When the home “is not in use” – e.g. children are in the school, parents at work – the home stops heating or just keeps it at a low level. Before they come back home, the house starts to heat up again to achieve the required temperature (or vice versa: the home chills down for those living in warmer regions). The home itself can be opened simply with the smartphone and devices within the house are connected as well. There are sensors for elder people that prevent danger and advanced surveillance systems protect the home from unwanted visitors.
As with smart cities, this also requires a lot of back-end technology that is delivered via the cloud and uses big data technologies.

Privacy killed the Big Data star

Big Data is all about limiting our privacy. With Big Data, we get no privacy at all. Hello, Big Brother is watching us and we have to stop it right now!
Well, this is far too cruel. Big Data is NOT all about limiting our privacy. Just to make it clear: I see the benefits of Big Data. However, there are a lot of people out there that are afraid of Big Data because of privacy. The thing I want to state first: Big Data is not NSA, Privacy, Facebook or whatever surveillance technology you can think of. Of course, it is often enabled by Big Data technologies. I see this discussion often and I recently came across an event, that stated stated that Big Data is bad and it limits our privacy. I say, this is bullsh##.
The event I am talking about stated that Big Data is bad, it is limiting our privacy and it needs to be stopped. It is a statement that only sees one side of the topic. I agree that the continuous monitoring of people by secret services isn’t great and we need to do something about it. But this is not Big Data. I agree that Facebook is limiting my privacy. I significantly reduced the amount of time spending on Facebook and don’t use the mobile Apps. This needs to change.
However, this not Big Data. This are companies/organisations doing something that is not ok. Big Data is much more than that. Big Data is not just evil, it is great for many aspects:

  • Big Data in healthcare can save thousands, if not millions of lives by improving medicine, vaccination and finding correlations for chronically ill people to improve their treatment. Nowadays, we can decode the DNA in short time, which helps a lot of people!
  • Big Data in agriculture can improve how we produce foods. Since the global population is growing, we need to get more productive in order to feed everyone.
  • Big Data can improve the stability and reliability of IT systems by providing real-time analytics. Logs are analysed in real-time to react to incidents before they happen.
  • Big Data can – and actually does – improve the reliability of devices and machines. An example is that of medicine devices. A company in this field could reduce the time the devices had an outage from weeks to only hours! This does not just save money, it also saves lives!
  • There are many other use-cases in that field, where Big Data is great

We need to start to work together instead of just calling something bad because it seems to be so. No technology is good or evil, there are always some bad things but also some good things. It is necessary to see all sides of a technology. The conference I was talking about gave me the inspiration to write this article as it is so small-minded.