Big Data is a disruptive technology. It is changing major industries from the inside. In the next posts, we will learn how Big Data changes different industries.
Today’s focus: Big Data for Healthcare.
Big Data offers several benefits for the healthcare industry. Decoding the human genome was one of the first Big Data applications in IT. It took years to decode the DNA sequence the first time, nowadays it is a matter of hours! This gives entirely new approaches to research in the health industry, enabled by Big Data algorithms.
However, Big Data in healthcare is not only about decoding the DNA. There are several other benefits. Analyzing data brings large benefits to illnesses we don’t know enough yet. There is a large number of chronic illnesses where doctors are still not sure where they come from and how to best treat them. This can be done by collecting large amounts of data from a specific illness and compare it on a broad base with different factors. However, it is necessary to keep the data anonymous and respect the data rights and privacy of individuals. The target should be to improve the healthcare.
Another benefit of Big Data in Healthcare is about medical devices. There are a large number of devices today that are used in the healthcare environment. Outages of these devices are often a problem, as they are always connected to very important functions. When a device that is used for analysis has an outage, problems will occur. It is either necessary to have more of the same devices in case of a failure or to simply wait for the devices to come back. In recent years, I had several projects in the predictive maintenance area, where Big Data analytics were integrated to improve the stability of devices and to predict when a failure might occur. I saw several companies that could reduce the time a device “stands still” from several days to only hours by applying such algorithms.

Apache Oozie is the workflow scheduler for Hadoop Jobs. Oozie basically takes care of the step-wise workflow iteration in Hadoop. Oozie is like all other Hadoop projects built for high scalability, fault tolerance and extensible.

An Oozie Workflow is started by data availability or after a specific time. Oozie is the root for all MapReduce jobs as they get scheduled via Oozie. This also means that all other projects such as Pig and Hive (which we will discuss later on) also take advantage of Oozie.

Oozie workflows are described in an XML-Dialect, which is called hPDL. Oozie knows two different types of nodes:

  • Control-Flow-Nodes that take do exactly what the name says: controlling the flow.
  • Action-Nodes take care of the actual execution of a job.

The following illustration shows the iteration process in an Oozie Workflow. The first step for Oozie is to start a task (MapReduce Job) on a remote system. Once the task has completed, the remote system sends the result back to the remote system via a callback function.

Apache Oozie
Apache Oozie

The 70-pages E-Book on Big Data is for free until Sunday on Amazon! This ebook is available exclusively for Amazon Kindle.
More information about the E-Book can be found here.
You can find the E-Book on Amazon here.

Big Data is a disruptive technology. It is changing major industries from the inside. In the next posts, we will learn how Big Data changes different industries.
Today’s focus: Big Data for Governments.
Big Data is of great benefit for governments on several layers. First, there is open data. Governments, cities and other public institutions store large amounts of data, that can be available to the public. Many cities around the globe make their data available via open data catalogs. At present, this is still far away from being big data, but we will soon be there.
The public sector has several other challenges that can be addressed by Big Data technologies. As similar with the banking industry, fraud is a main key here. Tax payers can be analyzed with Big Data technologies and those that avoid paying tax can be found by the financial departments. This increases the number of tax income and reduces the possibilities to hide from the tax. On the other hand, fraud and abuse on social services such as health insurance and retirement plans can be reduced (if covered by the state).
Another key topic for governments is the “Smart City” approach. A smart city operates on large amounts of data, that need to be processed somewhere, somehow. Smart cities have several interesting benefits for their inhabitants: within a smart city, the traffic is constantly improved. For instance, a car (which might drive on it’s own) will ask the city, not the navigation system for the best route to reach a destination. The benefit of that is that the city can collect all requests and then checks what routes might be overcrowded. The smart city will then route the cars in a way that traffic jams are reduced to a minimum by re-arranging and re-scheduling routes. Current navigation systems can react to traffic jams and change the routes, which will eventually create a traffic jam at another route. A smart city knows where a cars want to get to and can arrange the routes before traffic jams occur.

One of the key infrastructure services for Hadoop is Apache ZooKeeper. ZooKeeper is in charge of coordinating nodes in the Hadoop cluster. Key challenges for ZooKeeper in that domain are to provide high availability for Hadoop and to take care of the distributed coordination.

Under these challenges, Hadoop takes care of managing the cluster configuration for Hadoop. A key challenge in the Hadoop Cluster is naming, which has to be applied to all nodes within a cluster. Apache ZooKeeper takes care of that by providing unique names to individual nodes based on naming conventions.

The hierarchy in Zookeeper
The hierarchy in Zookeeper

As shown in Figure 7, naming is hierarchical. This means that naming also occurs via a path. The Root instance starts with a “/”, all successors have their unique name, and their successors also apply this naming schema. This enables the cluster to have nodes with child-nodes, which in return has positive effects on maintainability.

ZooKeeper takes care of synchronization within the distributed environment and provides some group services to the Hadoop Cluster. As of synchronization, there is one server in the ZooKeeper Service that acts as the “Leader” of all servers running under the ZooKeeper Service. The following illustration shows this.

Synchronisation in the ZooKeeper Service
Synchronisation in the ZooKeeper Service

To ensure a high uptime and availability, individual servers in the ZooKeeper service are mirrored. Each of the servers in the service knows any other server. In case that one server has a failure and isn’t available any more, clients connect to other servers. The ZooKeeper service itself is built for failover and is also highly scalable.

Big Data is a disruptive technology. It is changing major industries from the inside. In the next posts, we will learn how Big Data changes different industries.
Today’s focus: Big Data in Finance.
The finance sector heavily benefits from Big Data analytics. First, there is banking. Banks have large amounts of data on transactions that have to be processed every day. This data needs to be checked for fraud. Real-time analytics such as Apache Storm play a vital role in that process. To improve the security and detect fraud before it can happen largely decreases financial loss for them. But not only banks adapt Big Data for that: credit card companies such as Visa or MasterCard also apply these techniques in order to prevent fraud. A sample is when you travel: imagine you travel from New York to London. You didn’t pay anything during the trip with your credit card nor did you pay the travel itself with your credit card. Once in London, you get a coffee at the airport, but your credit card is rejected (or at least, authorization is required). In case you paid the trip with your credit card, the credit card company knows that you will be in London and can accept the payment.
Insurance companies face a similar problem with financial fraud. By analyzing data, the validity of a claim can be checked. Insurance companies can also lower their risk by analyzing data (and thus increase our bill)

Apache Ambari was developed by the Hadoop distributor Hortonworks and also comes with their distribution. The aim of Ambari is to make the management of Hadoop clusters easier. Ambari is useful, if you run large server farms based on Hadoop. Ambari automates much of the manual work you would need to do with Hadoop when managing your cluster from the console.

Ambari comes with three key aspects around cluster management: first, it is about provisioning instances. This is helpful when you want to add new instances to your Hadoop cluster. Ambari takes care of automating all aspects of adding new instances. Next, there is monitoring. Ambari monitors your server farm and gives you an overview on what is going on. The last aspect is the management of your server farm itself.

Provisioning has always been a very tricky part of Hadoop. When someone wanted to add new nodes to a cluster, this was basically not an easy thing to do and included a lot of manual work. Most organizations abstracted this problem by creating scripts and using automation software, but this simply couldn’t fill the scope that is often necessary in Hadoop clusters. Ambari provides an easy-to-use assistant that enables users to install new services or activate/deactivate them. Ambari takes care of the entire cluster provisioning and configuration with an easy UI.

Ambari also includes comprehensive monitoring capabilities for the cluster. This allows user to view the status of the cluster in a dashboard and to get to know immediately what the cluster is up to (or not). Ambari uses Apache Ganglia to collect the metrics. Ambari also integrates the possibility to send System messages via Apache Nagios. This includes alerts and other things that are necessary for the administrator of the cluster.

Other key aspects of Ambari are:

  • Extensibility. Ambari is built on a plug-in architecture, which basically allows you to extend Ambari with your own functionality used within your company or organization. This is useful if you want to integrate Hadoop into your business processes.
  • Fault Tolerance. Ambari takes care of errors and reacts to them. For example, if an instance has an error, Ambari restarts this instance. This takes away much of the headache you got in previous, pre-Ambari, versions of Hadoop.
  • Secure. Ambari uses a role-based authentication. This gives you more control over sensitive information in your cluster(s) and enables you to apply different roles.
  • Feedback. Ambari provides Feedback to the user(s) about long-running processes. This is especially useful for stream processing and near-real-time processes that basically have no end of their lifespan.

Apache Ambari can be accessed easily via two different ways: first, Ambari provides a mature UI that enables you to access the cluster management via a Browser. Furthermore, Ambari can also be accessed via ReSTful Web Services, which gives you additional possibilities in working with the service.

The following illustration outlines the Ambari Server and the Agents Communication.

Apache Ambari
Apache Ambari

As of the architecture, Ambari leverages several projects. As key elements, Ambari uses message queues for communication. The configuration within Apache Ambari is done by Puppet. The next figure shows the overall architecture of Ambari.

Apache Ambari Architecture
Apache Ambari Architecture

Big Data is a disruptive technology. It is changing major industries from the inside. In the next posts, we will learn how Big Data changes different industries.
Today’s focus: Big Data in Manufacturing.
Manufacturing is a traditional industry relevant to almost any country in the world. It started to emerge in the industrial revolution, when machines took over and production became more and more automated. Big Data has the possibility to substantially change the manufacturing industry again – with various opportunities.
Manufactures can utilize Big Data for various reasons. First, it is all about quality. When we look at production chains, may it be producing a car or just some metal works, quality is key. Who wants to buy a car that is broken? Exactly, nobody. Improving quality is a key aspect in Big Data for manufacturers. As of Big Data, this can come with several aspects. First of all, it is necessary to collect data about the production line(s) and all devices that are connected or connect-able. When errors occur or a product isn’t as desired, the production data can be analyzed and reviewed. Data scientists basically do a great job on that. Real-Time analytics allow the company to improve the material quality and product quality again. This can be done by analyzing images of products or materials and removing them from the production line in case they don’t fulfill certain standards.
A key challenge today in manufacturing is the high degree of product customization. When buying a new car, the words by Henry Ford (you can have any type of the T-model as long as it is black) are not true any more. When customers order whatever type of product, customers expect that their own personality is reflected by the product. If a company fails to deliver that, they might risk loosing customers. But what is the affiliation with Big Data now? Well, this customization is a strong shift towards Industry 4.0, which is heavily promoted by the German industry. In order to make products customize able, it is necessary to have an automated product line and to know what customers might want – by analyzing recent sales and trends from social networks and alike.
Changing the output of a production line is often difficult and ineffective. Big Data analytics allow manufacturers to better understand future demands and they can reduce production pikes. This enables the manufacturer to better plan and act in the market – and get more efficient.

Last week I wrote a blog post introducing the Hadoop project and gave an overview of the Map/Reduce algorithm. This week, I will outline the Hadoop stack and major technologies in the Hadoop step. Please note: there are many projects in the Hadoop stack and this is not complete. The following figure will outline major Hadoop projects.

The Hadoop technology stack
The Hadoop technology stack

I have clustered the Hadoop stack into several areas. The lowest area is the cluster management. This level is everything about managing and running Hadoop. Projects on this layer include Ambari for provisioning, monitoring and management, Zookeeper for the coordination and reliability and Oozie for Workflow-scheduling. This layer is focused on infrastructure and if you work on this layer, you normally don’t analyse data (yet).

Moving one level up, we find ourselves in the “Infrastructure” layer. This layer is not about physical or virtual machines or disk storage. I called it “Infrastructure” since it contains projects that are used by other Hadoop components. This includes Apache Commons, a shared library, and the HDSF (Hadoop Distributed File System). HDFS is used by all other projects and it is a virtual file system that can span over many different servers and abstracts individual (machine-based) file systems to one common file system.

The next layer could also be called the 42 layer. Apache YARN is the core of almost everything you do in Hadoop. YARN takes care of the Map/Reduce jobs and many other things including resource management, job management, job tracking and job scheduling.

The next layer is all about data. As we can see here, this layer contains a lot of projects for the 3 core things when it comes to data: data storage, data access and data science. As of data storage, a key project is HBase, a distributed, key/value database. It is built for large amounts of data. We will dig deeper into HBase in a couple of weeks from now. Data access includes several important projects such as Hive (a SQL-like query language), Pig (a data flow language), streaming and in-memory processing for real-time applications such as Spark and Storm, and Graph processing with Giraph. Mahout is the only project in the data science layer. Mahout is useful for machine learning, clustering and recommendation mining.

On the next layer, we have several tools for data governance and integration. When it is necessary to import data into Hadoop, we can find projects on this layer.

The last layer consists of Apache Hue. This is the Hadoop UI that makes our lives easier 😉

Next week, I will give more insights on the individual layers discussed here. Stay tuned 😉

On top of all those collaboration- and cloud-services a lot of us have found out that working together has not become much easier since the introduction of those services. As today every organization uses own infrastructure either self-hosted or an online services the borders have only moved but have not gotten transparent when needed. The walls between collaborating organisations are as strong as ever.
SPHARES is here to change this.

We are allowing sharinglike DropBox, but between different systems. Even hosted on your own systems -Dietmar Gombotz, CEO of SPHARES

SPHARES is a small start-up team of 5 from Vienna with the mission to make working-life and collaboration much easier by providing a tool that allows you to integrate different work environments without having to actually change tools.
It is working as a service-integrator between different systems in the background. The sync-engine allows to transparently share data to and from colleagues using different (or even the same) systems as oneself.
As an integration type it currently allows one-way and two way synchronization, between different heterogenous systems.

Our Goal is to make sharing between organisations as easy as sitting beside each other in the same office, even at the same desk, Hannes Schmied, BizDev SPHARES

Overview SPHARES

Overview SPHARES

SPHARES either runs on your server or is hosted online for you on a dedicated virtual machine. It allows you to directly integrate your partners with you via your own server where you control the environment. Even if you have a virtual machine from us we will not have access to the users data, neither will you. We secured the communication with double encryption.
Current Use-Cases SPHARES focus on

  • Marketing Agencies for collaborator integration
  • Tax Advisors in the digital agency
  • Unique System Integration for integrating bigger solutions
  • Technology Providing for Plattforms

SPHARES provides the system either on a service agreement providing you the service on a monthly fee, including all costs for license, updates and support handling via web-interface or as a technology license for one-time fees + maintenance.
If you are interested please simply drop the team a line at and they will come back to you ASAP