On the IaaS layer, work on cloud interoperability was already conducted [Ste13a], [Ste13b]. The authors described in “Challenges in the Management of Federated Heterogeneous Scientific Clouds” the problem and a feasible solution to migrate virtual machines between providers. Another challenge identified is the layer between the vendor API and the user. This problem is addressed by the same authors in the paper “Building an On-Demand Virtual Computing Market in Non-Commercial Communities”, where a market is introduced to handle that problem. The concept of the market is then described in “Take a Penny, Leave a Penny Scaling out to Off-premise Unused Cloud Resources” [Ste13b] in detail, where a solution is presented that allows users to use different cloud vendors with one abstract API.
Gonidis et al. [Gon11] give a first hint at challenges addressed for Platform as a Service interoperability. Platform as a Service gives the promise of speeding up application development [Mei11] by utilizing services. Platforms such as Microsoft’s Azure, Amazon’s Elastic Beanstalk or Google’s AppEngine offer a large number of services. The possibilities range from object storage, databases, messaging to analytics. A comprehensive overview is given in section 4. However, this leads to new challenges in terms of vertical interoperability. The challenge with PaaS is not like with IaaS, where one has to move the virtual machine. Now, it is about moving individual services and the corresponding data. The leading PaaS providers such as Amazon, Google and Microsoft provide their own frameworks and tools to access their platforms. When a customer decides to move the application to another provider, this becomes a challenge.
Loutas et al. [Lou11b] describe similar challenges. First, it is stated that cloud providers promote their own, incompatible formats over standards. This is largely due to the on-going battle of dominance in the cloud. Since each provider has their own standard, new and smaller providers cannot enter the market easily as there is no common standard yet. It is stated that interoperability is the missing element so far, even though it would benefit both – customers and providers [Lou11b].
Stravoskoufos et al. [Str] define different cloud interoperability challenges on each of the cloud layers. On the IaaS layer, interoperability is about controlling the infrastructure-specific services. For platform as a service, interoperability is about using the APIs and services. For Software as a s Service, the challenge is to exchange messages and data.
This post is part of a work done on Cloud interoperability. You can access the full work here and the list of references here.

Interoperability is used in various levels of software and systems. With Cloud Computing, several major interoperability questions arise. First of all, interoperability can be seen as an infrastructure interoperability element. As for infrastructure interoperability, it is basically about the question on how to move virtual instances from one cloud provider to another.
Another interoperability question is about software and tools interoperability. This is not on how to transfer software from one provider to another, it is about how different software languages and tools collaborate in one cloud ecosystem or different cloud ecosystems. An example of that is a message-based communication between a Java-based solution and a Microsoft .NET solution. Mayrbäurl et al. described such a scenario on a Microsoft Azure based solution that itself communicates with an on-premise solution that is written in Java, whereas the Cloud service uses .NET [May11].
The third interoperability question is about the services offered by different distributors. When someone builds a cloud solution based on a specific provider and uses a software stack such as J2EE or Microsoft .NET, it is not that difficult to move the application to another instance or Cloud provider. However, when this application consumes platform-specific services such as a distributed storage service, a messaging service, e-mail and alike, migration becomes challenging. Therefore, providing interoperability on the platform services level is another key issue in Cloud computing interoperability.
Gonidis et al. defines Interoperability with two different characteristics [Gon11]. According to this paper, interoperability is when two components in different cloud environments are collaborating, whereas the possibility to move from one cloud platform to another is portability. Liu et al. defines portability itself with system portability, which means that a virtual machine can easily be moved to another cloud provider’s platform [Nis11]. In general, they define portability with the ability to move an application and all it’s data with little effort from one cloud provider to another cloud provider.
As for interoperability and portability, current literature knows different approaches. In Gonidis et al. portability and interoperability are distinct features [Gon11], whereas Dowel et al. defines portability to be a subset of interoperability [Dow11]. Petcu et al. discusses interoperability consisting of two dimensions: a horizontal dimension and a vertical dimension [Pet11]. Horizontal interoperability (as illustrated in figure 2) means that two cloud services on the same service level (e.g. PaaS) can communicate with each other, whereas vertical interoperability describes the ability to host an application on different cloud providers. When implementing vertical portability, it is necessary to support different cloud platforms.
cloud interoperability dimensions
Another description of interoperability for cloud solutions is given by Parameswaran et al. [Par12]. The authors describe 4 major factors of interoperability challenges. The first is portability, which was already discussed. The second is interoperability itself, which focuses on efficient and reliable protocols. The third is heterogeneity. This states that there are several protocols such as SOAP and Representational State Transfer (REST) [Fie00] and different formats such as XML and JSON. A PaaS application will need to deal with all of these protocols and formats in order to ensure interoperability in a heterogeneous environment. Last, there is geo-diversity. This means that smaller, but geographically more distributed datacentres might be more effective than large datacentres that are only available in some regions.
Why is interoperability so important to the cloud?
Leavitt [Lea09] addresses potential issues with the vendor Lock-In in cloud computing. When it is not possible to move from one platform to another at low cost and effort, it is very likely that one get’s locked into a certain platform. Should the provider decide to increase prices, the customer has to stick to the new conditions without the ability to move the application to a cheaper carrier. Another reason is when the cloud provider fails. Even large providers such as Microsoft [Oky14] and Amazon [Kep14] fail. This could mean that the platform is not available for several hours to days. In their SLA’s, cloud providers typically don’t reimburse the customers for lost revenue [Ama14b], [Mic14c]. But it could become worse: the cloud provider might actually disappear and customers have to move entirely to another platform entirely. This happened in February 2009 when Coghead, a cloud Platform went out of service [Kin09]. Users could export their data, but couldn’t port their applications.
This post is part of a work done on Cloud interoperability. You can access the full work here and the list of references here.

Everyone is doing Big Data these days. If you don’t work on Big Data projects within your company, you are simply not up to date and don’t know how things work. Big Data solves all of your problems, really!
Well, in reality this is different. It doesn’t solve all your problems. It actually creates more problems then you think! Most companies I saw recently working on Big Data projects failed. They started a Big Data project and successfully wasted thousands of dollars on Big Data projects. But what exactly went wrong?
First of all, Big Data is often only seen as Hadoop. We live with the mis-perception that only Hadoop can solve all Big Data topics. This simply isn’t true. Hadoop can do many things – but real data science is often not done with the core of Hadoop. Ever talked to someone doing the analytics (e.g someone good in math or statistics)?. They are not ok with writing Java Map/Reduce queries or Pig/Hive scripts. They want to work with other tools that are ways more interactive.
The other thing is that most Big Data initiatives are often handled wrong. Most initiatives often simply don’t include someone being good in analytics. One simply doesn’t find this type of person in an IT team – the person has to be found somewhere else. Failing to include someone with this skills often leads to finding “nothing” in the data – because IT staff is good in writing queries – but not in doing complex analytics. These skills are actually not thought in IT classes – it requires a totally different study field to reach this skill set.
Hadoop as the solution to everything for many IT departments. However, projects often stop with implementing Hadoop. Most Hadoop implementations never leave the pilot phase. This is often due to the fact that IT departments see Hadoop as a fun thing to play with – but getting this into production requires a different approach. There are actually more solutions out there that can be done when delivering a Big Data project.
A key to ruining your Big Data project is not involving the LoB. The IT department often doesn’t know what questions to ask. So how can they know the answer and try to find the question? The LoB sees that different. They see an answer – and know what question it would be.
The key to kill your Big Data initiative is exactly one thing: go with the hype. Implement Hadoop and don’t think about what you actually want to achieve with it. Forget the use-case, just go and play with the fancy technology. NOT
As long as companies will stich to that, I am sure I will have enough work to do. I “inherited” several failed projects and turned them into success. So, please continue.

Amazon announced details about their Q2 earnings yesterday. Their cloud business grew with incredible 81%. This is massive, given the fact that Amazon is already the number #1 company in that area. This quarter, they earned 1.8 billion USD from cloud computing.
Summing up this number, their revenue would definitively reach some 7 billion this year. However, if this growth continues to increase so fast, I guess they could even get double-digit by the end of this year. Will Amazon reach 10 billion in 2015? If so, this would be incredible! Microsoft stated that their growth was somewhere well above the 100% mark, so I am interested in where Microsoft will stand by the end of the year.
But what does this tell us? Both Microsoft and Amazon are growing fast in this business and we can expect that we will see many more interesting services in the coming month and years in the Cloud. My opinion is that the market is already consolidated between Microsoft and Amazon. Other companies such as Google and Oracle are rather niche players in the Cloud market.

When working with the main Hadoop services, it is not necessary to work with the console at all time (event though this is the most powerful way of doing so). Most Hadoop distributions also come with a User Interface. The user interface is called “Apache Hue” and is a web-based interface running on top of a distribution. Apache Hue integrates major Hadoop projects in the UI such as Hive, Pig and HCatalog. The nice thing about Apache Hue is that it makes the management of your Hadoop installation pretty easy with a great web-based UI.
The following screenshot shows Apache Hue on the Cloudera distribution.
Apache Hue

Apache Commons is one of the easiest things to explain in the Hadoop context – even though it might get complicated when working with it. Apache Commons is a collection of libraries and tools that are often necessary when working with Hadoop. These libraries and tools are then used by various projects in the Hadoop ecosystem. Samples include:

  • A CLI minicluster, that enables a single-node Hadoop installation for testing purposes
  • Native libraries for Hadoop
  • Authentification and superusers
  • A Hadoop secure mode

You might not use all of these tools and libraries that are in Hadoop Commons as some of them are only used when you work on Hadoop projects.

Apache Avro is a service in Hadoop that enables data serialization. The main tasks of Avro are:

  • Provide complex data structures
  • Provide a compact and fast binary data format
  • Provide a container to persist data
  • Provide RPC’s to the data
  • Enable the integration with dynamic languages

Avro is built with a JSON Schema, that allows several different types:

Elementary types

  • Null, Boolean, Int, Long, Float, Double, Byte and String

Complex types

  • Record, Enum, Array, Map, Union and Fixed

The sample below demonstrates an Avro schema

{“namespace”: “person.avro”,

“type”: “record”,

“name”: “Person”,

“fields”: [

{“name”: “name”, “type”: “string”},

{“name”: “age”,  “type”: [“int”, “null”]},

{“name”: “street”, “type”: [“string”, “null”]}



Table 4: an avro schema

Apache Sqoop is in charge of moving large datasets between different storage systems such as relational databases to Hadoop. Sqoop supports a large number of connectors such as JDBC to work with different data sources. Sqoop makes it easy to import existing data into Hadoop.

Sqoop supports the following databases:

  • HSQLDB starting version 1.8
  • MySQL starting version 5.0
  • Oracle starting version 10.2
  • PostgreSQL
  • Microsoft SQL

Sqoop provides several possibilities to import and export data from and to Hadoop. The service also provides several mechanisms to validate data.

With July 1st, I’ve decided to change my professional career and change to Teradata. I will work as the Big Data Leader for CEE, developing the Business in the region. It is a major career step for me. In the upcoming years, I will work closely with our teams in the region to built great Big Data applications.

Most IT departments produce a large amount of log data. This occurs especially when server systems are monitored, but it is also necessary for device monitoring. Apache Flume comes into play when this log data needs to be analyzed.

Flume is all about data collection and aggregation. The architecture is built with a flexible architecture that is based on streaming data flows. The service allows you to extend the data model. Key elements of Flume are:

  • Event. An event is data that is transported from one place to another place.
  • Flow. A flow consists of several events that are transported between several places.
  • Client. A client is the start of a transport. There are several clients available. A frequently used client for example is the Log4j appender.
  • Agent. An Agent is an independent process that provides components to flume.
  • Source. This is an interface implementation that is capable of transporting events. A sample of that is an Avro source.
  • Channels. If a source receives an event, this event is passed on to several channels. A channel is a storage that can handle the event, e.g. JDBC.
  • Sink. A sink takes an event from the channel and transports it to the next process.

The following figure illustrates the typical workflow for Apache Flume with its components.

Apache Flume
Apache Flume