Apache Pig is an abstract language that puts data in the middle. Apache Pig is a “Data-flow” language. In contrast to SQL (and Hive), Pig goes an iterative way and lets data flow from one statement to another. This gives more powerful options when it comes to data. The language used for Apache Pig is called “PigLatin”. A key benefit of Apache Pig is that it abstracts complex tasks in MapReduce such as Joins to very easy functions in Apache Pig. Apache Pig is ways easier for Developers to write complex queries in Hadoop. Pig itself consists of two major components: PigLatin and a runtime environment.

When running Apache Pig, there are two possibilities: the first one is the stand alone mode which is intended to rather small datasets within a virtual machine. On processing Big Data, it is necessary to run Pig in the MapReduce Mode on top of HDFS. Pig applications are usually script files (with the extension .pig) that consist of a series of operations and transformations, that create output data from input data. Pig itself transforms these operations and transformations to MapReduce functions. The set of operations and transformations available by the language can easily be extended via custom code. When compared to the performance of “pure” MapReduce, Pig is a bit slower, but still very close to the native MapReduce performance. Especially for that not experienced in MapReduce, Pig is a great tool (and ways easier to learn than MapReduce)

When writing a Pig application, this application can easily be executed as a script in the Hadoop environment. Especially when using the previously demonstrated Hadoop VM’s, it is easy to get started. Another possibility is to work with Grunt, which allows us to execute Pig commands in the console. The third possibility to run Pig is to embed them in a Java application.

The question is, what differentiates Pig from SQL/Hive. First, Pig is a data-flow language. It is oriented on the data and how it is transformed from one statement to another. It works on a step-by-step iteration and transforms data. Another difference is that SQL needs a schema, but Pig doesn’t. The only dependency is that data needs to be able to work with it in parallel.

The table below will show a sample program. We will look at the possibilities within the next blog posts.

A = LOAD ‘student‘ USING PigStorage() AS (name:chararray, age:int, gpa:float);
X = FOREACH A GENERATE name,$2;
DUMP X;
(John,4.0F)
(Mary,3.8F)
(Bill,3.9F)
(Joe,3.8F)

In the last weeks, I outlined several Big Data benefits by industries. The next posts, I want to outline use-cases where Big Data are relevant in any company, as I will focus on the business functions.
This post’s focus: IT.
Big Data is a hot IT topic. Not just because it comes from the IT, but also because it gives great benefits to the overall IT operations. In a recent project, I’ve been working with a large european corporation in the manufacturing/production sector. Their IT had some 400 IT employees, serving more than 50,000 corporate employees and operating a large number of servers that run specific services. A key challenge for them was reliability of their services. To find out how a service is utilised, large amounts of log data were analysed in order to find out how they can prioritise different services. This gave them detailed insights on where they want to move their services too since different services had different utilisation patterns. The company could improve their utilisation of servers. New services get integrated in that approach as well, which means that they are capable of delivering these new services without the need to invest in new hardware.
Another great approach – and another hot topic – is Big Data for IT security. With Big Data analytics, companies can find security issues before they become serious threads. Patterns on web site access can provide insights on DoS attacks and similar issues. These analytics are often provided in real-time and provide fast ways to react in case problems occur.
As described in today’s article, Big Data is not just a topic coming from the IT, it is a topic MADE for the IT.

One of the easiest to use tools in Hadoop is Hive. Hive is very similar to SQL and is easy to learn for those that have a strong SQL background. Apache Hive is a data-warehousing tool for Hadoop, focusing on large datasets and how to create a structure on them.

Hive queries are written in HiveQL. HiveQL is very similar to SQL, but not the same. As already mentioned, HiveQL translates to MapReduce and therefore comes with minor performance trade-offs. HiveQL can be extended by custom code and MapReduce queries. This is useful, when additional performance is required.

The following listings will show some Hive queries. The first listing will show how to query two rows from a dataset.

hive> SELECT column1, column2 FROM dataset2 5

4 9

5 7

5 9

Listing 2: simple Hive query

The next sample shows how to include a where-clause.

hive> SELECT DISTINCT column1 FROM dataset WHERE column2 = 91

Listing 3: where in Hive

HCatalog is an abstract table manager for Hadoop. The target of HCatalog is to make it easier for users to work with data. Users see everything like it would be a relational database. To access HCatalog, it is possible to use a Rest API.

In the last weeks, I outlined several Big Data benefits by industries. The next posts, I want to outline use-cases where Big Data are relevant in any company, as I will focus on the business functions.
This post’s focus: Customer Services.
Big Data is great for customer services. In customer services, there are several benefits for it. A key benefit can be seen in the IT help desk. IT help desk applications can greatly be improved by Big Data. Analysing past incidents and calls, their occurrence and impact can give great benefits for future calls. On the one hand, a knowledge base can be built to give employees or customers an initial start. For challenging cases, trainings can be developed to reduce the number of tickets opened. This reduces costs on the one side and improves customer acceptance on the other side.
Big Data can have a large impact here. When a customer feels treated well, the customer is very likely to come back and buy more at the company. Big Data can serve as an enabler here.

MapReduce is the elementary data access for Hadoop. MapReduce provides the fastest way in terms of performance, but maybe not in terms of time-to-market. Writing MapReduce queries might be trickier than Hive or Pig. Other projects, such as Hive or Pig translate the code you entered into native MapReduce queries and therefore often come with a tradeoff.

A typical MapReduce function follows the following process:

  • The Input Data is distributed on different Map-processes. The Map processes work on the provided Map-function.
  • The Map-processes are executed in parallel.
  • Each Map-process issues intermediate results. These results are stored, which is often called the shuffle-phase.
  • Once all intermediate results are available, the Map-function has finished and the reduce function starts.
  • The Reduce-function works on the intermediate results. The Reduce-function is also provided by the user (just like the Map-function).

A classical way to demonstrate MapReduce is via the Word-count example. The following listing will show this.

map(String name, String content):

for each word w in content:

EmitIntermediate(w, 1);

reduce(String word, Iterator intermediateList):

int result = 0;

for each v in intermediateList:

result++;

Emit(word, result);

In the last weeks, I outlined several Big Data benefits by industries. The next posts, I want to outline use-cases where Big Data are relevant in any company, as I will focus on the business functions.
This post’s focus: Sales.
Las week I outlined Marketing possibilities (and downsides) with Big Data. Very similar to Marketing is Sales. Often,  those two things come together. However, I would say it needs to be stated separately. In this post, I won’t discuss the Sales opportunities in Big Data from Webshops and alike. Today, I want to focus on Big Data opportunities that respect privacy but still have an impact.
Last year, I attended a conference where a company outlined their big data case. It was about analysing bills issued in their chain stores. The data from the bills included no personal details like credit card number, bonus card number and alike. It was only about what was in the basket. With the help of that, they could figure out what products get more attention at a specific store and how it differs from other stores. This data was joined with open data from public sources and other data about demographics. They could also find out that specific products get bought with another products – which means that if customer X buys product C, the customer is very likely to buy product D. An example of that for instance is that if you buy a skirt, you are also likely to buy a top.
The later example focused on analysing data for fashion stores. However, most stores can benefit from Big Data. I recently had the chance to talk to the CIO of a large supermarket chain. They also have some Big Data algorithms that improve their chain stores. The company’s policy is to accept their customer’s privacy and they don’t work on their personal data. They figured out when the neighbourhood changes – e.g. because a university was built. They could see that other products are demanded and changed the assortment of goods accordingly.
There are many opportunities where Big Data can improve Sales, and as shown in these two examples, they don’t necessarily need to violate someone’s privacy.

Hadoop is very flexible and it is possible to integrate almost any kind of database into the system. Many database vendors extended their products to work with Hadoop. One database that is often used with Hadoop is Apache Cassandra. Cassandra isn’t part of the Hadoop project itself but is often seen in connection with Hadoop projects.
Cassandra comes with several benefits. First, it is a NoSQL-Database as well, working with a key/value store. Beeing developed by Facebook initially, it is now maintained by Datastax. Cassandra comes with a great performance and linear scalability.

In the last weeks, I outlined several Big Data benefits by industries. The next posts, I want to outline use-cases where Big Data are relevant in any company, as I will focus on the business functions.
This post’s focus: Marketing.
Marketing is one of the use-cases for Big Data, which are discussed controversial. One the one hand, it gives opportunities to companies to adjust offers to their customers and make the offers more “individual”. I will describe the themes here before I will discuss the downsides of this.
With customer loyalty programs, companies can better “target” their customers. When the company understands the behaviour of the customer, special offers and promotions can be sent to the customer. We all know this from large online shops, where you get regular offers by e-mail. But this also applies to retail stores around you: with programs from the retailers, they also collect data about their customers and can improve the portfolio. Furthermore, they can make their advertisement more individual – and increase the revenue. Marketing gets valuable insights for all industries. Retail is the most common, but also other industries that are not in retail can gain benefits from it. Companies that work in B2B can create value from Big Data by adjusting their sales processes adjusted by data – and react to new trends before competitors find out.
On the other side, this is somewhat frightening. I am basically in favour of Big Data. However, there must be some kind of assurance that personal privacy is respected. At present, it is hard to opt-out of such programs.