Big Data 101: Transformable and Filterable Data

There are two main characteristics that data needs to fullfill: there needs to be transformable data and filterable data. In this tutorial, I will describe both.

Transformable Data

 If data is transformed, it can be changed to a different format or layout. This could as well mean the format change from binary to e.g. Json or XML as well as a totally new representation. If someone wants to look at a specific dataset (which, for instance, could be filtered) not all data might be interesting.

Let’s assume that a manager wants to filter for all Customers younger than 18 in a specific district. The manager is probably not interested in the names of the customer but rather in the sum of customers. Instead returning a huge list of Names with addresses and alike, a number is returned.

Or the online marketing department wants to target all customers with specific criteria such as age, the address might not be relevant, but Names and E-Mail are. Transformability is also a necessary characteristic if data has to be exported to another database, e.g. for analytics.

Filterable Data

This is a key characteristic to Datasets. Analytics software use Filtering frequently and it is absolutely necessary since most analytics simply don’t run on all data but rather on selected Data. Filtered Data is often represented with the “Select … Where”-Clauses in Databases.

Most of what filtering of data is good for was already discussed with “Transformability”, however we would still go into detail with that. If we analyze data, it is often necessary to work on specific datasets.

Imagine a Google Search Query, where you search for “Big Data”. All Data within Google’s index gets filtered for exactly these Words and a consolidated List is returned. If the online marketing department mentioned in “Transformability” wants a list of customers in a specific area, this List is also filtered based on the Zip Code or other geographical data. Hence it is an important characteristic for Data to support Filtering.

I hope you enjoyed the first part of this tutorial about transformable and filterable data. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

BREAKING NEWS: New AWS Region in Germany, Frankfurt

Amazon Web Services today announced their new Datacenter for Germany, Frankfurt. This is AWS region number 11 and the second in Europe. AWS will support a large number of services from that datacenter.
Here is the original press release:

SEATTLE—Oct, 23, 2014– (NASDAQ:AMZN) — Amazon Web Services, Inc. (AWS, Inc.), an company, today announced the launch of its new AWS EU (Frankfurt) region, which is the 11th technology infrastructure region globally for AWS and the second region in the European Union (EU), joining the AWS EU (Ireland) region. All customers can now leverage AWS to build their businesses and run applications on infrastructure located in Germany. As with every AWS region, customers can do this knowing that their content will stay within the region they choose. The newly launched AWS EU (Frankfurt) region comes as a result of the rapid growth AWS has been experiencing and is available now for any business, organization or software developer to sign up and get started at:

All AWS infrastructure regions around the world are designed, built, and regularly audited to meet rigorous compliance standards including, ISO 27001, SOC 1 (Formerly SAS 70), PCI DSS Level 1, and many more, providing high levels of security for all AWS customers. AWS is fully compliant with all applicable EU Data Protection laws, and for customers that require it, AWS provides data processing agreements to help customers comply with EU data protection requirements. More information on how customers using AWS can meet EU data protection requirements and local certifications such as BSI IT Grundschutz, can be found on the AWS Data Protection webpage at: A full list of compliance certifications can be found on the AWS compliance webpage at:

The new AWS EU (Frankfurt) region consists of two separate Availability Zones at launch. Availability Zones refer to datacenters in separate, distinct locations within a single region that are engineered to be operationally independent of other Availability Zones, with independent power, cooling, and physical security, and are connected via a low latency network. AWS customers focused on high availability can architect their applications to run in multiple Availability Zones to achieve even higher fault-tolerance. For customers looking for inter-region redundancy, the new AWS EU (Frankfurt) region, in conjunction with the AWS EU (Ireland) region, gives them flexibility to architect across multiple AWS regions within the EU.

“Our European business continues to grow dramatically,” said Andy Jassy, Senior Vice President, Amazon Web Services. “By opening a second European region, and situating it in Germany, we’re enabling German customers to move more workloads to AWS, allowing European customers to architect across multiple EU regions, and better balancing our substantial European growth.”

Many German customers are already using AWS including Talanx, in the highly regulated insurance sector. Talanx is one of the top three largest insurers in Germany and one of the largest insurance companies in the world with over €28 billion in premium income in 2013. “For Talanx, like many companies that hold sensitive customer data, data privacy is paramount,” says Achim Heidebrecht, Head of Group IT, Talanx AG. “Using AWS we are already seeing a 75% reduction in calculation time, and €8 million in annual savings, when running our Solvency II simulations while still complying with our very strict data policies. With the launch of the AWS region on German soil, we will now move even more of our sensitive and mission critical workloads to AWS.”

Hubert Burda Media is one of the largest media companies in Europe with over 400 brands and revenues in excess of $3.6 billion. JP Schmetz, Chief Scientist of Hubert Burda Media said of the announcement, “Now that AWS is available in Germany it gives our subsidiaries the option to move certain assets to the cloud. We have long had policies preventing data to be hosted outside of German soil and this new German region gives us the option to use AWS more meaningfully.”

Academics in Germany were also quick to welcome the new region, “The arrival of an Amazon Web Services Region in Germany marks an important occasion for the German business and technology community,” said Prof. Dr Helmut Krcmar, Vice Dean of the Computer Science Faculty, and Chair of Information Systems at the Technical University of Munich. “We work with a number of DAX listed companies in Germany. Many have been holding off moving sensitive workloads to the cloud until they had computing and service facilities on German soil as this could help them comply with their internal processes. This new region from AWS answers this and we expect to see innovation amongst Germany, and Europe’s, companies flourish as a result.”

The Header Image was published by Martin aka Maha under the Creative Commons License.

Big Data 101: Data Representation as part of Variety

Data representation is an often-mentioned characteristic for Big Data. It goes well with “Variety” in the above stated definition. Each Data is represented in a specific form and it doesn’t matter what form it is. Well-known forms of Data are XML, Json, CSV or binary. Depending on the Representation of Data, different possibilities regarding relations can be integrated.

XML and Json for instance allows us to set child-objects or relations for data, whereas it is rather hard with CSV or binary. A possibility for relations can be a dataset of the type “Person”. Each person consists of some attributes that identify the person (e.g. the last name, age, sex) and an address that is an independent entity. To retrieve this data as CSV or binary, you either have to do two queries or create a new entity for a query where the data is merged. XML and Json allows us to nest entities in other entities.

What is Data representation?



The in Figure described entity would look like the following, if presented in XML:












Listing 1: XML representation of the entity “person”

Similar to that, the Json representation of our Model “Person” would look slightly similar:

[Person :[Common :


[“firstname” : “Mario”, “lastname” : “Meir-Huber”, “Age” : 29]


[Address :

[“zipcode” : “1150”, “city” : “Vienna”]



Listing 2: Json interpretation

The traditional way of data representation: SQL

If we now look at how we could represent this data from a database as binary data, we need to join two different datasets. This is basically supported by SQL. A possible representation could look like the following:

p.Firstname p.Lastname p.Age a.Zipcode a.City
Mario Meir-Huber 29 1150 Vienna

Listing 3: SQL-based binary representation

The representation of Data isn’t limited to what was described in this chapter so far. There are several other formats available and others might arise in the future. However, data must have a clear and documented representation in a form that can be processed by Tools that built upon that data.

I hope you enjoyed the first part of this tutorial about big data technologies. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

Big Data or what is the Data Lake?

When it comes to Big Data, people are often talking about the “Data Lake”. But what is this?
Historically, we normally lived in “Data Ponds”. With the data pond architecture, each department within a company has it’s own data storage, often in different formats and technologies. HR, for instance, uses other technologies like the marketing department. The basics for that vary, but it is mostly due to the fact that these applications are too different.
With a data pond we used to have different storage technologies such as SQL, NoSQL, XML, unstructured data and many more available.
The major difference to a data lake, which is the new approach, is that all data is now seen as one thing – regarding less of where it is stored, what department is the data owner and so on. All data within a company is the company’s entire knowledgement. With new technologies such as Hadoop, we have the possibility to use all available data. Hadoop offers many data integration and governance tools to go to different data types.
With the Data Lake, all existing data ponds are joined together to one place, that forms up a data lake. The company or organisation gets a much better view on what data is available and it also gets a more comprehensive insight.
Header Image copyright under the creative commons license by Dave Bloggs.

Amazon Web Services S3 upload bug

The AWS Java SDK Version 1.8.10 comes with a critical bug, affecting uploads. A fix was provided by AWS and normally the SDK is updated automatically, so you don’t need to worry.
However, if automatic updates are disabled in your Eclipse Version, you might loose data when uploading via the SDK Version 1.8.10. Here is what AWS has to say about the bug:
// //

AWS Message

Users of AWS SDK for Java 1.8.10 are urged to immediately update to the latest version of the SDK, version 1.8.11.
If you’ve already upgraded to 1.8.11, you can safely ignore this message.
Version 1.8.10 has a potential for data loss when uploading data to Amazon S3 under certain conditions. Data loss can occur if an upload request using an InputStream with no user-specified content-length fails and is automatically retried by the SDK.
The latest version of the AWS SDK for Java can be downloaded here:
And is also available through Maven central:

The bug itself is repaired, in case you didn’t update the AWS SDK and are on the SDK Version 1.8.10 you should update that. Normally, the AWS SDK updates itself automatically in Eclipse.