Data is often stored in one system, but the analytical systems are often somewhere else. In this tutorial, we will look at the challenges of moving data for analysis.

Moving data for analysis

Another issue with Big Data is indicated by (Alexander, Hoisie , & Szalay , 2011). The problem is that Data can’t be moved easily for analysis. With Big Data, we often have some Terabytes or more. Moving this via a network connection is not that easy or even impossible.
If real-time data is analyzed, it is literally impossible to move that amount of data to another cluster, since the data will be incorrect or not available at this time. Real-Time data analysis is also necessary in fraud protection. If this data now has to be moved to another cluster, it might already be too late.
In traditional databases, this wasn’t that hard since the data was often some Gigabyte in a single database. With Big Data, data is in various formats, at high volume and at high velocity. To comply with all these factors and moving data to another cluster, this might not be possible.

What are the challenges?

(Alexander, Hoisie , & Szalay , 2011) describes some factors that influence the challenges of moving data to another cluster: high-flux data, structured and unstructured data, real-time decisions and data organization.
High-flux data describes data that arrives in real time. If the data must be analyzed, this also has to be done in real-time. The data might be gone or modified at a later point. In Big Data applications, data will arrive structured as well as unstructured.
Decisions on Data must often be done in real time. If there is a data stream of financial transactions, an algorithm must decide in real time if the data needs more detailed analysis. If not all data is stored, an algorithm must decide if the data is stored or not. Data organization is another challenge when it comes to moving data.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

With the consistency of data, another challenge arrives: data concurrency. What it means is described in this tutorial.

What are the challenges with data concurrency?

Data needs to be partitioned if it can’t be stored on a single system. With Big Data applications, we don’t talk about small storages but rather about distributed systems. Data might be partitioned over hundred or thousand of nodes and the database must scale out to that demand.

Data partitioning is a key concept for databases and it serves as well in Big Data applications. However, if data is distributed over some servers, it might take a while until all nodes are informed about the changes.

To avoid concurrency issues, the data must be locked. This might result in a poor database performance if the database should be kept consistent at all time. One solution is to forget about data consistency in favor of data partitioning. This approach is described in detail in section 1.6.2 when we will focus on the CAP-Theorem.

How does this play out?

Let’s imagine a Web shop. There are 2 users in our sample; both of them (let’s call them User A and User B) want to buy a Product P. There is exactly one item on stock. User A sees this and proceeds with the checkout, as well as User B. They complete the order at about the same time.

The Database in our sample is designed in a way that partitioning is preferred over consistency and both Users get the acknowledgement that their Order was processed. Now we would have -1 items in stock since no database trigger or any other command told us that we ran out of items. We either have to tell one User to “forget” the order or have to find a way to deliver the item to both users.

In any case, one user might get angry. Some web shops solved this issue in a non-technical way: they tell the user “sorry, we are unable to deliver in time” and give them the option to cancel the order or take a voucher. However, there is no simple technical solution to that.

How to solve data concurrency issues?

In most cases, it will cost money to the company. If the web shop would use a system built for consistency, it might run into database outages. Users might not buy products at their web site since the web site is simply “not available”. The web shop can either loose money by users that were unable to buy products because of delays in the database or by consistency issues.

In the case of web shop outage, users might not return and buy products since they are annoyed about the “bad performance of the website” and “inability to process the order”, whereas people would return and buy other products if they get a voucher because of issues that came with data partitioning and concurrency.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

Another challenge for Big Data is about different storage systems. This creates a lot of variety in data and thus increases complexity. In this tutorial, we will discuss this.

What are the problems of different storage systems?

A main factor to Big Data is the variety of data. Data may not only change over time (e.g. a web shop not only wants to sell books but also cars) but will also have different formats. Databases must provide this possibility.
Companies might not only store all their data in one single database but rather in different databases and different APIs consume different formats such as JSON, XML or any other type. Facebook, for instance, uses MySQL, Cassandra and HBase to store their data. They have three different storage systems (Harris, 2011) (Muthukkaruppan, 2010), each of them serving a different need.
(Helland, 2011) described the challenges for datastores with 4 key principles:

  • unlocked data
  • inconsistent schema
  • extract, transform and load
  • too much to be accurate

What are these aspects about?

By unlocked data, it is meant that data is usually locked but with Big Data, this might result in problems, as they don’t rely on locked data. On the other hand, unlocked data leads to semantically changes in a database. With inconsistent schema, (Helland, 2011) describes the challenge of data from different sources and formats. Schema needs to be somewhat flexible to deal with extensibility. As stated earlier, businesses change over time and so does the data schema.
Extract, transform and load is something very specific to Big Data Systems, since data comes from many different sources and it needs to be put into place in a specific system. Too much to be accurate outlines the “velocity” problem with Big Data applications. If data is calculated, the result might not be exact since the data the calculation was built upon might have already changed. (Helland, 2011) states that you might not be accurate at all and you can only guess results.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

Big Data needs Big Storage and storage is at the end a physical device. Until now, most storage devices are hard disks that require mechanical movements. In this tutorial we will discuss the storage performance challenges.

What are the storage performance challenges?

A common hard drive available today (December 2012) has 15,000 (Seagate, 2013) revolutions per minute (rpm) and a desktop hard drive has some 7,200-rpm. In any case, this means that there is significant latency involved until the reading head is in place. The mechanical approach to storage has been around for decades and scientists as well as engineers complain about storage performance.

In-memory was always faster than hard disk storage and the network speed is higher than what can be done with hard disks. (Anthes, 2012) states that disk based storage is about 10-100 times slower than a network and about 1,000 times slower than main memory. This means that there is a significant “bottleneck” when it comes to delivering data from a disk-based storage to an application.

As big data is about storing and analyzing data, this is a major challenge to Big Data Applications. It doesn’t help us much if we have enough compute power to analyze data but our disks simply can’t deliver the data in a fast way.

Data is distributed

When we look at supercomputers nowadays, they are often measured in cores and Teraflops (Top 500 Supercomputers Site, 2012). This is basically good if you want to do whatever kind of calculation such as the human genome. However, this doesn’t tell us anything about disk performance if we want to store or analyze data. (Zverina, 2011) cites Allan Snavely when he proposes to include the disk performance in such metrics as well:

“I’d like to propose that we routinely compare machines using the metric of data motion capacity, or their ability to move data quickly” – Allan Snavely

Allan Snavely also stated that with increasing data size – hard disks are getting higher in capacity but access time stays the same – it is harder to find data.

This can be illustrated easily: you have an external hard disk with the capacity of 1 TB. The hard disk operates with 7,200 rpm and a cache of 16MB. There are 1,000 Videos stored on this hard drive, each with a size of 1 GB. This would fill the entire hard disk. If you now change to a larger system as your videos grow, you would change to a 2 TB system.

If this System is full, you won’t be able to transfer the videos to another system in the same time as you did with the 1 TB hard drive. It is very likely that your 2 TB System now needs about twice as much time to transfer the data. Whereas compute performance grows, the performance to access data stays about the same. Given the growth of data and storage capacity, it even gets slower. Allan Snavely (Zverina, 2011) describes this with the following statement:

“The number of cycles for computers to access data is getting longer – in fact disks are getting slower all the time as their capacity goes up but access times stay the same. It now takes twice as long to examine a disk every year, or put another way, this doubling of capacity halves the accessibility to any random data on a given media.”

How to overcome these challenges?

In the same article, Snavely suggests to include the following metrics in a computer’s performance: DRAM, flash memory, and disk capacity.

But what can enterprises do to achieve higher through output of their systems? There is already some research about that and most resources point towards Solid State Disks as Storage (SSD). Solid State Disks are getting commodity hardware in high end Personal Computers, but they are not that common for servers and distributed systems yet.

SSDs normally have better performance but lower disk space and the price per GB is more expensive. If we talk about large-scale databases that have the need for performance, SSDs might be a better choice. The San Diego Supercomputing Center (SDSC) built a supercomputer with SSDs. This computer is called “Gordon” and can handle Data up to 100 times faster as with normal drives (Zverina, 2011).

Another prototype, called “Moneta” (Anthes, 2012) used a phase change memory to boost I/O performance. The performance was about 9.5 times faster as a normal RAID-System and about 2.8 times faster as a flash-based raid system.

There is significant research around this topic as the performance of storage is a problem to large-scale data centric systems as we now have with Big Data Applications.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

One of the most important things is to partition data in an environment. Especially with large-scale systems, this is very important, as not everything can be stored on a limited number of systems.

How to partition data?

Partitioning is another factor for Big Data Applications. It is one of the factors of the CAP-Theorem (see 1.6.1) and is also important for scaling applications. Partitioning basically describes the ability to distribute a database over different servers. In Big Data Applications, it is often not possible to store everything on one (Josuttis, 2011)

Data Partitioning

Data Partitioning

The factors for partitioning illustrated in the Figure: Partitioning are described by (Rys, 2011). Functional partitioning is basically describing the service oriented architecture (SOA) approach (Josuttis, 2011). With SOA, different functions are provided by their own services. If we talk about a Web shop such as Amazon, there are a lot of different services involved. Some Services handle the Order Workflow; other Services handle the search and so on.
If there is high load on a specific service such as the shopping cart, new instances can be added on demand. This reduces the risk of an outage that would lead to loosing money. Building a service-oriented architecture simply doesn’t solve all problems for partitioning. Therefore, data also has to be partitioned. By data partitioning, all data is distributed over different servers. They can also be distributed geographically.
A partition key basically identifies partitioned Data. Since there is a lot of data available and single nodes may fail, it is necessary to partition data in the network. This means that data should be replicated and stored redundant in order to deal with node failures.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

Scalable data processing is necessary for all platforms handling data. In today’s tutorial we will have a look at this.
Scalability is another factor of Big Data Applications described by (Rys, 2011). Whenever we talk about Big Data, it mainly involves high-scaling systems. Each Big Data Application should be built in a way that eases scaling. (Rys, 2011) describes several needs for scaling: user load scalability, data load scalability, computational scalability and scale agility.

What is scalable data?

The figure illustrates the different needs for scalability in Big Data environments as described by (Rys, 2011). Many applications such as Facebook (Fowler, 2012) have a lot of users. Applications should support the large user base and should stay prone to errors in case the application sees unexpected high user numbers. Various techniques can be applied to support different needs such as fast data access. A factor that often – but not only – comes with a high number of users is the data load.
(Rys, 2011) describes that some or many users can produce this data. However, things such as sensors and other devices that do not directly relate to users can also produce large datasets. Computational scalability is the ability to scale to large datasets. Data is often analyzed and this needs compute power on the analysis side. Distributed algorithms such as Map/Reduce require a lot of nodes in order to perform queries and analyze in a performing manner.
Scale agility describes the possibility to change the environment of a system. This basically means that new instances such as compute can be added or removed on-demand. This requires a high level of automation and virtualization and is very similar to what can be done in cloud computing environments. Several Platforms such as Amazon EC2, Windows Azure, OpenStack, Eucalyptus and others enable this level of self-service that is a great support to scaling agility for Big Data environments.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

Agility is an important factor to Big Data Applications. Agile data needs to fulfill 3 different agility factors which are: model agility, operational agility and programming ability. (Rys, 2011)

Data agility

Data agility

Agile data: model agility

Model agility means how easy it is to change the Data Model. Traditionally, in SQL Systems it is rather hard to change a schema. Other Systems such as non-relational Databases allow easy change to the Database. If we look at Key/Value Storages such as DynamoDB (Amazon Web Services, 2013), the change to a Model is very easy. Databases in fast changing systems such as Social Media Applications, Online Shops and other require model agility. Updates to such systems occur frequently, often weekly to daily (Paul, 2012).

Operational agility

In distributed environments, it is often necessary to change operational aspects of a System. New Servers get added often, also with different aspects such as Operating System and Hardware. Database systems should stay tolerant to operational changes, as this is a crucial factor to growth.

Programming agility

Database Systems should support the software developers. This is when programming agility comes into play. Programming agility describes the approach that the Database and all associated SDK’s should easy the live of a developer that is working with the Database itself. Furthermore, it should also support fast development.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

There are two main characteristics that data needs to fullfill: there needs to be transformable data and filterable data. In this tutorial, I will describe both.

Transformable Data

 If data is transformed, it can be changed to a different format or layout. This could as well mean the format change from binary to e.g. Json or XML as well as a totally new representation. If someone wants to look at a specific dataset (which, for instance, could be filtered) not all data might be interesting.

Let’s assume that a manager wants to filter for all Customers younger than 18 in a specific district. The manager is probably not interested in the names of the customer but rather in the sum of customers. Instead returning a huge list of Names with addresses and alike, a number is returned.

Or the online marketing department wants to target all customers with specific criteria such as age, the address might not be relevant, but Names and E-Mail are. Transformability is also a necessary characteristic if data has to be exported to another database, e.g. for analytics.

Filterable Data

This is a key characteristic to Datasets. Analytics software use Filtering frequently and it is absolutely necessary since most analytics simply don’t run on all data but rather on selected Data. Filtered Data is often represented with the “Select … Where”-Clauses in Databases.

Most of what filtering of data is good for was already discussed with “Transformability”, however we would still go into detail with that. If we analyze data, it is often necessary to work on specific datasets.

Imagine a Google Search Query, where you search for “Big Data”. All Data within Google’s index gets filtered for exactly these Words and a consolidated List is returned. If the online marketing department mentioned in “Transformability” wants a list of customers in a specific area, this List is also filtered based on the Zip Code or other geographical data. Hence it is an important characteristic for Data to support Filtering.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

Amazon Web Services today announced their new Datacenter for Germany, Frankfurt. This is AWS region number 11 and the second in Europe. AWS will support a large number of services from that datacenter.
Here is the original press release:

SEATTLE—Oct, 23, 2014– (NASDAQ:AMZN) — Amazon Web Services, Inc. (AWS, Inc.), an Amazon.com company, today announced the launch of its new AWS EU (Frankfurt) region, which is the 11th technology infrastructure region globally for AWS and the second region in the European Union (EU), joining the AWS EU (Ireland) region. All customers can now leverage AWS to build their businesses and run applications on infrastructure located in Germany. As with every AWS region, customers can do this knowing that their content will stay within the region they choose. The newly launched AWS EU (Frankfurt) region comes as a result of the rapid growth AWS has been experiencing and is available now for any business, organization or software developer to sign up and get started at: http://aws.amazon.com.

All AWS infrastructure regions around the world are designed, built, and regularly audited to meet rigorous compliance standards including, ISO 27001, SOC 1 (Formerly SAS 70), PCI DSS Level 1, and many more, providing high levels of security for all AWS customers. AWS is fully compliant with all applicable EU Data Protection laws, and for customers that require it, AWS provides data processing agreements to help customers comply with EU data protection requirements. More information on how customers using AWS can meet EU data protection requirements and local certifications such as BSI IT Grundschutz, can be found on the AWS Data Protection webpage at: aws.amazon.com/de/data-protection. A full list of compliance certifications can be found on the AWS compliance webpage at: http://aws.amazon.com/compliance/.

The new AWS EU (Frankfurt) region consists of two separate Availability Zones at launch. Availability Zones refer to datacenters in separate, distinct locations within a single region that are engineered to be operationally independent of other Availability Zones, with independent power, cooling, and physical security, and are connected via a low latency network. AWS customers focused on high availability can architect their applications to run in multiple Availability Zones to achieve even higher fault-tolerance. For customers looking for inter-region redundancy, the new AWS EU (Frankfurt) region, in conjunction with the AWS EU (Ireland) region, gives them flexibility to architect across multiple AWS regions within the EU.

“Our European business continues to grow dramatically,” said Andy Jassy, Senior Vice President, Amazon Web Services. “By opening a second European region, and situating it in Germany, we’re enabling German customers to move more workloads to AWS, allowing European customers to architect across multiple EU regions, and better balancing our substantial European growth.”

Many German customers are already using AWS including Talanx, in the highly regulated insurance sector. Talanx is one of the top three largest insurers in Germany and one of the largest insurance companies in the world with over €28 billion in premium income in 2013. “For Talanx, like many companies that hold sensitive customer data, data privacy is paramount,” says Achim Heidebrecht, Head of Group IT, Talanx AG. “Using AWS we are already seeing a 75% reduction in calculation time, and €8 million in annual savings, when running our Solvency II simulations while still complying with our very strict data policies. With the launch of the AWS region on German soil, we will now move even more of our sensitive and mission critical workloads to AWS.”

Hubert Burda Media is one of the largest media companies in Europe with over 400 brands and revenues in excess of $3.6 billion. JP Schmetz, Chief Scientist of Hubert Burda Media said of the announcement, “Now that AWS is available in Germany it gives our subsidiaries the option to move certain assets to the cloud. We have long had policies preventing data to be hosted outside of German soil and this new German region gives us the option to use AWS more meaningfully.”

Academics in Germany were also quick to welcome the new region, “The arrival of an Amazon Web Services Region in Germany marks an important occasion for the German business and technology community,” said Prof. Dr Helmut Krcmar, Vice Dean of the Computer Science Faculty, and Chair of Information Systems at the Technical University of Munich. “We work with a number of DAX listed companies in Germany. Many have been holding off moving sensitive workloads to the cloud until they had computing and service facilities on German soil as this could help them comply with their internal processes. This new region from AWS answers this and we expect to see innovation amongst Germany, and Europe’s, companies flourish as a result.”

The Header Image was published by Martin aka Maha under the Creative Commons License.

Data representation is an often-mentioned characteristic for Big Data. It goes well with “Variety” in the above stated definition. Each Data is represented in a specific form and it doesn’t matter what form it is. Well-known forms of Data are XML, Json, CSV or binary. Depending on the Representation of Data, different possibilities regarding relations can be integrated.

XML and Json for instance allows us to set child-objects or relations for data, whereas it is rather hard with CSV or binary. A possibility for relations can be a dataset of the type “Person”. Each person consists of some attributes that identify the person (e.g. the last name, age, sex) and an address that is an independent entity. To retrieve this data as CSV or binary, you either have to do two queries or create a new entity for a query where the data is merged. XML and Json allows us to nest entities in other entities.

What is Data representation?

data-entity

data-entity

The in Figure described entity would look like the following, if presented in XML:

<person><common>

<firstname>Mario</firstname>

<lastname>Meir-Huber</lastname>

<age>29</age>

</common>

<address>

<zipcode>1150</zipcode>

<city>Vienna</city>

</address>

</person>

Listing 1: XML representation of the entity “person”

Similar to that, the Json representation of our Model “Person” would look slightly similar:

[Person :[Common :

[“firstname” : “Mario”, “lastname” : “Meir-Huber”, “Age” : 29]

]

[Address :

[“zipcode” : “1150”, “city” : “Vienna”]

]

]

Listing 2: Json interpretation

The traditional way of data representation: SQL

If we now look at how we could represent this data from a database as binary data, we need to join two different datasets. This is basically supported by SQL. A possible representation could look like the following:

p.Firstname p.Lastname p.Age a.Zipcode a.City
Mario Meir-Huber 29 1150 Vienna

Listing 3: SQL-based binary representation

The representation of Data isn’t limited to what was described in this chapter so far. There are several other formats available and others might arise in the future. However, data must have a clear and documented representation in a form that can be processed by Tools that built upon that data.

I hope you enjoyed the first part of this tutorial about big data technology. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.