Future Technologies that have an impact on Cloud and Big Data

As of future technologies, Cloud Computing and Big Data aren’t a future anymore. They are here, right now and more and more of us start to deal with these technologies. Even when you watch TV, a reference to the cloud is often made. But there are several other technologies that will have a certain impact on Cloud Computing and Big Data. These technologies are different to Cloud and Big Data but will utilize that and use it as an important basis and back end.

Future Emerging Technologies using Cloud and Big Data

Future Emerging Technologies using Cloud and Big Data


The technologies are:

  • Smart Cities
  • Smart Homes
  • Smart Production
  • Autonomous Systems
  • Smart Logistics
  • Internet of Things

All these technologies work together and have the Cloud as back end. Furthermore, they use Big Data concepts and technologies. Summing these technologies up, they can be described as „cyber-physical systems“. This basically means that the virtual world we were used to until now moves stronger into the physical world. These two worlds will merge together and form something totally new. In the upcoming weeks I will outline each topic in detail, so stay tuned and subscribe to this tag to get the updates.
Header Image Copyright by Pascal, licensed under the Creative Commons 2.0 license.

Software defined Storage (SdS) in the Cloud

Cloud Computing gave us several changes in how we handle IT nowadays. Common tasks that used to take a lot of time received great automation and much more is still about to come. Another interesting development is the „Software defined X“. This basically means that infrastructure elements receive larger automation as well, which ends up being more scale able and better to utilize from applications. A frequent term used lately is the „Software defined Networking“ approach, however, there is another one that sounds promising, especially for Cloud Computing and Big Data: Software defined Storage.
Software defined Storage gives us the promise to abstract the way how we use storage. This is especially useful for large scale systems, as no one really wants to care about how to distribute the content to different servers. This should basically be opaque to end-users (software developers). For instance, if you are using a storage system for your website, you want to have an API like Amazon’s S3. there is no need to worry about on which physical machine your files are stored – you just specify the desired region. The back-end system (in this case, Amazon S3) takes care of that.

Software defined Storage explained

Software defined Storage explained


As of the architecture, you simply communicate with the abstraction layer, that takes care of the distribution, redundancy and other factors.
At present, there are several systems available that takes care of that: next to the well-know systems such as Amazon S3, there are also other solutions such as the Hadoop Distributed File System (HDFS) or GlusterFS.
 
Header Image Copyright: nyuhuhuu. Licensed under the Creative Commons 2.0.

Big Data – how to achieve data quality

In order to analyse data right, it is necessary to have a high level of data quality. In this tutorial, we will look at how to achieve this.

What are the data quality attributes?

To Store Data, some attributes regarding the data must be fulfilled. (Heinrich & Stelzer, 2011) defined some data quality attributes that should be fulfilled.

Data quality attributes

Data quality attributes

  • Relevance. Data should be relevant to the use-case. If a query should look up all available users interested in “luxury cars” in a web portal, all these users should be returned. It should be possible to take some advantage out of these data, e.g. for advanced marketing targeting.
  • Correctness. Data has to be correct. If we again query for all existing users on a web portal interested in luxury cars, the data about that should be correct. By correctness, it is meant that the data should really represent people interested in luxury cars and that faked entries should be removed.
  • Completeness. Data should be complete. Targeting all users interested in luxury cars only makes sense if we can target them somehow, e.g. by e-mail. If the e-mail field is blank or any other field we would like to target our users, data is not complete for our use-case.
  • Timeliness. Data should be up-to date. A user might change the e-mail address after a while and our database should reflect these changes whenever and wherever possible. If we target our users for luxury cars, it won’t be good at all if only 50% of the user’s e-mail addresses are correct. We might have “big data” but the data is not correct since updates didn’t occur for a while.
  • Accuracy. Data should be as accurate as possible. Web site users should have the possibility to specify, “Yes, I am interested in luxury cars” instead of defining their favorite brand (which could be done additionally). If the users have the possibility to select a favorite brand, it might be accurate but not accurate enough. Imagine someone selects “BMW” as favorite brand. BMW could be considered as luxury car but they also have different models. If someone selects BMW just because one likes the sport features, the targeting mechanism might hit the wrong people.
  • Consistency. This shouldn’t be confused with the consistency requirement by the CAP-Theorem (see next section). Data might be duplicated, since users might register several times to get various benefits. The user might select “luxury cars” and with another account “budget cars”. Duplicate accounts leads to inconsistency of data and it is a frequent problem in large web portals such as Facebook (Kelly, 2012).
  • Availability. Availability states that data should be available. If we want to query all existing users interested in luxury cars, we are not interested in a subset but all of them. Availability is also a challenge addressed by the CAP-Theorem. In this case, it doesn’t focus on the general availability of the database but at the availability of each dataset itself. The algorithm querying the data should be as good as possible to retrieve all available data. There should be easy to use tools and languages to retrieve the data. Normally, each database provides a query language such as SQL, or O/R Mappers to developers.
  • Understandability. It should be easy to understand data. If we query our database for people interested in luxury cars, we should have the possibility to easily understand what the data is about. Once the data is returned, we should use our favorite tool to work with the data. The data itself should describe itself and we should know how to handle it. If the data returns a “zip” column, we know that this is the ZIP-code individual users are living in.

I hope you enjoyed the first part of this tutorial about transformable and filterable data. This tutorial is part of the Big Data Tutorial. Make sure to read the entire tutorials.

How to create Data analytics projects

Big Data Analysis is something that needs some iteration principles. If we look at a famous novel where a lot of data was analyzed, “A hitchhikers guide to the galaxy” became famous. In the novel, someone asked a supercomputer a question: “Answer to the Ultimate Question of Life, the Universe, and Everything”. As this was a quite difficult problem to solve, iteration for that is necessary.
(Bizer, Boncz, Brodie , & Erling, 2011) describes some iteration steps on creating Big Data Analysis Applications. Five easy steps are mentioned in this paper: Define, Search, Transform, Entity Resolution and Answer the Query.
Define deals with the problem that needs to be solved. This is when the marketing manager asks: “we need to find a location in county “xy”, where customers age is over 18 and below 30 and we have no store yet”. In our initial description of “A hitchhikers guide to the galaxy”, this would be the question about the answer to everything.
Next, we identify candidate elements in the Big Data space. (Bizer, Boncz, Brodie , & Erling, 2011) names this “search”. In the marketing sample, this would mean that we have to scan all data of all users that are between 18 and 30. This data must be combined with store locations. In the “hitchhikers guide to the galaxy”, this would mean that we have to scan all data – as we try to find the answer to all.
Transform means that the data identified has to be “extracted, transformed and loaded” into appropriate formats. This is part of the preparation phase, since the data is now almost ready for calculation. Data is extracted from different sources and transformed into a unique format that can be used for the analysis. In the marketing example, we will need to use sources from the government and combine it with our own data on customers. Furthermore, we need map data. All this data is now stored in our database for analysis. It is more complicated with the “hitchhikers” problem: since we need to analyze ALL data available in the universe, we simply can’t copy this to a new system. The analysis has to be done on the systems it is stored at.
After the data elements are prepared, we need to resolve elements. In this phase, we ensure that data entities are unique, relevant and comprehensive. In the marketing example, this would mean that all elements are resolved that have an age of 18 to 30. In the hitchhiker’s problem, we can’t resolve entities. Once again, we need to find the answer to all and can’t afford to exclude data.
In the last step, the data is finally analyzed. (Bizer, Boncz, Brodie , & Erling, 2011) describes this as “answer the query”. Basically this means that the data analysis is done. Big Data analysis usually needs a lot of nodes that compute the results out of the datasets available. In our marketing sample, we would look at the resolved data sets and compare it with our store locations. The result would be a list of counties where no store is available yet and the condition is fulfilled. In our hitchhikers sample, we would analyze all data and look for the ultimate answer.
The following figure shows the 5 Steps for Big Data Analysis displays the necessary steps for Big Data Analysis as described above.

Data iteration

Data iteration