2 Big Data and Hadoop E-Books are available at a special promotion. The reduced price is only valid for 1 week, so make sure to order soon! The offer expires on 21th of December and are available on the Kindle store. The two E-Books are:

  • Big Data (Introduction); 0.99$ instead of 5$: Get it here
  • Hadoop (Introduction); 0.99$ instead of 5$: Get it here

Have fun reading it!

2016 is around the corner and the question is, what the next year might bring. I’ve added my top 5 predictions that could become relevant for 2016:

  • The Cloud war will intensify. Amazon and Azure will lead the space, followed (with quite some distance) by IBM. Google and Oracle will stay far behind the leading 2+1 Cloud providers. Both Microsoft and Amazon will see significant growth, with Microsoft’s growth being higher, meaning that Microsoft will continue to catch up with Amazon
  • More PaaS Solutions will arrive. All major vendors will provide PaaS solutions on their platform for different use-cases (e.g. Internet of Things). These Solutions will become more industry-specific (e.g. a Solution specific for manufacturing workflows, …)
  • Vendors currently not using the cloud will see declines in their income, as more and more companies move to the cloud
  • Cloud Data Centers will become more often outsourced from the leading providers to local companies, in order to overcome local legislation
  • Big Data in the Cloud will grow significantly in 2016 as more companies will put workload to the Cloud for these kind of applications

What do you think? What are your predictions?

I am happy to announce the development we did over the last month within Teradata. We developed a light-weight process model for Big Data Analytic projects, which is called “RACE”. The model is agile and resembles the know-how of more than 25 consultants that worked in over 50 Big Data Analytic projects in the recent month. Teradata also developed CRISP-DM, the industry leading process for data mining. Now we invented a new process for agile projects that addresses the new challenges of Big Data Analytics.

Where does the ROI comes from?

This was one of the key questions we addressed when developing RACE. The economics of Big Data Discovery Analytics are different to traditional Integrated Data Warehousing economics. ROI comes from discovering insights in highly iterative projects run over very short time periods (4 to 8 weeks usually) Each meaningful insight or successful use case that can be actioned generates ROI. The total ROI is a sum of all the successful use cases. Competitive Advantage is therefore driven by the capability to produce both a high volume of insights as well as creative insights that generate a high ROI.

What is the purpose of RACE?

RACE is built to deliver a high volume of use cases, focusing on speed and efficiency of production. It fuses data science, business knowledge & creativity to produce high ROI insights

How does the process look like?

RACE - an agile process for Big Data Analytic Projects

RACE – an agile process for Big Data Analytic Projects

The process itself is divided into several short phases:

  • Roadmap.That’s an optional first step (but heavily recommended) to built a roadmap on where the customer wants to go in terms of Big Data.
  • Align. Use-cases are detailed and data is confirmed.
  • Create. Data is loaded, prepared and analyzed. Models are developed
  • Evaluate. Recommendations for the business are given

In the next couple of weeks we will publish much more on RACE, so stay tuned!

Amazon announced details about their Q2 earnings yesterday. Their cloud business grew with incredible 81%. This is massive, given the fact that Amazon is already the number #1 company in that area. This quarter, they earned 1.8 billion USD from cloud computing.

Summing up this number, their revenue would definitively reach some 7 billion this year. However, if this growth continues to increase so fast, I guess they could even get double-digit by the end of this year. Will Amazon reach 10 billion in 2015? If so, this would be incredible! Microsoft stated that their growth was somewhere well above the 100% mark, so I am interested in where Microsoft will stand by the end of the year.

But what does this tell us? Both Microsoft and Amazon are growing fast in this business and we can expect that we will see many more interesting services in the coming month and years in the Cloud. My opinion is that the market is already consolidated between Microsoft and Amazon. Other companies such as Google and Oracle are rather niche players in the Cloud market.

Hadoop is one of the most popular Big Data technologies, or maybe the key Big Data technology. Due to large demand for Hadoop, I’ve decided to write a short Hadoop tutorial series here. In the next weeks, I will write several articles on the Hadoop platform and key technologies.

When we talk about Hadoop, we don’t talk about one specific software or a service. The Hadoop project features several projects, each of them serving different topics in the Big Data ecosystem. When handling Data, Hadoop is very different to traditional RDBMS systems. Key differences are:

  • Hadoop is about large amounts of data. Traditional database systems were only about some gigabyte or terabyte of data, Hadoop can handle much more. Petabytes are not a problem for Hadoop
  • RDBMS work with an interactive access to data, whereas Hadoop is batch-oriented.
  • With traditional database systems, the approach was “read many, write many”. That means, that data gets written often but also modified often. With Hadoop, this is different: the approach now is “write once, read many”. This means that data is written once and then never gets changed. The only purpose is to read the data for analytics.
  • RDBMS systems have schemas. When you design an application, you first need to create the schema of the database. With Hadoop, this is different: the schema is very flexible, it is actually schema-less
  • Last but not least, Hadoop scales linear. If you add 10% more compute capacity, you will get about the same amount of performance. RDBMS are different; at a certain point, scaling them gets really difficult.

Central to Hadoop is the Map/Reduce algorithm. This algorithm was usually introduced by Google to power their search engine. However, the algorithm turned out to be very efficient for distributed systems, so it is nowadays used in many technologies. When you run queries in Hadoop with languages such as Hive or Pig (I will explain them later), these queries are translated to Map/Reduce algorithms by Hadoop. The following figure shows the Map/Reduce algorithm:

Map Reduce function
Map Reduce function

The Map/Reduce function has some steps:

  1. All input data is distributed to the Map functions
  2. The Map functions are running in parallel. The distribution and failover is handled entirely by Hadoop.
  3. The Map functions emit data to a temporary storage
  4. The Reduce function now calculates the temporary stored data

A typical sample is the word-count. With word-count, input data as text is put to a Map function. The Map function adds all words of the same kind to a list in the temporary store. The reduce-function now counts the words and builds a sum.

Next week I will blog about the different Hadoop projects. As already mentioned earlier, Hadoop consists of several other projects.

Big Data is all about limiting our privacy. With Big Data, we get no privacy at all. Hello, Big Brother is watching us and we have to stop it right now!

Well, this is far too cruel. Big Data is NOT all about limiting our privacy. Just to make it clear: I see the benefits of Big Data. However, there are a lot of people out there that are afraid of Big Data because of privacy. The thing I want to state first: Big Data is not NSA, Privacy, Facebook or whatever surveillance technology you can think of. Of course, it is often enabled by Big Data technologies. I see this discussion often and I recently came across an event, that stated stated that Big Data is bad and it limits our privacy. I say, this is bullsh##.

The event I am talking about stated that Big Data is bad, it is limiting our privacy and it needs to be stopped. It is a statement that only sees one side of the topic. I agree that the continuous monitoring of people by secret services isn’t great and we need to do something about it. But this is not Big Data. I agree that Facebook is limiting my privacy. I significantly reduced the amount of time spending on Facebook and don’t use the mobile Apps. This needs to change.

However, this not Big Data. This are companies/organisations doing something that is not ok. Big Data is much more than that. Big Data is not just evil, it is great for many aspects:

  • Big Data in healthcare can save thousands, if not millions of lives by improving medicine, vaccination and finding correlations for chronically ill people to improve their treatment. Nowadays, we can decode the DNA in short time, which helps a lot of people!
  • Big Data in agriculture can improve how we produce foods. Since the global population is growing, we need to get more productive in order to feed everyone.
  • Big Data can improve the stability and reliability of IT systems by providing real-time analytics. Logs are analysed in real-time to react to incidents before they happen.
  • Big Data can – and actually does – improve the reliability of devices and machines. An example is that of medicine devices. A company in this field could reduce the time the devices had an outage from weeks to only hours! This does not just save money, it also saves lives!
  • There are many other use-cases in that field, where Big Data is great

We need to start to work together instead of just calling something bad because it seems to be so. No technology is good or evil, there are always some bad things but also some good things. It is necessary to see all sides of a technology. The conference I was talking about gave me the inspiration to write this article as it is so small-minded.

As of future technologies, Cloud Computing and Big Data aren’t a future anymore. They are here, right now and more and more of us start to deal with these technologies. Even when you watch TV, a reference to the cloud is often made. But there are several other technologies that will have a certain impact on Cloud Computing and Big Data. These technologies are different to Cloud and Big Data but will utilize that and use it as an important basis and back end.

Future Emerging Technologies using Cloud and Big Data

Future Emerging Technologies using Cloud and Big Data

The technologies are:

  • Smart Cities
  • Smart Homes
  • Smart Production
  • Autonomous Systems
  • Smart Logistics
  • Internet of Things

All these technologies work together and have the Cloud as back end. Furthermore, they use Big Data concepts and technologies. Summing these technologies up, they can be described as “cyber-physical systems”. This basically means that the virtual world we were used to until now moves stronger into the physical world. These two worlds will merge together and form something totally new. In the upcoming weeks I will outline each topic in detail, so stay tuned and subscribe to this tag to get the updates.

Header Image Copyright by Pascal, licensed under the Creative Commons 2.0 license.

Cloud Computing gave us several changes in how we handle IT nowadays. Common tasks that used to take a lot of time received great automation and much more is still about to come. Another interesting development is the “Software defined X”. This basically means that infrastructure elements receive larger automation as well, which ends up being more scale able and better to utilize from applications. A frequent term used lately is the “Software defined Networking” approach, however, there is another one that sounds promising, especially for Cloud Computing and Big Data: Software defined Storage.

Software defined Storage gives us the promise to abstract the way how we use storage. This is especially useful for large scale systems, as no one really wants to care about how to distribute the content to different servers. This should basically be opaque to end-users (software developers). For instance, if you are using a storage system for your website, you want to have an API like Amazon’s S3. there is no need to worry about on which physical machine your files are stored – you just specify the desired region. The back-end system (in this case, Amazon S3) takes care of that.

Software defined Storage explained

Software defined Storage explained

As of the architecture, you simply communicate with the abstraction layer, that takes care of the distribution, redundancy and other factors.

At present, there are several systems available that takes care of that: next to the well-know systems such as Amazon S3, there are also other solutions such as the Hadoop Distributed File System (HDFS) or GlusterFS.

 

Header Image Copyright: nyuhuhuu. Licensed under the Creative Commons 2.0.

//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js //

Apache Software Foundation announced that Apache Storm is now a top level Hadoop project. But what is Apache Storm about? Well, basically Apache Storm is a project to analyse data streams that are near real time. Storm works with messages and analyses what is going on. Storm originates from Twitter, which is using it for their streaming API. Storm is about processing time-critical data and Storm guarantees that your data gets processed. It is basically fault tolerant and scalable. Apache Storm is useful for fraud protection in gambling, banking and financial services, but not only there. Storm can be used wherever real-time or time-critical applications are necessary. At the moment, Storm allows to process 1 million tupels per second and node. This is massive, given the fact that Storm is all about scaling out. Imagine adding 100 nodes! Apache Storm works with Tupels that come from spouts. A spout is a messaging system such as Apache Kafka. Storm supports much more Messaging systems and it can easily be extended by it’s abstraction layer. Storm consists of some major concepts illustrated in the following image: Apache Storm Nimbus is the Master Node, similar to Hadoop‘s Job Tracker. ZooKeeper is used for Cluster coordination and the Supervisor runs the worker process. Each worker process consists of some subsets: an executor that is a thread spanned by the worker and a task itself.

Enjoy this article?

Make sure to subscribe to Cloudvane to receive regular updates here

Major concepts in Apache Storm are 4 elements: streams, spouts, bolts and topologies.

Tuples in Apache Storm

Tuples in Apache Storm

Streams are an unbound sequence of Tuples, a Spout is a source of streams, Bolts process input streams and create new output streams and a topology is a network of Bolts and Spouts.   The header image is provided as Creative Commons license by MattysFlicks.

Kick Start: Big Data is an E-Book about Big Data. A kick start is an ebook that readers can read within short amount and get started really fast without the need to invest days in reading a book. The target of Kick starts is to learn all the important things about a specific topic in a short and easy to read ebook. The first of this series is on Big Data. Readers will learn what Big Data is, what core technologies are involved and where you can go from there. Some technologies featured in this ebook are: Hadoop, NoSQL Databases, Data Storage techniques, Data analytic techniques and many more.

Availabe in Amazon Stores:

Index:

Introduction to Big Data…………………………………………………………………. 7

  1. 1.1  Defining Big Data……………………………………………………………………. 7
  2. 1.2  Characteristics for Big Data……………………………………………………. 14

Challenges for Big Data ………………………………………………………………… 23

  1. 2.1  Storage Performance ……………………………………………………………. 23
  2. 2.2  Different Storage Systems …………………………………………………….. 25
  3. 2.3  Data partitioning and concurrency …………………………………………. 26
  4. 2.4  Moving Data for Analysis ………………………………………………………. 27

Creating Big Data Applications………………………………………………………. 29

3.1 Big Data Analysis iteration …………………………………………………….. 29

Big Data Management …………………………………………………………………. 32

4.1 Hardware Foundations …………………………………………………………. 32

  1. 4.1.1  Storage devices …………………………………………………………….. 32
  2. 4.1.2  Raid Systems ………………………………………………………………… 33
  3. 4.1.3  Requirements for private and public Cloud Solutions ………… 34

4.2 Data Storage and Software attributes …………………………………….. 39

  1. 4.2.1  Data Quality Attributes ………………………………………………….. 40
  2. 4.2.2  CAP Theorem ……………………………………………………………….. 42
  3. 4.2.3  Relational Database Management Systems ……………………… 45
  1. 4.2.4  NoSQL………………………………………………………………………….. 48
  2. 4.2.5  Hybrid RDBMS/NoSQL Systems ………………………………………. 52

Big Data Platforms ………………………………………………………………………. 55

5.1 Apache Hadoop……………………………………………………………………. 55

5.1.1 Hadoop Projects……………………………………………………………. 55

Big Data Analytics………………………………………………………………………… 58

  1. 6.1  Machine Learning…………………………………………………………………. 58
  2. 6.2  Data Mining…………………………………………………………………………. 58
  3. 6.3  Apache Mahout……………………………………………………………………. 60

Big Data Utilization………………………………………………………………………. 61

Appendix ……………………………………………………………………………………. 63

  1. 8.1  Table of Figures ……………………………………………………………………. 63
  2. 8.2  Table of Listings……………………………………………………………………. 64

References …………………………………………………………………………………. 65

 

Cover Image Copyright: Pete (https://www.flickr.com/photos/comedynose/) Cover Image Licensed under the Creative Commons License 2.0 (https://creativecommons.org/licenses/by/2.0/)