Why one of the Top100 Blogs moved from AWS to Windows Azure

The Blog I started on Cloud Computing and Big Data some years ago was steadily increasing in the number of people accessing it. CloudVane is also named one of the Top 100 Blogs on Cloud Computing (Source), which is backed by the number of visits I get per day. To meet the increased traffic, I had to scale up my Blog.
There was no question that I am going to use some kind of Cloud Platform. To date, I used Amazon Web Services. As I am always keen on using the newest technology, I decided to use a Platform as a Service Provider. The reasons for that vary: the most important factor is that I don’t want to take care of VM management and alike. The most important aspect I was looking at is to have a platform that eases administration. Ideally I would have only little administration or no administration.
I looked at the 3 most common platforms: Amazon Elastic Beanstalk, Google AppEngine and Windows Azure. After playing with all 3 platforms, doing load-tests, comparing pricing and looking at the scalability aspects of the platform, I decided to use Windows Azure. To me it seemed to be the most mature platform in terms of PaaS (this is my personal opinion after doing some research and don’t represent the opinion of my employer). Windows Azure Web Sites is very easy to handle and the features it offers are great.
Moving to Windows Azure Web Sites was straight forward: I created a WordPress instance from the templates provided in the Windows Azure gallery. After 2 steps of configuration, WordPress was ready to go. I did the 1-click setup by WordPress. The hardest part of the migration was to move the existing blog entries to the new blog; thanks to the Import/Export capabilities of WordPress, this was done within short time as well. Installing the plugins and so on took some more hours, but it went smooth as well.
In the next posts, I will talk about performance and setup/architecture of WordPress on Windows Azure.
Header Image Copyright by: leolintang

Are you a Data Scientist or what is necessary to become one?

Big Data is considered to be the job you simply have to go for. Some call it sexy, some call it the best job in the future. But what exactly is a Data Scientist? Is it someone you can simply hire from university or is it more complicated? Definitely the last one applies for that.
When we think about a Data Scientist, we often say that the perfect Data Scientist is kind of a hybrid between a Statistician and Computer Scientist. I think this needs to be redefined, since much more knowledge is necessary. A Data Scientist should also be good in analysing business cases and talk to line executives to understand the problem and model an ideal solution. Furthermore, extensive knowledge on current (international) law is necessary. In a recent study we did, we defined 5 major challenges:
perfect-data-scientist
Each of the 5 topics are about:

  • Big Data Business Developer: The person needs to know what questions to ask, how to cooperate with line of business (LOB) decision makers and must have good social skills to cooperate with all of them.
  • Big Data Technologist: In case your company isn’t using the cloud for Big Data Analytics, you also need to be into infrastructure. The person must know a lot about system infrastructure, distributed systems, datacenter design and operating systems. Furthermore, it is also important to know how to run your software. Hadoop doesn’t install itself and there is some maintenance necessary.
  • Big Data Analyst: This is the fun part; here it is all about writing your queries, running Hadoop jobs, doing fancy MapReduce queries and so on! However, the person should know what to analyse and how to implement such algorithms. It is also about machine learning and more advanced topics.
  • Big Data Developer: Here it is more about writing extensions, add-ons and other stuff. It is also about distributed programming, which isn’t the easiest part itself.
  • Big Data Artist: Got the hardware/datacenter right? Know what to analyse? Wrote the algorithms? What about presenting them to your management? Exactly! This is also necessary! You simply shouldn’t forget about that. The best data is worth noting if nobody is interested in it because of poor presentation. It is also necessary to know how to present your data.

As you can see, it is very hard to become a data scientist. Things are not as easy as it might seems. The Data Scientist should be a nerd in each of these fields, so the person should be some kind of a „super nerd“. This might be the super hero of the future.
Most likely, you won’t find one person that is good in all of these fields. Therefore, it is necessary to build an effective team.
Header Image Copyright: Chase Elliott Clark

Big Data: Why it is not so simple as you might think!

Big Data is definitely a very complex „thing“. Why do I call it „a thing“ here? Because it is simply not a technology itself! Hadoop is a technology, Lucene is a technology but Big Data is more of a concept, since it is nothing you can touch. Ever tried installing Big Data on your machine? Or said „I need this Big Data Software“? When you talk about a software or technology, you talk about a very concrete Product or Open Source Tool.
The concept of Big Data is rather complicated when it comes to implementing it. There are several major dimensions you have to be aware of.

Big Data Dimensions

Big Data Dimensions


The dimensions are:

  • Legal dimension: What is necessary in terms of data protection legislation? What do you need to know about legal impacts, what kind of data are you allowed to store or collect/process?
  • Social dimension: What social impacts will you generate with your application? How will your users react to that?
  • Business dimension: What is the business model you want to generate with your Big Data platform? How can your Big Data platform support your business? What kind of pricing do you want to calculate?
  • Technology dimension: How can you achieve your targets? What technology would you use to get there? What scale able software can you use?
  • Application dimension: What industry solutions are available for your needs? How can you enable decision support based on data for your company?

If you want to target all of these questions, you need to have a team that is capable of fulfilling this request. In the next posts I will talk about the Big Data technology stack and what it needs to be a data scientist.
Header Image copyright:  Michael Coghlan. Distributed under the Creative Commons license 2.0 by Creative Commons Australia Pool.

Defining Platform as a Service – PaaS

Platform as a Service is one of the major topics when it comes to cloud computing. However, it is well below the usage and acceptance of other levels such as SaaS and IaaS. The advantages are often not seen by the large majority. Furthermore, a definition of what PaaS is, is not always given. Therefore, I’ve started a discussion on PaaS – what it is and how it defines. Feel free to comment on the topics and add your points when you believe I’ve missed some of them.
A very simple definition could be: „Platform as a Service is talking away the pain of the Software Stack administration from IaaS platforms and allows to focus on your application“.
Platform as a Service attributes and characteristics
Basically, all attributes that apply to Cloud Computing also apply to PaaS. However, it is necessary to add some more specific attributes to PaaS itself. The below mentioned topics are my first findings:

  • Advanced Service Management. It is easy to run workflows on a PaaS Solution and they are often supported by a visual designer. Monitoring is very easy and comprehensive. It is often not based on the VM level since we are talking about abstraction here.
  • Elasticity, Flexibility and automated Resources. A key feature for PaaS is the elasticity, flexibility and resource automation that is going on in the background. In a PaaS environment you simply don’t realize the automation that is going on since this problem is abstracted from you.
  • Development-focused. Developers, Developers, Developers! A PaaS-Platform is all about developers. It is dedicated to them to get rid of the pain of using a complex deployment process or to install a software stack. With a PaaS platform, you simply don’t need to take care of your stack any more. You can focus on what is key – being a developer that delivers great apps.
  • Abstract and easy to use APIs. APIs, that are available are easy to use and abstracted. If you use a messaging service that comes with the platform, it is very simple and you don’t need to read a book first to get started.
  • Ease of operations. As stated earlier, it is also very easy to deploy new apps. In many cases, it is possible to have a test-environment and staging platforms. Deployment is often done from the Development environment itself.

Platform as a Service Types
When we look at currently available PaaS Solutions, we can see that there are some major differences in these solutions. Therefore, I believe that it is necessary to define them individually and focus on some key service types:

  • Application PaaS (aPaaS). This is a fully-featured PaaS-Platform that allows you to built applications on a pre-configured stack. With Application PaaS, it is possible to have all kinds of applications.
  • Dedicated PaaS (dPaaS). A dedicated PaaS platform is made for a specific SaaS-Application. The main focus of a dPaaS is to extend the SaaS-Application. The possibilities are somewhat limited to the aPaaS.
  • Integration PaaS (iPaaS). An iPaaS platform is made to integrate different platforms, services and applications. It is also often called a middleware.
  • Big Data PaaS (bPaaS). A bPaaS is built to handle all kinds of data driven applications. A common example is an easy to use Hadoop platform that can be run out of the box and enables the above described points.

Feel free to express your thoughts/opinion on this!

Data Protection legislation killed Big Data. Did it?

In Europe, it is rather difficult to get into the different member states legislation to find out how to apply data driven applications that are in accordance with the regional laws. The save-harbour principle tells us basically, that data that is referring to a specific person, may not leave Europe. But what does that mean?
Especially US Cloud and Big Data providers might find this difficult, sind the US law forces these companies to share the data with the US government (especially the intelligence services). This means a conflict in legislation itself. American companies are somewhat under heavy pressure in loosing large european customers that want their data „save“ – from a legal point of view.
Another problem is associated with the collection and storage of personal data. If we look at (not only) social media platforms, once you post something, it is there forever – even though you delete it.
If we focus on retail, what does that mean? Many of us have customer benefit program cards, which eventually means that data about our behaviour is collected. This gives us the possibility to get discounts that fit our behaviour and needs and allows the retail companies better marketing. On the other hand, what happens if I want my data deleted? As of now, I am not aware of any legal aspects of that. What happens with my data? Will it stay forever with the retail company and is there anything I can do about it?
What we need is a liberal but still good data protection standard that helps the individual and the economy – full security for individuals isn’t possible, but companies should’t be allowed to do all. We need to meet somewhere in the middle, which might be a difficult task for the next years.
I invite you to join the discussion about that!