Windows Azure Tutorial: Getting Started with Windows Azure

Windows Azure is Microsoft’s answer to Cloud Computing. It was first introduced by Microsoft in 2008 at the PDC in Los Angeles. Since that time, a lot of new services arrived for Windows Azure. In this tutorial series, we will cover all services and have a detailed look at what you can achieve with Windows Azure.
Let us first start with an Info-graphic on Windows Azure and it’s services.

Overview of Windows Azure Services

Overview of Windows Azure Services

Each Service Section contains some Services, so this is only “the big picture” of Windows Azure. Let us start by digging deeper into each Service Section.
Web Sites
Windows Azure Web Sites is the entry-level product to the Cloud. With Windows Azure Web Sites, Microsoft offers an easy to use, out of the box Cloud Solution that enables .NET, PHP or Node.js development for the Cloud. The advantage is that you can scale out once your platform grows.
Virtual Machines
With Virtual Machines, Microsoft offers an Infrastructure as a Service-Platform (IaaS). This was introduced after the Platform as a Service (PaaS) Platform, and serves companies with existing Applications an easier to-use platform as the PaaS-approach. Virtual Machines can be both, Linux and Windows.
Cloud Services
Windows Azure Cloud Services is the Platform as a Service (PaaS) Platform from Microsoft. It was one of the originally introduced platforms and gives better possibilities to developers regarding scalability and elasticity. The Windows Azure Cloud Services contains Web- and Worker Roles.
Data Management
In the section “Data Management”, Microsoft offers some Windows Azure Services for data-intense computing. We can find a traditional Database built on top of Microsoft’s SQL Server, as well as two NoSQL Databases. There is a non-relational Database for large, non-relational and schema-less datasets and a Blob (Binary large object) Storage.
Business Analytics
Business Analytics serves the need for Big Data Applications in the Cloud. With these services, customers can use SQL Reporting (Windows Azure SQL Reporting), Data Marketplace for large Datasets and Hadoop for MapReduce operations.
Caching features some services that have the target to speed up Websites in Performance for the end users. One of the Services is the Windows Azure Content Delivery Network (CDN) with several edge-locations all over the globe. The other service is Windows Azure Caching, that allows Web Applications to be more performant.
There are some requirenments for Businesses, that require more security in privacy. Microsoft delivers services to facilitate that with Windows Azure Networking. The Services are the Virtual Network, that allows users to built up a Virtual Private Network (VPN) Windows Azure Connect that allows direct connection between Machines and the Windows Azure Traffic Manager for load balancing.
Windows Azure Identity offers the Service Windows Azure Active Directory, which would be used for hybrid cloud Solutions. If you need to connect the Cloud to an existing Active Directory, this is a service that might be interesting for you.
Between different Cloud instances or services, communication is often necessary. To do this, Messaging can be used. Windows Azure offers two services for that: the Windows Azure Service Bus and Windows Azure Queues.
Media Services
Windows Azure Media Services offers a Workflow service to build and, manage and distribute Media Services. This service can be used to deliver media content such as videos or music to your customers.
Windows Azure Marketplace is a store to sell Software as a Service (SaaS) Applications or Datasets. It is used for Developers that built their applications on top of Windows Azure Services.
In the next posts, we will go more into detail to each Service. In the Tutorial, we will also feature Hands-on for Developers.
The Image “Windows Azure” was taken from the Microsoft Website at the official Press Site. This Image is copyright protected by Microsoft.

The art of Cloud Computing Scalability

With Cloud Computing, we often hear that “Scaling” is very easy. In most cases, it is actually like that! You can simply go and add new virtual machines/instances on demand, with only some seconds or minutes to be provisioned. However, there are some other factors that improove Scalability or limit scalability. The reason for that is simply: if your software is built a way that dissalows scaling, you can never use the benefits of the cloud. Just because you can scale your instances on demand, it might not mean that your software allows that!
Imagine the following scenario: your Web Application stores data like images in the File System. Suddenly, you have increased need for performance as your business grows. So what would you do? Exactly, simply add a new instance. After a while, your users and customers complain that they can’t find their images all the time. After a while, you figure out that the Load Balancer directs to either one of the virtual machines you are using. One machine contains images (stored on the file system), the other doesn’t (as it is the new one). If your software was built in a SoA Style, you might have used a Service instead of the File System. To really scale out your Software, you now need to invest a significant amount of time into re-engineering.
First of all, we have to get rid of some confusion: when talking about Scalability or Scaling, people often talk about “Elasticity” as well. But what is the difference between those two terms? Aren’t they the same at all? There is a significant difference between the term “Scalability” and “Elasticity”. When we talk about Scaling, we talk about the possibility to add new instances and remove instances. This is just the possibility itself. Elasticity means that we have exactly the power we need for our services. If we need, for instance, 4 virtual machines, we can scale them up or down, but we still have to do some work to achieve that (e.g. starting a new instance in the AWS Management Console). If we talk about elasticity, it means that we have exactly the power of 4 machines when they are necessary and we don’t even have to care about how they get provisioned – it is automated! The goal for your software should be to enable elasticity, not just scalability.
Whenever we talk about Scaling, we mainly talk about the factor “Hardware” but forget about other important things to scale out, such as:

Scaling Software Teams

Scaling Software Teams

If you add more and more instances as your business grows, you might also add more features to your application. However, more features also means that there is more code to maintain, more bugs to eliminate and more updates to deliver. Therefore, you need to increase your team size, which isn’t easy at all. Just because you add 2 new developers to a team of 2 developers (double the number of Dev’s) you simply don’t double the performance. Most of the time, the performance is even below the original performance for the first weeks, until they become a “real team”. But in all cases, there is a significant overload on communication and organisation involved, which can’t be solved that easy. Before taking care of scaling your Software, you might consider how to scale your teams that are in charge of delivering your software.
In the next posts, we will go deeper into different technologies and architectural styles that enable your software to scale out, not just up.
The Image used as Header is licenced under the Creative Commons Attribution-Share Alike 3.0 Unported license by José Ramón Polo López.

Starting a new Cloud Computing Project

More and more companies start Cloud Computing Projects, with a lot of Startups also among them. But what do you have to take into account when you start a new Cloud Computing Project? Is it save to just start or should you consider some best practices in project management? And if so, what do you need to do regarding Cloud Computing? In this article we will give a brief overview on what is necessary to get started with Project Management for Cloud Computing.
Project Management itself consists of the following 5 Iterations:

1.Project Initialisation
2.Project Creation
3.Project Planing

Each Iteration contains some sub-processes which we will describe more detailed now.

Project Management in the cloud

Project Management in the cloud

The first iteration, Project Initialisation, basically starts when someone realizes that there is a need for a project or someone has a business idea. Normally, you start building some KO-Criterias, like feasability or already existing products/platforms. If your project “passes” these KO-criterias, you might move on on evaluate Project Risks. This task often comes with a SWOT-Analysis. In a SWOT-Analysis you check the project for Strength, Weaknesses, Opportunities and Threads. Another important factor to analyse are the Stakeholders, since they can take affect on the project. Stakeholders are people or organizations that have a special interest in the project or product, such as the top-management, customers, owners or co-workers.
Once you are done with the basics, it is necessary to do a less corse-grained evaluation of the project. Initially, you would analyse what platform will be used. This is a very important part for Cloud Computing as we have different platforms that server different technology and offer various possibilities. A major thread is the risk of a vendor lock-in, what shouldn’t happen at all. Therefore, it is necessary to look at the platform and figure out how easy it is to move to other platforms or services. Once you decide to use a specific platform, it is necessary to evaluate what knowledge about this platform is available within the team/company. Other important factors are the platform costs and the interoperability with external services.

In the next phase, Project Creation you start with planing the project and setting some variables for it. You would also create the initial project plan and select the project organisation. The project organisation heavily depends on team size and knowledge of team members. If you are in a very agile environment, you might not want to have a very strict project organisation, whereas other environments require a rather “fixed” organisation form.

With Project Planing, you start detailing your project with costs and detailed delivery dates. In Project Planing, you would also select the Iteration model for the project, like in “Project Creation”, you need to select the iteration model based on your environment. If it is heavily agile, you might not use the “V-Model” but rather xP or Scrum-like techniques.

The phase Execution requires detailled monitoring about what is going on in the project. Some key indicators are the project milestones and the budget. If these indicators are out-of-bound, adjustment might be necessary.

The last phase, Introduction is an often underestimated phase, since it requires additional knowledge of go-to-market strategies and very good marketing. Often, there is no more budget available after the project has finished. Now imagine you got a great product but no money to tell it to possible customers? A clear go-to-market strategy is necessary in order to complete a project successful.

Pricing Models for Cloud Computing

Whenever we talk about Cloud Computing, we also talk about the postive effects it has on costs. The reason for that is simple: at first sight, Cloud Computing offers look very cheap. Most of the time, we have per-hour prices far below 1$ per hour. But the truth is that we have to keep an instance up and running for the full month in most cases. Different Cloud Computing providers offer different pricing models. This is our Guide to navigate you through the pricing models we have to deal with in the Cloud.
Fixed Costs

Fixed costs in Cloud Computing

Fixed costs in Cloud Computing

The easiest costs we can calculate with are fixed costs. This means that there is exactly one price for a month. The price consists of the monthly fee multiplied by the service unit. A service unit is a valuation which is most of the time the number of users. If you consume a service in your company and you need 10 users for this service, the service unit is the user. Software as a Service (SaaS) is the most common usage for this. Let’s assume you rent an e-mail service in your company. Your company consists of 10 people, where each person needs an e-mail service. The price per user is 5$ per month. Summing it up, you need to pay 50$ per month for your e-mail service. The formular, in short:
Service Charges per month x service units.
[widgets_on_pages id=”sb”]
Variable Costs
Whereas Software as a Service Solutions are very easy, Platform as a Service (PaaS) and Infrastructure as a Service (IaaS) are getting more complicated. What
Flexible costs in Cloud Computing

Flexible costs in Cloud Computing

we can see here is a more flexible cost approach, where you have to pay on different topics. These costs are common with the big players in Cloud Computing such as Amazon, Google, Microsoft and Rackspace. We have to deal with several different structures. At the core of each platform you might rent an Instance. Each Instance runs for a certain number of hours, named Compute Hours. Those Platforms often require storage, e.g. for Media Data (Images, Videos) or a Databases. To calculate storage, the providers charge on a per-GB price. This price has to be multiplied by the number of GB stored. Since PaaS and IaaS Solutions consume Network and Internet Traffic, we also get charged on Incoming and Outgoing Bandwidth. Here applies the same as with storage: the charge-ability is based on a per-GB fee. Last but not least, Storage Transactions also fall into this topic. A Storage Transaction is an operation you do on Objects or Datasets such as deleting, creating, updating or reading data. Most of the time, this is measured via the ReST-Keywords. The formular in short:
Costs per Compute Hour x Compute Hours+ Costs per GB Stored x GB Stored
+ Costs per Incoming Bandwidth x Incoming Bandwidth
+ Costs per Outgoing Bandwidth x Outgoing Bandwidth
+ Sum of Storage Transaction x Costs per Storage Transactions
Hybrid Costs
Hybrid Costs are a mixed form of the former two cost structures, where you find a variable part and a fixed part. This doesn’t occur too often at all but it makes sense in some cases, since it adds possibilities to save money. Currently, this model is offered by Amazon Web Services (AWS) with the name “Reserved Instances”. The key is that you pay a certain amount upfront and your hourly fee gets reduced. Normally, the Break-Even-Point for this Cloud Computing pricing model is reached after 6 month or more. So if you target your instance to run for more than 6 month, you might look into that.
Flexible Costs
Often talked about and offering some great benefits, but what are flexible costs? Normally flexible costs are referred to as “spot instances”. This means that you enter a certain amount for an instance that you are willing to pay for. If the spot price is lower than the amount you entered, your instance is ready to use. This pricing model is currently offered by Amazon Web Services (AWS). Spot prices gives us more flexibility on our costs, since your instances are only available if the price is at a certain level. Especially startups or universities will find this very interesting.
Having a closer look at pricing models in the cloud gives greater flexibility and enables us to save money. There are also more advanced technologies that also help to save money such as newvem, what we will talk about later.

Economic Principles for Cloud Computing

Cloud Computing is not just a new hype but it also delivers some great benefits for Businesses. Cloud Computing is nothing ground-breaking new but rather a new name for already existing technologies. This also applies to Businesses since factors that apply to outsourcing, mostly apply to Cloud Computing also.
This Article is inspired by the Whitepaper “Wildemann, 1987” on outsourcing.
Wildemann describes 5 key factors for Outsouring: Strategy, Performance, Costs, Financials, Human Resources. Each key factor consists of factors that fall into a key factor.

Strategy as indicator for Cloud Computing

Strategy as indicator for Cloud Computing

Strategy describes opportunities for the Business using Outsouring or Cloud Computing. It doesn’t describe benefits for the Cloud Computing Provider. Imagine you are a company that manufactures supplies for cars. It is definitely important to have a performant IT up’n’running, but why would you built all the IT on your own? For a company, it is important to concentrate on the core business. IT should support the company, but it shouldn’t use too many ressources. If a company outsources it’s IT department, it also gets higher flexibility. Another factor comes with risk – why would you want to care about possible server outages? With Cloud Computing, you simply transfer this risk to your Cloud Computing Provider. Last but not least, outsourcing gives greater Standardisation possibilities. With Cloud Computing, Standards get more and more important.
Performance as indicator for Cloud Computing

Performance as indicator for Cloud Computing

Performance describes the Outsourcers point of view. This means the increased performance a Cloud Computing provider can deliver. If you do something very often, you might get really good in it after a while. Different companies focused on delivering their products and improving them, as they specialized on it. Specialisation leads to higher performance by the service provider. Between Cloud Computing providers and companies using their services, SLAs are often used to define the Services and Responsibilities between them. If the IT is not outsourced but within your company, you might have no defined services. What we can already see in the Cloud is the high level of Service Orientation. All Cloud Platforms contain “as a Service” in it and the services are built for service orientation. Companies that keep their IT on-premise often run into scaling issues, as  services are not available on-demand. Cloud Computing providers have an on-demand availability of services.

Financial effects of Cloud Computing

Financial effects of Cloud Computing

Costs are often mentioned and discussed when we talk about Cloud Computing. Unfortunately, it is often referred to as the “main” factor to move to the Cloud. The other 4 factors are not even mentioned in many cases. However, a good thing about Cloud Computing is definitely the fact that costs can be planned easy. If you need another Instance, you know exactly what the rate per hour will be. On the other hand, your Chief Financial Officer will love your IT department for transferring Capex  into Opex. Controlling and Accounting departments often prefer operational expenses (Opex) over capital expenses (Capex). If you buy a new car, would you rather pay the full sum at once or would you rather lease it? Often leasing is preferred as it doesn’t require you to have all the money in cash.
Human Resources is always a difficult thing in IT companies. To find high qualified staff, expensive recruiting is often necessary. It is no secret that there is a lack of IT staff. If you outsource your IT department, this problem is also “outsourced”, since it is not necessary for you to find qualified IT staff. And we can focus on the topic mentioned in “Strategy”: focus on your core competency and find staff for what you need in your company.

Financials is the last, but not least topic we discuss when it comes to Cloud Computing. As discussed in “Costs” Capex are transformed into Opex. This has positive effects on the bilance at the end of the year. Costs are split over more years and not in a single year.

So, there are several positive effects for Cloud Computing, not just money. Unfortunately, money is the one that is referred to in most cases. If you talk about the Cloud again, try to adress the other topics as well.

Creating a distributed, scalable WordPress Platform on Amazon Web Services (AWS)

For we wanted to have a highly scalable, distributed and performing Platform that is also easy to Maintain. These challenges weren’t that easy to achieve and initially we had to find a system.  As CloudVane is all about the Cloud, the solution was easy: it must be a Cloud Provider. We selected Amazon Web Services to server our Magazine.
To better understand the Performance of WordPress, we wanted to have a System that allows us to handle about 8 million hits per day. So we started with a standard WordPress Installation on Ubuntu with MySQL just to figure out what is possible (and what not). We didn’t add any Plugins or so, the first tests were a really plain System.
For the Test, we used, which returns great statistics about the Test run. Our first Test gave us the following results:

  • Delay: 477 MS FROM VIRGINIA
  • 60 Seconds Test Run with 20 Users per Second at maximum
  • Response with 20 Users per Second was about 1 Second

So what does this mean? First of all, we can handle about 20 Users per second. However, the delay of 1 second is not good. Per Day, we would handle about 560,000 hits. So we are still far away from our target of 8 Million Hits per day. The CPU Utilization wasn’t good either – it turned out that our instance takes 100%. So this is the very maximum of an Out-of-the Box WordPress installation. Below you can see some graphics on the Test run.
Test Run #1:
60 Seconds, maximum of 20 Users per Second:

Performance for an AWS Micro Instance measured by a Load Test

Performance for an AWS Micro Instance measured by a Load Test

Amazon Performance with WordPress and a Micro Instance on EC2

Amazon Performance with WordPress and a Micro Instance on EC2

As you can imagine, this simply does not meet our requirements. As a first step, we wanted to achieve better scaling effects for Therefore, we started up another Micro Instance with Amazon RDS. On the RDS Instance, we took advantage from the ready-to-use MySQL Database and connected it as the primary database for our WordPress Platform. This gives us better scaling effects since the WordPress instance itself doesn’t store our data anymore. We can now scale out our database and Web frontend(s) independent from each other.
But what about images stored on the Platform? They are still stored on the Web Frontend. This is a though problem! As long as we store our images in the instance, scaling an instance gets really though. So we wanted to find a way to store those instances on Blob Storage. Good to know that Amazon Web Services offers a Service called “Simple Storage Service” or “S3” in short. We integrated this service to replace the default storage system of WordPress. To boost performance, we also added a Content Distribution Network. There is another Service by Amazon Web Services, called “Cloud Front”. With Cloud Front, Content is delivered from various Edge-Locations all over the Globe. This should boost the performance of our Platform.
As a final add-on, we installed “W3 Total Cache” to boost performance by Caching Data. This should also significantly boost our performance. But now lets have a look at the new Load Test, again with For our Test, we use the maximum we can do with our free tier: 250 concurrent users.
The output was:

  • An average of 15ms in delay
  • More that 10 million hits per day

Summing this up, it means that we achieved what we wanted: a fully scalable, distributed and performing WordPress platform. It is nice what you can do with a really great architecture and some really easy tweaks. Below are some graphics of our test run.

Load Testing an Amazon Web Service Micro Instance with Caching

Load Testing an Amazon Web Service Micro Instance with Caching

Amazon CPU Load on a Micro Instance with Caching and CDN

Amazon CPU Load on a Micro Instance with Caching and CDN

Creating a simple WordPress Blog with the Bitnami Stack on Amazon EC2

It is really easy to create a simple WordPress Blog on Amazon EC2 with the Bitnami Stack. To do so, simply click on “Launch Instance” in the Console.

Launch a new AWS Instance

Launch a new AWS Instance

Next, we get a Dialog where we can select the Wizard. For our sample, we use the “Classic” Wizard.
Create a new AWS EC2 Instance with the Wizard

Create a new AWS EC2 Instance with the Wizard

In the „Request Instances Wizard“, we now select the Tab „Community AMIs“ and type „wordpress“ in the Search Box. This will list us several WordPress-Enabled Instances.
Available AWS Community AMIs

Available AWS Community AMIs

We select an AMI that has the most recent WordPress Version installed. In the current case, it is “ami-018c8875“ but it might change over time.
In the next Dialog, we make sure to have “Micro” as Instance Type selected. This is the cheapest available instance type on EC2.
Select an instance type on AWS

Select an instance type on AWS

We simply confirm the next few Dialogs until we get to the point where we need to create a Key/Value Pair. This is necessary once we need to connect to the instance.
Create a new Key-Pair for an EC2 Instance on AWS

Create a new Key-Pair for an EC2 Instance on AWS

In the last Dialog simply click “Launch” and the Instance will be started.
Don’t forget to configure the security groups. If it is your first time with AWS, you might not have set HTTP Connections by the Firewall.
Amazon Web Services, the “Powered by Amazon Web Services” logo, are trademarks of, Inc. or its affiliates in the United States and/or other countries.

NoSQL as the Trend for databases in the Cloud?

SQL seems to be somewhat old fashioned when it comes to scalable databases in the cloud. Non-relational databases (also called NoSQL) seem to take over in most data storage fields. But why do those databases seem to be more popular than the “classic” relational databases? Is it due to the fact that professors at universities “tortured” us with relational databases and therefore reduced our interest – the interest of the “new” generation for relational databases? Or are there some hard facts that tell us why relational databases are somewhat out of date?
I was at a user group meeting in Austria, Vienna, one month ago where I talked about NoSQL databases. The topic seemed to be of interest to a lot of people. However, we sat together for about four hours (my talk was planned for one hour only) discussing NoSQL versus SQL. I decided to summarize some of the ideas in a short article as this is useful for cloud computing.
If we look at what NoSQL offers, we’ll find a numerous offers on NoSQL databases. Some of the most popular ones are MongoDB, Amazon Dynamo (Amazon SimpleDB), CouchDB, and Cassandra. Some people might think that non-relational databases might be for those people who are too “lazy” to do their complex business logic in the database. In fact, this logic reduces the performance of a system. If there is a need for a high-responsive and available system, SQL Databases might not be your best choice. But why is NoSQL more responsive than SQL-based systems? And why is there this saying that NoSQL allows better scalability than SQL-based systems? To understand this topic, we need to go back 10 years.
Dr. Eric A Brewer in his keynote “Symposium on Principles of Distributed Computing 2000” (Towards Robust Distributed Systems, 2000) addressed a problem that arises when we need high availability and scalability. This was the birth of the so-called “CAP Theorem.” The CAP Theorem says that a distributed system can only achieve two out of the three states: “Consistency, Availability and Partition tolerance.” This means:

  • That every node in a distributed system should see the same data as all other nodes at the same time (consistency)
  • That the failure of a node must not affect the availability of the system (availability)
  • That the system stays tolerant to the loss of some messages

Nowadays when talking about databases we often use the term “ACID,” but NoSQL is related to another term: BASE. Base stands for “Basically Available, Soft state, eventually consistent.” If you want to go deeper into eventually consistent, read the post by Werner Vogels – Eventually Consistent revisited. BASE states that all updates that occur to a distributed system will be eventually consistent after a period of no updates. For distributed systems such as cloud-based systems, it is simply not possible to keep a system consistent at all times. This results in bad availability.
To understand eventually consistent, it might be helpful to look at how Facebook is handling their data. Facebook uses MySQL, which is a relational (SQL) database. However, they simply don’t use such features as joins that MySQL offers them; Facebook joins data on the web server. You might think “What, are they crazy?” However, the problem is that the joins Facebook needs will sooner or later result in a very slow system. David Recordon, Manager at Facebook, stated that joins are better performing on the web server [1]. Facebook must know what is good performance or not as they will store some 50 petabytes of data by the end of 2010. Twitter, another social platform that needs to scale their platform, should also think about switching to NoSQL platforms. This will hopefully reduce the “fail whale” to a minimum [2].
Summing it up, NoSQL is relevant for applications that are in the need of large-scale global Internet applications. But are there any other benefits for NoSQL databases? Another benefit is that there are often no schemas associated with a table. This allows the database to adopt new business requirements. I’ve seen a lot of projects where the requirements changed over the years. As this is rather hard to handle with traditional databases, NoSQL allows easy adoption of such requirements. A good example of this is Amazon. Amazon stores a lot of data on their products. As they offer products of different types – such as personal computers, smartphones, music, home entertainment systems and books – they need a flexible database. This is a challenge for traditional databases. With NoSQL databases it’s easy to implement some kind of inheritance hierarchy – just by calling the table “product” and letting every product have its own fields. Databases such as Amazon Dynamo handle this with key/value storage. If you want to dig deeper into Amazon Dynamo, read Eventually Consistent [3] by Werner Vogels.
Will there be some sort of “war” between NoSQL and SQL supporters like the one of REST versus SOAP? The answer is maybe. Who will win this case? As with SOAP versus REST, there won’t be a winner or a loser. We will have more opportunities to choose our database systems in the future. For data warehousing and systems that require business intelligence to be in the database, SQL databases might be your choice. If you need high-responsive, scalable and flexible databases, NoSQL might be better for you.

  1. Facebook infrastructure
  2. Twitters switches to NoSQL
  3. Eventually Consistent
This post was originally posted by Mario Meir-Huber on Sys-Con Media.


Design Guidelines for Cloud Computing and Distributed Systems

Infrastructure as a Service and Platform as a Service offer us easy scaling of services. However, scaling is not as easy as it seems to be in the Cloud. If your software architecture isn’t done right, your services and applications might not scale as expected, even if you add new instances. As for most distributed systems, there are a couple of guidelines you should consider. I have summed up the ones I use most often for designing distributed systems.
Design for Failure
As Moore stated, everything that can fail, will fail. So it is very clear that a distributed system will fail at a certain time, even though cloud computing providers tell us that it is very unlikely. We had some outages [1][2] in the last year of some of the major platforms, and there might be even more of them. Therefore, your application should be able to deal with an outage of your cloud provider. This can be done with different techniques such as distributing an application in more than one availability zone (which should be done anyway). Netflix has a very interesting approach to steadily test their software for errors – they have employed an army of “Chaos Monkeys” [3]. Of course, they are not real monkeys. It is software that randomly takes down different Instances. Netflix produces errors on purpose to see how their system reacts and if it is still performing well. The question is not if there will be another outage; the question is when the next outage will be.
Design for at Least Three Running Systems
For on-premise systems, we always used to do an “N+1” Design. This still applies in the cloud. There should always be one more system available than actually necessary. In the cloud, this can easily be achieved by running your instances in different geographical locations and availability zones. In case one region fails, the other region will take over. Some platforms offer intelligent routing and can easily forward traffic to another zone if one zone is down. However, there is this “rule of three,” that basically says you should have three systems available: one for me, one for the customer and one if there is a failure. This will minimize the risk of an outage significantly for you.
Design for Monitoring
We all need to know what is going on in our datacenters and on our systems. Therefore, monitoring is an important aspect for every application you build. If you want to design intelligent monitoring, I/O performance or other metrics are not the only important things. It would be best if your system could “predict” your future load – this could either be done by statistical data you have from your applications‘ history or from your applications‘ domain. If your application is on sports betting, you might have high load on during major sports events. If it is for social games, your load might be higher during the day or when the weather is bad outside. However, your system should be monitored all the time and it should tell you in case a major failure might come up.
Design for Rollback
Large systems are typically owned by different teams in your company. This means that a lot of people work on your systems and rollout happens often. Even though there should be a lot of testing involved, it will still happen that new features will affect other services of your application. To prevent from that, our application should allow an easy rollback mechanism.
Design No State
State kills. If you store states on your systems, this will make load balancing much more complicated for you. State should be eliminated wherever and whenever possible. There are several techniques to reduce or remove state in your application. Modern devices such as tablets or smartphones have sufficient performance to store state information on the client. Every service call should be independent and it shouldn‘t be necessary to have a session state on the server. All session state should be transferred to the client, as described by Roy Fielding [4]. Architectural styles such as ROA support this idea and help you make your services stateless. I will dig into ReST and ROA in one of my upcoming articles since this is really great for distributed systems.
Design to Disable Services
It should be easy to disable services that are not performing well or influencing your system in a way that is poisoning the entire application. Therefore, it will be important to isolate each service from each other, since it should not affect the entire system’s functionality. Imagine the comment function of Amazon is not working – this might be essential to make up your mind about buying a book, but it wouldn’t prevent you from buying the book.
Design Different Roles
However, with distributed systems we have a lot of servers involved – and it‘s necessary not to scale a front-end or back-end server, but to scale individual services. If there is exactly one front-end system that hosts all roles and a specific service experiences high load, why would it be necessary to scale up all services, even those services that have minor load? You might improve your systems if you have them split up in different roles. As already described by Bertram Meyer [5] with Command Query Separation, your application should also be split in different roles. This is basically a key thing for SOA applications; however, I still see that most services are not separated. There should be more separation of concerns based on the services. Implement some kind of application role separation for your application and services to improve scaling.
There might be additional principles for distributed systems. I see this article as a rather “living” one and will extend it over the time. I would be interested is your feedback on this. What are your thoughts on distributed systems? Email me at, use the comment section here or get in touch with me via Twitter at @mario_mh

  1. Azure Management Outage, Ars Technica
  2. Amazon EC2 Outage, TechCrunch
  3. The Netflix Simian Army, Netflix Tech Blog
  4. Representational State Transfer (REST), Roy Fielding, 2000
  5. Command Query Separation, Wikipedia
This post was originally posted by Mario Meir-Huber on Sys-Con Media.
The Image displayed for this post is Licenced under the Creative Commons and further details about the picture can be found here.