The book “The art of Scalability” describes a very interesting approach to Software Architectures for distributed Systems. A key challenge is that a Software Architecture should be smart. But what exactly is “smart”? The book describes “smart” in a different sense and the letters are capitalised. We talk about a SMART architecture. Each of the letters represents an individual challenge to Software Architectures:

  • Specific
  • Measurable
  • Achievable
  • Realistic
  • Testable

Specific: The Architecture should solve a Problem. It doesn‘t need to be the „coolest“ one.
Measurable: Application basics must be measurable. Sample: the Service must return the data within 1 second, if 1 Million People access it. WRONG: The Service must be fast if a lot of people access it.
Achievable. The goals set by the architecture must be achievable. It makes no sense if the architecture allows everything but can‘t be done by the developers as it is too complex
Realistic. It is necessary to use the potential within an organisation. If the developers in a company use Java, it makes no sense to use other technology since they might fail.
Testable. Results must be testable.
Picture Copyright by Moyan Brenn

Whenever we think about scaling our Applications, we basically think about building a software architecture that supports scaling and selecting a technology such as IaaS or PaaS Platforms to achieve that goal. But scaling is more compliated than it seems. It is not only a thing that needs to be achieved in technology or software architectures.
An important question is, what needs to be scaled in an enterprise. It is not only the software architecture but also other factors:

  • Websites
  • Applications
  • Teams
  • Organisations

When talking about scaling organizations, some questions may arise:

  • How easy is it to add a Person to a Company/Team or remove a Person?
  • How can the work force be measured within the organisational structure?
  • What effort has to be made if a new Person is added to a company?
  • Does the company structure allows rapid organisational growth?

The Output of a team is not proportional to the number of people in a team. This is similar with Applications!
 

Scaling teams in a software project

Scaling teams in a software project


To achieve scalability, it is not only necessary to built an architecture that is made for scale but also to think about how to scale a team. Imagine you start with 5 employees and your start-up becomes super-famous. Your team might grow to 1,000 employees in some years. You need to think about how to solve this problem.
The following picture demonstrates how scaling problems might start in a company:
Productivity inhibitors in an IT project

Productivity inhibitors in an IT project


 
 

Picture Copyright by Moyan Brenn

For Cloud Solutions, Scalability and Elasticity are key requirements. In private Cloud Solutions, this should also be supported, even if scalability and elasticity might has a lower border as we see in the public Cloud. Scaling applications means that we can add a new instance of Linux or a Windows Server. Elasticity is something “more advanced” to that as described by Reuven Cohen, an opinion leader in Cloud Computing (Cohen, 2010). Reuven describes scalability as the possibility to “grow to the demands of the users on a platform” whereas he states that elasticity is something that reflects real-time conditions. A platform might have millions of users, but if this platform is only available in the United States, there might be significant fewer load on the servers during night. The load will be much higher at peak times and elasticity means that unnecessary instances are shut down if the load is lower or that new instances are started if the load is higher. (Owens, 2010) defines Elasticity as „the golden nugget of Cloud Computing“ and a key inhibitor to move to Cloud Environments. A very similar definition on what Cohen defined as elasticity is also provided by the National Institute of Standards and Technology  (Mell & Grance , 2011):
 

“Capabilities can be elastically provisioned and released, in some cases automatically, to scale rapidly outward and inward commensurate with demand. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be appropriated in any quantity at any time.” –  (Mell & Grance , 2011)

 

Elasticity is something that can lower costs, but requires a lot of up-front work. Elasticity has to be supported by the software architecture and resource automation has to be put in place.

Picture Copyright by Moyan Brenn

The Resource Automation Series is finished so far – but it will continue once there is new exciting content. There is still a lot to be covered regarding resource automation. With this post, I want to give an Overview of what was discussed.
In the first post, I discussed if and how Cloud Computing will change the datacenter design. We were talking about standardisation in the datacenter and how Cloud Computing affects this. The next post was about how automations in datacenters. Cloud Computing needs us to think different since we don’t talk about 10 or 100 servers but rather tens of thousands. This also leads to various problems, which are discussed in this blog post.
Resource Automation also requires a cooperation between Developer Teams and Operation Teams. This is often referenced as “DevOps“. How deployment can be done, requires different strategies. Resource Automation leads to several problems that need to be addressed. One approach is to apply Datacenter Event Lifecycle Management. Resource Automation should lead to self Service IT, or also called “Software defined Datacenter” (VMware). Monitoring is an important task for resource automation in the Cloud. With Resource Automation, we have several possibilities to automate processes. How to identify processes, we discuss datacenter automation and integration in this post.
I hope you liked this tutorial on Cloud Computing and resource automation. Should you have any ideas, feel free to comment below. Our next series will be about Software Architectures for the Cloud so stay tuned ;).

This is the last post on our series about resource automation in the Cloud. Today, we will look at datacenter automation and integration.

What are the Benefits for Data Center Automation? First of all, it frees up IT staff. If more things are automated, you don’t need to allocate resources for that. Your IT can care about more important things. You should automate things that are repeatable, such as provisioning machines.
With Data Center Integration, you leverage best capabilities of
  • Existing systems
  • Processes
  • Environments

Key areas for Datacenter Automation are:

  • Reducing labor costs by allowing reduction or reallocation of people
  • Improving service levels through faster measurement and reaction times
  • Improving efficiency by freeing up skilled resources to do smarter work
  • Improving performance and availability by reducing human errors and delays
  • Improving productivity by allowing companies to do more work with the same resources
  • Improving agility by allowing rapid reaction to change, delivering new processes and applications faster
  • Reducing reliance on high-value technical resources and personal knowledge

A key driver for Datacenter Integration and Automation is SOA (Service Oriented Architectures). This allows much better integration of different services all over the datacenter. Drivers for Integration are:

  • Flexibility. Rapid response to change, enabling shorter time to value
  • Improved performance and availability. Faster reactions producing better service levels
  • Compliance. Procedures are documented, controlled and audited
  • Return on investment. Do more with less, reduce cost of operations and management

If you decide to automate tasks in your datacenter, there are some areas where you should start:

  • The most manual process
  • The most time-critical process
  • The most error-prone processes
Once these 3 processes are identified, enterprises should
  • Break donw high-level processes into smaller, granular components
  • Identify where lower-level processes can be „packaged“ and reused in multiple high-level components
  • Identify process triggers (e.g. End-user requests, time events) and end-points (e.g. Notifications, validation actions
  • Identify linkages and interfaces between steps of each process
  • Codify any manual steps wherever possible

 

This is a follow-up post on the series about resource automation in the Cloud. In this part, we will look at monitoring. Monitoring is not the easiest thing to do in distributed systems. You have to monitor a large number of instances. The challenge is to find out what you want to monitor. If you run your application (such as a SaaS-Platform) you might not be interested in the performance of a single instance but in the performance of the application itself. You might not be interested in the I/O performance of an instance but again of the overall experience your application delivers. To find that metrics, you have to invest significant experience into monitoring.
Let us look at how monitoring works basically. There are 2 key concepts to monitor instances:

  • Agent-less Monitoring
  • Agent-based Monitoring
If we talk about Agent-less monitoring, we have two possibilities:

  • Remotly analyse the System with a remote API (e.g. Log Data on File System)
  • Analyse Network Packets: SNMP (Simple Network Management Protocol) is often used for that
What is good about agent-less monitoring?
  • No client agend to deploy
  • Lightweight
  • No application to install or run on the client. Typically doesn‘t consume resources on the System
  • Lower cost
  • Option to close or lock down a system, don‘t allow to install new Applications

What is bad about agent-less monitoring?

  • No in depth metrics for granular analysis
  • Can be affected by networking issues
  • Security

On the other hand, we can use Agend-based monitoring.

With Agend-based monitoring, a Software Component is running on each Server.  The software collects Data on the Server about different Metrics such as CPU Load, IO throughoutput, Application Performance, … The Software now sends this Data to a Master Server, which is in charge of aggregating the data. This gives an overall overview of the system performance. If Agent is not managed by a monitoring station, the System Performance might be influenced. This leads to the fact that a Lightweight Agend is necessary.
What is bad about Agend-based monitoring?
  • Need to deploy agents to systems
  • Each running System needs to have an Agent installed in order to work. This can be automated
  • Internal certification for deployment on production systems in some companies
  • Up-front Cost
  • Requires Software or custom Development

What is good about Agend-based monitoring?

  • Deeper and more granular data collection, E.g. About performance of a specific application and the CPU Utilization
  • Tighter service integration
  • Control applications and services on remote nodes
  • Higher network security
  • Encrypted proprietary protocols
  • Lower risk of downtime
  • Easier to react, e.g. If „Apache“ has high load

This is a follow-up Post to the Series on Resource Automation in the Cloud. This time we will talk about self-service IT. Self-service IT is an important factor for Automation. It basically enables users to solve their problems and not to talk about the technology. For instance, if the marketing department needs to run a website for a campaign, the IT should enable the department to start this “out of the box”: an empty website template (e. g. WordPress-based) should be started. Furthermore, scaling should be enabled, since the load will change over time. The website should also be branded in the corporate design. The target of self-service IT is that the IT department provides Tools and Services that gives the users more independence.
Another sample for self-service IT is the launch of virtual instances. This is a rather easy thing to accomplish as this can be handled by self-service platforms such as OpenStack, Eucalyptus or various vendor platforms. To achieve the one explained initially, much more work is necessary. If you plan to ease the job of your marketing guys, you would have to prepare not only virtual images but also scripts and templates to build the website. However, more and more self-service platforms emerge nowadays and they will definitely come with more features and possibilities over time.

This is the follow-up post to our Series “Resource Automation”. In this part we will focus on Event Lifecycle Management and the associated challenges we face in terms of resource Automation. By Event Lifecycle Managment, we basically mean what happens if events in the datacenter or cloud occur. Events can be different things but most likely they are of the type “errors” or “warnings”. If an error occurs, this is triggered as an event and necessary steps will be taken.

In each of the Steps of the Lifecylce, erros can and will occur. So we have to take that into account. Responding to errors is critical for the business. Just imagine what happens if you fail to deliver a service level. More calls will be received by your support Desk and you not only have a technical problem but also an organisational. In many cases, the support desk is outsourced and it can be sort of scaled on demand – however, this costs money again.
Event Lifecycle Management Consists of 4 Steps:
  1. Alerting. Time that is necessary to realise that there is a problem. In between 15 minutes to more than an hour.
  2. Identification. Identifying the cause of a problem and the likely solution.
  3. Correction. Correcting the Error
  4. Validation. Validating that the error is now gone

Correcting this error often takes up to a day! So optimising in each phase leads to a signifiant cost reduction.

Event Lifecycle Management in the Cloud

Event Lifecycle Management in the Cloud

This is the last part of the interview with Mario Szpuszta, who works with Microsoft as Technical Evangelist for Windows Azure. Mario’s answers are stated with [MS], the questions are marked with [MMH]. The Interview is divided into several parts and will be published over the next weeks.

Mario Szpuszta, Microsoft Cooperation

Mario Szpuszta, Microsoft Cooperation


[MMH] Since you do a lot in Europe and your main partners are here as well. What problems do you see in Europe with Cloud Computing?
[MS] I think in Europe you can most often bring it down to legal and compliance. Everyone brings that on the table and very often uses that as arguments for not going to the cloud.
[MMH] Do you think that some of these problems will go away over time?
[MS] Well, I think it will get easier and it started getting easier, already! Cloud vendors are investing a lot in certifications and the like to make sure they are more compliant with regional data regulation and compliance policies. E.g. on Azure we have recently finalized the ISO 27001 certifications for our core services. There were some recent announcements on SSAE 16 (the successor of SAS70) and even HIPAA. The best way to fully understand those is to take a look at the Windows Azure Trust Center http://www.windowsazure.com/en-us/support/trust-center/.
All of these steps make it easier to drive discussions on cloud also in Europe…
[MMH] Now lets talk a little more about the technical details. I know you are more passionate about the technology. So could you give us some basic design considerations for Software Architecture with distributed Systems?
[MS] We could fill a whole interview just with that topic:)
So I need to be short and precise. First of all when it comes to web apps and web services I think most people should start with simple yet effective things. Still I see so many stateful apps and services. It is really hard to scale with those in a load balanced environment. So the first step in my opinion is to make sure that you get to a stateless design and implementation or at least outsource state into a separate state server or cache, for example. In my opinion that is the first big thing to make sure it’s in the blood of your application. That way you really scale across machines and can improve your performance simply by adding additional servers with your bits deployed.
Another design consideration trying to think and design more in an asynchronous fashion. Leverage queues and outsource complex tasks to background processes whenever possible. That really helps boosting the perceived performance of your application. And it truly helps again to distribute load across multiple nodes in your deployment effectively.
Distribution of load in your app and web server tiers is fine, but if you’re still running on one database in the backend that is going to become your bottleneck and can destroy all the efforts you’ve made on the tiers above. So you should think about distributing load by partitioning/sharding your data across multiple databases. When it comes to scalability having many small databases with load distributed across all of them (running on different servers, of course) is way more scalable than having one really big database that needs to deal with the whole load.
I think these are the practical things you can start thinking about, immediately. Of course there are many other theories that are applied by the truly big internet companies such as Facebook and the like. Many of those large scale, global players think about CAP and BASE instead of ACID transactions when it comes to writing back to the store. Just look at http://en.wikipedia.org/wiki/CAP_theorem if you want to learn more. I don’t cover them in detail because that’s (a) too complex and (b) in my opinion not relevant for most of the traditional ISVs as it goes way to far for many of them. I think most of us are really a big step forward by applying principles I mentioned before: stateless, work in load balanced environments, distribute load across multiple databases and the like. These are practical and most people can implement them sooner as compared to completely rethinking about how you deal with transactions in your system. Of course, if you want to have millions of customers on a global basis with thousands of concurrent users then you should rather think early about CAP and BASE instead of too late…
[MMH]  That actually sounds like a huge effort to bring applications to Azure. How can people deal with that?
[MS] One simple thought I tend to follow: stay simple, be pragmatic and work on an architecture that is good enough for your business goals. If you want to address millions of customers then you should rather think about all of these changes and principles I mentioned before sooner. But if you want to stay, let’s say in your region, and your customer base should increase but not up to the millions or your scenario is for specific target groups then many of these CAP and BASE things are just over-engineering. As you can see – the decision on how far you need to go depends on your business goals and business plans;) And in the context of those I tend to stay pragmatic and simple…
[MMH] A final statement: what excites you most about Cloud Computing?
[MS] For me that is super-easy and comes down to one specific point: cloud computing and the principles that are being established with cloud computing brings business and technology closer together than I’ve seen it ever before. Just to give you one example: from a pure technical point-of-view in the past it was really hard, if not impossible, to differentiate an effective architecture from a less effective architecture. Of course I know many people will argue different, but at the end of the day it’s all about opinions in the world of architecture very often. In the context of cloud I can do that much better: an effective architecture leads to less monthly cost for operating an environment in the cloud as compared to a not so effective architecture. Of course that always has to be seen in the context of the business goals and is a bit simplified, but at the end of the day that’s what it is in my opinion. Breaking efficiency of architecture down at that level has been tremendously hard in the past – and now we’re moving into that direction. That excites me most!!

Jeremy Geelan from Cloud Computing Journal / Cloud Computing Expo listed Mario Meir-Huber from Cloudvane as one of the Top 100 Bloggers on Cloud Computing! Thanks a lot! This is great news to the growing platform Cloudvane!
Link: Top 100 Blogs on Cloud Computing.