In this Category, we will describe everything that comes with Ressource Automation for Cloud Environments

Cloud Computing gave us several changes in how we handle IT nowadays. Common tasks that used to take a lot of time received great automation and much more is still about to come. Another interesting development is the “Software defined X”. This basically means that infrastructure elements receive larger automation as well, which ends up being more scale able and better to utilize from applications. A frequent term used lately is the “Software defined Networking” approach, however, there is another one that sounds promising, especially for Cloud Computing and Big Data: Software defined Storage.

Software defined Storage gives us the promise to abstract the way how we use storage. This is especially useful for large scale systems, as no one really wants to care about how to distribute the content to different servers. This should basically be opaque to end-users (software developers). For instance, if you are using a storage system for your website, you want to have an API like Amazon’s S3. there is no need to worry about on which physical machine your files are stored – you just specify the desired region. The back-end system (in this case, Amazon S3) takes care of that.

Software defined Storage explained

Software defined Storage explained

As of the architecture, you simply communicate with the abstraction layer, that takes care of the distribution, redundancy and other factors.

At present, there are several systems available that takes care of that: next to the well-know systems such as Amazon S3, there are also other solutions such as the Hadoop Distributed File System (HDFS) or GlusterFS.

 

Header Image Copyright: nyuhuhuu. Licensed under the Creative Commons 2.0.

The Resource Automation Series is finished so far – but it will continue once there is new exciting content. There is still a lot to be covered regarding resource automation. With this post, I want to give an Overview of what was discussed.

In the first post, I discussed if and how Cloud Computing will change the datacenter design. We were talking about standardisation in the datacenter and how Cloud Computing affects this. The next post was about how automations in datacenters. Cloud Computing needs us to think different since we don’t talk about 10 or 100 servers but rather tens of thousands. This also leads to various problems, which are discussed in this blog post.

Resource Automation also requires a cooperation between Developer Teams and Operation Teams. This is often referenced as “DevOps“. How deployment can be done, requires different strategies. Resource Automation leads to several problems that need to be addressed. One approach is to apply Datacenter Event Lifecycle Management. Resource Automation should lead to self Service IT, or also called “Software defined Datacenter” (VMware). Monitoring is an important task for resource automation in the Cloud. With Resource Automation, we have several possibilities to automate processes. How to identify processes, we discuss datacenter automation and integration in this post.

I hope you liked this tutorial on Cloud Computing and resource automation. Should you have any ideas, feel free to comment below. Our next series will be about Software Architectures for the Cloud so stay tuned ;).

This is the last post on our series about resource automation in the Cloud. Today, we will look at datacenter automation and integration.

What are the Benefits for Data Center Automation? First of all, it frees up IT staff. If more things are automated, you don’t need to allocate resources for that. Your IT can care about more important things. You should automate things that are repeatable, such as provisioning machines.
With Data Center Integration, you leverage best capabilities of
  • Existing systems
  • Processes
  • Environments

Key areas for Datacenter Automation are:

  • Reducing labor costs by allowing reduction or reallocation of people
  • Improving service levels through faster measurement and reaction times
  • Improving efficiency by freeing up skilled resources to do smarter work
  • Improving performance and availability by reducing human errors and delays
  • Improving productivity by allowing companies to do more work with the same resources
  • Improving agility by allowing rapid reaction to change, delivering new processes and applications faster
  • Reducing reliance on high-value technical resources and personal knowledge

A key driver for Datacenter Integration and Automation is SOA (Service Oriented Architectures). This allows much better integration of different services all over the datacenter. Drivers for Integration are:

  • Flexibility. Rapid response to change, enabling shorter time to value
  • Improved performance and availability. Faster reactions producing better service levels
  • Compliance. Procedures are documented, controlled and audited
  • Return on investment. Do more with less, reduce cost of operations and management

If you decide to automate tasks in your datacenter, there are some areas where you should start:

  • The most manual process
  • The most time-critical process
  • The most error-prone processes
Once these 3 processes are identified, enterprises should
  • Break donw high-level processes into smaller, granular components
  • Identify where lower-level processes can be „packaged“ and reused in multiple high-level components
  • Identify process triggers (e.g. End-user requests, time events) and end-points (e.g. Notifications, validation actions
  • Identify linkages and interfaces between steps of each process
  • Codify any manual steps wherever possible

 

This is a follow-up post on the series about resource automation in the Cloud. In this part, we will look at monitoring. Monitoring is not the easiest thing to do in distributed systems. You have to monitor a large number of instances. The challenge is to find out what you want to monitor. If you run your application (such as a SaaS-Platform) you might not be interested in the performance of a single instance but in the performance of the application itself. You might not be interested in the I/O performance of an instance but again of the overall experience your application delivers. To find that metrics, you have to invest significant experience into monitoring.

Let us look at how monitoring works basically. There are 2 key concepts to monitor instances:

  • Agent-less Monitoring
  • Agent-based Monitoring
If we talk about Agent-less monitoring, we have two possibilities:

  • Remotly analyse the System with a remote API (e.g. Log Data on File System)
  • Analyse Network Packets: SNMP (Simple Network Management Protocol) is often used for that
What is good about agent-less monitoring?
  • No client agend to deploy
  • Lightweight
  • No application to install or run on the client. Typically doesn‘t consume resources on the System
  • Lower cost
  • Option to close or lock down a system, don‘t allow to install new Applications

What is bad about agent-less monitoring?

  • No in depth metrics for granular analysis
  • Can be affected by networking issues
  • Security

On the other hand, we can use Agend-based monitoring.

With Agend-based monitoring, a Software Component is running on each Server.  The software collects Data on the Server about different Metrics such as CPU Load, IO throughoutput, Application Performance, … The Software now sends this Data to a Master Server, which is in charge of aggregating the data. This gives an overall overview of the system performance. If Agent is not managed by a monitoring station, the System Performance might be influenced. This leads to the fact that a Lightweight Agend is necessary.
What is bad about Agend-based monitoring?
  • Need to deploy agents to systems
  • Each running System needs to have an Agent installed in order to work. This can be automated
  • Internal certification for deployment on production systems in some companies
  • Up-front Cost
  • Requires Software or custom Development

What is good about Agend-based monitoring?

  • Deeper and more granular data collection, E.g. About performance of a specific application and the CPU Utilization
  • Tighter service integration
  • Control applications and services on remote nodes
  • Higher network security
  • Encrypted proprietary protocols
  • Lower risk of downtime
  • Easier to react, e.g. If „Apache“ has high load

This is a follow-up Post to the Series on Resource Automation in the Cloud. This time we will talk about self-service IT. Self-service IT is an important factor for Automation. It basically enables users to solve their problems and not to talk about the technology. For instance, if the marketing department needs to run a website for a campaign, the IT should enable the department to start this “out of the box”: an empty website template (e. g. WordPress-based) should be started. Furthermore, scaling should be enabled, since the load will change over time. The website should also be branded in the corporate design. The target of self-service IT is that the IT department provides Tools and Services that gives the users more independence.

Another sample for self-service IT is the launch of virtual instances. This is a rather easy thing to accomplish as this can be handled by self-service platforms such as OpenStack, Eucalyptus or various vendor platforms. To achieve the one explained initially, much more work is necessary. If you plan to ease the job of your marketing guys, you would have to prepare not only virtual images but also scripts and templates to build the website. However, more and more self-service platforms emerge nowadays and they will definitely come with more features and possibilities over time.

This is the follow-up post to our Series “Resource Automation”. In this part we will focus on Event Lifecycle Management and the associated challenges we face in terms of resource Automation. By Event Lifecycle Managment, we basically mean what happens if events in the datacenter or cloud occur. Events can be different things but most likely they are of the type “errors” or “warnings”. If an error occurs, this is triggered as an event and necessary steps will be taken.

In each of the Steps of the Lifecylce, erros can and will occur. So we have to take that into account. Responding to errors is critical for the business. Just imagine what happens if you fail to deliver a service level. More calls will be received by your support Desk and you not only have a technical problem but also an organisational. In many cases, the support desk is outsourced and it can be sort of scaled on demand – however, this costs money again.
Event Lifecycle Management Consists of 4 Steps:
  1. Alerting. Time that is necessary to realise that there is a problem. In between 15 minutes to more than an hour.
  2. Identification. Identifying the cause of a problem and the likely solution.
  3. Correction. Correcting the Error
  4. Validation. Validating that the error is now gone

Correcting this error often takes up to a day! So optimising in each phase leads to a signifiant cost reduction.

Event Lifecycle Management in the Cloud

Event Lifecycle Management in the Cloud

The next few posts will deal with resource automation in the Cloud. As with other tasks, this is not a thing you solve in one day nor will it be easy. In the first post on resource automation, we will look at the main issues related to resource automation before we dig deeper into resource automation itself.

The major areas of resource automation problems are:

  • Service Level Monitoring
  • Event Lifecycle Management
  • IT Self Service

Service Level Monitoring

SLAs are key metrics in Cloud-based environments (but not only!). A key issue is that Response Times are seldom in SLAs. Metrics that should be contained in SLAs are:  Application availability, Storage availability, CPU use, Network I/O, Events and many more. A key issue is that there are several sources that SLAs have to be built from:
  • Management Tools
  • Monitors
  • Logs
  • Services

There are several reasons why building SLAs manually is not that effective:

  • Higher staff levels and traing costs
  • Higher service desk costs
  • Increased risk of errors and downtime
  • Slower remediation and downtime
  • Slower remediation and missed service levels
  • Increased business pressure on IT service
We will talk about the other two issues – Event Lifecycle Management and IT Self Service – in the next posts.

Want to receive updates about new posts? Subscribe to the Newsletter:

YTo4OntzOjk6IndpZGdldF9pZCI7czoyMDoid3lzaWphLW5sLTEzNTQ1MjEwMjMiO3M6NToibGlzdHMiO2E6MTp7aTowO3M6MToiMyI7fXM6MTA6Imxpc3RzX25hbWUiO2E6MTp7aTozO3M6MjI6Ik5ld3NsZXR0ZXIgU3Vic2NyaWJlcnMiO31zOjEyOiJhdXRvcmVnaXN0ZXIiO3M6MTc6Im5vdF9hdXRvX3JlZ2lzdGVyIjtzOjEyOiJsYWJlbHN3aXRoaW4iO3M6MTM6ImxhYmVsc193aXRoaW4iO3M6Njoic3VibWl0IjtzOjEwOiJTdWJzY3JpYmUhIjtzOjc6InN1Y2Nlc3MiO3M6ODY6IkNoZWNrIHlvdXIgaW5ib3ggbm93IHRvIGNvbmZpcm0geW91ciBzdWJzY3JpcHRpb24uIFBMRUFTRSBBTFNPIENIRUNLIFlPVVIgU1BBTSBGT0xERVIhIjtzOjEyOiJjdXN0b21maWVsZHMiO2E6MTp7czo1OiJlbWFpbCI7YToxOntzOjU6ImxhYmVsIjtzOjU6IkVtYWlsIjt9fX0=

 

As already mentioned in the last post, Application deployment in the Cloud is not an easy task. It requires a lot of work and knowledge, unless you use Platform as a Service. The later one might reduce this complexity significantly. In order to deploy your Applications to an Infrastructure as a Service Layer, you might need additional knowledge. Let us now built upon what we have heard from the last post by looking at several strategies and how they are implemented by different enterprises.

Some years ago before joining IDC, I worked for a local Software Company. We had some large-scale deployments in place and we used Cloud Services to run our Systems. In this company, my task was to work on the deployment and iteration strategy (but not only on that ;)). We developed in an agile iteration model (basically it was Scrum). This means that our Team had a new “Release” every 2 weeks. During these two weeks, new Features were developed and bugs fixed. The Testing department checked for Bugs and the development Team had to fix them. We enforced daily check-ins and gated them. A gated check-in means that the check-in is only accepted if the Source Code can be compiled on the Build Server(s). The benefit of this was to have a working Version every morning as some started early (e.g. 6am) and others rather late (e.g. at 10am). Before starting with that, our Teams often ran into the error of having to debug bad code that was checked in by someone the day before. Different teams worked on different Features and if the team decided a feature to be final, it was merged to the “stable” branch. Usually on Thursdays (we figured out that Fridays are not good for deployment for various reasons) we deployed the stable branch. The features initially get deployed to a staging system. The Operations Team tests the features. If a feature isn’t working, it has to be disabled and removed from the stable release. Once there was a final version with working features, we deployed them to the live system. This usually happened on Friday evening or Saturdays. Over the years, the Development and Operations Team got a real “Team” and less features had to be removed from the staging system due to errors/bugs. My task was to keep the process running since it was now “easy” to forget about the rules once everything seems to run good.

Want to hear more about Application Deployment? Subscribe to the Newsletter:

YTo4OntzOjk6IndpZGdldF9pZCI7czoyMDoid3lzaWphLW5sLTEzNTQ1MTgzMTIiO3M6NToibGlzdHMiO2E6MTp7aTowO3M6MToiMyI7fXM6MTA6Imxpc3RzX25hbWUiO2E6MTp7aTozO3M6MjI6Ik5ld3NsZXR0ZXIgU3Vic2NyaWJlcnMiO31zOjEyOiJhdXRvcmVnaXN0ZXIiO3M6MTc6Im5vdF9hdXRvX3JlZ2lzdGVyIjtzOjEyOiJsYWJlbHN3aXRoaW4iO3M6MTM6ImxhYmVsc193aXRoaW4iO3M6Njoic3VibWl0IjtzOjEwOiJTdWJzY3JpYmUhIjtzOjc6InN1Y2Nlc3MiO3M6ODU6IkNoZWNrIHlvdXIgaW5ib3ggbm93IHRvIGNvbmZpcm0geW91ciBzdWJzY3JpcHRpb24uIFBMRUFTRSBBTFNPIENIRUNLIFRIRSBTUEFNIEZPTERFUiEiO3M6MTI6ImN1c3RvbWZpZWxkcyI7YToxOntzOjU6ImVtYWlsIjthOjE6e3M6NToibGFiZWwiO3M6NToiRW1haWwiO319fQ==

 

A very advanced environment is Windows Azure. Windows Azure is Microsoft’s Cloud Platform and was usually designed as a PaaS-Platform. However, they added several IaaS-Features so far. In this post, I will only focus on the PaaS-Capapilities of Windows Azure. There are several steps involved:

  1. Full Package with Assemblies gets uploaded
  2. If a new Package is added, a Staging Environment is started. There is now the possibility to Test on that Staging Environment
  3. Once the Administrator decides the Package to be „safe“, Staging is switched to Production
  4. New Roles with the new Binaries are started
  5. Old Roles that still have Sessions are up and running until all Sessions are closed
Deployment for Windows Azure PaaS

Deployment for Windows Azure PaaS

This is a very mature deployment process and solves a lot of problems that are normally associated with deployment for large-scale Cloud Systems.
Facebook described their process for deployment very clear:
  1. Facebook takes advantage of BitTorrent
  2. New Versions of the Software gets added to a Tracker
  3. Each Server listens to the Trackers
  4. Once a new Version is available, it is downloaded to the Server
  5. The Server then unloads the current Version and Loads the new one
  6. Old Versions always stay on the Server
  7. Easy Rollback
Any Questions/Issues on deployment? Post your comments below.

When we talk about Cloud Computing, we also talk about Automation in Datacenters. Cloud Computing transforms Datacenters to a way where we see much more Automation than we saw before. There is significant transformation going on and more and more Projects that enable that are launched nowadays. Famous Automation Platforms are Eucalyptus and OpenStack in the Open Source area. Microsoft and vmWare also offer some Automation Tools for the Cloud. But what are the concepts for Cloud Automation?

Let us first look at the illustration below to find out how automation in Datacenters work.

Datacenter Automation in the Cloud

Datacenter Automation in the Cloud

As shown in the illustration above, there are several steps involved. First, we add a new physical Server. This usually happens when a new Rack or Container is deployed to a Datacenter. The new physical Server is started and a Maintenance OS is started. This is usually a lightweight Version of Windows Server or Linux. The Maintenance OS is the basis for virtualisation. The Maintenance OS now connects to the Controller. The Controller is a Server that is somewhat of a Master in the Datacenter. The Controller tells the Maintenance OS what it should do. Normally, this is what virtual Host VM to start. The Host VM is now the container for different virtual Instances. They are called “Guest VMs”. Guest virtual Machines now run the applications the user wants. This can either be the Operating System (Infrastructure as a Service) or a Platform.

 

YTo4OntzOjk6IndpZGdldF9pZCI7czoyMDoid3lzaWphLW5sLTEzNTQzNjUyMTEiO3M6NToibGlzdHMiO2E6MTp7aTowO3M6MToiMyI7fXM6MTA6Imxpc3RzX25hbWUiO2E6MTp7aTozO3M6MjI6Ik5ld3NsZXR0ZXIgU3Vic2NyaWJlcnMiO31zOjEyOiJhdXRvcmVnaXN0ZXIiO3M6MTc6Im5vdF9hdXRvX3JlZ2lzdGVyIjtzOjEyOiJsYWJlbHN3aXRoaW4iO3M6MTM6ImxhYmVsc193aXRoaW4iO3M6Njoic3VibWl0IjtzOjEwOiJTdWJzY3JpYmUhIjtzOjc6InN1Y2Nlc3MiO3M6ODU6IkNoZWNrIHlvdXIgaW5ib3ggbm93IHRvIGNvbmZpcm0geW91ciBzdWJzY3JpcHRpb24uIFBsZWFzZSBhbHNvIGNoZWNrIHlvdXIgU3BhbSBGb2xkZXIiO3M6MTI6ImN1c3RvbWZpZWxkcyI7YToxOntzOjU6ImVtYWlsIjthOjE6e3M6NToibGFiZWwiO3M6NToiRW1haWwiO319fQ==

 

 

In Cloud Computing Environments, we see a significant switch in how the industry treats Datacenters. Cloud Computing and Big Data transform the way how we think about Datacenters. But why should that be so?

First of all, we have to look at how Servers and the market looked like some years ago. Many companies bought their Servers. Small and Medium Businesses (SMBs) used to run their own Servers, maintain them and own them. A datacenter often consisted of only one rack or some Servers. This means that there is a polypoly on the demand side (buyers). On the supply-side, we had some companies such as Dell, HP, IBM and others. This is named an oligopoly. In this market situation, the supply side is stronger and negotiations are traditionally harder. If you are a large enterprise, you could generally get better conditions.

Nowadays with Cloud Computing, the datacenter design is shifting dramatically. Small and Medium Businesses will rent their services such as Software (SaaS), Platforms (PaaS), Data Storages or Infrastructure. We will see less datacenter providers but large providers will have more datacenters at high capacity. This means that the market is shifting from an polypoly to an oligopoly. Large providers of Cloud Platforms will now buy more Hardware at better conditions. The Datacenter will look very homogenous as buyers will switch towards commodity Hardware and easy to replace Systems.