Posts

The Resource Automation Series is finished so far – but it will continue once there is new exciting content. There is still a lot to be covered regarding resource automation. With this post, I want to give an Overview of what was discussed.
In the first post, I discussed if and how Cloud Computing will change the datacenter design. We were talking about standardisation in the datacenter and how Cloud Computing affects this. The next post was about how automations in datacenters. Cloud Computing needs us to think different since we don’t talk about 10 or 100 servers but rather tens of thousands. This also leads to various problems, which are discussed in this blog post.
Resource Automation also requires a cooperation between Developer Teams and Operation Teams. This is often referenced as “DevOps“. How deployment can be done, requires different strategies. Resource Automation leads to several problems that need to be addressed. One approach is to apply Datacenter Event Lifecycle Management. Resource Automation should lead to self Service IT, or also called “Software defined Datacenter” (VMware). Monitoring is an important task for resource automation in the Cloud. With Resource Automation, we have several possibilities to automate processes. How to identify processes, we discuss datacenter automation and integration in this post.
I hope you liked this tutorial on Cloud Computing and resource automation. Should you have any ideas, feel free to comment below. Our next series will be about Software Architectures for the Cloud so stay tuned ;).

This is the last post on our series about resource automation in the Cloud. Today, we will look at datacenter automation and integration.

What are the Benefits for Data Center Automation? First of all, it frees up IT staff. If more things are automated, you don’t need to allocate resources for that. Your IT can care about more important things. You should automate things that are repeatable, such as provisioning machines.
With Data Center Integration, you leverage best capabilities of
  • Existing systems
  • Processes
  • Environments

Key areas for Datacenter Automation are:

  • Reducing labor costs by allowing reduction or reallocation of people
  • Improving service levels through faster measurement and reaction times
  • Improving efficiency by freeing up skilled resources to do smarter work
  • Improving performance and availability by reducing human errors and delays
  • Improving productivity by allowing companies to do more work with the same resources
  • Improving agility by allowing rapid reaction to change, delivering new processes and applications faster
  • Reducing reliance on high-value technical resources and personal knowledge

A key driver for Datacenter Integration and Automation is SOA (Service Oriented Architectures). This allows much better integration of different services all over the datacenter. Drivers for Integration are:

  • Flexibility. Rapid response to change, enabling shorter time to value
  • Improved performance and availability. Faster reactions producing better service levels
  • Compliance. Procedures are documented, controlled and audited
  • Return on investment. Do more with less, reduce cost of operations and management

If you decide to automate tasks in your datacenter, there are some areas where you should start:

  • The most manual process
  • The most time-critical process
  • The most error-prone processes
Once these 3 processes are identified, enterprises should
  • Break donw high-level processes into smaller, granular components
  • Identify where lower-level processes can be „packaged“ and reused in multiple high-level components
  • Identify process triggers (e.g. End-user requests, time events) and end-points (e.g. Notifications, validation actions
  • Identify linkages and interfaces between steps of each process
  • Codify any manual steps wherever possible

 

This is a follow-up post on the series about resource automation in the Cloud. In this part, we will look at monitoring. Monitoring is not the easiest thing to do in distributed systems. You have to monitor a large number of instances. The challenge is to find out what you want to monitor. If you run your application (such as a SaaS-Platform) you might not be interested in the performance of a single instance but in the performance of the application itself. You might not be interested in the I/O performance of an instance but again of the overall experience your application delivers. To find that metrics, you have to invest significant experience into monitoring.
Let us look at how monitoring works basically. There are 2 key concepts to monitor instances:

  • Agent-less Monitoring
  • Agent-based Monitoring
If we talk about Agent-less monitoring, we have two possibilities:

  • Remotly analyse the System with a remote API (e.g. Log Data on File System)
  • Analyse Network Packets: SNMP (Simple Network Management Protocol) is often used for that
What is good about agent-less monitoring?
  • No client agend to deploy
  • Lightweight
  • No application to install or run on the client. Typically doesn‘t consume resources on the System
  • Lower cost
  • Option to close or lock down a system, don‘t allow to install new Applications

What is bad about agent-less monitoring?

  • No in depth metrics for granular analysis
  • Can be affected by networking issues
  • Security

On the other hand, we can use Agend-based monitoring.

With Agend-based monitoring, a Software Component is running on each Server.  The software collects Data on the Server about different Metrics such as CPU Load, IO throughoutput, Application Performance, … The Software now sends this Data to a Master Server, which is in charge of aggregating the data. This gives an overall overview of the system performance. If Agent is not managed by a monitoring station, the System Performance might be influenced. This leads to the fact that a Lightweight Agend is necessary.
What is bad about Agend-based monitoring?
  • Need to deploy agents to systems
  • Each running System needs to have an Agent installed in order to work. This can be automated
  • Internal certification for deployment on production systems in some companies
  • Up-front Cost
  • Requires Software or custom Development

What is good about Agend-based monitoring?

  • Deeper and more granular data collection, E.g. About performance of a specific application and the CPU Utilization
  • Tighter service integration
  • Control applications and services on remote nodes
  • Higher network security
  • Encrypted proprietary protocols
  • Lower risk of downtime
  • Easier to react, e.g. If „Apache“ has high load

This is a follow-up Post to the Series on Resource Automation in the Cloud. This time we will talk about self-service IT. Self-service IT is an important factor for Automation. It basically enables users to solve their problems and not to talk about the technology. For instance, if the marketing department needs to run a website for a campaign, the IT should enable the department to start this “out of the box”: an empty website template (e. g. WordPress-based) should be started. Furthermore, scaling should be enabled, since the load will change over time. The website should also be branded in the corporate design. The target of self-service IT is that the IT department provides Tools and Services that gives the users more independence.
Another sample for self-service IT is the launch of virtual instances. This is a rather easy thing to accomplish as this can be handled by self-service platforms such as OpenStack, Eucalyptus or various vendor platforms. To achieve the one explained initially, much more work is necessary. If you plan to ease the job of your marketing guys, you would have to prepare not only virtual images but also scripts and templates to build the website. However, more and more self-service platforms emerge nowadays and they will definitely come with more features and possibilities over time.

This is the follow-up post to our Series “Resource Automation”. In this part we will focus on Event Lifecycle Management and the associated challenges we face in terms of resource Automation. By Event Lifecycle Managment, we basically mean what happens if events in the datacenter or cloud occur. Events can be different things but most likely they are of the type “errors” or “warnings”. If an error occurs, this is triggered as an event and necessary steps will be taken.

In each of the Steps of the Lifecylce, erros can and will occur. So we have to take that into account. Responding to errors is critical for the business. Just imagine what happens if you fail to deliver a service level. More calls will be received by your support Desk and you not only have a technical problem but also an organisational. In many cases, the support desk is outsourced and it can be sort of scaled on demand – however, this costs money again.
Event Lifecycle Management Consists of 4 Steps:
  1. Alerting. Time that is necessary to realise that there is a problem. In between 15 minutes to more than an hour.
  2. Identification. Identifying the cause of a problem and the likely solution.
  3. Correction. Correcting the Error
  4. Validation. Validating that the error is now gone

Correcting this error often takes up to a day! So optimising in each phase leads to a signifiant cost reduction.

Event Lifecycle Management in the Cloud

Event Lifecycle Management in the Cloud

The next few posts will deal with resource automation in the Cloud. As with other tasks, this is not a thing you solve in one day nor will it be easy. In the first post on resource automation, we will look at the main issues related to resource automation before we dig deeper into resource automation itself.
The major areas of resource automation problems are:

  • Service Level Monitoring
  • Event Lifecycle Management
  • IT Self Service

Service Level Monitoring

SLAs are key metrics in Cloud-based environments (but not only!). A key issue is that Response Times are seldom in SLAs. Metrics that should be contained in SLAs are:  Application availability, Storage availability, CPU use, Network I/O, Events and many more. A key issue is that there are several sources that SLAs have to be built from:
  • Management Tools
  • Monitors
  • Logs
  • Services

There are several reasons why building SLAs manually is not that effective:

  • Higher staff levels and traing costs
  • Higher service desk costs
  • Increased risk of errors and downtime
  • Slower remediation and downtime
  • Slower remediation and missed service levels
  • Increased business pressure on IT service
We will talk about the other two issues – Event Lifecycle Management and IT Self Service – in the next posts.
Want to receive updates about new posts? Subscribe to the Newsletter:

YTo4OntzOjk6IndpZGdldF9pZCI7czoyMDoid3lzaWphLW5sLTEzNTQ1MjEwMjMiO3M6NToibGlzdHMiO2E6MTp7aTowO3M6MToiMyI7fXM6MTA6Imxpc3RzX25hbWUiO2E6MTp7aTozO3M6MjI6Ik5ld3NsZXR0ZXIgU3Vic2NyaWJlcnMiO31zOjEyOiJhdXRvcmVnaXN0ZXIiO3M6MTc6Im5vdF9hdXRvX3JlZ2lzdGVyIjtzOjEyOiJsYWJlbHN3aXRoaW4iO3M6MTM6ImxhYmVsc193aXRoaW4iO3M6Njoic3VibWl0IjtzOjEwOiJTdWJzY3JpYmUhIjtzOjc6InN1Y2Nlc3MiO3M6ODY6IkNoZWNrIHlvdXIgaW5ib3ggbm93IHRvIGNvbmZpcm0geW91ciBzdWJzY3JpcHRpb24uIFBMRUFTRSBBTFNPIENIRUNLIFlPVVIgU1BBTSBGT0xERVIhIjtzOjEyOiJjdXN0b21maWVsZHMiO2E6MTp7czo1OiJlbWFpbCI7YToxOntzOjU6ImxhYmVsIjtzOjU6IkVtYWlsIjt9fX0=