How clouds falter... and how to straighten them

When we talk about cloud computing clouds ‘breaking’, we mean it in the sense of the IT in question failing to achieve its specified level of performance and availability, but Machine Learning tools are coming forward to automate our engineers’ multifarious remediation responsibilities.


Clouds don’t actually break. That core truism comes from the fact that clouds are, by definition, virtualised resources of compute, storage and analytics, all created as an instance inside a server unit, somewhere inside a datacentre.

Usually backed up for disaster recovery purposes with appropriate levels of redundancy, the only way for a cloud to physically break would be for someone to take a sledge hammer to a server rack and wreak havoc upon the chunks of metal and silicon therein. Even then, there ought to be backup.

What cloud breakage really means

When we do talk about clouds breaking, we mean it in the sense of the entire IT instance (or smaller cloud component) in question failing to achieve its specified level of performance. This happens when its functions become log-jammed in some way so that its core level of availability fails to serve the users who access the services it feeds.

In real world operational terms, there are many and various reasons for cloud service degradation or failure. Whatever the reason for cloud breakage, it is actually more commonly just known as downtime.

Cloud application downtime events can be caused by faulty code or configuration changes, unbalanced cloud container clusters, or resource exhaustion (in the form of CPU, memory, disk, or other power drain outs), all of which inevitably lead to bad customer experiences and lost revenue.

Companies invest a considerable amount of resources, time and money to deploy multiple monitoring tools, often managed separately. They then also have to develop, augment and maintain custom alerts for common issues like spikes in load balancer errors or drops in application request rates.

We’re so ‘over’ thresholds

Setting thresholds to identify and alert when application resources are behaving abnormally is difficult to get right, involves manual setup and requires thresholds that must be continually updated as application usage changes (e.g. an unusually large number of requests during a sales promotion).

If a threshold is set too high, cloud engineers and software developers don’t see alarms until operational performance is severely impacted. When a threshold is set too low, engineers get too many false positives, which they are prone to ignore.

Even when engineers get alerted to a potential operational issue with thresholds, the process of identifying the root cause can still prove difficult. Using existing tools, developers often have difficulty triangulating the root cause of an operational issue from graphs and alarms. Even when they are able to find the root cause, they are often left without the right information to fix it.

So what is the industry doing about the situation and what tools are being developed to work at the coalface?

Platform-level mismatches

Some cloud pitfalls are caused by the nature of cloud itself and the way it manifests itself. We know that Amazon Web Services (AWS), Google Cloud Platform (GCP) and Microsoft Azure shoulder the lion’s share of the Cloud Service Provider (CSP) market, but even among these three there can be platform-level mismatches that slow down cloud connections.

With many organisations seeking to deploy hybrid multi-cloud strategies and put some apps and data with one CSP and some with another - things get complicated. Add in poly-cloud where some application workloads are ‘separated out’ and shared across different providers - and things get even more complex.

Customers looking at their cloud estate will be thinking about observability technologies designed to look inside clouds and assess their performance. These will often be twinned with Application Performance Management (APM) tools, which straddle the observability space as it continues to evolve.

Where we go next with cloud management and the mission to fix cloud breaks is probably not hard to guess; it comes down to new threads of Artificial Intelligence (AI) provided by in-cloud Machine Learning (ML).

It’s all about availability

Let’s remember, we’re not questioning whether clouds are powerful, flexible, well-secured, interoperable, cost-manageable and Operational Expenditure (OpEx)-friendly. We already know those factors, that’s what it says on the packet. This discussion is all about cloud availability and the things we need to do to ensure it exists at the highest level.

AWS itself has earlier this year tabled Amazon DevOps Guru (it’s a service, not a person) as an effort to increase availability. C-suite managers won’t physically touch this product, it is designed to work in the cloud engineering department and be handled by developers and data scientists.

The guru knowledge factor here stems not from Sanskrit teachings, but from analysis spanning years of previous operational cloud service metrics. This gender-neutral guru applies ML analyse data like application metrics, logs, events and traces (vital signs that any application or database puts out) for behaviours that deviate from normal operating patterns.

When anomalous application behaviour that could cause potential outages or service disruptions is found, automatic alerts are generated to give human technicians the chance to intervene and manage the parameters of the cloud at hand.

That doesn’t mean it’s a sudden switch-to-manual power down, this technology provides remediation suggestions designed to offer the shortest fix, or ‘time to resolution’ as the action is known.

No more cold start clouds

“With Amazon DevOps Guru, we have taken our years of expertise and built specialised machine learning models to detect, troubleshoot and prevent operational issues long before they impact customers and without dealing with cold starts each time an issue arises,” said Swami Sivasubramanian, vice president, Amazon Machine Learning, AWS.

Sivasubramanian and team further comment on this whole issue of cloud availability and say that as more organisations move to cloud-based application deployment and microservice architectures to scale their businesses, applications have become increasingly distributed.

This distributed reality means cloud engineers need more automated practices to maintain application availability and reduce the time and effort spent detecting, debugging and resolving operational issues.

Google Cloud and Microsoft Azure are working to finesse their availability-related services and provide offerings to play in this space, all of which aim to be as self-service (which generally means they can be accessed via a software-based management console) as possible.

It’s a little like spinning plates, at some stage you’re going to need to motorise muscle when human multi-tasking reaches its physical limit.

The bottom line with cloud is all about software application development and the mission to find any given organisation’s ‘most expensive line of code’ i.e. that’s the one that’s log-jamming availability and that’s the one that’s about to drop the plate.