Mitigating outages with a digital twin and chaos engineering

Digital transformation and the rise of remote working have heightened the demands on IT infrastructure over the last year. This has in turn, increased the importance of maintaining uptime to prevent disruption to services and avoid financial repercussions. Digital twins and chaos engineering offer an invaluable solution to this challenge. With this technology in place, data center managers can mitigate the impact of an outage by preparing for failure scenarios and planning for the safe implementation of changes.

3d render cobot robotic arm… Digital twin industrial technology
Shutterstock

Over the course of the last year, the demands placed on IT have irreversibly changed, in no small part due to the vast acceleration of digital transformation through remote working. This means uptime has never been more important for preventing commercial and consumer disruption, as well as avoiding financial losses. In fact, The 2021 Uptime Institute Global Survey of Data Center Managers found that 40% of outages now cost between $100,000 and $1 million, not to mention their reputational price.

To minimise the likelihood of outages and their consequences, data centre managers must be able to assess potential risks like never before. Their focus on making their facilities resilient is essential. And to achieve this, it’s imperative that they can pre-empt and avoid potential downtime triggers, in addition to minimising the impact of failure scenarios.

Using a digital twin to understand the impact of infrastructure changes

One sure-fire way to mitigate the chance of an outage is to deploy a data centre digital twin. This is a virtual representation of the physical facility which uses Computational Fluid Dynamics (CFD) to simulate air flow through the facility, and to reveal thermal challenges which could cause downtime in the future.

Using this technology, operators can trial and assess the impact of any given change in the digital realm before they apply it to the real-life facility. This mitigates risk and reduces the chance of outages as potential issues can be identified before they occur.

Conducting chaos engineering through a digital twin

First coined by Netfilx during their move to AWS in 2011, chaos engineering is the principle of “breaking things on purpose” to test systems reliability in the face of unexpected and challenging disruptions. In the application layer this means performing experiments like deliberately failing servers and clusters, dropping packets, and filling up hard drives.  According to chaos engineering provider Gremlin this results in increased systems availability and reduced Mean Time to Resolution (MTTR) for incidents.

However, running chaos engineering experiments at the physical layer in the data centre is hard to do safely, and some questions are simply not possible to answer with physical testing. Say you want to have some insight into how the facility will respond to a complete failure of the chiller plant? Or you want to test what happens when the racks running the passive half of an active-passive redundant application ramp up, when the local cooling unit has failed, and it is the hottest day of the year? Tricky, but these are the kind of black swan events that cause painful outages, and this is where the digital twin comes into its own. 

Using digital twin software, the data centre can be put in any configuration and simulated to see what would happen in the event of unplanned, problematic and even catastrophic data centre conditions – such as series of cooling or airflow units failing, or circuit breakers malfunctioning. Consequently, operators can uncover and resolve weaknesses, as well as understand how much time they would have to rectify problems in a disaster situation. This ultimately empowers them to safely simulate resilience and bolster their incident response to reduce the likelihood of extended downtime with its associated impacts.

Looking ahead

The damage caused by outages can be severe. Take the Microsoft Azure outage at the end of last year as an example. The UK facility was brought offline due to cooling-related challenges, with serious implications for those relying on it, including the UK government’s Covid-19 information portal.

It’s clear that taking the necessary measures to avoid data centre outages must continue to remain a top priority. The good news? Deploying a digital twin will enable operators to prepare for failure scenarios in addition to planning for the safe implementation of changes. As a result, they can keep their systems up and running, serving customers efficiently as demand continues to grow.

Dave King is a Product Manager at Future Facilities and has over 15 years of experience with data center simulation. His knowledge of data center cooling techniques and thermal performance helps data center managers get the most out of their facilities.