Escaping operational black holes with unified ‘full-fidelity’ observability

As tech leaders now look to gain deep and granular tranches of management control across their IT estates, there is a reasonable (if not compelling) argument for questioning the form, focus and fidelity of our observability viewpoint – the alternative may be something like a journey down an operational black hole, which is clearly a fairly suffocating experience for everyone.

A ladder coming up thriough a keyhole

No enterprise IT system implementation exists in a vacuum. By their very nature, organisations need to build, manage and manipulate a corporate software services layer that is characterised by its ability to compute, interconnect and deliver.

That last word is important; the delivery factor here is a mechanism where users get results at the upper-tier level, but, more fundamentally, all enterprise software systems need to deliver observability in the first instance, otherwise, they will risk existing in some sort of operational black hole, or disconnected vacuum.

Distributed complex black holes

A lot of what gets written today about observability is in the context of the challenges that DevOps and Site Reliability Engineers (SRE) teams face in cloud-native environments, which are highly distributed and complicated.

In those environments, identifying and resolving system issues is tough to do, but some IT industry vendors and commentators are calling that ‘looking inside’ process observability. But, according to Mike Marks in his role as vice president of product marketing at network visibility and end user experience management company Riverbed, that’s really an extension of Application Performance Monitoring (APM).

Looking at the modern observability challenge facing customers today, Marks suggests that cloud-native infrastructures aren’t the only highly distributed environments that are challenging the ability of IT to manage.

“Hybrid work is a great example since it’s the new normal. If you’re part of the digital workplace team of a big company, you may have 10,000 employees who depend on you for an excellent digital experience. Those 10,000 employees are working from home, at least part of the time, with 10,000 unique combinations of laptop resources, Wi-Fi signal strength, ISP connectivity, portfolios of SaaS applications, not to mention shadow IT,” explained Marks.

Unified full-fidelity observability

To manage that complexity, Marks and team suggest that enterprises need observability tools that unify the sources of telemetry across all domains and devices at full-fidelity, in order to gather and ingest data in its purest state.

Just to stop and define this expression in more precise terms, full-fidelity data is information created by compute events, machines, sensors, people or objects that exists in its original rawest form so that its ‘state’ retains its actual meaning and granularity. Typically unstructured and often held in a data lake, full-fidelity data sits at the opposite end of the data science information spectrum from dummy denormalised redundant data.

“The ability to collect, extract, transform and analyse data is increasingly critical for decision-making. If you’re working from partial, incomplete, or inaccurate datasets with lower fidelity than the source, all you’re doing is automating bad decisions faster. If the data isn’t right, you might as well throw the whole analytics model out the window,” explained David Sweenor, senior director of product marketing at analytics automation company Alteryx.

“Full fidelity data mirrors high-fidelity (Hi-Fi) in music. It’s an original, accurate, and unchanged source which is then faithfully transmuted with no information loss. In many use cases, such as anomaly or fraud detection, sampling data can remove critical information… an experience to which any audiophile who has uploaded an album to their computer can attest.”

We need to define this point because, in the context of this discussion, we are talking about how wide a transept our observability viewpoint extends. If an enterprise doesn't capture all domains, applications and devices (and if it doesn’t capture every data point within those domains), then it’s only sampling, which can miss huge problems.

Imagine capturing only one out of ten website transactions on Black Friday – how many poor experiences would that create inside the supply chain and at the consumer level?

“But capturing everything means many more alerts. A typical medium-sized enterprise might get as many as 1,000 alerts per hour, which is too many to manually resolve. Observability filters and prioritized alerts (using AI and ML) enable us to use observability with alerts in context, so IT teams can quickly identify the root cause of any problem,” explained Riverbed’s Marks.

 IT team war rooms

Previously (or at least in the age before observability specialists like Riverbed), the only way to uncover difficult problems was through resource-intensive IT team war rooms.  A war room typically has the network team, the server team, the application team, the data base team etc. all together on a call or in a room trying to figure out a problem.

But these often floundered as a result of senior-level IT people strategising and solutionising, rather than getting down to solid work plugging leaks.

According to Digital Enterprise Journal, high-performing organisations (somewhere around the top 20% across the world) can proactively detect 79% of performance problems with an average Mean Time To Resolution (MTTR) of 38 minutes. For all other companies, they only catch approximately 40% of problems and their MTTR is more than three and a half hours, which clearly wastes money.

Sweenor elaborates, highlighting: “The core goal here is having the right information at the right time at a high enough quality to make the right decisions. The challenge, however, is one of resource versus business benefit. Resource-intensive analysis of data at this scale is only really possible through automation. For those looking to complete this manually, visibility will require a razor-sharp prioritisation focus on what delivers the most business value.”

Top performing companies are therefore more likely to plough money back into R&D to better help the organisation and leave the weaker competition bailing out a leaking boat instead of zipping along under full sail.

A finicky zero-sum game

“It's important to get this business-technology calculation right,” said Marks. “Customers are finicky and won't tolerate slow page loads. In this zero-sum game, if you can't stay close to the leaders in providing a great digital experience, revenue suffers as finicky customers wander. For employees, even slight delays with business-critical applications multiply into real dollars.

The suggestion here is that observability solves those problems through full-fidelity telemetry and intelligent analytics that deliver actual (and actionable) insights. Riverbed also has preconfigured libraries that can be customised to deliver automated actions for frequently encountered problems, freeing IT staff to focus on higher-level tasks.

Down the (black) rabbit hole

As tech leaders now look to gain deep and granular tranches of management control across their IT estates, there is a reasonable (if not compelling) argument for questioning the form, focus and fidelity of our observability viewpoint – the alternative may be something like a journey down an operational black hole, which is clearly a fairly suffocating experience for everyone.