Why distributed processing helps data scale-out & speed-up

For CIOs working across the full transept of industry verticals in small, medium-sized or large capacity enterprises, there is an increased focused on data-driven business -- so, as data needs to scale-out and speed-up, could distributed processing be a key enabling technology for the next wave of data change?

IDGConnect_data_dataprocess_shutterstock_1170554746_1200x800
Shutterstock

Data is on the move, figuratively, literally and architecturally. We are now building enterprise software systems with increasingly complex and distributed data channels that push data workflows around at an increasingly rapid (often real-time) pace. Organisations are also moving their use of data outward, with some of that processing, analytics and storage happening locally, out on the ‘edge’, on the Internet of Things (IoT).

Technology analysts are fond of sticking their finger in the air to assess potential growth in this space. The magical eye-of-newt soothsayers at Gartner have estimated that some  75% of enterprise-generated data will be created and processed at the edge and outside a traditional centralised datacentre by 2025.

Whether it is in fact 66% by 2026 or perhaps 77% by 2027 doesn’t matter as much as acknowledging that data is becoming more distributed. This reality and truism begs of us one core question i.e. if data is becoming more distributed, then shouldn’t data processing become more distributed too?

Monte Zweben, CEO of Machine Learning (ML) modeling and predictive cloud applications company  Splice Machine, says that in order to use all of the data gathered by the edge and IoT devices, CIOs will need to implement data architectures that can support both distributed scale-out and real-time processes. 

Remember good old megabytes?

Machine-generated data is constantly being created and stored. Today it approaches petabyte-scale systems very quickly, says Zweben. This is different to the past. In the old days i.e. before the millennium, human-generated systems of record (such as order management, inventory and employee record systems) only generated megabytes of data a day at most, amounting to no more than terabytes-scale systems over their lifetimes.

In order to run applications on petabyte-scale systems without crashing software, Zweben says organisations need distributed processing. The technology proposition here hinges around the suggestion that so-called ‘scale-out computing’ will become the standard tool for anyone interested in taking advantage of data gathered by IoT and edge devices.

It’s an (arguably) reasonable suggestion. We know that a few seconds can make all the difference in both industrial and consumer settings. We also know that real-time applications are essential for businesses to run industrial deployments such as sensors inside a reactor, which need to report changes as soon as they happen, to prevent damage before it disrupts an entire plant. Equally, for consumer-facing IoT applications, an organisation can waste precious marketing dollars by showing a customer a product recommendation they purchased minutes/hours ago.

But firms weren’t used to this deluge of data when it first appeared. Splice Machine’s Zweben reminds us that when massive amounts of data first started coming in without any immediate or obvious purpose, that data was often stored (we could say dumped) for later use in a data lake.

"For businesses that didn’t organize their data, this data lake could morph into a data swamp, where data scientists had to wade through mountains of unusable data in order to find a small useful piece. Customer-facing industries are the most likely to accumulate data sprawl and store it in data lakes, as their customers produce massive amounts of data constantly that are not always easy to analyse," said Zweben.

Sprawl-prone clickstream swamps

Prime examples of sprawl-prone industries include financial services, retail and healthcare. These kinds of companies will often retain petabytes of ‘clickstream data’. But without analytics to explain what they mean, no useful information on customer behaviour can be attained.

So how should firms think about architecting for a distributed data-centric future? The answer lies in remembering that traditional storage systems house all their data in one place (whether on a physical hard drive, or the cloud), which limits their data analysis by a measure of their CPU capability and network bandwidth.

"In contrast, distributed processing systems break data up and store it across different storage nodes. Because the data is distributed, it can then be processed separately in parallel. This allows for far faster data analysis that is infinitely scalable because it isn’t limited by CPU, network bandwidth, or anything else. Because distributing processing can process concurrently, it allows for data analysis much faster and at infinite scale," said Zweben.

A hybrid approach for transactions & analytics

So then, what types of data architectures support scale-out and real-time processes?

Streaming data pipelines, which carry real-time information from websites, applications, or IoT technologies, are needed to drive ML models that are used to react directly to end user interactions in real-time. This kind of operation typically affects a small number of rows at a time but has potential for very high concurrency. This low latency and high concurrency processing is usually characteristic of Online Transactional Processing (OLTP) databases.

"Batch data pipelines carry less time-sensitive data and they occur periodically (typically once a day or weekly). They process large amounts of source data by extracting, loading, cleansing, aggregating and otherwise curating data into usable features. Transforming large amounts of data usually requires parallel processing that can scale. This high-volume data processing found in massively parallel processing database engines is usually referred to as Online Analytical Processing (OLAP)," explained Zweben.

A hybrid HTAP database combines the ability of OLTP and OLAP databases so it is optimised for use cases that require real-time and scale-out capabilities. Splice Machine, PingCAP and SingleStore (previously MemSQL) are a few of the companies that offer this kind of database.

It seems like the best way to overcome data sprawl is to optimise data consumption. There’s an emerging technology that allows companies to do just this. A ‘feature store’ automates transformational pipelines necessary to take raw data (from warehouses, lakes, databases, or streaming data sources) and turn it into useful information that machine learning models can use.

Zweben and team insist that hybrid databases are an especially powerful platform upon which to build a feature store. Most feature stores have two separate databases for online and offline features, which force a user to manually manage these two kinds of features, which can become inconsistent. So, again, it’s a more distributed data processing universe that we’re building here.

The perhaps smaller Splice Machine, PingCAP and SingleStore aren’t the only firms vocal in this space, the usual suspect tech behemoths also want share of voice in the distributed data space.

Distribute & decentralise, manage & monetise

Hewlett Packard Enterprise (HPE) says it is working on a ‘radically new approach’ to unlock the power of decentralised data distributed across the edge and cloud, solving a global issue holding companies back from embracing the next wave of digital transformation.

Tapping into GAIA-X, which connects centralised and decentralised infrastructures to strengthen the ability to access and share data securely and confidently, HPE’s  product releases late Spring 2021 are aligned in this space, with a special focus on how organisations can manage their data monetisation capabilities.

Data is very definitely on the move. It is moving to cloud datacentres, it’s moving between datacentres and going out to work on the IoT, it’s moving between data lakes and feature store drainage pipelines and it’s moving between on online and offline data-compute resources.

Sitting still on data is not a good idea, embracing the inherent flow of the pipe appears to be the best way to stop drowning.