Master Data Management

Glyn Bowden (Global) - The ABCs of Big Data

First we have analytics. The workloads within this set normally come from scientific, exploration or financial institutes that have huge pools of data that need to be processed quickly to discover a desired result. Often require data will need to be processed over and over again, to either find multiple possible resolutions, or by tweaking algorithms for each run. This means that from a data profile perspective, we have an ocean model. What I mean by this is that the large pool of data is fed by new streams continuously, but once a part of that pool, the data is primarily read and never rewritten. Essentially we're looking at a write once, read many configuration.

The next big data segment is bandwidth intensive. From a storage model this is similar to the analytics write once, read many, with the subtle difference of that write being huge. The focus is on bringing huge amounts of data, often from disparate sources, combining that data and writing it to a single pool. For this the profile of the storage device needs to allow high bandwidth writes. Low read latency isn't necessarily required and will be determined by the application.

Finally we have content. What does this mean in the real world? Well, for instance, a doctor will likely want access to the data of a single person; and only recent images from, say, the past 12 months. That's a relatively small data set, but it will be stored alongside every patient's data with a requirement to retain for the life of that person plus 25 years. It is not common for tens of thousands of scans to be ingested immediately as they are usually taken only a couple at a time, with maybe 100's of patients in parallel. So the storage model for content is one based on availability and compliance. The key technology in content is actually going to be the search. Locating that needle in the haystack is vital. Analysis models are used when it's not known exactly what is being searched for, in the case of content, its metadata is usually well known, and this just needs to be translated into a function to describe where the data blocks physically reside.

So those are the three models that I tend to work with, and pretty much all big data will slot into one of them. In my experience it's currently rare to find workloads that fall into other categories at any one point in time, but with the nature of data, this will come. It's at this point that technologies and standards such as SNIAs Cloud Data Management Interface (CDMI) can be used to locate the data and allow it to be moved to infrastructure focused on different capabilities. Currently CDMI has an obvious value to add to the content group, but I see this becoming more and more relevant as we start to knit these areas together, making all the content we are creating really begin to work for us.

By Glyn Bowden, enterprise infrastructure architect, NetApp



« Margaret Dawson (Global) - IT and Business Benefits of Cloud-Based Integration, Part II


Mark Warburton (Global) - Passing the Baton - From Virtual to Augmented Reality - Part 2 »

Recommended for You

Trump hits partial pause on Huawei ban, but 5G concerns persist

Phil Muncaster reports on China and beyond

FinancialForce profits from PSA investment

Martin Veitch's inside track on today’s tech trends

Future-proofing the Middle East

Keri Allan looks at the latest trends and technologies


Do you think your smartphone is making you a workaholic?