Database architects help scientists deal with data floods
Statistical Data Analysis

Database architects help scientists deal with data floods

Scientists by nature are fascinated by everything but that innate sense of curiosity can be both a blessing and a curse. The more opportunities for exploration that present themselves, the more likely they are to be side-tracked.

The Internet of Things (IoT) is a case in point because it is generating endless possibilities and the potential areas for investigation far outweigh the supply of scientists. Unlike we narrow-minded types, who tend to stick to a single track, many scientists can be tempted by each new intellectual challenge that presents itself, and before long the discipline is all over the place like Einstein’s hair.

It’s already been established that the IoT multiplies the complexity of a scientific investigation in an equation that involves volume times velocity to the power of variety. Virtualisation, by way of cloud computing, could improve the situation – but first it had to make things worse. Yes, the cloud allows computing power to be ramped up and down at will in response to demand, but that now means the sheer liquidity of CPU, storage and memory far outstrips the speed at which databases can respond. Ironically, it’s the software that is now the bottleneck while the hardware is today so fast and fluid that it can shapeshift to meet all eventualities.

This is why the database industry has had to rethink its processes to get around the limited movement of two-dimensional structures of row and tables stored in a single location. Getting data from different systems, and interrogating it, has become a massive problem thanks to the cloud and the IoT, namely because it exists in so many formats. This punishes scientists in particular, mainly because there are so many different fields of information gathered from so many different disciplines. If, for example, researchers wanted to cross compare genomics data with patient records with, say, occupational information and the geographic disparity of the subjects, this calls for cross collaboration among multiple specialties, each of which has its own uniquely shaped, intransigent silo of information.

This presents another distraction to scientists, because it offers them the chance of learning new programming languages in order to extract the relevant bits of information and bring them all together. Sadly, it’s an intellectual challenge that too many can’t resist and as a result research is often held up while scientists are distracted from their core tasks as they add on the additional task of learning how to code. That distraction could, however, be eliminated.

This is the problem that database scientific database start up Paradigm4 set out to solve with its SciDB system. At the risk of over-simplifying the rationale behind the system, SciDB industrialises the process of discovery, by automating the process of stripping out data and connecting up the relevant parts.

“An epidemiologist doesn’t need to spend time learning about four different file formats, multiple interfaces and three new programming languages,” says Paul Brown, chief scientist at Paradigm4. “You don’t want to waste the expensive time of scientists on remembering HDF5 or Python interfaces.”

The IoT makes it even harder to keep pace with the variety of work, since machines are dumping information on the poor scientists at even faster speeds. Machine data is structured even more ‘wrongly’ than traditional information created by humans and set into rows and tables. Robots, it seems, like uniformity and structured data even less than humans. So the array of models of the relational database era are even more out of step with the randomness of the multidimensional reporting styles creating by machines.

A couple of the use cases of Paradigm 4’s clients exemplify this.

The North American Space Agency (NASA) built a SciDB database that helped it manage and process a mass influx of Doppler radar as part of its study on the intensity and frequency of storms in the US. Since the technology for gathering this data is constantly evolving, the data feeds come in a variety of file formats and levels of detail. Some of the information is measured by the degree, while other sources compare changes in tenths of a degree. Meanwhile, the representation of the data is equally diverse, with grey scale and raw signal data jostling alongside derived data values like precipitation rates and wind speed.

Without SciDB, NASA’s scientists would miss the bigger pictures evolving as they’d be too busy writing code that could map the various representations together, says Brown.

Creating a consistent, abstracted representation of their information saves NASA’s scientists from having to hand code a file data management system. Even though they might enjoy it. This means they can cut straight to the job of examining multiple lines of IoT evidence through running ad-hoc queries. The more lines of inquiry they pursue, the more likely they are to crack an investigation, Brown adds.

Meanwhile, further south, Brazil’s national institute for space research (INPE), uses NASA’s MODIS (Moderate Resolution Imaging Radiometer) satellite data for its study of rainforest ecology. Its teams are using satellite data to monitor events in the Brazilian rain forest as climate change alters rainfall patterns. They’re also obliged to combine data from multiple satellites and to include a lot of “ground truth” sensor data.

The additional advantage of using SciDB is that it helps the team exploit the array data model to reduce the complexity of the data scientist’s analytic operations. Much of this IoT scientific data is ‘intrinsically ordered’, says Brown. “Data values are ‘left’ or ‘right’ of each other, meaning that each lies west, east, north or south, in relation to another value. Or it can be ‘up’ or ‘down’, meaning higher or lower in the atmosphere. To make things more complicated, the chronology of its collection is measured as being ‘before’ or ‘after’ related events.”

The ordering of satellite and IoT data may suit an array data model but SQL-based database management system, being characteristically unordered, cannot create the conditions under which parallel access to information can take place. This is where SciDB re-orders information and prepares it for a new way of working.

“IoT data is more than just Nest thermostat data, or your fridge’s complaints about the biological condition of your crème fraiche,” says Brown, “IoT data is going to be dominated by high bandwidth signal data from satellites, RADAR, LIDAR, radio frequency and infra-red cameras.”

The sheer volume of data, the numbers of users and the algorithmic complexity means that scientists need a massively parallel cluster to cope with the data volumes. You can’t have a productive IoT without creating logical interfaces, so we can actually understand what all these machines are telling us, according to Brown.

The life sciences create a huge spectrum of biological and man made systems and no research will ever yield any insights into the links between nature and nurture until all these multiple lines of evidence are integrated, says Brown.

You could argue that the IoT is creating too much data for scientists to cope and that SciDB wants to save them from drowning in the rising seas of information.


Also read:

Meet Michael Stonebraker, database pioneer

Rant: Internet of Things bandwagon chasers


«Nutanix founder Mohit Aron talks hyperconvergence in secondary storage


Featurespace: ‘Accidental’ revenue via machine-learning-driven fraud detection»
Nick Booth

Nick Booth worked in IT in the UK’s National Health Service, financial services and The Met Police, witnessing at first hand the disruptive effects of new technology. As a journalist and analyst, his mission is to stop history repeating itself.

Add Your Comment

Most Recent Comments

Our Case Studies

IDG Connect delivers full creative solutions to meet all your demand generatlon needs. These cover the full scope of options, from customized content and lead delivery through to fully integrated campaigns.


Our Marketing Research

Our in-house analyst and editorial team create a range of insights for the global marketing community. These look at IT buying preferences, the latest soclal media trends and other zeitgeist topics.



Should companies have Bitcoins on hand in preparation for a Ransomware attack?