How can we handle Big 'Genome' Data?

Steven Newhouse of The European Bioinformatics Institute discusses the challenges of storing and managing ridiculous volumes of biological data

“The biggest surprise is really the relentless growth of the data,” says Steven Newhouse, head of technical services at The European Bioinformatics Institute (EMBL-EBI) over the phone from the UK.

EMBL-EBI acts as a reference archive for large biological datasets. These are samples that are in the public domain which are contributed by scientists and research organisations. They are donated by organisations that wish to support open publications and aid the field of medicine by providing the data to deliver better statistical analysis.

This is one of a number of similar organisations across the globe which are working to help the growth of personalised medicine. At present one of the core challenges is how to store and manage the data.

“However much work is being done on say, decompression formats and things like that, the ability of the new instruments to sequence [genomes] in increasing accuracy [is rising]. This allows a lot of [good] data to be generated,” explains Newhouse. “Then there is the breadth of the data. We don’t only deal in human data we also deal in non-human,” such as plants.

“What we’re seeing across the life sciences area is the cost of sequencing genomes is dropping dramatically,” adds Newhouse.  “It is becoming possible for labs, hospitals and so forth to start sequencing genetic samples and start contributing those samples to databases such as ours, hence the massive exponential increase we’re seeing in the data that is being deposited into our archive.”

Where in the past it might have cost a billion dollars and taken ten years to sequence a genome. “Now for about $1000 you can get a genome sequence overnight,” says Newhouse.

To provide some context on just how big this is: “EMBL-EBI manages well over 50 Petabytes of data (about 100,000 laptops worth), and this amount is approximately doubling in size every year,” Delphix stated in a recent press release to announce the deployment of its Data as a Service platform by EMBL-EBI. This is set to decrease its data storage footprint by 70% and make it far faster to deliver data to interested parties.

Solutions like this are worth writing about because the rise in data volumes is only likely to snowball in future. There are a “number of scenarios being discussed” as to how this is likely to be handled in future says Newhouse. These range from one extreme where data is analysed as it comes in then discarded to save space, through to the other end of the spectrum where a more distributed form of archiving is utilised.

“This is something that the community as a whole recognise as a challenge,” explains Newhouse. “There are a number of models. The selection of one model over another is not something that has been set yet.”

This sheer volume of data also has implications to individual researchers.  When the data was smaller researchers just downloaded the information they needed and worked on it via their own facilities. This soon ceased to be viable and the next phase was researchers would interrogate EMBL-EBI’s database via the web. Yet again, only those with enough networking and storage capacity could utilise it effectively.

Now the organisation has put in place a cloud based solution, Embassy Cloud. This allows researchers to “gain access to our data and our services and get all the flexibility of using their own infrastructure without downloading the data to their own resources,” explains Newhouse.

The real promise of personalised medicine – which not everyone sees as a good thing – is that it will boost our understanding our own genome sequences. And help track what reaction an individual might have to any particular drug based on their genetic profile correlated against other parts of the database.

Of course, the reliability of this will be entirely dependent on the size of the database. In fact, this hit the news this week when the Wall Street Journal reported, “A new study has triggered a dispute about the accuracy of genomic tests that are increasingly used to match cancer patients with drugs that attack their tumours”.

It is likely to be a while before a really useable big database exists. “My focus is to provide the infrastructure to support that research activity,” says Newhouse. “This is not something that is going to be routinely available in the next few years. But the infrastructure is being put in place to make this sort of thing available in the future… although I couldn’t put a date on that.”