Guy Harrison (Global) - Big Data and Emerging NoSQL Databases Shift to Hybrid Database Environments

The future of big data is in flux. Guy Harrison, director of research and development at Quest Software, discusses how emerging NoSQL Databases are contributing to a shift toward hybrid database Environments.

Many will be familiar with the Gartner hype cycle which describes how new technologies are initially subject to "peak of inflated expectations", followed by a "trough of disillusionment." These two overreactions typically are followed by the "slope of enlightenment" and the "plateau of productivity," after which the technology finds its proper place in the enterprise IT portfolio.

Two to three years ago, the first serious challenges to the relational database (RDBMS) emerged. The increasing prevalence of truly massive Web 2.0 consumer applications strained the traditional RDBMS stack to breaking point. At the same time, the emergence of credible cloud computing infrastructures - notably Amazon AWS and Microsoft Azure - demanded a more elastic provisioning model than the traditional RDBMS could provide.

By the middle of 2009, more than a dozen significant alternatives to the RDBMS had emerged, and the blanket term "NoSQL" was used to describe this disparate mix of technologies. An immediate "peak of inflated expectations" followed with many proclaiming the beginning of the end for the RDBMS. Now - in mid 2011 - there are signs that we may be entering the trough of disillusionment, with some suggestions that the entire NoSQL movement was a merely passing fad with no enterprise relevance.

The term NoSQL was arguably poorly chosen and definitely over-hyped. Yet there are some definite cases where a non-relational alternative can be more practical or economic than the traditional relational approach.

Massive Web scale applications pose particular challenges for database management systems. When you approach the scale of Twitter, Facebook or Amazon, it's obvious that a single database is unlikely to provide the performance, fault tolerance and economy of scale that is required. Many of the larger Web 2.0 - and quite a few smaller sites - have resorted to "sharding" databases across multiple hosts, but this is widely recognized as an inadequate solution. Massively distributed online databases such as Cassandra and HBase have emerged to try to meet this need.
The other major driver for enterprise NoSQL adoption is the increasing desire to manage and exploit "big data." Many applications generate masses of unstructured data - web logs and so forth - that contain information which potentially can create great competitive advantage. Predictive analytics, churn forecasts, social networking, default prediction, and many other business critical functions can be tackled by processing this unstructured data.

Until recently, there have been few practical options for processing this "Big Data," other than to load it into a RDBMS - often at great risk and expense. This expense includes both the cost of data warehouse hardware and software, and the consulting and project costs involved in the Extract, Transform and Load (ETL) of the data.
Hadoop - an open source Apache project - provides an economical means of processing unstructured data without undertaking an expensive and risky ETL project. Hadoop includes open source implementations of many of the technologies that allowed Google to pioneer massively parallel indexing of the Web. The economics of Hadoop can be compelling: for example, the cost of storing data in a high end SAN-backed database can be as much as 20 times greater per gigabyte than storing it in a commodity Hadoop cluster.

Enterprises that don't require a large online presence may find the scalability goals of databases such as Cassandra and HBase unnecessary. But many are seeing the competitive advantages and cost savings that can be gained from big data technologies such as Hadoop.

Consequently, many enterprises are seriously investigating Hadoop and other non-relational databases. As these pilot projects graduate to production status, one of the most obvious challenges will be data and information integration. The data held in RDBMS and in newer non-relational stores cannot be isolated - effective business decision-making depends on the integration of all data sources, and a lot of unstructured data does not make sense until married with relational master data.

It's no surprise, therefore, to see BI and data integration vendors rush to support Hadoop and other non-relational technologies. They anticipate that these new databases will be a significant, although hardly dominant, part of enterprise IT database infrastructure.

IT decision-makers need to become familiar with the strengths and weaknesses of non-relational systems so they can make informed decisions as to their possible place in the IT infrastructure. The "one size fits all" RDBMS has made database technology decisions relatively easy; in a hybrid future, picking the right database tool may become more complex.

Guy Harrison is the director of research and development for Quest Software in Australia, and the author of many books, articles and presentations on database technology. Guy can be found on the internet here , or e-mail -  and is @guyharrison on Twitter.