Data Mining

Big Data and the harsh reality of the Hadoop hype

The following is a contributed article by Greg Rahn, director of product management at Snowflake Computing, a company that offers a SQL data warehouse for the cloud

One of the clearest trends of the last several years is how data analytics has gone from being something almost invisible behind the scenes to a strategic initiative. Although this focus on extracting insights from data has led to a lot of projects, the painful reality is that a large share are failing to deliver. One recent survey found that only 5% of respondents had a fully deployed Big Data initiative and only 11% even had a pilot in place. That’s not a smashing success, to say the least.

What’s behind that reality?

When companies embark on a Big Data initiative, one of the key technical challenges is the result of significant changes in the data they have. Twenty years ago, all data worth analyzing was structured data coming from transactional business systems. That data is still important but in today’s world, many of the nuggets of data gold can be found in a different place—in machine-generated data that is typically semi-structured and hierarchically organized, for example in JSON or Avro formats.

Because traditional databases weren’t built to handle this type of data, many people have looked to Hadoop for a solution. That’s a key reason why Hadoop has received a deafening amount of attention and hype, often being called on as the miracle cure for data problems. However, the hype obscures the fact that it’s easy to take an interesting technology like Hadoop and head down the wrong path with it.

The fact that Hadoop is available as open source software that can run on any commodity or cloud hardware makes it really easy to experiment with it for lots of different projects — data lakes, machine learning, data processing, ETL and more. Those experiments turn into grassroots projects, and soon the momentum of those experiments leads to a larger project. So far so good.

But then things start to go off track. When companies try to transition from experiment to deployment, reality sets in. It’s at that point that people commonly discover that Hadoop is ill-suited for many applications they would like to apply it to. So they are trapped into investing more and more time and resources into trying to force-fit the technology into something for which it was not designed.

So what are some best practices to avoid being side-tracked with Hadoop?

Here are a few:

Clearly identify the problem you’re trying to solve. Picking the technology before identifying the problem is a critical mistake, particularly with Hadoop. A number of organizations are struggling because they made the decision to implement Hadoop before they fully understood their business need.

Learn what Hadoop is good at — and not good at — before kicking off a Hadoop project. Expecting Hadoop to deliver interactive analytics would be a mistake. But if you’re looking to do sophisticated machine learning in batch mode, Hadoop can be the right option.

Don’t confuse theoretically possible with practically achievable. The flexibility of Hadoop means that it’s possible to do many different things with it - but that doesn’t mean Hadoop is the right tool for any job at hand. After all, just because a Swiss army knife has a saw blade doesn’t mean you should try to use it to cut down trees.

Assess the fit between Hadoop and your organization. Hadoop requires retraining and retooling if you want to scale it to support use beyond sophisticated data science programmers. Its complexity also requires operations teams to keep it up and running. Make sure that you understand the cost and effort required to decide if that is a good investment.

These guidelines mean that Hadoop isn’t the right technology for every project. Many existing technologies, including the venerable data warehouse, continue to evolve and for many needs deserve a fresh look. New technologies in cloud infrastructure, automatic optimization and elasticity can make them dramatically more capable and accessible than before.

New technologies open up new possibilities, but also the risk of a quagmire of unrealistic expectations and failed projects. By looking past hype to choose the right technology for the right task, companies can focus on what matters – gaining insight from data to drive business success.


Greg Rahn has been a performance engineer for over a decade working on both RDBMS and Hadoop SQL engines. He spent eight years running competitive data warehouse benchmarks at Oracle as a member of the esteemed Real-World Performance Group as well as working on Impala while at Cloudera. He now runs SQL benchmarks for fun and education.


« "Outsource Romania": Why Romania is new outsourcing hotspot


Why do we need more women in IT? »
IDG Connect

IDG Connect tackles the tech stories that matter to you

  • Mail


Do you think your smartphone is making you a workaholic?