Open Source Cracks the Cloud Data-at-Scale Challenge

A few years ago, exploring structured and unstructured data using a single solution wasn’t possible. Today it is...

Building, integrating and scaling search solutions to meet the challenges posed by structured and unstructured data has historically proven daunting. Part of that problem is simply the sheer volume of data that society and businesses generate. Also, the commoditization of storage solutions and the associated lower incurred costs that come with that make businesses decide to keep data around longer than they previously did. Further adding to the volume of data available to analyze. A second issue is that traditional analytics technologies have faced, is their limitation to only be able to process structured data. And until a few years ago, exploring structured and unstructured data using a single solution wasn’t possible. Today it is, even in real-time.

Today’s search delivers a powerful combination of real-time unstructured data and text search, structured search, and advanced analytics, to reveal information previously well hidden deep within those terabytes or petabytes of data. Some open source search solutions take entirely new approaches to this challenge and are creating a next generation platform to power search and analytics as a core capability within business to business and consumer facing web applications.

During the late stages of the 2012 Presidential campaign, thousands (if not millions) of tweets gave daily hints about the upcoming vote and sentiments about the candidates. While that unstructured text alone provided interesting insight, there was also an underlying wealth of information to be found and analyzed in the structured information associated with each of those tweets. Namely, metadata like who sent it, the location and time stamps on those tweets. 

One campaign in particular indexed Twitter data streams and then analyzed it using, Elasticsearch, a popular open source search and analytics engine. Using Elasticsearch, the campaign could search for tweets expressing positive sentiment about its candidate. And then by utilizing the structured data beneath it, present the data on a timeline, broken down by location. The campaign could then alter those analytical dimensions, reformulating the analysis and receiving a new response seconds later to garner even further information and value about the election in another state, or different timeframe. Even running the same analysis for the competing candidate was possible by simply changing the search term. Real-time result analysis such as this became increasingly important, as the election grew closer in battleground states like Ohio.

Examples like this prove that search is no longer an either/or proposition when it comes to structured and unstructured data. Solutions must allow users to sprinkle structure around the unstructured with any attribute that data holds. Unfortunately, traditional enterprise search solutions were initially concerned with how, and how much, structured data was stored and indexed. In many ways, it had become a race to see which solution could connect to the most external systems and index the most documents and document types, rather than how those documents could be explored. And while those enterprise search solutions seemed valuable due to their connectors with third party systems, in fact, this resulted in highly complex data-ecosystems that in the end provided very limited business insights.

Open source solutions today take the opposite approach on search and analytics, focusing first on how data can be explored more powerfully, with little concern for what the data is or where it resides. By using an inverted index data structure, open source solutions like Elasticsearch allow businesses or consumers to more easily explore the explosion of both structured and unstructured data like web content, business transactions, or social media streams. This is a big step forward from what solutions with similar objectives were capable of five years ago.

Modern and powerful solutions like Elasticsearch, apply structure to all data being stored using a data format called JSON. JSON is a data representation format, which is easy to digest for both humans and systems. Because JSON applies structure to data, it allows powering advanced analytics on data of all sorts, hereby opening a whole new world of data analytics possibilities.

By combining structured and unstructured data search and analytics, enterprises can easily throw whatever data they want at a search and analytics solution, making all of that data searchable and useful in real-time. Because ultimately, the power and usefulness comes from how users get to the data to analyze it in additional ways, not how that data is stored.

Open source, by its nature, also empowers developers to explore, adopt, and utilize search in new and interesting ways within applications or web properties. Elegantly designed open APIs allow them to easily integrate robust search into whatever it is they are building. Take the social network application Path, which just passed 10 million users: a single post about an event – say, a pop concert – could contain between 60 and 100 different structured and unstructured attributes. The ability of a user to search and discover across all of those slices of information becomes a powerful and in-depth way to utilize and gain a richer experience from their social network.

Experiences like that will drive not only deeper social relationships, but search and analytics, properly harnessed, will drive innovation, create new applications and help enterprises meet constantly changing market demands.

By Shay Banon, Founder and CTO, Elasticsearch