Analytics Software

Why IBM is backing Spark

Although it’s relatively new, Apache Spark is gathering quite a bit of momentum in the Big Data space. The Open Source cluster computing framework provides an API that enables Big Data processing, and is quickly gaining a dedicated following.

One of the biggest companies to jump on the Spark Wagon is IBM. In 2015, Big Blue announced it was investing $300 million in Spark as part of its drive for better analytics. In a press release announcing its commitment to the cause, the company labelled it “Potentially the Most Significant Open Source Project of the Next Decade.”

“We started this journey a little more than a year ago, where we decided that Spark is going to be our execution engine, “says Dinesh Nirmal, Vice President of Big Data at IBM, during a chat at last year’s Spark Summit. “For two reasons; one, it gives us a standardised platform for easy development; two, it gives us an execution engine that's scalable.”

Although Spark is often seen as an alternative to the limitations of MapReduce, Nirmal sees it as a way to overcome Hadoop’s complexity.

“It is not easy to set up a big data Hadoop infrastructure,” he says. “There's 30, 40 components, and all of a sudden customers need a tremendous amount of knowledge and skills to implement those things.”

 “We are seeing more and more customers embracing Spark as we - I don't want to say go away from Hadoop - but we see a lot of customers looking at Spark much more seriously.”

“Spark makes this really easy. You get the performance; you get the faster development. I think that's one reason why we see that shift happening where spark is gaining real momentum.”


IBM taking Spark seriously

There’s been plenty of action from IBM around Spark. The company set up the Spark Technology Center in 2015, and announced Spark for Mainframes in March. And, perhaps most significantly, Spark was used at the basis for the recently announced Watson Data Platform, the company’s new Machine Learning-based analytics offering.

“We have 50+ developers in the Spark Technology centre, we have 45+ products that are using or planning to use Spark as the execution engine. We have built our Watson Data Platform on top of Spark. All those things are what makes us very unique in the investment side of Spark.”

So what makes Spark such an appealing prospect? According to Nirmal, there are three key elements: Data, Cloud, and Open Source.

“The proliferation of data is huge, and do we have an execution engine that can process that data at high speed? How can we get an execution engine that can scale on Cloud? And there is an embracement happening in Open Source, who is the one who is going to play a huge role in the coming decade?”


Watson & Machine Learning

During his keynote, one of Spark’s creators, and CEO of Databricks, Ali Ghodsi, talked about how Spark can help “democratise” Machine and Deep Learning, which he labelled as something of a ‘black art’ currently. The new Watson Data Platform is a prime example of that. 

IBM has been in the analytics and data space with the likes of SPSS and InfoSphere Streams, but the company wanted something faster and easier.

“I don't think we could have done this without Spark,” explains Nirmal. “It was a concept six months ago, and we have gone into closed beta and soon to be GA.”

“Even if you are a professional who has no knowledge of data science, you can bring your data [to the WDP],” says Nirmal. “We can take it from one end of a novice or very low-skilled professional, to the other end of an enterprise customer who has complex models, who wants to manage, monitor, deploy in a large scale.”

“So if you're a real estate agent and you have last five years’ worth of data but you want to predict the next five years but you don't want to go hire a data scientist to do it, we’ve allowed that individual to just throw in the data, and we do all the modelling.”

“I look at it as an end to end data life cycle, where you can ingest the data at high speeds to visualise the data in that lifecycle without ever dropping the data. You never have to write on the disk, we keep it in the data lake.”

“We believe that the ML that we are putting out as part of the Watson Data Platform definitely is very intuitive, very simple, very collaborative, converged, and High Performance, and that we couldn't have done without Spark.”


Adapt or Die

You’ve probably noticed that IBM has been betting big with Watson and Machine Learning of late. But at a recent Deep Learning event, a speaker said that while we live in the era of “Big Data”, we were still lacking big insights and big actions as a result.

“I think companies are adapting,” Nirmal retorts. “The proliferation of data is happening, and it's happening at a very fast pace. Companies have to adapt their processes and policies to meet that data avalanche.”

“So it will probably take some time: “Here is the social data coming in, how do I mix that with the weather data, or how do I bring in other unstructured data like Twitter data?” Then you obviously have your transaction data that is still sitting in the systems like the mainframe or other huge hardwares out there. How do you combine all those data together to make insights out of the data?”

“It will take some time for customers to figure out, but it's not about will they participate; it's about how soon they will participate in machine learning,” he says. “There's no question about it, because without that they won't be able to survive.”

Also read:
Databricks has designs on democratising Deep Learning
IBM’s Anjul Bhambhri on Spark, women in tech, and Watson
IBM’s Big Blue hope: How far can Watson go?
IBM sets Watson free to re-think the world
Linux Foundation exec on why Open Source is now everywhere


« C-suite talk fav tech: George Brasher, HP


Letting go: How one tech startup CEO ceded the top seat »
Dan Swinhoe

Dan is a journalist at CSO Online. Previously he was Senior Staff Writer at IDG Connect.

  • twt
  • twt
  • twt
  • Mail


Do you think your smartphone is making you a workaholic?