Software & Web Development

From Berkeley Labs to democratising AI with APIs

Apache Spark is an open source cluster computing framework which provides an API that enables Big Data processing.

Originally designed out of Berkeley’s AMPLab in 2009, Spark was created to overcome limitations within the MapReduce frameworks and speed up large-scale data processing within one unified package.

Spark's creators, after handing Spark over to the Apache Foundation in 2010, set up Databricks a few years later. The company provides a commercial, eponymous data platform based on Spark. As home of its developers and being the largest contributor to the project, both Spark and Databricks are intrinsically linked.


Democratisation of AI

With the latest 2.0 version Spark, Databricks CEO and co-founder Ali Ghodsi is hoping to bring Artificial Intelligence to the masses.

“People have been working on it [Machine Learning] for 20, 30, 40 years, but it's really breaking through now. We're seeing it everywhere,” he says. “The same algorithms that existed in the 70s or 60s now become powerful because you have a lot of data.”

Citing a Google research paper from 2014, Ghodsi suggests the most challenging parts of Machine Learning have very little to do with writing the code.

“Most and the time and effort of building machine learning systems goes to configuring them, collecting these massive amounts of data that these algorithms need, doing feature engineering, extracting the features that you need, tuning that, and then running it through machine learning, then doing the verification, using tools to make sure that you’re managing all these resources that you have.”


He says that up to 50% of a data scientist’s time can go on feature engineering, and part of Databricks’, and Spark’s, mission is to streamline the whole process.

“The hard part of this is really all the other stuff that goes around it, not necessarily running the algorithm. So how do we democratize this?”

He says that what makes Spark uniquely positioned to be a great democratizer is that it unifies lots of different paradigms into one language, offering far wider opportunities that what he calls one-use case frameworks such as Storm, MapReduce, or even Google’s TensorFlow.

“When people want to solve a problem, they're not looking at “Oh, how do I do this particular step really well?” they have an end-to-end problem. It’s just much more powerful to be able to do that in one framework, or speak one language that lets you do that.”


So easy, anyone can mock Boris Johnson

During the 2016 Spark Summit in Brussels, Belgium, Ghodsi demonstrated how easy it could be to start using Machine Learning live on stage in just a 10-minute demonstration by taking a picture of UK foreign secretary Boris Johnson and training the system to hallucinate images of dogs within the image. The effects were pretty upsetting, but effective at getting the ease of use message across.

“The technology isn't Excel yet, but it will be, I think.”

When he talks about making Machine Learning more Excel-like, Ghodsi means simplification and communisation, and likens the technology to the history of compilers.

“If you look at it [AI/Machine Learning etc.] right now, it's a little bit of a black art. It's much more art than science. It's how you know how to tune it, and you need to be the expert that knows how to do these things.”

“No one cares about compilers anymore, but there was an era where compilers were also considered a black art, and there would be these very few experts on the planet who knew just how to hardcode in the grammar and make it work and then it would just work, and it was a really hard thing.”

“Then it became commoditised, people came up with abstractions, now there are books on it, there is software you can use to write your own compiler, and it's totally not considered a difficult problem anymore. That will happen to Machine Learning. That will happen to Big Data processing. To AI.”


Spark and Databricks

Although Ghodsi was part of the original team that developed Spark back at Berkley Labs, he only became CEO of the company in January of 2016. And the fate of both his company and his creation are closely tied.

“The goal is to make it the big data platform that others can build their applications on top of.”

As the biggest contributor to Spark, Databricks sees itself as the stewards of the whole community. The company runs meetup groups across that world that boasts over 200,000 members, and takes education seriously.

“The biggest problem is the skills gap. This stuff is hard. You need to train lots of people to learn to use this stuff and solve their problems with it.”

As well as a freemium community edition of their product so people can get to grips with the technology, the company has free MOOCs on its site.

“It's not just volunteers sitting in their cellars coding and submitting some patch and go and having a different daytime job. When we're talking to some CIOs I ask, “How many of you know what Spark is, how many of you took our free MOOC?” and there's always a bunch of hands in the room.”

“Those people that have used it, they will be the champions inside that company that say "Hey, don't be afraid of it.” That's important for the success of Spark, and it's important for the success of Databricks.”

Ghodsi also highlights that Spark is a platform that fits in with business ecosystem. He cites database management company Datastax as an example.

“They have a dedicated person whose job is to make sure Cassandra works well with Spark, and there's a lot of companies like that; [there] to make sure Spark works in their ecosystem.”

“That makes it extremely powerful, and it also makes it sticky. It's going to be around for a long time to come, because people have invested so much time into integrating it with all these different systems.”


IBM and legacy corporations

One of the biggest companies to jump on the Spark Wagon is IBM. Big Blue announced it was investing $300 million in Spark as part of its drive for better analytics.

“It was great,” Ghodsi says of the investment. “When we started, people were extremely uninterested in Spark. There was this “no way are we actually going to use this ever,” sort of general sentiment.”

“Seeing the IBM endorsement is sort of like, “Ok, it's here to stay.”

He also sees Big Blue’s backing as the kind of rubber stamp big corporations need to start properly looking at what Spark can do.

“A lot of big corporations are pretty risk averse and they just don't want to jump on any new technology.”

“Some companies are still on mainframes [IBM announced Spark for Mainframes in March]. They don’t even use Hadoop or things like that, they use traditional technology that they have. So IBM endorsing it gives them sort of assurance that they need.”



A lot of major companies - most notably Microsoft in recent years – are embracing the idea of Open Source. It’s an idea the Spark team always had in mind for their creation.

“We always wanted spark to work really well with lots of other tools. Spark could not have been successful if it was this secret sauce we built that was propriety and it on its own is the best thing out there and you don't need anything else.”

Ghodsi says part of Spark’s success is because of this openness, standardized nature, and flexibility with all different kinds of frameworks and data types.

“We think it's the way the whole world is going. A lot of traditional proprietary software is being replaced by Open Source. There's no doubt about whether that trend is going to continue. There were actually a lot of frameworks at the same time out there that were proprietary, they never made it big.”

“Companies don't want to be locked in. People are fed up with these proprietary software vendors that are repeatedly just increasing the price, and not giving them much new value which matches that [increase], and they're locked in.”


Also read:
Databricks has designs on democratising Deep Learning
Hortonworks CEO still bullish on Big Data prospects
IBM’s Anjul Bhambhri on Spark, women in tech, and Watson
Apache stresses ‘old fashioned’ community at the heart of growth
Linux Foundation exec on why Open Source is now everywhere
What will Linux and open source look like in 2041?


« The CMO Files: Guillaume Roques, Salesforce


C-suite career advice: Andy Heather, Centrify »
Dan Swinhoe

Dan is a journalist at CSO Online. Previously he was Senior Staff Writer at IDG Connect.

  • twt
  • twt
  • twt
  • Mail


Do you think your smartphone is making you a workaholic?