Deep Learning has a data problem

At the Re: Work Deep Learning Summit in London, the data deluge isn’t enough

Despite being told we’re now in the age of big data, data explosions, and Exabytes, Artificial Intelligence is still suffering from a lack of data.

“There are people saying the same things they were in the 90s,” Professor Neil Lawrence of the University of Sheffield said at the Re: Work Deep Learning Summit in London. “The only difference is scale.”

At the summit, various experts in AI [Deep Learning is a branch of AI where the learning systems are loosely inspired by the human brain] showed off an impressive array of uses, ranging from robot’s interacting with the world to intuitive health diagnostics and intelligent chatbots. Currently, these are often very good at one specified task; identifying certain images, translating text, playing Mario etc. But Google’s DeepMind are now working on ways to train systems to transfer learning: not only mastering one Atari classic, but then use those lessons in other Atari games, and then applying the same methods to teach robots how to better interact with the world.

However, there were repeated highlights of the problems around both the gathering of necessary data and methods for training these highly complex systems with that information. SwiftKey CTO Ben Medlock – whose intelligence keyboard app was acquired by Microsoft earlier in the year – warned that we are still “Oceans apart” between current learning systems and the efficiency of the human brain.

“It’s very easy to look at the successes of Machine Learning, you could believe we’re racing towards human-level intelligence,” he said. “But if you look at how the learning occurs, it’s very different. We still require learning from vast quantities of data and fundamentally the trained human brain learns from very few data samples.”

Collecting large data sets might be easy for the likes of Google and Facebook but they rarely share, meaning the startups are often left to do the legwork themselves. Even once you have enough data that is probably labelled – still often a manual and time-intensive job despite advances in unsupervised learning– it can take a large amount of computing power to train systems.

Simon Edwardsson, co-founder of Computer Vision startup Aipoly, says that while the AI community is a very open one, there’s still trouble with data sharing. While there are an increasing number of data sets being made open for training systems (ranging from autonomous vehicle data to pictures of torsos) they are often they are small, of poor quality, or are released under non-commercial licenses.

One way some companies are getting around that lack of data is through simulation. 3D rendering is now at a point where systems can be trained in virtual environments created through something like the Unity Engine and the results are good enough to be applied in real-world situations. Driverless cars are even being trained using Grand Theft Auto 5.

More data in every sense of the word – the number of data sets, the scale, the quality – being made open and available, better training methods, and the relentless march of computing efficiency are required if we’re ever to fully reach the kind of AI we’ve seen in science fiction for over 50 years.