Data catalogues offer boost in the machine learning race

Why making corporate data more accessible and understandable to its data scientists, ultimately boosts the productivity of machine learning.

Spending on machine learning and artificial intelligence is growing by 37 percent annually and is set to reach $77.6 billion in 2022, according to IDC. But organisations keen to exploit this latest technological trend hit the same problem: data. Only 18 percent say their companies have a clear strategy in place for sourcing the data that enable AI work, McKinsey has found.

To crack the problem, a new class of enterprise data tools is emerging. Data catalogue systems are designed to help organisations manage a common weakness in the data science process, says Gartner research analyst Sanjeev Mohan.

"What happens is a data scientist finds out that their organisation has data lake. They get excited, jump in and think, ‘Wow I have access to all the data I need, this is perfect'. But they get ready to train an algorithm; they don't like data lake: it's too slow. So, they make a copy of the data onto their laptop. They run until they are happy with the model. Then they go to IT and say, ‘Here are the models, can you operationalise them?'"

Problems getting models into production

But typically, IT teams are not keen to operationalise models built in this way — with good reason, Mohan says. Firstly, the data has been outside the firewall. Then, the data scientists do not understand the provenance of the data: how it was collected, its quality, and what biases it may contain. As a result, there is no way to audit operational models once they are in production, he says.

"Data catalogues are trying to solve all these problems by offering curation of data from one centralised place," he says.

One data catalogue is built by Redwood, California start-up Alation. Founded six years ago, its products are used by 100 in enterprises around the world. Among them are eBay and LinkedIn. The group also includes the French bank Société Générale. Data manager, Julie Lerose, says the interest in a solution started two years ago with a discussion between marketing, finance and risk teams.

To continue reading this article register now