It was a blatant plug. Mandy Chessell, IBM’s chief data officer had just delivered a keynote talk on open metadata and she was petitioning the audience to join her, or at least the ODPi, saying that “adoption is key to standards.” She has a point. For open metadata to become the de facto standard for the big data industry, it will need volume support and where better to get it than at a big data conference?
Dataworks 2018, held at the Estrel in Berlin, a large hotel and congress center on the edge of the Neukolln district in the south of the city should have been fertile ground for Chessell. The room was full of data geeks. How Erich Mielke, the head of the East German Stasi would have loved to mine their collective knowledge on handling large amounts of data. The irony is not lost, given the Berlin wall ran just a stone’s throw from the Estrel and the last person to be killed at the wall, Chris Geuffroy was shot scaling a fence between Treptow Park and Neukolln.
That’s history of course but managing big data remains a constant problem for businesses and organizations, especially as it moves between devices, datacenters and the public cloud. As organizations evolve, their adoption of cloud strategies varies widely. Interestingly, a quick interactive poll of the Dataworks audience revealed that 35 percent had no interest in moving data to the cloud. It wasn’t expected but then there is a lot of uncertainty about security and how to manage it if it is in the cloud.
It’s something the event’s main sponsor Hortonworks has clearly tried to solve with its DataPlane Service, a management system that can give visibility of all data regardless of its location. Chessell’s push for open metadata dovetails neatly with this idea, offering a sort of filing cabinet, like in a library, which helps users find what they are looking for quickly. The metadata is a catalogue of all available data but not all filing cabinets are created equal. That’s Chessell’s point. If it’s not open, it just slows everything down.
“Organizations will either be locked into single vendor deployments or spend time creating their own integrations to shift and transform metadata between repositories from different vendors - this is what is happening today and what we are trying to change,” she says when asked what happens if the open standard is not widely taken-up. “The effect is increased cost in managing data and less effective use of data, since it is harder to find and control it.”
Unsurprisingly perhaps Hortonworks’ VP marketing John Kreisa agrees. Hortonworks was after all a founding member of the ODPi.
“Honestly, I think there will be pressure on businesses to adopt a standard,” he says. “The more the tools are connected to a common metadata layer, the more compliance, which will make GDPR that much easier. If companies are not subscribed to the open idea, it’s still possible but it will be a lot of work and will take longer to achieve.”
It sounds valid but it’s difficult not to be cynical. Is this a ruse to drive interest in DataPlane? Perhaps that’s unfair but should businesses still pursue open metadata even if they are not subscribed to open source big data management?
“Yes,” says Chessell, adding there are a number of choices and that Apache Atlas can be used as a standalone metadata repository and does not require Apache Hadoop to work. “If companies have already bought data tools and catalogues from multiple vendors, they can use the open metadata libraries to integrate their existing purchases. If they do not have a lot of IT skills, or do not want to invest in the integration themselves, they can first adopt the open metadata standards as their corporate standard for metadata and push their vendors to adopt by making it a requirement on RFIs/RFPs.”
This would of course go some way to creating a reason to adoption but Chessell suggests that if customers could make it mandatory, if enough organizations could push vendors through the purchasing process with open metadata part of the deal, “the vendors will see value in complying,” she says.
That’s a tough call but then when you think about it, something has to be done to improve the speed and accuracy of data retrieval, after all, the amount of data businesses and organizations are collecting is accelerating rapidly. As Dataworks guest speaker Bernard Marr reminded us, in the next five years the amount of data in the world is expected to increase from around five or six zettabytes to over 20 zettabytes.
Clearly there needs to be increased automation. Machine learning will have to take over and given we are creating the building blocks for enabling AI, the data needs to be clean and trusted and accessed quickly. A standard is really essential and an open one makes perfect sense but as we know from history what makes sense does not always prevail when it comes to standardization.
Having effective cataloguing and management of data increases both speed of innovation and deployment of new models. Regulations such as GDPR are also making it essential that an organization can articulate how they ensure personal data is only used for purposes where the data subject has given permission.
“Open metadata aims to lower the bar for organizations to get up and running with a catalogue and governance processes,” adds Chessell. “This will help them find and manage their data more effectively, improving their success rate with ML. The increased control should help them avoid inappropriate use of data/ML and prevent brand damage and loss of public confidence.
As Chessell points out in her whitepaper, developing an open ecosystem would be an ideal way to ensure governance and data integrity. The Estrel conference center was full of people that would no doubt agree but they have to do more than that. It is their duty to get this right. As custodians of data and purveyors of data platforms, all of these businesses are setting the tone for the future management of data. While they can’t all legislate for the ethics, they can at least put in place a framework that adheres to the idea of openness and compliance.