This is a contributed article by Szymon Klarman, Knowledge Architect at BlackSwan Technologies.
Over the last 20 years, CIOs, CDOs and analysts making decisions about data stored on their company’s infrastructure have fostered a culture emphasising centralisation. The thought process has been that data is most controlled and useful when processed under the auspices of the IT team.
That logic could not be argued with throughout this period. It spurred the first wave of data integration tackled through data warehousing. In the 1990s, this was an effective way of connecting a number of business intelligence platforms in a single solution, by pre-processing and storing data in a fixed, structural form fit for predefined use cases. However, as time passed, it was clear that such an architecture required extensive customisation and data maintenance, and it struggled with the requirement to scale. In addition, it was difficult for different departments to take ownership of data projects, therefore reducing the overall impact data had on the business.
In an attempt to solve this issue, a new type of architecture - a data lake - took over. Data lakes allowed enterprises to store all of their structured and unstructured data at any scale in a central repository.
Like data warehousing, data lakes made strides in the way data had been handled; allowing for real-time streaming capabilities and the ability to cope with unstructured and structured data. In addition, data lakes could handle consumption, storage and output of data better than data warehouses, and apply metadata to make its performance much more flexible than data warehouses.
However, despite these advances, limitations remain; while the data teams within enterprises can do more with the data they have, they’re also riddled with ad-hoc queries from different departments within the business, who have very different requirements.
Because of the monolithic nature of the data platform architecture, the data teams wouldn’t be best placed to really understand the data they were administrating. Why? Because these centralised data platforms based on a data lake architecture, host and own the data belonging to different domains within an organisation. Both data lakes and data warehouse approaches are about moving data physically from one place to another, where we have both storage and computing power to manage it all, but this is not the best way for enterprises to make the most of the data they have.
Take financial services organisations as an example. Data ingested into a bank’s data platform includes compliance information such as Know Your Customer (KYC), account information such as product and service history, credit ratings, income, debts and engagement with the bank. In addition, the platform ingests operational data about a bank’s infrastructure performance and external data such as adverse news and social media sentiment. Different departments or domains need to use one or more of these datasets.
In a data lake, the data team is tasked with preparing analytical pipelines for various domains, to make the data available for numerous users within these domain teams. The onus on a data team to enable the data to be ingested, before cleansing it, enriching it and transforming it into usable data that can address the needs of a diverse set of consumers, is too great.
How are they meant to understand the peculiarities of data within every single domain? Sometimes the data comes from various domains, and the data team needs to learn about the specific use case as well as the domain requirements. This means they have to match many input domains (sources) to the required output of the target domain, which requires extensive analysis.
Their difficulty is compounded by the fact that they are siloed organisationally. Because of this, the business teams providing them with the data are far less likely to know how to provide insightful or accurate data, as they do not have data specialists on their teams. It’s no wonder that Gartner has suggested that business and IT leaders were overestimating the effectiveness and usefulness of data lakes in their data and analytics strategies.
To help businesses use data more effectively, it makes more sense to enable enterprises to nurture the data in the domains where they originate, using precise semantics. In addition, data should be accessible where it originally resides, so that it is up-to-date and readily available for users from other domains to discover and utilise.
In order to do this and to achieve another breakthrough in data accessibility, incorporate further context and enable unprecedented data monetisation capabilities, there is a requirement for a change in both the organisation and architecture of data.
Key to this change is the concept of the data fabric - a design concept which requires multiple data management technologies to work with the aim of supporting “frictionless access and sharing of data in a distributed network environment”, according to Gartner. The data fabric achieves this by means of a unified data management framework. The framework incorporates a combination of data integration, data visualisation and data management technologies to create a semantic layer that aids many business processes such as accelerating data preparation.
As the data fabric becomes more dynamic, it transforms into what is called a ‘data mesh’. This is an evolving, distributed data architecture, which is focused on metadata and supported by machine learning capabilities to enable data discovery and categorisation, as well as to optimise system performance. The data mesh vision is one based on data virtualisation, where data resides at source, across different business units, and can be consumed on a self-serving basis by users across the enterprise, thanks to flexible and intelligent data infrastructure.
The data fabric design concept, and the data mesh architecture are essential to overcome the technical and organisational struggles that enterprises have had in effectively using domain data.
However, there are numerous approaches to data mesh architecture, so it depends on the strength of the approach taken. One of the data mesh methods is to ensure that businesses - even those that have a centralised data environment - have the capability to capture data in a decentralised way, keeping domain data intact within its natural habitat.
With this particular data mesh approach, the complexity for data teams in understanding the specifics of the domain, and the need to match data sources to the required output of the target domain, is effectively automated by the use of a knowledge graph and a semantic metadata layer.
This layer describes the meaning of data, which helps enterprises with data findability and discoverability - both for humans, with easy-to-use data catalogs, and for machines, which enables them to know which data to pull at which time. By using shared metadata models, all the parts can be easily composed in an automated fashion, providing interoperability. The layer and the resources it describes provide the organisation with knowledge that would otherwise be impossible to obtain using traditional data platforms and architecture.
Data lakes and data warehouses are not obsolete, but they are now nodes in a new generation of data infrastructure.
Szymon Klarman is a Knowledge Architect at BlackSwan Technologies. He holds a PhD in knowledge representation and reasoning and has over 10 years of experience working in the field as an R&D specialist, consultant, and academic researcher.