How to avoid common pitfalls in your data integrity journey

If data represents a new form of doing business, then success in that new climate relies on equipping users with the tools needed to meet the challenges of a changing world. Dave Langton, vice president of product at Matillion looks at the four common pitfalls for data teams as they embark on data integrity initiatives, and advises how best to establish a comprehensive plan to keep data clean.

IDGConnect_pitfall_data_shutterstock_439164280_1200x800
Shutterstock

This is a contributed article by Dave Langton, VP Product, Matillion.

As the volume of data generated each day increases exponentially over time, so too does its importance to modern enterprises. Data is now the world’s most valuable commodity when used correctly. However, the sheer quantity of it means that the risk of incomplete, inconsistent datasets – and their subsequent impact on business profitability – is never far away. Experts from the DAMA Data Management Body of Knowledge estimate that organisations spend somewhere between 10 and 30% of revenues on resolving data quality issues.

Modern data teams, now recognising the importance of data integrity, are increasingly focusing their efforts on preserving it as they work to prepare data for analysis. If you’re not familiar with the term, ‘Data Integrity’ encompasses the accuracy, completeness, consistency, and compliance of data within systems. It’s an aspirational state that data teams aim to achieve, which also incorporates the processes that are used to achieve it. The definition comprises several aspects of data, from its physical integrity (how safely it is stored), to its logical integrity (its accuracy, completeness and correctness) and matters of compliance (whether it meets necessary standards, such as GDPR). Many modern distributed data systems have actually relaxed built-in support for checking logical integrity in the interests of maximising performance, leaving teams to explore other strategies for ensuring correctness.

Achieving data integrity is ultimately a way to ensure better performance, reliability, and access for an organisation. As teams embark on data integrity initiatives, there are four key risks that they should be cognizant of:

  • Assessing accountability - Without uniform standards, inputting and working with data can create inconsistencies throughout the data system. Accountability is key to any organisation’s success, and is especially important when it comes to managing data. Without it, there will likely be uncertainty about who is ultimately responsible for the integrity of your data.
  • Out-of-date and overlapping systems - Consistency is another tenet of data integrity, most often compromised by overlapping and outdated systems. Are important details stored in a standardised format across the database? Are different groups within your organisation working with the same datasets? Inconsistent data inhibits quality by creating duplicate records, data that is invalid for certain criteria, or data that is inaccessible at a given time.
  • Incorrect or omitted data - Taking on more data can increase the difficulty of spotting incomplete or inaccurate records. Unifying data that was captured from multiple disparate systems at different points in time can also leave blind spots or inaccuracies that become buried deeper and deeper in the growing data pool. Integrity requires not only being correct but being able to withstand the demands on your data further down the line.
  • Losing track of data - The complications brought on by trying to track those mistakes down and resolve them weeks, months, or years in the future can be even more costly than the original errors. Not having reliable audit trails for your data means uncertainty about who made changes and when. Some establish audit trails without reviewing them, rendering them equally ineffective.

Once teams are aware of the areas to watch when it comes to maintaining data integrity, implementing a plan to achieve and maintain it is critical. Because data touches every aspect of the organisation — and data teams are under pressure to manage and deliver it properly — establishing a comprehensive plan to keep data clean is essential. There are four key pillars to a data integrity plan that modern data teams should adopt:

  • Invest in integration - As a long-term investment, the time and resources required to integrate data now can pale in comparison to the money and manpower an organisation can save as datasets grow. Solutions such as data preparation and ETL applications can improve consistency by not only organising data, but cleansing it in the process to help remove inconsistencies. ETL is a critical step as data volumes increase and data types vary more widely.
  • Appoint and train a ‘data steward’ - Give employees a place to turn by appointing a ‘data steward’ to oversee a specific set of data – or the organisation’s data system as a whole. In addition, regular training sessions with employees can minimise errors at the point of entry, and establish a system of accountability and a clear framework for managing data. As the data teams grow, a data catalogue can help democratise data usage further by building trust in the datasets that matter.
  • Audit and validate - Stewards can also monitor audit trails and take quick corrective action. Audit trails reveal what changes have been made, and by whom, tracking alterations down to the date they were made. Inaccurate or incomplete data is not only identified, but tracked to its source. Through this process, stewards can also confidently validate the data being relied upon to guide the organisation’s future.
  • Test and iterate - It is clear that audit trails aren’t as effective when they’re not being reviewed on a regular basis. Avoid guessing at data accuracy by creating a regular testing system that augments a strong validation process. This helps ensure, for example, that data hasn’t been entered into conflicting field types for weeks or months before being found out. Just like going to the doctor, finding the problem early is often the best way to tackle it.

Building for the future involves identifying and addressing issues before they become major problems. Though the volume and complexity of data will invariably grow, if data represents a new form of doing business, then success in that new climate relies on equipping users with the tools needed to meet the challenges of a changing world – and give your organisation a place in it.

Dave Langton is the vice president of product at Matillion, a leading data integration platform. He is a seasoned software professional with over 20 years of experience creating award-winning technology and products. Prior to his role at Matillion, he worked as a data warehouse manager and contractor in the financial industry.