When it comes to improving data quality, one of the largest gaps is that most data quality plans don’t ever result in statistically significant improvements to the data values themselves. This is mainly due to the fact that they pay too little attention to gathering the necessary metadata at the outset — information that unfortunately resides only in individuals’ heads. Instead, such incomplete efforts often lead organizations to believe or feel they are doing things to improve data … despite the lack of empirical evidence that impacts profit & loss.
Why is the case? The answer is that most of the tools on the market that promise to improve data quality by scoring data profiling results have three essential shortcomings.
First, the scoring provided by these tools is qualitative in nature, or at minimum contains noise. These tools are generally great for getting descriptive statistics, but when it comes to gaining accurate insights based on the business processes that are causing specific problems in the data, the tools can’t do so automatically. Rather, they require the organization to conduct much more involved research.
Second, these tools typically provide scores at the granular level — i.e., profiling results for a given data element in specific columns or tables. Generally, they do not roll up the results to produce broader (and far more useful) analytics. For example, it’s quite common for a business data element (BDE) such as ‘customer name’ to live in multiple columns, across numerous tables, and in several different systems. A typical data quality tool will dutifully provide individual scores representing all 50 physical occurrences of that business data element across the enterprise; but it’s far more useful to get one aggregated score indicating the overall health of a given business data element with a drill-down that provides insight into what is driving the quality trends for the BDE, wherever it lives.
Third, if your goal is a data quality plan that includes a single integrated score for each given BDE (as described above), then you should really be using statistical control methods that provide detailed insights into the quality of BDEs and help to identify anomalies that need to be addressed. These methods use standard mathematics and statistical techniques to measure actual metric results versus the limits or expectations established by your organization’s data community for a given BDE. Unfortunately, none of the currently available data quality tools offers this level of sophistication out of the box.
Coordinating data efforts
Another key to ensuring that your data quality plan produces results is establishing an Office of Data (led by a Chief Data Officer, or CDO), representing a new type of function within your organization. Because data is such a valuable commodity, your organization needs an Office of Data to serve an essential, strategic role as the “broker” of data between its creators and its users. The Office should be tasked with data governance, metadata management, data quality and data architecture —responsibilities that today might reside across multiple functions within your organization.
On a more nuts-and-bolts level, your CDO and Office of Data can oversee and coordinate your data quality plan with two closely related efforts: data governance and the acquisition of metadata.. These activities are interlocked and feed into each other in three major ways.
First, your data governance sets and manages the scope and priority of work for both your metadata and data quality efforts. Data governance and metadata teams should collaborate to ensure the BDEs in your data governance initiative are scoped, understood, and have undergone thorough lineage discovery (that is, the lifecycle of the data from its source to its end users, and all the significant points of transformation along the way). Then your metadata team can use a metadata repository to provide a visualization of the associated lineage for each of the BDEs in scope. This in turn tells the data quality team exactly where the data lives (i.e., in which specific physical columns/tables).
Second, once your data quality team has completed data profiling and assessments, they can locate the patterns that appear to be real anomalies in the data. Data governance is then used to facilitate a collaborative, face-to-face session with members of both the business and technical communities within the organization, who can provide critical insight into these anomalies. These insights are collected and converted into requirements, which are then tested on the data to derive specific results — that is, distinct quality metrics for each test, column, table, data asset, and BDE. Last but not least, the data governance team facilitates with the data community to get sign-off regarding both the rules used to build the tests, and the test results.
The third way your data quality, data governance, and metadata activities interconnect is that after the previous two steps are complete, your teams can get their first quantitative look at what is wrong with a given set of data values. Your data governance team must then facilitate the creation of data standards to prevent or limit any anomalies that are considered serious, and set up tests to monitor success. These standards must be published enterprise-wide and more importantly, enforced throughout the organization’s technology development lifecycles to ensure that the data doesn’t get worse over time.
The ideal data quality plan includes strategies and processes for improving the quality of data — not just tools for measuring quality. It also requires the collection and analysis of data statistics that are more meaningful than simply pass/fail metrics. Last but not least, your data quality plan must carefully identify and coordinate the related disciplines that all play parts in improving the quality of data elements, metadata, and data governance.
The relationship between an organization and its data is, for most individuals in most organizations, complex, obscure and seemingly of little relevance. But in fact, the effective use of high quality data, driven by a focused and resourced Office of Data, can be a major driver of profitability for any organization. It’s no exaggeration to say that it’s as ubiquitous as cash — and properly managed, it can be just as important to the bottom line.