Data lineage shows how a given data element originates in the firm’s information systems, and all its various stopping points, transformations, and consumptions points occurring downstream. Not understanding the lineage and purpose of data elements is, as management thinker Tom Peters would say, a strategic mistake of the ‘first order.’
Even so, we often see companies that decide to leapfrog over the data linage part of the process so they can get to work fixing those nagging reporting problems, or deriving new data insights for the business. In doing so, they may assume they only need to implement a MDM (master data management) initiative, create new data warehouses, or implement Hadoop.
Such relatively quick fixes may give the organization a sense that “things are getting done,” but they never answer the fundamental questions: what does our data mean, and where does it come from? In fact, by skipping over the more complicated aspects of data lineage, the companies actually waste money, jeopardize their reputation and credibility, and make the state of their data worse in the long run.
The valuable insights of data lineage
Data lineage may seem like an obscure, esoteric concept. But actually, it is critical to your organization’s ability to succeed. To understand why, consider this simple analogy.
Suppose you’re the mayor of a town on a river, from which your town gets all its drinking water. Should you and your public safety team be satisfied with having only that insight? Wouldn’t you feel safer knowing who’s using the water upstream — what other communities and companies have used it, how they treat the water before putting it back into the river, and so on? Of course you can take measurements of water quality as you pump it out of the river … but without knowing the water’s lineage, you can’t really know where or when contaminants are getting into the water, or how to prevent them from coming downstream in the first place.
Data is similar as it flows through an organization. It has sources, consumers, and points of transformation (which could include both contamination and purification). That’s what data lineage is all about: understanding and mapping the data’s flow so that you have a better understanding of how to safeguard its quality and value.
After all, data quality is absolutely critical to your success — it’s what you base your reporting on; it’s how you manage operations; and more than likely, it’s how you maintain profitability. Without an assured sense of your data lineage, when discrepancies arise (and they always do), you’ll have no clue where to start looking for the problem, or how to solve it.
To ensure that your organization develops and maintains an effective understanding of its data lineage, it’s critical to establish an Office of Data, led by a Chief Data Officer, or CDO. This office represents a new type of function within your organization, and serves an essential, strategic role as the “broker” of data between its creators and its users.
The Office should be tasked with data governance, metadata management, data quality and data architecture —responsibilities that today might reside across multiple functions within your organization.
Look out for blind spots
Data lineage is also important because, depending on the original source and provenance of the data — that is, how it was created, when the repositories were initially set up, who uses the data, how it is transformed, etc. — you may not have complete visibility into the metadata surrounding the data, even if you use the latest tools.
Indeed, there are some data analysis tools that claim to help organizations determine the lineage of data. But these tools have limitations. For example, data that is relational is easily understood by modeling tools/lineage tools, if the relationships are maintained appropriately. But for new, unstructured data assets, the lineage tools aren’t effective, because the relationships between data storage containers and objects aren’t always known or defined.
In our experience, the best way to gather the metadata needed to establish data lineage is to conduct collaborative, face-to-face workshops with the individuals in an organization who are best positioned to shed light on the source and use of data. Of course, some data architects and consultants claim that all one needs to do is buy a metadata or data lineage tool (conveniently, one they’re selling). But with most data that resides in legacy systems, as well as data created in certain newer technologies, the only way to determine lineage is to extract the information from the heads of humans within the organization — and there’s not a tool on the planet that can do that.
Establishing a clear understanding of data lineage belongs at the top of your list when it comes to improving data quality. It guides every decision you make, from how to create your business data element inventory, to identifying and managing data stewards, to reducing the costs of your data management technology and processes.