In part I of this two-part post, we outlined the promise of the Data Exchange and discussed some of the capabilities, such as data governance and data quality, that are required to make it happen in our era of mixed data storage systems. We left off by underlining the value of metadata management and its function in drastically reducing ‘dark data’ that cannot easily be searched for. This post outlines the remaining capabilities needed to create a fully-functional Data Exchange.
Physically Distributed Data & Metadata Management
We believe that the best path forward for any business that has a mix of legacy and more current data storage systems is not to dump all the data into a single big data solution, but to adapt to the world of physically distributed data. The image below illustrates a typical configuration with data coming from multiple types of systems including RDBMS and flat files as well as NoSql and internet-friendly file formats. The data flows from left to right and into appropriate intermediary systems depending on file type and consumption requirements. This type of configuration is very flexible and can be re-configured to adapt to a future with additional data storage systems of yet-to-be-determined technical specifications.
A distributed data ecosystem includes multiple file types and systems.
Metadata management is the key to making this ecosystem work. Data must be enriched with physical, logical and conceptual metadata in order to facilitate the kind of fast discovery, search and processing on demand that the Data Exchange promises. As mentioned in part I, physical and logical metadata can be collected from machines, while most conceptual metadata must be acquired from the people who work with data. To fully leverage this metadata, it must be stored in the right metadata repository.
Metadata Repository: This repository will house the three types of metadata and make the metadata accessible and useful. It forms the invisible backbone of the entire system by housing the metadata that governs the entire ecosystem. It also provides easily-accessible metadata to the Business Glossary 2.0 (see below) that allows business users to discover and research data and datasets from the shopping cart.
BG2.0: In our proposed ecosystem, the Business Glossary 2.0 becomes the central node of data management. While many organizations have a Business Glossary, it tends to be a tack-on collection of definitions that few are happy with and even fewer rely upon. We envision the Business Glossary as being the foundation of Data Governance Operations; Nothing can be changed or touched with first clearing it in the Business Glossary. Another key difference is that the BG2.0 is written from the perspective of the business. It is organized according to business needs and includes the conceptual metadata that shows how data lives in the business and in business processes.
Data Lake vs. Data Swamp: With the exponential growth of data, the ‘containers’ for data have also grown. For example, Hadoop clusters or MongoDB promise to hold all of that big data. These solutions are especially appealing to organizations that are looking to offload the costs of maintaining several or even dozens of legacy systems. Though they can be very useful, they are not, by themselves, the magic bullet that will solve data issues. The problem is that data lakes can quickly become data swamps if the data held within is not systematically processed. We might contrast the data lake with the metaphor of the data reservoir which, like a water reservoir, has the end-goal of providing contents ‘fit for consumption’. The alternative to the reservoir is a virtually infinite collection of data that lacks a unified view on what is there, why it is there, who can or should use it and why it might be useful.
Zones: With comprehensive metadata in place, you can securely manage your data by creating zones. Each zone should have its own limits to access and its own monitoring mechanisms. For example, you might have six zones of which the first is a ‘sandbox’ open to everyone in the organization as the data within represents little in the way of information risk. Zone six would be the highest level of security of which the data is accessible to the fewest number of people and is protected through close monitoring for volatility.
While our stakeholder survey and our metadata would enable an initial automated classification system, we need to acknowledge that data is always changing. Unlike a bureaucratic municipal zoning committee, a firm would have to implement a fast and efficient classification system that relies on a real-time social mechanism. Consider Wikipedia. While the information is crowd-sourced, it is also subject to verification for attribution in order to maintain accuracy and consistency. This solutions accommodates special occasions when an emerging ‘hot’ or controversial topic is taken out of circulation until information can be verified. A firm could implement automated ‘circuit breakers’ that can be tripped by unusual activity, and/or, for tightly controlled data, the organization can assign subject matter experts to vet any changes to sensitive or other high-priority information.
Finally, an organization can provide incentives for data scientists or other data experts to recommend new zoning classifications. This would be particularly useful when a data scientist creates a new data set by mashing together existing internal and/or external data sets. This exercise might require the elevation of zoning since it is often the combination of BDEs that pose the most serious privacy risks. For example, a last name alone does not entail much risk, but a last name, along with age and zip code could constitute a ‘weaponized’ combination of Personally Identifiable Information that could be leveraged to conduct hacking activities.
Publishing Up: This is the capability of giving data scientists or other high-level data analysts the authority to re-zone data. There needs to be a mechanism to facilitate the publication of data analysis work, whether in the form of improved or refined datasets, in-progress statistical models, draft reports, or other insights that save time and can help to avoid ‘reinvention of the wheel’. The functional data exchange would enable and encourage this type of sharing.
The Data Shopping Experience
Building the portal and the interface for the data exchange is a fairly straightforward practice once the data has been processed. A firm could just slap on a nice interface with search capabilities without going through all of the processes above, but it might just lead to a situation where they have fast access to horrible data. As we have seen, this will only compound problems by allowing the firm to make bad decisions faster!
As mentioned at the outset of Part 1, the Data Exchange leverages the value of data by providing an Amazon-like data ‘shopping’ experience. The data consumer will search by entering ordinary, ‘natural language’ into a dialog box, press ‘Search’ and then be presented with recommended results.
The data consumer can put the data into a virtual shopping cart or drill down into the details of the recommended physical data to learn more about it. The exchange will reference the Business Glossary in order to provide more context derived from the conceptual metadata. Advanced search features would include pre-filtering to narrow down results to specific sources, time-frames or other variables.
Customized & Integrated Data Delivery
One of the key features of the exchange will be customized outputs. Using a menu, data consumers can choose data formatted for use with Excel, Tableau, R or any other business tool. This feature alone can save hours in prep-time for each analysis, not to mention improvement in the job satisfaction for the data scientists who want nothing more than to apply their creative and analytical firepower on quality data rather than spending their time wrangling data.
The Data Exchange can also be configured to draw in external datasets from public repositories or trusted vendors. The Data Exchange will enable one-stop data shopping by purchasing datasets, and charging them to the user’s account. The true promise of Big Data can only be realized when smart firms mash their proprietary, internal data with external datasets to create a competitive advantage. The Data Exchange makes seamless the entire process of data discovery and delivery in order to empower cutting-edge data analysis.
If you treat your organization’s data as an asset, you will uncover ways to make the data work in unexpected ways. Some data monetization examples include combining your data with readily available weather data in order to find predictive sales models. Another data monetization example could be to integrate stock market, demographic or any number of other datasets with your organization’s data in order to create unprecedented data insight.
Building the Data Exchange is not simple or easy. If there is one takeaway from these twin posts, let it be that the implementation and maintenance of a powerful Data Exchange requires the expert harmonization of capabilities. Just as a high performance race car engine requires expert tuning, an optimized Data Supply Chain requires expert fine-tuning to be sure that all of the pistons are firing in the right sequence.
While the Data Exchange that we are talking about cannot be implemented overnight with the purchase of a single new technology, it can be done in stages. A practical first step would be to re-direct some of the budget from IT and some from BI/Analytics to support the middle stages of the Data Supply Chain. By doing this, you can build the foundation required to construct the state of the art Data Exchange that will pay dividends for years to come.
The rewards are worth the cost. In the world of physically distributed data, we need solutions that leverage past investments and will accommodate future data, in whatever format. Automated integration of legacy and current, internal and external datasets unleashes Big Data to deliver Big Value.