Online shopping is now an almost seamless experience: Go to Amazon.com, type in what you are looking for using natural terms, narrow down results using filters, or choose from the recommended offerings, confirm shipping and billing details, then expect timely delivery. Unfortunately, this is not how data ‘shopping’ (aka search and delivery) functions in most organizations. Too often, analysts, data scientists or other data consumers put in requests for data, then wait… for hours, or days, or just until they have almost forgotten why they requested the data in the first place. In addition to the long lag time, data exploration and discovery – the seemingly simple capability of making all appropriate data visible and accessible – is a huge problem in many large organizations. ‘Dark data’ is a growing problem. Together, these limits on the accessibility of data result in an impoverished ability for the firm to draw the insights from data that could define or improve their competitive advantage. Even if an organization hires the best mathematicians, statisticians and data scientists, their analyses will be hamstrung by slow or incomplete data.
In many organizations, the savvy (i.e. those fed-up with aforementioned latency) data scientists and analysts running their own ‘shadow’ data pipelines – maintaining duplicate datasets and not sharing their improved data, models and insights with the entire organization? The curious manager might mutter to themselves, ‘We have good data and we have made major capital expenditures in Big Data and Advanced Analytics, so what is wrong? Why do I get the feeling that our data problems are getting worse rather than getting better?’ Many organizations will respond to this situation by dumping more money into an expensive new technological solution, or they might bring in a well-known consulting firm to give a detailed, if costly, assessment.
We have found that these solutions make, at best, a limited impact on the ultimate goal – comprehensive, reliable, easily-accessible, secure data that is ready to fuel business analytics and data science. In some cases, partial solutions only make things get worse. The trend that we have noticed is for organizations to spend money on expensive technology to collect and house massive amounts of data and/or expensive talent on the analytical end (Data Scientists and Analysts) without investing enough in the middle layer – the less obvious capabilities of Data Governance, Data Quality, and Metadata Management. It is like constructing an orchestra by starting with a great percussion section and three solo violinists, and then only throwing in a few woodwinds and brass as an afterthought. What you get will be uneven results – like an amateur orchestra assigned to play a Mahler symphony without any practice or sheet music – it might be dark and turbulent and not very enjoyable.
Worse yet, many organizations have two or three ‘conductors’ who constantly fight over everything: budget, timelines, credit, etc. Unfortunately, some of these organizations think that they are building a complete strategy that will move them toward becoming data-centric, but they are only building part of it. To build the seemless data ‘search & shopping’ experience described at the outset, you need to build what we call the Data Exchange. In effect, it is a state of the art data delivery and management machine that ensures quality data delivered quickly, reliably and securely at the point of consumption. This last point deserves comment – in some cases, data will be processed to become ‘fit for use’ only when it is called upon. This is more efficient that trying to ‘fix’ every shred of data because some data may never be used, and the time and effort required makes this solution impractical. Furthermore, the data will be delivered ‘to your desktop’ in ‘packaging’ that will save analysts significant time in re-formatting.
Building the Data Exchange we are describing takes more than slapping on an off the shelf search and delivery portal with a nice user interface. It requires an optimized Data Supply Chain, that, like an orchestra, needs a fully trained ensemble under the coordination of an expert conductor. Below, we will describe the components you need to put into place to harmonize your Data Supply Chain so as to enable the Data Exchange.
Building the Complete Data Supply Chain
We call the sequence of processes involved in the production and distribution of data, the Data Supply Chain. In any organization, data is created and then stored for later use. On the supply side, data is created in a myriad of ways ranging from manual entry, to the internet of things. Data is then stored in any number of data stores, databases or data warehouses. At some point, the consumers of data on the demand side will pull data from the sources in order to conduct analyses. In many organizations, raw data will be processed in the management and exchange step in order to ensure that the data is more reliable. Unfortunately, in many organizations, this middle step of the process is neglected or under-resourced. As the volume of data increases rapidly, organizations spend more money to store data, and hire more and more data analysts to derive insights from this avalanche of data.
An optimized Data Supply Chain is required for the timely delivery of reliable, secure data throughout an organization. The components are as follows:
Data Governance Operations: This capability oversees the overall management of data. It relies on personnel such as data stewards and data custodians who take accountability for particular data elements. They develop and enforce rules that govern the creation, movement and use of data throughout the data ecosystem. At an advanced level of Data Governance Operations, Business Intelligence is applied to the metadata to support data governance by giving feedback and implementing warnings – we call this Data Intelligence.
Metadata: Metadata is data about data. For example, most every digital camera captures and attaches metadata to each image including the time and date, resolution, GPS coordinates and other technical camera settings. These ‘metatags’ can make sorting and searching for images much easier.
Data Quality Management: This capability checks data to be sure that data is in the right format, is complete and is accurate. This capability requires some technology to profile data and to identify anomalies and enough personnel to match the speed and scale of the project. Integrity, conformance, accuracy, etc.
There are three primary types of metadata: Physical, logical and conceptual. Physical metadata is the data about when and where a data value resides within a data store. Any time a system touches, updates, moves and eliminates data values, the interaction is logged automatically via technical code. Logical metadata is concerned with the schematics that map out how the data systems are organized, and in theory, how data should move through the system or connected systems. Physical and logical metadata can be extracted out of machines or diagrams with relative ease by a competent technician or data architect. The third category of metadata is conceptual. This type of metadata consists of what data creators, consumers and analysts think of the data, and in many cases represents the nuances and realities that are forced on data by the business model. Their knowledge reveals an important aspect of how data actually lives and works in the data ecosystem, and more importantly, how it functions within the business model.
To extend the camera metaphor, conceptual metadata needs to be added from people. As of yet, image management software has a difficult time distinguishing between different cats or different cars or the ‘best’ photo from a set (though they are getting much better at identifying Aunt Jane and Uncle Joe, if given the right initial inputs). If you are trying to put together an album of Aunt Jane’s cats, you might be able to use technical metadata to narrow down time and place, but you will still have to manually add metatags to mark the ‘cutest’ 5 or 6 cat images. Business often calls for analysts to find and group data by criteria that is not logged in the technical metadata.
The key to data optimization in a world that now includes Graph and NoSQL as well as the more traditional relational database management systems (RDBMS) and legacy data systems, is to effectively gather and leverage metadata. In traditional relational databases, the database provides structure to data – think rows and columns. While this type of database has been a reliable workhorse for decades, and is still the right solution for some types of data, it does not make sense to jam all other types of data into this artificial structure. The drawbacks to this approach include that it can impose unnecessary constraints, it can reduce or distort data, and it can make it more difficult to search for since data will need to be cataloged in pre-existing/pre-formulated categories that might not align with current business processes or conceptual frameworks.
In the new world of data, RDBMS will not be eliminated. It means that updated types of data storage systems will need to be integrated into the existing data eco-system. While it might be tempting to dump all data into a new container, a more realistic, cost-effective and nimble solution is to design for physically distributed data. This means creating the tools, processes and expertise to manage an ecosystem that combines legacy data systems with new data storage and computational technologies (i.e. Hadoop, Spark, Solr, etc.) and everything in between. The data will be housed in separate containers, but you can still create and control a single access point. How can this be done? Once again, the answer is the strategic deployment of metadata.
Systematic and comprehensive metadata management will free you from the confines of the vertical, hierarchical structure of traditional data modeling and design. Instead, you will be able to search horizontally, or, for that matter, in non-horizontal or vertical angles in order to find data that you want to find on any axis. You can even set-up a recommendation engine for data in your ecosystem. By way of analogy, think of the way that Netflix generates recommendations across genres, across directors, across cast members, across themes, across studios, across eras, etc. etc. In other words, their recommendations are not confined to any predetermined data structures or grouping.
If it is not already apparent, the process of creating the Data Exchange requires the implementation and harmonization of several core data capabilities – more than one post’s worth! In Part II, we will cover the remaining capabilities that form the foundation of the Data Exchange as well as the functionality of the exchange itself.