Almost all organizations experience data leaks to one degree or another. Data can leak out in small amounts or rush out in torrents, depending on the nature of the hack or other security compromise. But even a trickle of lost data has the potential to turn into a disaster — so it pays to be able to conduct data leakage detection as a routine part of your business operations.
The term “data leakage” generally refers to the exposure of data that is more focused and limited than a full-scale data breach. A classic example is when an employee’s laptop goes missing. The person who now has possession of the laptop may be able to hack its hard drive and access files that the device’s owner downloaded from the organization’s network. If there is personally identifiable information (PII) in those files, that’s certainly a risk. However, if the culprit doesn’t have authorization to get inside the network, the potential for serious harm is limited to the files on that laptop — which may be a small blessing.
The central role of metadata
The central strategy and key resource involved in data leakage detection is pretty much the same, regardless of the amount of data or files involved: effective management of metadata. Many organizations don’t realize the real value of metadata, especially when it incorporates both conceptual metadata (the kind that comes from people’s heads and is therefore subjective) and logical metadata (the kind gathered with technology tools, which is very black-and-white and objective). If you haven’t acquired both types and resolved any discrepancies, you will not be able to interpret everything needed to manage a given data asset more effectively — because you are looking at only a “half truth.”
If an organization collects and stores the right kind of metadata, it gains insights into a variety of relevant details that can aid in data leakage detection:
– specific locations of sensitive data elements
– which users access sensitive data and what they do with it, using which programs
– which other sensitive data elements they generally access in combination with a given data element
– what documents or other work products they create with it
In short, gathering metadata provides insights into a data element’s entire lifecycle. Each legitimate use of a given data element is treated as a separate “story” (or contextual use case), and can be quite specific. For example, one contextual use case might describe a process in a derivative valuation department, where once a month they retrieve the running total of cash reserves from a certain database, and it goes into a specific quant model as variables X1, X4 and Z; the results are rolled up and used in the quarterly close process to derive net present value for the firm’s derivative assets.
Insights at the end of data’s lifecycle
Metadata is also helpful in another important part of monitoring for data leakage: understanding exactly what happens to the data at the end of its information lifecycle — that is, how it’s consumed or used as described in the contextual use case above. Technical metadata tools are most effective at gathering this information when the data is flowing across enterprise systems. But often data consumers, analysts, data scientists and statisticians will pull the data assets off the network and work on it offline, using apps their own custom apps or desktop software. Metadata tools are less effective at tracking this phase of the data lifecycle. If the data consumer is using applications like Excel, MS Access, “R” or Octave to analyze the data, this problem can be even worse. These types of applications are really designed for true and robust data analysis, not necessarily designed for how data is moving around, within, to and from the application.
The organization can use its understanding of the data lifecycle as a baseline against which day-to-day data usage can be compared. This monitoring and comparison can be done in one of two ways. One way is to create machine learning systems or algorithms that monitor and compare the use of data against the baseline and, when they detect inappropriate use, raise red flags with the data governance and information security teams.
A second approach is for the governance group to establish a protocol for conducting proactive audits with groups that are accountable for certain data assets — or even more specifically, certain business data elements. This auditing process, which is sometimes referred to as “eyeball-driven,” uses contextual use cases to determine if any misuse is occurring. Engaging the help and insights of the data user community can be an enormously important part of the strategy for monitoring data usage. Of course, the two approaches can also work side by side for even greater effect.
Whichever tactic you use to monitor for inappropriate data use, the underlying requirement is to have an up-to-date record of metadata about your sensitive data elements (and ideally, about all your data). In addition to improving your leadership’s ability to maintain an accurate view of your most important metrics and make decisions accordingly, having accurate metadata is crucial to your ability to know when sensitive data is being misused and react quickly to limit the damage.