As we have said before in this forum, data should be a driver of the business. The right data can help to improve efficiency and can be monetized to bring in its own revenue. But data is also a liability. There are costs in collecting, storing and securing data. Data breaches and cyber security are mainstays of business life and the daily news. Data breaches can cripple well-run businesses, incur steep fines and result in devastating publicity.
There are many threats to data both mundane and sensational – data can be lost during a routine software update or it can be stolen by a mysterious troop of foreign hackers. Due to the nature of data breaches, it is impossible to predict when and where a breach will occur – there are just too many variables. While the news media latches on to major public breaches, who knows how many small-scale or undiscovered breaches are occurring every day?
At the same time that security is becoming a mainstream issue, big data is getting bigger every day. With the advent of new technologies, many firms feel compelled to store more and more data at low cost in order to, hopefully, someday, figure out how to make a fortune off of it. In light of this trend toward the stockpiling of data, wouldn’t it make sense to conduct an efficient internal assessment of risk? We think so. That’s why we have developed a method to assess risk based on an organization’s internal working knowledge of the data. The people who use the data have an excellent perspective from which to judge how valuable and how vulnerable the data is. This information should be stored as metadata and updated regularly through real-time collaboration. Once this system is running, you can assess the overall risk and implement appropriate security measures for each individual Business Data Element.
Security From the Ground Up
The Business Data Element (BDE) is a fundamental, conceptual unit of data. It encompasses how the average person in the organization thinks about data. Take the BDE ‘Address’: It typically consists of a street name and number, an apartment or suite number, a municipality, a state and a zip code. In terms of data storage, it consists of 4 to 6 related pieces of data that are stored separately. The seemingly simple BDE ‘Address’ illustrates some of the expanding complexity that is involved in maintaining data as one BDE results in 4-6 data entities.
This complexity is compounded when internal systems are updated or replaced, or when new systems are integrated, as often occurs in a Merger or Acquisition. Tracking where and how data is stored becomes a significant challenge and, if this information is not adequately known and documented, it becomes a nightmare.
To keep track of the data, you need a robust metadata solution that can store and publish the three classes of metadata: Physical, Logical and Conceptual. Physical metadata constitutes where and how the data resides in a given data store (think servers, databases and column names). Logical metadata constitutes the design for the data systems (think data model schematics). Most Physical and Logical metadata can be captured easily by using a metadata solution to scan a given data store, data model, or ETL (extract, transform, load) for the metadata stored within technology log files. Most of this technology-native metadata is practically unintelligible to anyone but an IT specialist. Conceptual metadata holds the most value to the business, but it is the least accessible because it is the most difficult to acquire as it resides not in machines, but in people’s minds.
Conceptual metadata is, in a sentence, information about what data is and how it is used by the business. Our definition of Conceptual metadata does not refer to what a technician guesses that the data means to the business. We are talking about a comprehensive perspective on how businesspeople think about and use data to run the business. To gather this information, you will need to tap the minds of the people who touch data. This can be done in person or via email or other online means, though we have found this process can get bogged down if not done quickly. We use the proprietary Controlled ChaosSM method to acquire this information and to populate the Business Meta-Model – the comprehensive set of metadata that gives a holistic view of how data is used to run the business. In short, Controlled ChaosSM consists of fast-paced collaborative sessions where the people who use the data discuss, argue and define what a BDE is and how it is used in the business context.
Unfortunately, current metadata solutions do a poor job of capturing, updating and displaying Conceptual metadata. With the expanse of big data, we need fast, efficient solutions that leverage the insights of the people who use the data. We need a re-boot of the Business Data Glossary that enables a Wikipedia-like experience – one that evolves with the needs and uses of the business and at the pace of business. Call it the Business Glossary 2.0.
What data to protect?
The need for the protection of Personally Identifiable Information, Protected Health Information and Payment Card Information is well-known to every responsible firm that handles data. There and legal requirements and financial penalties for any firm that neglects to protect these classes of data. But what other forms of data are valuable to a firm and vulnerable to breach? Proprietary algorithms or KPIs might be the target of corporate espionage. Future strategic plans may be another target. Any information that yields a competitive advantage could be subject to attack, and could represent a significant loss. Any firm would want to track closely any data that falls into the categories above.
At the Data Clairvoyance Group, we specialize in making data serve the business. One of the ways we do that is through leveraging information about the data garnered from the businesspeople who use it. In this particular context, what we need to assess is the risk to an individual BDE. What we want to gauge is, from the perspective of the business people who use the data, how valuable is the data and how vulnerable is it?
We have identified three main categories of bad data outcomes: Theft, editing/overwrite, and loss. These ‘bad outcomes’ could in many cases be caused by internal error, or by an external threat that has breached your organization’s security & network firewall.
The dimensions that affect the value and vulnerability of data include the number of people who can access, copy and edit a BDE as well as how internally or externally valuable the data is. Is there an external legal or illicit market for the data? Would the loss of or the exposure of internal data erode competitive advantage? These questions inherently reflect the organization’s confidence in their internal cybersecurity, their trust in their colleagues, and the general seriousness with which data is handled.
We recommend executing a broad survey of employees to gauge perceived risk to a given BDE on a 0-5 scale – from extremely unlikely to extremely likely.
How likely is it that BDE X will be subject to theft?
How likely is it that BDE X will be subject to overwriting or editing?
How likely is it that BDE X will be subject to loss?
Theft can occur from external hackers who break into a system or from a devious internal employee. In either case, we can assume their motivation would be to copy valuable files. We want to assess how likely it is for a given BDE to be the target of theft. For example, we would expect, in most contexts, for ‘social security number’ and/or ‘credit card number’ to be a more likely target than something like ‘educational level’ or ‘purchase history’.
Overwrite/Editing could occur in a variety of scenarios where a person maliciously or unwittingly changes data. Again, this could be an external or internal threat and this could also occur due to a software implementation error. Overwrite/editing is particularly harmful if information is not properly backed up. The more people who have editing permission for a BDE, the more likely it is that it will be altered in an unauthorized way.
Loss – Total loss of data could occur as a result of a variety of scenarios if data is not properly backed up. For instance, hackers have employed malware or ‘ransomware’ to encrypt the data on a hard drive or server. Without paying ransom for the encryption key, the data is useless. Data loss could also occur due to an internal software update, or malicious intent. You need to gauge how valuable data is if it were completely lost.
If you were to aggregate the scores on these three questions, you would get an indication of the Probability of Occurrence:
Theft + Overwrite/Edit + Loss = Probability of Occurrence (P.O.)
While the P.O. gives a good indication of how vulnerable a given BDE is, a good risk assessment will include how many vulnerable BDEs there are. This can be calculated manually by sifting through the systems Schematics to determine where the BDE lives (how many containers) and how many individual values exist in each container. Of course, it would be far better to implement an automated solution that could indicate how many total instances (number or records or data values) there are at any given moment. This can be done through automated queries or through the use of a Data Quality or Profiling solution (learn more about DCG’s Data Quality Product). In any case, once you have determined the total number of instances, you can multiply that by the Probability of Occurrence to arrive at an estimate of total impact.
Total Instances (of ‘X’ BDE) x P.O. = Total Impact
1 million x .23 = 230,000
According to the Ponemon Institute, the average cost associated with the breach of a single record is $158. While the costs associated with a data breach are debatable – Verizon calculates a much lower cost – the losses due to a data breach can go far beyond the initial monetary cost of fixing things and paying fines. If you were liable to for loss of 230,000 records, the costs would certainly add it up no matter your estimate.
If you aggregate all of these responses throughout an organization, you get a clear picture of how valuable and vulnerable the data is, and you can institute measures to mitigate that risk.
With the metadata in place, you can manage your data by creating security zones. Each zone should have its own limits to access and monitoring mechanisms. For example, you might have six zones of which the first is a ‘sandbox’ open to everyone in the organization as it the data within represents little in the way of risk. Zone six would be the highest level of security of which the data is accessible to the fewest number of people and is protected through close monitoring for volatility.
While our survey and subsequent metadata would enable an initial automated classification system, we need to acknowledge that data is always changing. Unlike a bureaucratic municipal zoning committee, a firm would have to implement a fast and efficient classification system that relies on a real-time social mechanism. Think of Wikipedia. While the information is crowd-sourced, it is also crowd-vetted for accuracy and consistency, or, in some cases, an emerging ‘hot’ or controversial topic is taken out of circulation until information can be verified. A firm could implement automated ‘circuit breakers’ that can be tripped by unusual activity, and/or, for tightly controlled data, the organization can assign subject matter experts to vet any changes to sensitive information.
Finally, an organization can provide incentives for data scientists or other data experts to recommend new zoning classifications. This would be particularly useful when a data scientist creates a new data set by mashing together existing data sets. This exercise might require the elevation of zoning since it is often the combination of BDEs that pose the most serious privacy risks. For example, a last name alone does not entail much risk, but a last name, along with age and zip code could constitute a “weaponized” Personally Identifiable Information combination that could be leveraged to conduct hacking activities.
Data breaches are a fact of modern business life. With the explosion of data, we need better, smarter and faster ways of assessing and mitigating risk. We propose a method of assessing risk through harnessing the insights of those who are closest to the data. Leveraging real-time communication technologies, this risk assessment model can be integrated into a smarter data ecosystem that brings human knowledge and understanding to bear on the management of data. Assessing risk is an important part of optimizing data for any organization. It’s time to do it right.