Privacy Lost – Unmasking Masked Data
Privacy is an issue which is consistently in the news. Large amounts of data are stored by retailers, governments, health care providers, employers and so forth. Much of this data contains personal information. Keeping that data private has proven itself to be a difficult task.
We have seen numerous examples of unintended data loss (unintended by the company whose systems are stolen or attacked).
We hear about thefts of laptops containing personal information for hundreds of thousands of people. Internet-based attacks that allow attackers access to financial transaction data and even rogue credit card swiping equipment hidden in gas pumps have become background noise in a sea of leaked data. This is an area that gets the lion’s share of attention in the media and by security professionals.
Worse than these types of personal data loss, because they are completely preventable, are those that are predicated on a company consciously releasing their customer data. Such companies always assume that they are not introducing risk, but often they are. In all cases, if the owner of the data had simply held it internally no privacy loss would have occurred.
There have been cases of personal data loss due to mistakes in judgment.
AOL released a large collection of search data to researchers. The people releasing the data didn’t consider this a risk to privacy. How could the search terms entered by anonymous people present a risk to privacy?
Of course we now now know that within the data were people’s social security numbers (SSN), phone numbers, credit card numbers and so forth. Why? Well, it turns out that some people will search for those things, quite possibly to prove to themselves that their data is safe. What better way to see if your SSN or credit card number is published on the Internet than by typing it into a search engine? No matches, great!
Personal data has even been lost by companies releasing data after attempting to mask or anonymize it.
The intent of masking is to remove enough information, the personally identifying information (PII), so that the data cannot be associated with real people. Of course this has to be done without losing the important details that allow patterns and relationships in the data to be found.
A recent example where masking did not work as planned is the Netflix competition. Netflix released a data set containing the rental history for a large group of people. This was in support of a competition, with a one million dollar award. The winner(s) would be the first person or team that was able to significantly surpass (on the order of 10%) Netflix’s own algorithm that predicts movie rental choices based on a person’s prior rental history.
Netflix masked the data before it was released and did not believe that it was putting people’s privacy at risk. However, some researchers were interested in whether they could tie the anonymized data back to real people. They began to look for ways to associate the data with other data sources.
This potential vulnerability of masked data to be “unmasked” is an Achilles heel of the data masking process. We can remove the PII, but the remaining data, if it is unique enough and has a related, non-masked, data source somewhere else, can be used to tie back to the person.
In the case of Netflix’s data an obvious place to look for similar data was the Internet Movie Database (IMDB). By comparing comments and ratings from the Netflix data set and those from users on IMDB the researchers were able to identify some people in the Netflix data set. This meant that they now knew the rental history for specific individuals.
Controlling this process of “unmasking” is difficult. The release of data into the wild removes all controls. So why risk it?
The beauty of the approach from Netflix’s perspective is the amazingly cheap labor. Many individuals and teams invested an incredible amount of time over several years to get to that 10% mark. What would Netflix have had to pay a research staff to get to that result? Far more than one million dollars – and there was no risk from Netflix’s perspective. If the threshold was never reached not a dime would have been spent.
The costs and risks from this type of arrangement are actually borne by the people whose data is being misused. Companies have no right to increase the risk to a person’s privacy as a side effect of that person doing business with the organization. The company benefits from the customer’s decision to buy goods or services. Abusing that relationship is inappropriate and wrong.
It may be possible to “perfectly mask” data that is being released into the wild. The domain of the data plays a part as does the breadth and detail within the data set. A company must think long and hard, using experts in masking, before releasing any privileged data into the public domain. Personally I believe this can be accomplished, though probably not with such a broad data set as Netflix was providing.
As someone interested in data mining and machine learning, I have benefited from anonymized data. Having meaningful data sets available to test algorithms is valuable for educational and research purposes. Those releasing the data need to assure that they have constrained the data set to a narrow enough domain and set of values to make it very unlikely that someone will find a corroborating non-masked data source.
Note that Blue Slate is rolling out a blogging tool and I will be blogging over there (http://www.blueslate.net/roller/daveread/). In the near term I’ll post my entries to both places. Eventually this blog will evolve as a place for friends and family type entries.