When is anonymous data not anonymous?

Here is some interesting research by Arvind Narayanan and Vitaly Shmatikovon at U Texas at Austin on privacy breaches from supposedly “anonymous” databases.

For researchers studying human behavior, getting access to data about human actions and opinions is very valuable. In some cases, such data is made available to researchers, but the data is supposed to be anonymized to protect the identity of the people involved. Netflix recently did this when they released a database of movie ratings as part of a competition. Contestants were offered a prize of $1 million of they could improve the accuracy of predictions about what movies people will like based on their past ratings.

Even though Netflix removed all personal information from the data when the released it, this research demonstrated that the pattern of ratings that the anonymous Netflix users made could be used to identify them. The issue is caused by a parallel, non-anonymous movie rating service at the Internet Movie Database (IMDb). If people rated the same movies at roughly the same time on the private Netflix service and the public IMDb service, then the patterns of ratings could be matched.

The privacy issue is that people may have made more sensitive ratings on what they thought was a private Netflix rating service, only to find that Netflix had revealed their personal data. Not only could the persons login information on the IMDb be determined, but the research also demonstrated the kinds of inferences that can be made by examining the patterns of ratings:

First, we can immediately find his political orientation based on his strong opinions about “Power and Terror: Noam Chomsky in Our Times” and “Fahrenheit 9/11.” Strong guesses about his religious views can be made based on his ratings on “Jesus of Nazareth” and “The Gospel of John”. He did not like “Super Size Me” at all; perhaps this implies something about his physical size? Both items that we found with predominantly gay themes, “Bent” and “Queer as folk” were rated one star out of five. He is a cultish follower of “Mystery Science Theater 3000”. This is far from all we found about this one person, but having made our point, we will spare the reader further lurid details.

Researchers often rely on organizations that collect data about human behavior to make the data available for research. Developing good methods of protecting privacy while allowing research is important. There are techniques that can be used to anonymize datasets for research while providing privacy, and this research illustrates its importance.

There is some interesting discussion about the research at the physics arX1v blog and Slashdot.

One thought on “When is anonymous data not anonymous?

  1. Pingback: Stickybeak by proxy (Part 2) | whereishayden

Leave a new comment (all comments are moderated):

Your email address will not be published. Required fields are marked *

Answer this question to comment * Time limit is exhausted. Please reload CAPTCHA.