With so much information being shared online these days, it’s critical that much of it remains private and anonymous. We trust, for example, that social networking sites such as Facebook remove personally identifiable information when they share our preferences and desires with advertisers.
Vitaly Shmatikov, a young, fast-talking associate professor of computer science studies privacy in ubiquitous data sharing systems, from Facebook to hospitals to Netflix. “When companies and organizations that have your data say they protect your privacy, what do they actually mean?” Shmatikov asks. “They normally think they can share the data if they just anonymize it by removing the names and any identifying information, such as Social Security numbers and email addresses, and that there is no privacy issue.”
Shmatikov is finding that simply doing that doesn’t quite do the trick.
“What we’ve managed to show is that even if names are removed, any attackers with access to a little bit of public information can often reattach the names,” he says.
For example, he and graduate student Arvind Narayanan recently reported their ability to re-identify anonymized data produced by social networking sites Twitter and Flickr that they sell to advertisers.
The scientists developed an algorithm that looked at the overlapping information among users of the different sites and analyzed the structure of the individuals’ “network neighborhood.”
In one third of the cases, they identified people from what were completely anonymous data.
“In practice, it’s very easy to re-attach a name to anonymous data,” says Shmatikov.
He was also able to find the identities of Netflix users from anonymized data released by the company.
“First off, if you know just a few bits of information about some Netflix subscriber, just a few movies and whether they liked them or not, you can with very high confidence find that subscriber’s record in the dataset,” he says.
Using some further mathematical reasoning, statistics and algorithms, Shmatikov re-attached names to the users by connecting their movie preferences and reviews with other breadcrumbs of information they left publicly on blogs and the IMDB (an online movie database).
People might not use their real names on such sites, but they do often use their email address or list their city of residence.
“Then you can just Google search the email address, find their Amazon reviews with their real name, and you’ve made the link,” says Shmatikov. “That’s not difficult.”
He says many people could abuse this lack of protection. Unscrupulous companies could use the information to spam people. Oppressive governments could use the information to monitor people. Cyberstalking—from a mild version where helicopter parents monitor their kids to jilted exes stalking former partners—could also be an issue.
By breaking such systems apart, Shmatikov’s research group is trying to show people that privacy protections should not be ad-hoc. Like Chang, he believes protections need to be built into the systems from the start.
“The ultimate vision is that we’re going to give you some software building blocks, and we’re going to prove to you mathematically that if you build your program using these building blocks,” he says, “then you will be safe against a certain class of privacy attacks.”
But he’s quick to caution: “There are always other avenues of attack. We can’t protect from everything with the software we would like to design.”
A relatively new arm of research for Shmatikov regards looking at genetic data. It’s research he’s embarking on with colleague Emmett Witchel.
“We are well on our way to having all of our genomes sequenced very cheaply,” Shmatikov says. “More and more medical and law enforcement databases are going to come with genetic data, and that information is extremely sensitive.”
In particular, doctors and researchers will be sharing this genetic information so they can search for better understandings of disease and develop new treatments. Data computations of this sort will be on a massive scale and will be extremely powerful in the world of health care.
Witchel and Shmatikov are working to adapt a program called MapReduce, developed by Google for use on such large-scale computations, so that it provides privacy guarantees to people whose genetic data are being used but still keeps the process efficient and high performance.
“As time goes on, we will see more databases with so-called anonymized genetic information being used for medical research,” says Shmatikov. “I think the next big privacy break is going to be in that area.”
They hope to develop tools and building blocks that can be used for these massive epidemiological and genome studies.
“We’ll be able to tell them that they can run computations on sensitive data, but they can be assured that they won’t learn anything about individual identities,” he says. (The good guys don’t need to know all those names either.)
And again, with just a bit more information—like a casual conversation about a person’s ancestry or hair color—cyber attackers could pretty easily re-attach a name to particular genetic sequences if they can get their hands on them. Shmatikov is reluctant to even predict what people might do maliciously with such data.