Why Pete Warden Should Not Release Profile Data on 215 Million Facebook Users

Speaking of the research ethics related to automatically harvesting public social networking data, we are confronted this week with the story of Pete Warden, a former Apple engineer who has spent the last six months harvesting and analyzing data from some 215 million public Facebook profile pages.

According to Warden, he exploited a flaw in Facebook’s architecture to access public profiles without needing to be signed in to a Facebook account, effectively avoiding being bound by Facebook’s Terms of Service preventing such automated harvesting of data. As a result, he amassed a database of names, fan pages, and lists of friends for 215 million public Facebook accounts.

Warden has already done some impressive analysis of this data at an aggregate level, and I know researchers would love to get their hands on it. And like the “Tastes, Ties, and Time” Facebook project, Warden wants to release the dataset to the academic community.

But also like the “Tastes, Ties, and Time” project, Warden would be wrong to do so.

First, similar to our discussion of the ethics of collecting public Twitter streams, just because these Facebook users made their profiles publicly available does not mean they are fair game for scraping for research purposes. Yes, I have limited profile information viewable to the public, and I’ve authorized Facebook to make that information available for search engines to crawl. But the purpose of this public availability is to help people — humans, not bots — find me. The presumption is that my public profile data will only be found and viewed if someone actually searches for “Michael Zimmer” on Facebook or a search engine. In reality, my profile is only “public” if a human being takes specific and conscious action to find me.

Warden’s actions, however, violate this implicit understanding for making profiles publicly searchable. Rather than trying to find me, Warden is systematically sought everyone, letting a script to the work of seeking and harvesting my data. There is no genuine desire to find me, to friend me, and so on. He’s just collecting data. His reasons might be honest and beneficial, but that’s not what’s at issue here. The point is whether the 215 million Facebook users who now have some of their information in Warden’s database contemplated such harvesting and aggregating when they built their profile and configured their privacy settings. They almost certainly didn’t, which brings into doubt whether this data has been collected with proper consent.

Second, Warden’s release of this dataset — even with the best of intentions — poses a serious privacy threat to the subjects in the dataset, their friends, and perhaps unknown others. Warden claims to be sensitive to the privacy of the subjects in the database, and in response he has removed the identifying URL’s’s that are unique to each profile, but the dataset retains the subjects’ names (really!), locations, Fan page lists and partial Friends lists (I’m not sure what is meant by a “partial” list of friends).

So, obviously, individuals can be easily identified within the dataset. But that’s not the greatest threat with the release of this data. What is most dangerous is its potential use to help re-identify other datasets, ones that might contain much more sensitive or potentially damaging data. Recall the research that showed how trivial it was to re-identify the presumed “anonymized” Netflix database, or the ease in identifying individuals within social networks. These ease of re-identifying these datasets came from having ready access to other large sets of data where the subjects where already known. By overlaying social graphs and other intricate data-comparison methods, the “anonymous” datasets were quickly re-identified. (See Paul Ohm’s “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization” for excellent coverage of these cases and discussion of consequences for law & policy).

Warden’s rich dataset of 210 million Facebook users, complete with their names, locations, and social graphs, is just the ammunition needed to fuel a new wave of re-identification of presumed anonymous datasets. It is impossible to predict who might use Warden’s dataset and to what ends, but this threat is real.

It turns out that Facebook has asked Warden to delay releasing this data to the academic community (I’m curious as to what kind of pressure — if any — they exerted to keep him from releasing this week as originally planned). We will need to keep a close eye to see if the data is actually released, in what form, and if any steps will be taken to control and track its usage.

UPDATE: Under a threat of a lawsuit from Facebook, Warden has destroyed the dataset.

5 comments

Heh. That social networks paper took us several months of data analysis and coding. Not trivial 🙂 A better argument would be that the techniques are now already out there and it would be pretty easy for someone else to reproduce it.

Thanks for the comment, Arvind. Sorry, didn’t mean to trivialize your efforts. My meaning was that it was “trivial” in the sense that it didn’t require a supercomputer and the NSA, but rather some hard work by bright computer scientists.

One of your main points seems to be that people are unaware that their data may be collected by bots, and so there is an implicit agreement that the data will be consumed by only humans.

However, once a document is public on the web, one can no longer make presumptions about how it will be used. Even if Pete Warden does not release his dataset, others have already collected this data and are using it for unknown purposes. In fact, it’s probably on sale here. If Pete Warden releases the Facebook data, he will just be bringing attention to the fact that this kind of automated collection is already happening, which in my mind is better than allowing people to remain ignorant of this possibility.

You could argue that this argument is based on the premise that one wrong justifies another, which is in part true. If no bots crawled this data, then your argument against releasing the data would be stronger. However, there’s no way of enforcing this kind of control, so this is a moot point in my mind.

Conrad: Thanks for the comment, but, respectfully, I disagree. While you and I recognize that “once a document is public on the web, one can no longer make presumptions about how it will be used”, many Web users do not fully understand this, and we cannot presume it as a proper starting place for forming norms of information flow online. And, as you predict, I also disagree that Warden’s releasing of the data would be a positive step towards reducing ignorance. There are much better ways to educate Web users and increase digital literacy than the drastic move of releasing data about them. My intent isn’t to control bots, but to educate users and researchers alike regarding the norms of information flow, and, as a result, hopefully adjust their tactics and strategies accordingly.

Michael, I think Conrad really has a point here. I agree with you that the key to reduce ignorance is education, but unfortunately this is a painfully slow process.

We can of course wait and hope that within 5 or 10 years the average Internet user is more educated and most will think twice about unconstrained disclosure of personal details to global public, companies, or governments, but with the rate at which technology develops and remaining privacy is being eradicated there is not much to hope for. Apart from that, studies in psychology and sociology have repeatedly pointed out that we tend to bother less as long as something does not affect our immediate environment or ourselves directly, so opening people’s eyes (before it is really too late and massive datasets of thoroughly profiled individuals are widely available to be used and misused by anybody) is of utmost importance.

Reducing ignorance by education is essential, but unfortunately we too often need things to go terribly wrong and get personally involved to get painfully aware of how important things are that we often take for granted. A combination of education and thoughtful confrontation to increase awareness and a sense of importance would therefore be the most effective. With a open, free, safe society and our civil liberties at stake, there is no time to waste.

Share this:

Related

5 comments

Leave a comment