Archive for the ‘Big Data’ Category

Open datanarchy

Monday, November 7th, 2011

Open data is increasingly claimed to be “democratizing”. It is not clear to me where the “democracy” part is. If 99% of people decide to keep information private but 1% person disagrees, that 1% can still make that information publicly available. This is more like anarchy. There is a place for anarchy in the world – it is freedom at its most extreme. I am an overwhelming proponent of open data, but the price of data freedom is data vigilance.

The phrase “democratizing data” came up more than once at the recent “Big Open Data Panel” panel at PeopleBrowsr Labs and on the “Philanthropy Panel” at CrowdConf, where I was a panelist, and the “Silicon Valley Humans Rights Conference“, where I was a regular participant. By “democratizing”, people simply meant “publishing online”, but by calling it “democratizing” it carries the implication of inherent good (and more divisively, that any opponent to publishing data is somehow non-democratic.) I question whether many people calling for open data really have the resources to also support the needed vigilance, or simply use the “democratizing” tag to absolve themselves from the consequences of publishing or republishing information.

Occupy Oakland

Occupy Oakland - this is what democracy looks like? Or this is what it looks like when people are no longer sure they are in a democracy

Returning to the 1% who make data public when the 99% disagree, that 1%-person can take part in further open data sharing with the others among the 1%, and the 99% can simply opt out of that channel of information. I joined the Occupy Wall Street protests in Oakland following the shooting of a former marine by Oakland Police. I hadn’t been closely involved with Occupy Wall Street until that point, but the decision to turn weapons upon a peaceful protest (especially against someone who had served two tours of duty for their country) was too much to ignore. When I was there, I wondered how much the 99% had previously opted out of financial channels long ago? Did everything get too far gone (at least in part) because the 1% had become so egregious and removed that the 99% had let them operate unchecked for too long?

Perhaps the same is happening with open data online. Cisco predicts that 90% of all web traffic will be video in the next three years. Let’s see who is democratizing it:

“the democratization process of video is ChatRoulette” Radvission.

“ChatRoulette represents a true breakdown and symbolic revolution of the relationship between content producers and consumers” Faster Times.

ChatRoulette, for those who don’t know it, is a very simple idea: it randomly and anonymously connects you to other users via a web-cam and instant messaging. Have you seen ChatRoulette lately? This is not what democracy looks like. If you have not seen it before, check out this old(ish) video of Merton in hooded top singing and playing piano to random people on ChatRoulette and take my word for it that the current user-base is not about partially concealed pianists – a very specific 1% has taken control of this channel. Chatroulette, and internet video as a whole, did not ‘democratize’ video – it became ‘voyeur takes all’.

The only people to openly admit to me recently that they used “democratizing data” in a less than noble way were advertisers, confessing that “democratizing data” (to them) mostly meant trying to coerce Facebook into make it easier for their start-up to scrape and sell data. The advertising community has been capitalizing on big data for some time (more than 95% of Google’s revenue is advertising targeted via big-data analytics) and they seem to be ahead of the curve. For them, it is not about democracy but simple capitalism – personal gain through someone else’s data. Respect my privacy, and more power to you.

The more serious problems arise when data can be used to harm an individual. I have lost count of how many “open data” or “information sharing” technologies have been enthusiastically called a “Swiss Army Knife”, followed by a list of many positive use cases. A Swiss Army Knife can be used to harm you in many more ways than it can be used to help you – it is a weapon. With the proliferation of cellphones and information sharing tools like Drupal, Twitter, Ushahidi and WordPress, anybody with a little technical knowledge can share masses of data. But a little knowledge is a dangerous thing, and the ease of use of many of these platforms means that we are sending people off to battle with inadequate weapons training.

I have also lost count of how many people have come to me over the last year asking for help with a real or planned map to document a crisis that they were passionate about. It has been from people world-wide, but not a democratic mix. People who launch crisis-maps are overwhelmingly the same demographic as those on Chatroulette: excited young men with an internet connection. The deployments might serve and connect some of the least resourced people in the world, but they are not being curated by them. I try to give the same advice in all cases where there is a real element of physical danger: constantly review all your data in light of changing conditions; remove anything that is dangerous or irrelevant; and if you do not have the resources to constantly monitor and reevaluate what you have already published, discontinue your service. The overwhelming majority listen, and most of them decide that they cannot meet this requirement, instead serving their communities in more direct ways.

Big Data Revolution

Left: the 1% of the revolution - open victory on tank. Right: the 99% of the revolution - a family huddled in the dark, trying to determine if the gunshots are coming closer. The only guarantee that the 99% have from open data is to shine a spotlight on them - would you?

It is common for someone to be taken from their home and killed in a conflict and for the cause to never be known. The recent to bloggers in Mexico, whose bodies were deliberately displayed with their social media handles, are the exception. We have to assume that contributing to social media leads to targeted deaths much more frequently.

The victims in Mexico knew that they were taking calculated risks. Open data means that someone could contribute to an open platform without even realizing it – someone else could take their words/reports and add this to an open platform, making them oblivious collaborators. Connecting with open data is uncertain – it can bring help or it can bring enemies. There is only one guarantee in publishing your information to open, social media in a conflict situation: it shines a splotlight on you. If you choose to publish information from/about someone in a conflict zone, you are shining a spotlight on them too. Republishing simply makes that spotlight brighter. The 1% of the revolution is a celebration on a now-still tank. The 99% of the revolution is huddling in the dark with your family close, trying to determine if the gunshots are coming closer.

Oblivious collaboration exists everywhere. There is no doubt that I am an oblivious collaborator with the advertising agencies mentioned above, looking to increase their market share having scraped information from my Twitter account or this site. I don’t care much about this context.

I am leading the construction of the largest humanitarian open data project to date – EpidemicIQ is currently processing about 1 billion data points per day, almost all of which are from open data. We do not yet republish open data – it is the struggle of coming to terms with the complexities of open humanitarian data at this scale that led me to write this article.

Take one example report: “a young girl from village X was treated for Y”. It is anonymous to me. If it is published openly, but only in medical circles, then she remains anonymous. If we republish this somewhere that people in village X will read, it might not be – perhaps only one young girl from the village was hospitalized at that time, so they will know who she is. Should we republish? What if the people in village X have been known to harm people with disease Y, because of a mixture of fear of disease and traditional beliefs? I have seen all these factors line up more than once. At the recent Strata big data conference in New York, a wealthy CEO insinuated that people were cowards for not republishing aggregated open data for fear of the legal implications. I don’t fear lawyers. I don’t fear billions of data points. I consistently worry about balancing the need to share information with the privacy and well-being of this girl, and many like her who are now oblivious collaborators in a global outbreak monitoring system.

Oblivious collaborators in conflict situations are a greater concern. This is not a fringe problem – 30% of the world (about 2 billion people) live a conflict zone or a transitional situation. For obvious reasons, these are the most recent people to join the connected world meaning that the least experienced populations now accessing social media are also the most vulnerable. We saw this with the recent Libya Crisis-Map that was commissioned by the United Nations Office for the Coordination of Humanitarian Affairs (UN OCHA) and initially implemented by the Stand By Task Force (SBTF), of which I’m a co-founder (full disclosure, I am also co-author of the SBTF Libya report which I’ll be quoting).

The Libya Crisis Map aggregated information, primarily from traditional and social media, about the (then) mounting crisis in Libya, in order to support intelligence gathering by the UN in the leadup to their deployment. The feedback was positive:

“If you go back a couple of years, all of this information probably would have been available, but it would have been seen as noise coming at you in multiple formats … Libya Crisis Map has done an extraordinary job to aggregate all of this information.” Brendan McDonald, UN OCHA

But part-way through the deployment, UN OCHA decided to make the map public. This was a case of the 1% making a decision without the 99%. The people who submitted and structured the reports were not asked if they wanted to make the map public. The majority were not even informed. A compromise was reached where only partial and/or obfuscated data was published on the public-facing map. For fear of security, the public map still drove away the most important volunteers – those with knowledge of Libya. In their rush to show the world that they were using crowdsourcing technologies, the UN excluded and endangered the crowd.

The UN OCHA response to this in the Libya Crisis Map Report was unrepentant:

“why not allow full text of tweets already available? … if it is already fully available on the web” Information Management Unit, UN OCHA, Libya Crisis Map Report.

(re withholding/obfuscating information) “Bad instruction. All this became available on the web very quickly … belligerents know where camps and exit routes are, there is no security risk from this appearing on one more site on the web.” Information Management Unit, UN OCHA, Libya Crisis Map Report.

I don’t think it is productive to be so absolutist about something we know so little about, especially in big data’s first public use in a conflict setting. There are two clear reasons why publishing all information is dangerous:

1) You are showing your hand. Let’s say the bad guys know all the details that you do, and many more. If you have missed somewthing, they now know that you don’t know: they know where to target.

2) You are creating oblivious collaborators. It is one thing for someone to tweet “there are many gunshots here, I wonder why”, but it is another for someone to aggregate this with reports of violence in the same area, using this tweet as further evidence, and publishing both together next to a logo of an organization considered to be an enemy. (This actually happened, but I’ve deliberately changed the wording). From the analogy above, you are turning the spotlight on that person up to 100. (Unless they don’t know about it, which is more like keeping them in the dark while giving all soldiers night-vision goggles).

There are two more reasons, both of which come from being in a newly connected world:

1) Not all bad guys will otherwise be resourced to collect data from disparate sources. Even if the information is open, if it is spread across dozens or 100s of information points across the web, it takes a sizable operation to collect this information. Some bad guys belong to complex large networks that might be able to scrape and parse all this information. Most are just opportunistic but they might now have an internet connection. Publishing aggregate, structured data weaponizes everybody.

2) Information can be open and describe an entire region in fine detail before any one person on the ground knows the full extent. Previously, conditions would change much quicker on the ground than the reports that made it through to aid agencies and, yes, the bad guys on the ground very often knew about the changes before the aid agencies. But big open data can, and often is, ahead of the curve of any one individual, or any one organization. In the global disease outbreaks that we track, this is the norm, not the exception. You can get ahead of the bad guys on the ground for the first time. This is one of the most positive aspects of big open data (in parsing, if not republishing) – do not give away your advantage so quickly.

The most frequent response to these kinds of arguments made it into the report:

“If we can’t handle the info publicly, it’s off, we lack adequate security to handle confidential info reported” Information Management Unit, UN OCHA, Libya Crisis Map Report.

It is impossible to predict what information will become sensitive. A report that obliquely mentions doctors in a secure refugee camp is harmless, right up until that camp is later raided and the most educated witnesses are deliberately targeted (this has also happened, but again I’ve deliberately changed the details a little). To avoid any possibility of security implications is to collect no information. Any information that is held, whether privately or publicly, needs to be constantly reviewed according to a changing environment. There is no way around this. You need to collect data. You need to have the resources to continually review it.

The second most frequent response to these kinds of arguments also made it into the report:

“the personal responsibility [is] incumbent on the info sender.” Information Management Unit, UN OCHA, Libya Crisis Map Report.

In conflict situations, I think it is rare that someone caught up in middle has a complete picture of their security situation. If you choose to publish aggregated information (regardless of your organization) then your act of publication is asserting a position of power and knowledge. That places at least some responsibility onto you.

If the security is wholly on the reporter, then it falls on the reporter to remove/edit any reports that they have contributed. If they lose communications (or are unreachable for unknown reasons) or may not have known that security was their responsibility, then the responsibility must still fall back on the publishers – the exact same situation.

For oblivious collaborators over open data, this will also put limits on how much data you can store, as you will need to maintain the manpower and/or technology to continually review all existing data. So just what are the limits on how much already-open data can be stored? I can give one answer:

29.

Twitter 29

29 tweets: your ethical upper limit on the number of tweets to republish from free open data.

That’s the maximum number of tweets that anyone should ever republish from free open data if security is the responsibility of the reporter. It is not exactly “big data”. The math is simple. Let’s assume that the person who tweeted “there are a lot of gunshots here” decided to delete their tweet – if security is their responsibility alone, then the republishers have the responsibility to also remove it. Let’s also say that an acceptable latency in deletion is five minutes and that you have an OAuth key that allows you the maximum 350 free API calls per hour on Twitter. You will need to check every existing tweet for deletion via the API every five minutes: (350/60)*5 = 29.16. As soon as you have stored your 30th tweet, you will no longer be able to check for deletions every 5 minutes without hitting the API limit.

You could pay to increase the limits on Twitter, but this is no longer free open data. Or you could simply not honor people’s wish for their tweet to be deleted, possibly endangering them (for reasons that may only be apparent to them), but this is falling short in data vigilance. So if you want to put the security in the hands of the reporters, leveraging only free open data, then that is your ethical upper-limit for Twitter: 29 data points.

I don’t want to be too harsh on the individuals in UN OCHA (or anyone entering the big data for the first time – we are all new), and I greatly appreciate the willingness to discuss these points publicly. But we need to be critical on the idea that there are simple rules for collecting and publishing data that absolve us of responsibility once the data is out there.

By living most of my adult life outside my homeland, I have helped monitor elections more times than I have been permitted a vote. I would love to say that being at the forefront of big open data means taking part in democracy, but this simply isn’t the case. In big open data, I am the excited 1% trying to meet my obligations to the 99%. Open data is a form of freedom that can help liberate us from disease and oppression, but it is not a democratic freedom – it is extreme and potentially dangerous – we need to always keep watch.