Skip to content
Kayla Jacobs edited this page Feb 11, 2014 · 6 revisions

Because Ushahidi is a crisis crowdsourcing platform, its data is all about human-submitted reports. This page covers what the data looks like, where we got it from, how we worked with it, and our data ethics

Data format

In essence, each report has a message, a title, a description, and a category. Each report has the following fields:

Message
	id
	text

Report
	id = int
	title = string
	description = string
	message_id = int
	message_text = string
	category_ids = list 

Category
	id = int
	parent_id = int
	name = string

Data sources

The Ushine Learning API takes in reports from the Ushahidi platform, attempts to detect the report's language/category/etc., and outputs these things as suggestions for human annotators to confirm. To train the API's machine learning algorithms to perform this work, we needed plenty of historical data from past Ushahidi deployments. In order words, we needed reports where volunteers had already annotated language/categories/etc so the computer could learn to make these same judgements during future deployments.

To get these reports, we had the export data from past deployments. There are two types of deployments:

  • Centralized deployments from crowdmap.com, a hosted service provided by the Ushahidi organization.
  • Ushahidi instances on standalone servers. Because Ushahidi is open source software, people around the world have deployed it on their own servers. This made it more challenging to gather data from many deployments, since the Ushahidi nonprofit doesn't have direct access to them. A Ushahidi employee kindly provided report datasets from these community deployments.

In the end, we got report dumps from the following Ushahidi deployments, crowdmap.com or otherwise:

For more detailed info, visit our master dataset list.

In the end, we only used a few of these datasets. Specifically for category suggestion, we cleaned and used data from Kenya, Sudan, Lebanon, Philippines, Venezuela, Mexico, and India.

Data reformatting

Because our training data came from so many disparate sources, it came to use in diverse formats - .sql, .csv, and .dta.

  • Each of these formats make slightly different fields available. Most datasets didn't include every field available in a full .sql dump, notably the original message before it was modified by an annotators during processing (i.e. the final report's description.)

  • Even when we got the data as a .sql dump, there is currently no log of what content edits were made to a report by annotators, though the original message (if it exists) is always preserved unchanged.

  • In addition, not all reports have associated messages - this varied by data sets and even within data sets.

We designed a consistent .json format for the data so that our machine learning algorithms could always read in Ushahidi data in an identical manner, and wrote scripts to convert the raw data into this standard format.

Here's an example of this output json:

Categories

categories = {
	<id> : {
		name : <string>,
		parent_id : <int>
	},
	1 : {
		name : 'text'
		parent_id : 0
	},
	...

Reports

reports = {
	id1 : {
	  description : description_text,
	  message_id : id,
	  categories : [category_id1,category_id2, ...] 
	  location_name : location_text,
	  location_latlong : [loc_lat, loc_long]
	},
	...
}

Messages

messages = {
	id1 : {
	  id : int,
	  text : message_text
	},
	...
}

These are then combined into a single json. Each dataset has its own id and meta information.

data = {
	"meta" : {
		id : int,
		original_file : string,
		// input_file_path : do we need?
		name : plain text name,
		location : country, lat lng center?
		event_type : <disaster, election, ...>
	},
	"messages" : messages,
	"reports" : reports,
	"categories" : categories
} 

Data cleaning

Once we had gathered and reformatted our datasets, we still needed to clean them before we could use them to train our algorithms.

Language translation When we first start the project, we made a decision that we are going to assume that reports come in English. Although this is a huge assumption, our resources are not enough to handle non-English data. Therefore, we decided to use Google translator to translate non-English data manually. Most of the reports that are mixed language were in a form of either English-then-non-English or non-English-then-English, and English there were translation. In such case, we manually extracted English part. Therefore, we trained our classifiers on reports that are either (A) originally English (B) English part of the report (C) translated from non-English reports.

Removing bad reports Kenya data had many duplicate reports, and often had reports with only workflow labels such as verified and to-be-geolocated. We manually went through reports to remove them, and the number of reports cut down to ~3000 reports.

Data ethics

So that's how we got the data in shape. But there are also ethical issues to think through before working with Ushahidi reports.

Reports can easily contain sensitive information, especially (though not exclusively) in crisis situations where people who submitted or are mentioned in reports are often especially vulnerable, such as children or civilians in unstable settings.

Carelessness about sensitive data can have real and dangerous consequences! Please always, always ensure you are handling the information entrusted to you with care and responsibility.

Even if someone (a report submitter, or a deployment administrator) agrees to share their data, make sure that they fully understand the potential consequences -- intentional or not -- of sharing the information.

This topic is one of far-reaching importance to crowdsourced crisis reporting community. Some particular implications for Ushine:

  1. The personally identifiable information identification tool should be used wisely -- it's not 100% accurate and the definition of "sensitive" information can be a subtle and context-specific one. It should never replace, but rather should support, thoughtful human decision-making. (Of course, humans are not 100% accurate either!)

  2. Some features of Ushine, such as category classification, need access to training data from existing crisis reports (usually from past events) to improve. Who should have access to these training reports? How should they be safely stored and shared? How is informed permission obtained? How are these privacy concerns balanced with the worthy goal of training and improving Ushine's features?