Home

Tweedr is an API that sifts and classifies tweets to better inform disaster relief efforts.

For extensive details on how to use the API, read our tutorial site.

The problem: a flood

Disasters are chaotic, creating massive changes in the local environment and damage the affected area at random. They also tend to disrupt communications - by downing telephone lines and cutting power, wrecking roads and bridges, or debilitating transportation.

To be effective, however, disaster recovery efforts need the most up-to-date information available. They need information to make the following decisions:

Prioritizing which needs are most important to address.
- Different areas will have different needs. One neighborhood might need help erecting temporary shelter, while another only needs food and water supplies.
Deciding what organizations are best suited to address different groups of needs.
Planning recovery missions.
- If some roads are impassable, the entire trip might need to be adjusted.

This lack of information means there is one more step between relief and the disaster victims, since the relief workers must perform their damage reconnaissance while out in the field, and return to headquarters to update and strategize their next outing.

The solution: a flood of text

Social media can fill in some of these gaps of information. Mobile communication usually outlives landlines and other modes of communication, and in recent disasters, the volume of messages —both SMSes and tweets - sent from disaster sites has been enormous. Hurricane Sandy, for example, resulted in at least five million tweets. (We've gathered every tweek containing the word "sandy" produced between October 27th and November 7th, 2012.)

Effective use of this information requires immediately processing large amounts of natural text in a short amount of time. Currently, this means that workers from relief organizations have to sift through thousands of tweets, SMSes, and calls, to extract useful information that will help them direct or carry out on-the-ground relief efforts.

Disaster relief agencies like the Red Cross, UN, or FEMA, need to quickly assimilate information coming in from various sources, so that they know what's damaged and where. This tool aims to expedite on-site efforts by extracting useful information from this flood of text, thus enhancing relief agencies' situational awareness. This knowledge should allows them to efficiently address problems, including:

prioritize tasks
deliver supplies
route vehicles

Natural Language Processing (NLP) is one way to help with this processing. More general machine learning (ML) methods can help to geolocate tweets based on text and the social graph. We've implemented these techniques into Tweedr an API and a UI that turns floods of tweets into actionable knowledge.

Tweedr consists of:

An API that can process a stream of text (along with metadata) and add useful annotations on top of that text, using machine learning to learn from previous disasters.
A user interface for effectively viewing and consuming these annotations in aggregate.

Data: getting and labeling it

Before we can use machine learning algorithms to process tweets, we need to gather and label thousands of them.

We partnered with Gnip to get every relevant tweet from Hurricane Sandy and the Joplin tornado. The tweets were stored on Amazon Web Services.

Next, we needed to label these tweets so that our algorithms could learn to classify them. We crowdsourced annotations of the Sandy and Joplin tweets, including number of token-level labels, marking sequences of tokens as useful for a variety of sub-tasks.

In one example of a target annotation domain, each tweet was assigned one of a number of categories:

Casualties and damage
Caution and advice
Donations of money, goods or services
People missing, found or seen
Unknown
Information source

See Ontology for a more structured listing of the categories we are interested in.

With this set of labeled data, our goal is to filter tweets from a new disaster that refer to disaster events.

Tweedr API: how it works

Once we gathered and labeled data, we trained a set of machine learning algorithms and build an API around them. Here's how the API pipeline works.

For extensive details on how to use the API, read our tutorial site.

The pipeline processes incoming lines (generally, tweets), one at a time, primarily enhancing each input item with additional fields produced by analyzing existing fields.

Preprocessing involves dropping empty lines, parsing JSON input into dictionaries, ignoring non-tweet entities, and ensuring that any tweet-like entities that are fed into the pipeline are consolidated into a common format with certain predictable.

EmptyLineFilter
JSONParser
IgnoreMetadata
TweetStandardizer

This ensures that all entities follow the TweetDictProtocol, which means that the subsequent mappers can assume the existence of a few common fields:

TweetDictProtocol:

{
    "id": unicode,
    "text": unicode,
    "author": unicode,
    ...
}

The other mappers add several new fields:

{
    ...

    # TextCounter uses a Bloomfilter to keep a running tally of how many times
    # it's seen the current item's "text". "count" indicates the how many times
    # the pipeline has seen this item's text before.
    "count": int,

    # FuzzyTextCounter uses simhash and a configurable threshold to track the
    # number of near-matches seen before. "fuzzy_count" is the number of
    # previously seen tweets that are more similar than the threshold
    "fuzzy_count": int,
    # "fuzzy_votes" is the sum of fuzzy_count's from those similar tweets
    "fuzzy_votes": int,

    # POSTagger uses TweetNLP to tokenize the text and add part-of-speech (POS) tags.
    # TweetNLP produces whitespace-separated strings, so these are both just strings.
    # Use split() or similar to get actual lists of tags. Also:
    #     len(tweet['tokens'].split()) == len(tweet['pos'].split())
    "tokens": str,
    "pos": str,

    # SequenceTagger uses a CRF trained on labeled data to annotate tweets
    # with token-level classifications
    "labels": [{
        "text": unicode,
        "start": int,
        "end": int,
    }, ... ]

    # DBpediaSpotter calls out to a DBpedia Spotlight server to identify entities
    "dbpedia": [{
        "text": str,
        "start": int,
        "end": int,
        "uri": str,
        "types": [str, ...],
    }, ...]
}

LineStream handles the "sink" part of the pipeline -- stringifying each item and writing it to STDOUT along with a newline separator.

Example

Here's an example of an enhanced tweet:

{
  "count": 1,
  "fuzzy_count": 0,
  "retweetCount": 0,
  "fuzzy_votes": 0,
  "text": "Too many earthquake jolts in Kolkata in last few months",
  "pos": "R A N N P ^ P A A N",
  "tokens": "Too many earthquake jolts in Kolkata in last few months",
  "dbpedia": [{
    "text": "Kolkata",
    "start": 29,
    "end": 36,
    "uri": "http://dbpedia.org/resource/Kolkata",
    "types": [
      "Schema:Place",
      "DBpedia:Place",
      "DBpedia:PopulatedPlace",
      "DBpedia:Settlement",
      "Schema:City",
      "DBpedia:City"
    ]
  }],
  "sequences": [],
}

Related work

To date, a decent amount of work has been done in this area. Here, we list existing applications with similar goals and related papers.

Complete applications

DigiDoc

Used by Red Cross.
Closed source, powered by Radian6

DigiDoc photo

Ushahidi Crowdmap

Mapping oriented, but with plugins to pull in external data and receive reports through a web application.
See related DSSG project.

Sentinel Project

Effectively similar to Ushahidi, but ostensibly for a much narrower domain (genocide precognition).
Open source, on github
Live site: http://threatwiki.thesentinelproject.org/iranvisualization

Mapping