Skip to content
Qiancheng Wu edited this page Jul 1, 2020 · 2 revisions

Synopsis

geo-tag is a module to tag each tweet json a geo-information. The geo-information includes its stateID, stateName, countyID, countyName, cityID, cityName, coordinate(longitude, latitude), and source which it is used to infer. (We currently only consider the U.S. domestic field.)

Motivation

Cloudberry and other clients need the geo-informaton in the tweets to implement some corresponding functions.

Implementation Idea

To infer each tweet releted geo-information, we take strategy as follow:

  • Extract coordinate information with two steps. First step is to check coordinates field and get it. Additionally, we mark thecoordinate_source to coordniates. If it is none, we take the second step which is to check coordinates from bounding_box field and pick a random point from the polygon(rectangle). Furthermore, we mark thecoordinate_source to bounding_box. Besides, there are three modes you could choose and the default one is UNIFORM_DISTRIBUTION_RANDOM.

  • To infer the city, county and state information, we first check the place field in the tweet to get the full cityName and infer other information from city.json, so the source is place.

  • If place field is none, we continue to check if we have coordinates. If so, we use STRTREE and bounding_box to infer the location depend on the longitude and latitude. Hence, the source is coordinate.

  • If we do not have coordinate, we would continue to check location field in the user field, and also infer other information including inferred coordinate from city.json, so the source is user and coordinate_source is user_location.

Usage

We have a class named TwitterJSONTagger and tag_one_tweet is a function interface for you to use this module. We take a tweet(json format) as input parameter.

Performance

Clone this wiki locally