`GEOMDN` readme

Introduction

GEOMDN is an implementation of Continuous Representation of Location for Geolocation and Lexical Dialectology using Mixture Density Networks (EMNLP2017).

The neural-network is implemented using Theano/Lasagne but it shouldn't be difficult to adopt it to other NN frameworks.

The work has 3 main modules:

lang2loc.py implements mixture density networks to predict location from text input
lang2loc_mdnshared.py implements mixture density networks to predict location from text input with the difference that the mus, sigmas and corxys of the mixure of Gaussians are shared between all the input samples and only pis of samples are conditioned on input. This improved the model as the global mixture of Gaussian sturcture exists and can be learned from all the samples rather than predicted for each individual sample.
loc2lang.py implements a lexical dialectology model where given 2d coordinate inputs predicts a unigram probability distribution over vocabulary. The input is a normal 2d input layer but the hidden layer consisits of several Gaussian distributions whose mus, sigmas and corxys are learned and its output is the probability of input in each of the Gaussian components.

Look at some of the maps, a lot of local words including named entities for several DARE dialect regions and city terms including named entities for about 100 U.S. cities.

local words retrieved for dialect region Delmarva:

    "delmarva": [
        "llsssss", 
        "llssss", 
        "llsss", 
        "downingtown", 
        "ardd", 
        "dickeating", 
        "llss", 
        "brovah", 
        "millersville", 
        "erked", 
        "rehoboth", 
        "suitland", 
        "arddd", 
        "oldhead", 
        "deptford", 
        "exton", 
        "youngbull", 
        "harford", 
        "fraudin", 
        "drawlin", 
        "dfl", 
        "cheltenham", 
        "reisterstown", 
        "ared", 
        "parkville", 
        "nizz", 
        "#ttm", 
        "marlton", 
        "xib", 
        "llls", 
        "norristown", 
        "horsham", 
        "owings", 
        "schuylkill", 
        "ard", 
        "kutztown", 
        "manayunk", 
        "bensalem", 
        "elkridge", 
        "btfu", 
        "fyd", 
        "llab",

Geolocation Datasets

Datasets are GEOTEXT a.k.a CMU (a small Twitter geolocation dataset) and TwitterUS a.k.a NA (a bigger Twitter geolocation dataset) both covering continental U.S. which can be downloaded from here

Quick Start

Download the datasets and place them in ''./datasets/cmu'' and ''./datasets/na'' for GEOTEXT and TwitterUS (contact me for the datasets).
For lang2loc geolocation run:

For GEOTEXT a.k.a CMU run:

THEANO_FLAGS='device=cpu,floatX=float32' nice -n 10 python lang2loc.py -d ./datasets/cmu/ -enc latin1 -reg 0 -drop 0.5 -mindf 10 -hid 100 -ncomp 100

For TwitterUS a.k.a NA run:

THEANO_FLAGS='device=cpu,floatX=float32' nice -n 10 python lang2loc.py -d ./datasets/na/ -enc utf-8 -reg 1e-5 -drop 0.0 -mindf 10 -hid 300 -ncomp 100

For lang2loc_mdnshared geolocation run:

For GEOTEXT a.k.a CMU run:

THEANO_FLAGS='device=cpu,floatX=float32' nice -n 10 python lang2loc_mdnshared.py -d ~/datasets/cmu/ -enc latin1 -reg 0.0 -drop 0.0 -mindf 10 -hid 100 -ncomp 300 -batch 200

For TwitterUS a.k.a NA run:

THEANO_FLAGS='device=cpu,floatX=float32' nice -n 10 python lang2loc_mdnshared.py -d ~/datasets/na/ -enc utf-8 -reg 0.0 -drop 0.0 -mindf 10 -hid 900 -ncomp 900 -batch 2000

For loc2lang lexical dialectology model run:

THEANO_FLAGS='device=cpu,floatX=float32'   nice -n 10 python loc2lang.py -d ~/datasets/na/ -enc utf-8 -reg 0.0 -drop 0.0 -mindf 100 -hid 1000 -ncomp 500 -batch 5000

Note that cmu is very small to be used for lexical dialectology.

Citation

@InProceedings{rahimicontinuous2017,
  author    = {Rahimi, Afshin  and  Baldwin, Timothy and Cohn, Trevor},
  title     = {Continuous Representation of Location for Geolocation and Lexical Dialectology using Mixture Density Networks },
  booktitle = {Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP2017)},
  month     = {September},
  year      = {2017},
  address   = {Copenhagen, Denmark},
  publisher = {Association for Computational Linguistics},
  url       = {http://people.eng.unimelb.edu.au/tcohn/papers/emnlp17geomdn.pdf}
}

Contact

Afshin Rahimi [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
data		data
README.md		README.md
data.py		data.py
hella.jpg		hella.jpg
lang2loc.py		lang2loc.py
lang2loc_mdnshared.py		lang2loc_mdnshared.py
lasagne_layers.py		lasagne_layers.py
loc2lang.py		loc2lang.py
loc2lang_withpi.py		loc2lang_withpi.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`GEOMDN` readme

Introduction

Geolocation Datasets

Quick Start

Citation

Contact

About

Releases

Packages

Languages

afshinrahimi/geomdn

Folders and files

Latest commit

History

Repository files navigation

GEOMDN readme

Introduction

Geolocation Datasets

Quick Start

Citation

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`GEOMDN` readme

Packages