WordRank

WordRank is a word embedding algorithm that estimates vector representations for words via robust ranking. Similar to GloVe, WordRank's training is performed on aggregated word-word co-occurrence matrix from a corpus. But dissimilar to GloVe, where a regression loss is employed, WordRank optimizes a ranking-based loss. WordRank distributes computation across multiple machines via MPI to support large scale word embedding problems.

License

All source code files in WordRank is under Apache License 2.0, except ./CMake/FindMKL.cmake which is adopted from https://github.com/Eyescale/CMake and is under BSD.

Prerequisites

WordRank is developed and tested on UNIX-based systems, with the following software dependencies:

C++ compiler with C++11 support (Intel compiler is preferred; g++ is ok but all #pragma simd are ignored as of now, which lead to 2x-3x performance loss.)
MPI library, with multi-threading support (Intel MPI, MPICH2 or MVAPICH2)
OpenMP (No separated installation is needed once Intel compiler is installed)
CMake (at least 2.6)
Boost library (at least 1.49)
GloVe v.1.0 (for co-occurrence matrix preparation)
HyperWords (for evaluation)
MKL (optional)

Environment Setup

Install Intel Parallel Studio XE Cluster Edition (i.e., Intel compiler, OpenMP, MPI and MKL. free copies are available for some users)
Enable Intel C++ development environment

source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64 
source /opt/intel/impi_latest/bin64/mpivars.sh (pointing to the path of your installation)

Install Boost library

sudo yum install boost-devel (on RedHat/Centos)
sudo apt-get install libboost-all-dev (on Ubuntu)

Intel compiler is preferred; g++ is ok but all #pragma simd are ignored as of now, which lead to 2x-3x performance loss.

Quick Start

Download the code: git clone https://bitbucket.org/shihaoji/wordrank
Run .\install.sh to build the package (e.g., it downloads GloVe v.1.0 and HyperWords and applies patches to them, and then compiles the source code. Intel compiler is used as default. See the switch in .\install.sh to use g++ instead.)
Run the demo script: cd scripts; ./demo.sh (NUM_CORES=16 by default, set this to # of physical cores of your machine)
Evaluate the models: cd scripts; ./eval.sh N (to evaluate the model after N iterations, e.g., N=200)

Reference

Shihao Ji, Hyokun Yun, Pinar Yanardag, Shin Matsushima, S. V. N. Vishwanathan. "WordRank: Learning Word Embeddings via Robust Ranking", Conference on Empirical Methods in Natural Language Processing (EMNLP), Nov. 2016.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
CMake		CMake
scripts		scripts
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
clean.sh		clean.sh
data.hpp		data.hpp
glove.patch		glove.patch
hyperwords.patch		hyperwords.patch
install.sh		install.sh
model.hpp		model.hpp
parameter.hpp		parameter.hpp
wordrank.cpp		wordrank.cpp
wordrank.hpp		wordrank.hpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WordRank

License

Prerequisites

Environment Setup

Quick Start

Reference

About

Releases

Packages

Languages

License

shihaoji/wordrank

Folders and files

Latest commit

History

Repository files navigation

WordRank

License

Prerequisites

Environment Setup

Quick Start

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages