WordRank is a word embedding algorithm that estimates vector representations for words via robust ranking. Similar to GloVe, WordRank's training is performed on aggregated word-word co-occurrence matrix from a corpus. But dissimilar to GloVe, where a regression loss is employed, WordRank optimizes a ranking-based loss. WordRank distributes computation across multiple machines via MPI to support large scale word embedding problems.
All source code files in WordRank is under Apache License 2.0, except ./CMake/FindMKL.cmake
which is adopted from https://github.com/Eyescale/CMake and is under BSD.
WordRank is developed and tested on UNIX-based systems, with the following software dependencies:
- C++ compiler with C++11 support (Intel compiler is preferred; g++ is ok but all #pragma simd are ignored as of now, which lead to 2x-3x performance loss.)
- MPI library, with multi-threading support (Intel MPI, MPICH2 or MVAPICH2)
- OpenMP (No separated installation is needed once Intel compiler is installed)
- CMake (at least 2.6)
- Boost library (at least 1.49)
- GloVe v.1.0 (for co-occurrence matrix preparation)
- HyperWords (for evaluation)
- MKL (optional)
- Install Intel Parallel Studio XE Cluster Edition (i.e., Intel compiler, OpenMP, MPI and MKL. free copies are available for some users)
- Enable Intel C++ development environment
source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64
source /opt/intel/impi_latest/bin64/mpivars.sh (pointing to the path of your installation)
- Install Boost library
sudo yum install boost-devel (on RedHat/Centos)
sudo apt-get install libboost-all-dev (on Ubuntu)
- Intel compiler is preferred; g++ is ok but all #pragma simd are ignored as of now, which lead to 2x-3x performance loss.
- Download the code:
git clone https://bitbucket.org/shihaoji/wordrank
- Run
.\install.sh
to build the package (e.g., it downloads GloVe v.1.0 and HyperWords and applies patches to them, and then compiles the source code. Intel compiler is used as default. See the switch in .\install.sh to use g++ instead.) - Run the demo script:
cd scripts; ./demo.sh
(NUM_CORES=16 by default, set this to # of physical cores of your machine) - Evaluate the models:
cd scripts; ./eval.sh N (to evaluate the model after N iterations, e.g., N=200)
Shihao Ji, Hyokun Yun, Pinar Yanardag, Shin Matsushima, S. V. N. Vishwanathan. "WordRank: Learning Word Embeddings via Robust Ranking", Conference on Empirical Methods in Natural Language Processing (EMNLP), Nov. 2016.