This projects provides an enhanced version of the original word2vec code. In addition to the normal functionality (i.e., training word vectors based on their surrounding context), this implementation also provides a possibility to train word embeddings tweaked to a particular user-defined task (in addition to or instead of the normal objective).
In order to build this project, you need to proceed to the build
directory of the checked-out repository and execute the following
command:
cmake ../
make
This will look for the necessary libraries, adjust the compilation options, and compile the executable files. Currently, this project depends on the following third party utils:
In order to test the built program, you should run the following command:
make test
Afterwards, you can start using the compiled word2vec
. You can find
examples of input data in the test/
directory of this projects.
In order to run the normal word2vec
training, you can execute the
following command (from the build
directory):
./bin/word2vec -min-count 0 -train ../tests/test_1.0.in
this will train the vanilla word2vec
embeddings, which, however,
might be slightly different from the original results when trained
with multiple threads.
If you, however, want to train embeddings with respect to a particular task (e.g., predicting the subjective polarity of a sentence), you can launch:
./bin/word2vec -ts -min-count 0 -train ../tests/test_2.0.in
Then, the resulting word vector will be trained to best fit your
custom task. The labels for each task should be specified as
contiguous non-negative integers starting from zero (i.e., if a task
has three classes, the labels to use should be 0
, 1
, and 2
) and
separated by a tab character from the main text, e.g.:
Ich fahre morgen nach Hause.\t0
Ich bin sehr froh dich zu sehen.\t1
Schade, dass wir uns nicht getroffen haben.\t2
If the label for the task is not known, you should put an underscore
_
instead of the tag. In the same way, you can also specify
multiple tags for different objectives, e.g.:
Ich fahre morgen nach Hause.\t0\t1
Ich bin sehr froh dich zu sehen.\t1\t_
Schade, dass wir uns nicht getroffen haben.\t2\t0
Besides the -ts
mode which trains purely task-specific embeddings,
we also provide a couple of in-between solutions:
-
With the
-ts-w2v
option, you can simultaneously train bothword2vec
and task-specific objectives, in which case word embeddings will be shared and updated to match both tasks. -
Alternatively, you can also use the
-ts-least-sq
option, in which caseword2vec
and task-specific embeddings will be trained independently. In the final step, however, task-specific embeddings of words which did not appear in the task-labeled lines will be computed from theirword2vec
representation using the linear least-squares method.
To build the documentation for the compiled executable, you need to
install Doxygen prior to executing cmake
and
then run:
make doc
after the Makefiles have been generated.