-
Notifications
You must be signed in to change notification settings - Fork 8
/
README
110 lines (82 loc) · 4.03 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
Near-lossless Binarization of Word Embeddings
=============================================
PREAMBLE
This work is one of my contributions of my PhD thesis entitled
"Improving methods to learn word representations for efficient semantic
similarities computations" in which I propose new methods to learn
better word embeddings. You can find and read my thesis freely available
at https://github.com/tca19/phd-thesis.
ABOUT
This repository contains source code to binarize any real-value word
embeddings into binary vectors. It also contains some scripts to
evaluate the performances of the binary vectors on semantic similarity
tasks and top-k queries. Related paper can be found at
https://aaai.org/ojs/index.php/AAAI/article/view/4692/4570.
If you use this repository, please cite:
@inproceedings{tissier2019near,
author = {Tissier, Julien and Gravier, Christophe and Habrard, Amaury},
title = {Near-Lossless Binarization of Word Embeddings},
booktitle = {Proceedings of the Thirty-Third {AAAI} Conference on
Artificial Intelligence, Honolulu, Hawaii, USA,
January 27 - February 1, 2019.},
volume = {33},
pages = {7104--7111},
year = {2019},
url = {https://aaai.org/ojs/index.php/AAAI/article/view/4692},
doi = {10.1609/aaai.v33i01.33017104}
}
INSTALLATION
To compile the source files of this repository, you need to have on your
system:
- OpenBLAS [1]
- a C compiler (gcc, clang ...)
- make
Then run the command `make` to build the different binary executables.
[1] https://github.com/xianyi/OpenBLAS/wiki/Precompiled-installation-packages
USAGE
1. Binarize word vectors
------------------------
Run the executable `binarize` to transform real-value embeddings into
binary vectors. The only mandatory command line argument is `-input`,
the filename containing the real-value vectors.
./binarize -input vectors.vec
All the other existing flags documentation can be found with
`./binarize -h` or `./binarize --help`
Binary vectors are saved by default into the file `binary_vectors.vec`.
The first line of this file indicates the number of binary word vectors
and the number of bits in each vector. Each following line are formatted
like:
WORD INTEGER_1 INTEGER_2 [...]
Binary vectors are not saved as strings of zeros (0) and ones (1) but as
groups of unsigned long integers. Each integer represents 64 bits so for
a binary vector of 256 bits, there are 4 integers (4 * 64 = 256). The
binary vector of a word is the concatenation of the binary
representations of all the integers on the rest of its line.
2. Evaluate semantic similarity
-------------------------------
Run the executable `similarity_binary` to evaluate the semantic
similarity correlation scores of the produced binary vectors.
./similarity_binary binary_vectors.vec
This repository includes some semantic similarity datasets:
- MEN
- Rare Word (RW)
- SimVerb 3500 (SimVerb)
- SimLex 999 (SimLex)
- WordSim 353 (WS353)
To evaluate on other semantic similarity datasets, simply add them into
the datasets/ folder and run again the `./similarity_binary` executable.
3. Top-K queries
----------------
Run the executable `topk_binary` to compute the K closest neighbors
words and their respective similarity to a QUERY word.
./topk_binary binary_vectors.vec K QUERY
The script will report the closest words and their similarity, as well
as the time needed to compute the K closest neighbors. You can also run
multiple top-k queries at the same time, simply replace the QUERY word
with a list of space separated words, like:
./topk_binary binary_vectors.vec 10 queen automobile man moon computer
AUTHOR
Written by Julien Tissier <[email protected]>.
COPYRIGHT
This software is licensed under the GNU GPLv3 license. See the LICENSE
file for more details.