Skip to content

Commit ce7a9b6

Browse files
author
garrafao
committed
full SemEval data and pipelines
1 parent 739837c commit ce7a9b6

9 files changed

+333
-2
lines changed

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,10 @@ matrices
33
results
44
corpora/durel
55
corpora/surel
6+
corpora/semeval2020*
67
corpora/semcor_lsc
78
testsets/semcor_lsc
9+
testsets/semeval2020*
810
modules/__pycache__
911
modules/*.pyc
1012
update-git.sh

README.md

+23
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,22 @@
11
# LSCDetection
2+
3+
* [General](#general)
4+
* [Usage](#usage)
5+
* [Models](#models)
6+
+ [Semantic Representations](#semantic-representations)
7+
+ [Alignment](#alignment)
8+
+ [Measures](#measures)
9+
* [Parameter Settings](#parameter-settings)
10+
* [Evaluation](#evaluation)
11+
+ [Metrics](#metrics)
12+
+ [Pipeline](#pipeline)
13+
* [Important Changes](#important-changes)
14+
* [Error Sources](#error-sources)
15+
- [BibTex](#bibtex)
16+
17+
18+
### General
19+
220
Data Sets and Models for Evaluation of Lexical Semantic Change Detection.
321

422
If you use this software for academic research, please [cite](#bibtex) this paper:
@@ -9,6 +27,7 @@ Also make sure you give appropriate credit to the below-mentioned software this
927

1028
Parts of the code rely on [DISSECT](https://github.com/composes-toolkit/dissect), [gensim](https://github.com/rare-technologies/gensim), [numpy](https://pypi.org/project/numpy/), [scikit-learn](https://pypi.org/project/scikit-learn/), [scipy](https://pypi.org/project/scipy/), [VecMap](https://github.com/artetxem/vecmap).
1129

30+
1231
### Usage
1332

1433
The scripts should be run directly from the main directory. If you wish to do otherwise, you may have to change the path you add to the path attribute in `sys.path.append('./modules/')` in the scripts. All scripts can be run directly from the command line:
@@ -104,6 +123,10 @@ The evaluation framework of this repository is based on the comparison of a set
104123
| DURel | German | DTA18 | DTA19 | [Dataset](https://www.ims.uni-stuttgart.de/data/durel), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc) | - version from Schlechtweg et al. (2019) at `testsets/durel/` |
105124
| SURel | German | SDEWAC | COOK | [Dataset](https://www.ims.uni-stuttgart.de/data/surel), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc) | - version from Schlechtweg et al. (2019) at `testsets/surel/` |
106125
| SemCor LSC | English | SEMCOR1 | SEMCOR2 | [Dataset](https://www.ims.uni-stuttgart.de/data/lsc-simul), [Corpora](https://www.ims.uni-stuttgart.de/data/lsc-simul) | |
126+
| SemEval Eng | English | CCOHA 1810-1860 | CCOHA 1960-2010 | [Dataset](https://www.ims.uni-stuttgart.de/data/sem-eval-ulscd), [Corpora](https://www.ims.uni-stuttgart.de/data/sem-eval-ulscd) | |
127+
| SemEval Ger | German | DTA 1800-1899 | BZND 1946-1990 | [Dataset](https://www.ims.uni-stuttgart.de/data/sem-eval-ulscd), [Corpora](https://www.ims.uni-stuttgart.de/data/sem-eval-ulscd) | |
128+
| SemEval Lat | Latin | LatinISE -200-0 | LatinISE 0-2000 | [Dataset](https://www.ims.uni-stuttgart.de/data/sem-eval-ulscd), [Corpora](https://www.ims.uni-stuttgart.de/data/sem-eval-ulscd) | |
129+
| SemEval Swe | Swedish | Kubhist2 1790-1830 | Kubhist2 1895-1903 | [Dataset](https://www.ims.uni-stuttgart.de/data/sem-eval-ulscd), [Corpora](https://www.ims.uni-stuttgart.de/data/sem-eval-ulscd) | |
107130

108131
We provide several evaluation pipelines, downloading the corpora and evaluating the models on the above-mentioned datasets, see [pipeline](#pipeline).
109132

scripts/parameters_semeval_eng.sh

+2-2
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ windowSizes=(10) # window sizes for all models
1010
ks=(5) # values for shifting parameter k
1111
ts=(None) # values for subsampling parameter t
1212
iterations=(1) # list of iterations, each item is one iteration, for five iterations define: iterations=(1 2 3 4 5)
13-
dims=(30 100) # dimensionality of low-dimensional matrices (SVD/RI/SGNS)
14-
eps=(10) # training epochs for SGNS
13+
dims=(100) # dimensionality of low-dimensional matrices (SVD/RI/SGNS)
14+
eps=(30) # training epochs for SGNS
1515
targets="testsets/semeval2020_ulscd_eng/testset/targets.tsv" # target words for which change scores should be predicted (one target per line)
1616
testset="testsets/semeval2020_ulscd_eng/testset/targets_in.tsv" # target words in input format (one target per line repeated twice with tab-separation, i.e., 'word\tword', will be created)
1717
testsetwi="testsets/semeval2020_ulscd_eng/testset/targets_wi.tsv" # target words in word injection format (one target per line, injected version in first column, non-injected version in second column, i.e., 'word_\tword', will be created)

scripts/parameters_semeval_ger.sh

+31
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
shopt -s extglob # For more powerful regular expressions in shell
2+
3+
### Define parameters ###
4+
corpDir1="corpora/semeval2020_ulscd_ger/corpus1/" # directory for corpus1 files (all files in directory will be read)
5+
corpDir2="corpora/semeval2020_ulscd_ger/corpus2/" # directory for corpus2 files (all files in directory will be read)
6+
wiCorpDir="corpora/semeval2020_ulscd_ger/corpus_wi_full/" # directory for word-injected corpus (only needed for Word Injection)
7+
freqnorms=(70244495 72397520) # normalization constants for token frequency (total number of tokens in first and second corpus)
8+
typesnorms=(1072963 2375179) # normalization constants for number of context types (total number of types in first and second corpus)
9+
windowSizes=(10) # window sizes for all models
10+
ks=(5) # values for shifting parameter k
11+
ts=(None) # values for subsampling parameter t
12+
iterations=(1) # list of iterations, each item is one iteration, for five iterations define: iterations=(1 2 3 4 5)
13+
dims=(300) # dimensionality of low-dimensional matrices (SVD/RI/SGNS)
14+
eps=(5) # training epochs for SGNS
15+
targets="testsets/semeval2020_ulscd_ger/testset/targets.tsv" # target words for which change scores should be predicted (one target per line)
16+
testset="testsets/semeval2020_ulscd_ger/testset/targets_in.tsv" # target words in input format (one target per line repeated twice with tab-separation, i.e., 'word\tword', will be created)
17+
testsetwi="testsets/semeval2020_ulscd_ger/testset/targets_wi.tsv" # target words in word injection format (one target per line, injected version in first column, non-injected version in second column, i.e., 'word_\tword', will be created)
18+
goldrankfile="testsets/semeval2020_ulscd_ger/testset/graded.tsv" # file with gold scores for target words in same order as targets in testsets
19+
goldclassfile="testsets/semeval2020_ulscd_ger/testset/binary.tsv" # file with gold classes for target words in same order as targets in testsets (leave undefined if non-existent)
20+
21+
# Get normalization constants for dispersion measures
22+
freqnorm1=${freqnorms[0]}
23+
freqnorm2=${freqnorms[1]}
24+
typesnorm1=${typesnorms[0]}
25+
typesnorm2=${typesnorms[1]}
26+
27+
### Make folder structure ###
28+
source scripts/make_folders.sh
29+
30+
### Make target input files ###
31+
source scripts/make_targets.sh

scripts/parameters_semeval_lat.sh

+31
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
shopt -s extglob # For more powerful regular expressions in shell
2+
3+
### Define parameters ###
4+
corpDir1="corpora/semeval2020_ulscd_lat/corpus1/" # directory for corpus1 files (all files in directory will be read)
5+
corpDir2="corpora/semeval2020_ulscd_lat/corpus2/" # directory for corpus2 files (all files in directory will be read)
6+
wiCorpDir="corpora/semeval2020_ulscd_lat/corpus_wi_full/" # directory for word-injected corpus (only needed for Word Injection)
7+
freqnorms=(1751405 9417033) # normalization constants for token frequency (total number of tokens in first and second corpus)
8+
typesnorms=(65702 253970) # normalization constants for number of context types (total number of types in first and second corpus)
9+
windowSizes=(10) # window sizes for all models
10+
ks=(5) # values for shifting parameter k
11+
ts=(None) # values for subsampling parameter t
12+
iterations=(1) # list of iterations, each item is one iteration, for five iterations define: iterations=(1 2 3 4 5)
13+
dims=(100) # dimensionality of low-dimensional matrices (SVD/RI/SGNS)
14+
eps=(30) # training epochs for SGNS
15+
targets="testsets/semeval2020_ulscd_lat/testset/targets.tsv" # target words for which change scores should be predicted (one target per line)
16+
testset="testsets/semeval2020_ulscd_lat/testset/targets_in.tsv" # target words in input format (one target per line repeated twice with tab-separation, i.e., 'word\tword', will be created)
17+
testsetwi="testsets/semeval2020_ulscd_lat/testset/targets_wi.tsv" # target words in word injection format (one target per line, injected version in first column, non-injected version in second column, i.e., 'word_\tword', will be created)
18+
goldrankfile="testsets/semeval2020_ulscd_lat/testset/graded.tsv" # file with gold scores for target words in same order as targets in testsets
19+
goldclassfile="testsets/semeval2020_ulscd_lat/testset/binary.tsv" # file with gold classes for target words in same order as targets in testsets (leave undefined if non-existent)
20+
21+
# Get normalization constants for dispersion measures
22+
freqnorm1=${freqnorms[0]}
23+
freqnorm2=${freqnorms[1]}
24+
typesnorm1=${typesnorms[0]}
25+
typesnorm2=${typesnorms[1]}
26+
27+
### Make folder structure ###
28+
source scripts/make_folders.sh
29+
30+
### Make target input files ###
31+
source scripts/make_targets.sh

scripts/parameters_semeval_swe.sh

+31
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
shopt -s extglob # For more powerful regular expressions in shell
2+
3+
### Define parameters ###
4+
corpDir1="corpora/semeval2020_ulscd_swe/corpus1/" # directory for corpus1 files (all files in directory will be read)
5+
corpDir2="corpora/semeval2020_ulscd_swe/corpus2/" # directory for corpus2 files (all files in directory will be read)
6+
wiCorpDir="corpora/semeval2020_ulscd_swe/corpus_wi_full/" # directory for word-injected corpus (only needed for Word Injection)
7+
freqnorms=(71091464 110792654) # normalization constants for token frequency (total number of tokens in first and second corpus)
8+
typesnorms=(3493027 1937358) # normalization constants for number of context types (total number of types in first and second corpus)
9+
windowSizes=(10) # window sizes for all models
10+
ks=(5) # values for shifting parameter k
11+
ts=(None) # values for subsampling parameter t
12+
iterations=(1) # list of iterations, each item is one iteration, for five iterations define: iterations=(1 2 3 4 5)
13+
dims=(300) # dimensionality of low-dimensional matrices (SVD/RI/SGNS)
14+
eps=(5) # training epochs for SGNS
15+
targets="testsets/semeval2020_ulscd_swe/testset/targets.tsv" # target words for which change scores should be predicted (one target per line)
16+
testset="testsets/semeval2020_ulscd_swe/testset/targets_in.tsv" # target words in input format (one target per line repeated twice with tab-separation, i.e., 'word\tword', will be created)
17+
testsetwi="testsets/semeval2020_ulscd_swe/testset/targets_wi.tsv" # target words in word injection format (one target per line, injected version in first column, non-injected version in second column, i.e., 'word_\tword', will be created)
18+
goldrankfile="testsets/semeval2020_ulscd_swe/testset/graded.tsv" # file with gold scores for target words in same order as targets in testsets
19+
goldclassfile="testsets/semeval2020_ulscd_swe/testset/binary.tsv" # file with gold classes for target words in same order as targets in testsets (leave undefined if non-existent)
20+
21+
# Get normalization constants for dispersion measures
22+
freqnorm1=${freqnorms[0]}
23+
freqnorm2=${freqnorms[1]}
24+
typesnorm1=${typesnorms[0]}
25+
typesnorm2=${typesnorms[1]}
26+
27+
### Make folder structure ###
28+
source scripts/make_folders.sh
29+
30+
### Make target input files ###
31+
source scripts/make_targets.sh

scripts/run_semeval_ger.sh

+71
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
### THIS SCRIPT PRODUCES PREDICTIONS AND EVALUATES THEM FOR ALL MODELS WITH SEMEVAL-ENG PARAMETERS ###
2+
3+
## Download corpora and testsets ##
4+
wget https://www2.ims.uni-stuttgart.de/data/sem-eval-ulscd/semeval2020_ulscd_ger.zip -nc -P testsets/
5+
cd testsets/ && unzip -o semeval2020_ulscd_ger.zip && rm semeval2020_ulscd_ger.zip && cd ..
6+
if [ ! -d corpora/semeval2020_ulscd_ger ];
7+
then
8+
mkdir -p corpora/semeval2020_ulscd_ger/corpus1/
9+
mkdir -p corpora/semeval2020_ulscd_ger/corpus2/
10+
scripts/preprocess.sh testsets/semeval2020_ulscd_ger/corpus1/lemma/ corpora/semeval2020_ulscd_ger/corpus1/corpus1_preprocessed.txt 42
11+
scripts/preprocess.sh testsets/semeval2020_ulscd_ger/corpus2/lemma/ corpora/semeval2020_ulscd_ger/corpus2/corpus2_preprocessed.txt 43
12+
gzip corpora/semeval2020_ulscd_ger/corpus1/*
13+
gzip corpora/semeval2020_ulscd_ger/corpus2/*
14+
fi
15+
rm -r testsets/semeval2020_ulscd_ger/corpus1
16+
rm -r testsets/semeval2020_ulscd_ger/corpus2
17+
18+
## Bring testsets in correct format ##
19+
mkdir -p testsets/semeval2020_ulscd_ger/testset
20+
cp -u testsets/semeval2020_ulscd_ger/targets.txt testsets/semeval2020_ulscd_ger/testset/targets.tsv
21+
cut -f 2- testsets/semeval2020_ulscd_ger/truth/graded.txt > testsets/semeval2020_ulscd_ger/testset/graded.tsv
22+
cut -f 2- testsets/semeval2020_ulscd_ger/truth/binary.txt > testsets/semeval2020_ulscd_ger/testset/binary.tsv
23+
24+
## Define global parameters ##
25+
parameterfile=scripts/parameters_semeval_ger.sh # corpus- and testset-specific parameter specifications
26+
27+
## Get predictions from models ##
28+
# All models with similarity measures
29+
globalmatrixfolderprefix=matrices/semeval_ger_sim # parent folder for matrices
30+
globalresultfolderprefix=results/semeval_ger_sim # parent folder for results
31+
source $parameterfile # get corpus- and testset-specific parameters
32+
source scripts/make_results_sim.sh
33+
# Evaluate results
34+
resultfolder=$resultfolder
35+
outfolder=$globalresultfolder
36+
source scripts/run_SPR.sh # Get Spearman correlation of measure predictions with gold scores
37+
source scripts/run_AP.sh # Get Average Precision of measure predictions with gold classes
38+
39+
# All models with dispersion measures
40+
globalmatrixfolderprefix=matrices/semeval_ger_disp # parent folder for matrices
41+
globalresultfolderprefix=results/semeval_ger_disp # parent folder for results
42+
source $parameterfile # get corpus- and testset-specific parameters
43+
source scripts/make_results_disp.sh
44+
# Evaluate results
45+
resultfolder=$resultfolder
46+
outfolder=$globalresultfolder
47+
source scripts/run_SPR.sh # Get Spearman correlation of measure predictions with gold scores
48+
source scripts/run_AP.sh # Get Average Precision of measure predictions with gold classes
49+
50+
# All models with word injection
51+
globalmatrixfolderprefix=matrices/semeval_ger_wi # parent folder for matrices
52+
globalresultfolderprefix=results/semeval_ger_wi # parent folder for results
53+
source $parameterfile # get corpus- and testset-specific parameters
54+
55+
## Make word-injected corpus ##
56+
if [ ! -f $wiCorpDir/corpus_wi.txt.gz ];
57+
then
58+
mkdir -p $wiCorpDir
59+
corpDir1=$corpDir1
60+
corpDir2=$corpDir2
61+
outfile=$wiCorpDir/corpus_wi.txt
62+
source scripts/run_WI.sh # Create combined word-injected corpus from corpus1 and corpus2
63+
fi
64+
65+
source scripts/make_results_wi.sh
66+
67+
# Evaluate results
68+
resultfolder=$resultfolder
69+
outfolder=$globalresultfolder
70+
source scripts/run_SPR.sh # Get Spearman correlation of measure predictions with gold scores
71+
source scripts/run_AP.sh # Get Average Precision of measure predictions with gold classes

scripts/run_semeval_lat.sh

+71
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
### THIS SCRIPT PRODUCES PREDICTIONS AND EVALUATES THEM FOR ALL MODELS WITH SEMEVAL-ENG PARAMETERS ###
2+
3+
## Download corpora and testsets ##
4+
wget https://zenodo.org/record/3734089/files/semeval2020_ulscd_lat.zip -nc -P testsets/
5+
cd testsets/ && unzip -o semeval2020_ulscd_lat.zip && rm semeval2020_ulscd_lat.zip && cd ..
6+
if [ ! -d corpora/semeval2020_ulscd_lat ];
7+
then
8+
mkdir -p corpora/semeval2020_ulscd_lat/corpus1/
9+
mkdir -p corpora/semeval2020_ulscd_lat/corpus2/
10+
scripts/preprocess.sh testsets/semeval2020_ulscd_lat/corpus1/lemma/ corpora/semeval2020_ulscd_lat/corpus1/corpus1_preprocessed.txt 1
11+
scripts/preprocess.sh testsets/semeval2020_ulscd_lat/corpus2/lemma/ corpora/semeval2020_ulscd_lat/corpus2/corpus2_preprocessed.txt 6
12+
gzip corpora/semeval2020_ulscd_lat/corpus1/*
13+
gzip corpora/semeval2020_ulscd_lat/corpus2/*
14+
fi
15+
rm -r testsets/semeval2020_ulscd_lat/corpus1
16+
rm -r testsets/semeval2020_ulscd_lat/corpus2
17+
18+
## Bring testsets in correct format ##
19+
mkdir -p testsets/semeval2020_ulscd_lat/testset
20+
cp -u testsets/semeval2020_ulscd_lat/targets.txt testsets/semeval2020_ulscd_lat/testset/targets.tsv
21+
cut -f 2- testsets/semeval2020_ulscd_lat/truth/graded.txt > testsets/semeval2020_ulscd_lat/testset/graded.tsv
22+
cut -f 2- testsets/semeval2020_ulscd_lat/truth/binary.txt > testsets/semeval2020_ulscd_lat/testset/binary.tsv
23+
24+
## Define global parameters ##
25+
parameterfile=scripts/parameters_semeval_lat.sh # corpus- and testset-specific parameter specifications
26+
27+
## Get predictions from models ##
28+
# All models with similarity measures
29+
globalmatrixfolderprefix=matrices/semeval_lat_sim # parent folder for matrices
30+
globalresultfolderprefix=results/semeval_lat_sim # parent folder for results
31+
source $parameterfile # get corpus- and testset-specific parameters
32+
source scripts/make_results_sim.sh
33+
# Evaluate results
34+
resultfolder=$resultfolder
35+
outfolder=$globalresultfolder
36+
source scripts/run_SPR.sh # Get Spearman correlation of measure predictions with gold scores
37+
source scripts/run_AP.sh # Get Average Precision of measure predictions with gold classes
38+
39+
# All models with dispersion measures
40+
globalmatrixfolderprefix=matrices/semeval_lat_disp # parent folder for matrices
41+
globalresultfolderprefix=results/semeval_lat_disp # parent folder for results
42+
source $parameterfile # get corpus- and testset-specific parameters
43+
source scripts/make_results_disp.sh
44+
# Evaluate results
45+
resultfolder=$resultfolder
46+
outfolder=$globalresultfolder
47+
source scripts/run_SPR.sh # Get Spearman correlation of measure predictions with gold scores
48+
source scripts/run_AP.sh # Get Average Precision of measure predictions with gold classes
49+
50+
# All models with word injection
51+
globalmatrixfolderprefix=matrices/semeval_lat_wi # parent folder for matrices
52+
globalresultfolderprefix=results/semeval_lat_wi # parent folder for results
53+
source $parameterfile # get corpus- and testset-specific parameters
54+
55+
## Make word-injected corpus ##
56+
if [ ! -f $wiCorpDir/corpus_wi.txt.gz ];
57+
then
58+
mkdir -p $wiCorpDir
59+
corpDir1=$corpDir1
60+
corpDir2=$corpDir2
61+
outfile=$wiCorpDir/corpus_wi.txt
62+
source scripts/run_WI.sh # Create combined word-injected corpus from corpus1 and corpus2
63+
fi
64+
65+
source scripts/make_results_wi.sh
66+
67+
# Evaluate results
68+
resultfolder=$resultfolder
69+
outfolder=$globalresultfolder
70+
source scripts/run_SPR.sh # Get Spearman correlation of measure predictions with gold scores
71+
source scripts/run_AP.sh # Get Average Precision of measure predictions with gold classes

0 commit comments

Comments
 (0)