This folder contains the two types of files released:
In the text-similarity
folder there are 3 tsv files:
RO-STS.train.tsv
containing the train portion of the datasetRO-STS.dev.tsv
containing the development portion of the datasetRO-STS.test.tsv
containing the test portion of the dataset
The train/dev/test splits are identical to the original English STS corpus splits.
All files have the same tab-separated format:
Similarity score [TAB] Sentence 1 [TAB] Sentence 2
where Similarity score
is a floating point number from 0 to 5, with 5 being the maximum similarity.
For example:
1.5 Un bărbat cântă la harpă. Un bărbat cântă la claviatură.
1.8 O femeie taie cepe. O femeie taie tofu.
3.5 Un bărbat merge pe o bicicletă electrică. Un bărbat merge pe bicicletă.
2.2 Un bărbat cântă la tobe. Un bărbat cântă la chitară.
2.2 Un bărbat cântă la chitară. O doamnă cântă la chitară.
The parallel corpus is found in the ro-en
folder, and contains 6 files:
RO-STS.train.ro
andRO-STS.train.en
representing the train portion of the datasetRO-STS.dev.ro
andRO-STS.dev.en
representing the development portion of the datasetRO-STS.test.ro
andRO-STS.test.en
representing the test portion of the dataset
The files are raw text, one-sentence per line, each file pair containing corresponding ro
- en
sentences.