Cross-Lingual Tagging

The following work is an attempt to create a cross-lingual tagger for the two languages- kaz and tel, hereafter referred to as proposed target languages.

For each proposed target language, a list of most similar languages was calculated by using languages_most_similar.py file from here. The file uses WALS data to return the WALS Code of the languages with most similarity to the given language with the corresponding similarity measure. From the resultant list, there were some languages removed, on the basis of following criteria:

Unavailability of UD Treebank for the given language.
Unavailability of Watchtower data for the given language.
Language Similarity Measure too low.

It is worth noting that the WALS Code for a language might differ from the ISO Standard Code for the language. A list of languages, with their language codes in WALS, ISO 639-1, ISO 693-2 and ISO 693-3 standards can be found here.

For each language, the loss percentage (100-Accuracy %) for the generated parallel-data with the proposed target language with the filtered languages was calculated using parallel_data_accuracy.py file from here. The threshold for the loss characterstic was kept at 40 %. Correspondingly, the parallel corpora with loss characteristic of <= 40 % were kept, the rest were not used. The following are the alignment loss percentages for each proposed target - source language pair.

kaz

Pair Loss (in %) Dropped?

kaz-ja 66.942 Yes

kaz-ko 60.746 Yes

kaz-tur 48.578 Yes
tel

Pair Loss (in %) Dropped?

tel-hi 40.53 Yes

tel-ja 72.374 Yes

tel-ta 35.523 No

tel-tur 35.399 No

The third column in all the above tables shows if the pair was dropped from calculations or not, for the fear of lowering down the accuracy factor. Having lost all the pairs for kaz, the list of target-languages drops to one, i.e. tel.

Files Included

makefile

The makefile can be used to UDPipe parse the data, and generate the alignments. This can be done by using clean_data, align_data and UDpipe targets in the makefile.

The dummy targets demonstrate how the files can be used to generate pickles (pickle), tag the data (tag), train the UDPipe Models on the tagged data (train_models) and finally the training and test accuracy of the generated models (train_accuracy and test_accuracy respectively).
align.py

This is the main file of the pipeline. At first, the file generates sentence and word-level alignments given input files. The alignments are generated by using the following arguments:

-i or --input: CONLLU file with the target language data, tagged/un-tagged. Required argument.
-a or --alignments: mGiza generated file containing the alignments from source to target language. Can take multiple inputs. Required argument.
-c or --conllu: CONLLU format tagged files for the sources listed in -a argument. Used for generating alignments. Required argument.

The file reads in the alignments data, and the corresponding conllu files, creating sentence and word-level alignments. Since these alignments are used and needed for every run, they can be saved or loaded from a file with the following mutually exclusive arguments:

--pickle: Saves the sentence and word-level alignments in two different files as a pickle object. Exits after saving the pickles.
--already_pickled: Loads the sentence and word-level alignments from the pickles (in that order).

Once the alignments have been generated, the language scores are generated. These scores differ for each run and have been elaborated in a table later. The files for specifying the language based scores can be input by using the following argument:

-l or --lang_scores: TSV files with ISO language code, and the score. Can take multiple inputs.

The resulting alignments from the earlier arguments, combined with the scores help decide the POS tags for the target from the source(s). If no -l argument is given, all the sources are given equal weights. The aligned tokens are allotted POS values, based on the disambiguation procedure where the highest scored POS is selected, in case of a singleton winner. Once we have disambiguated amongst the most probable POS tags, there are tokens with more than 1 possible candidates, and tokens in a sentence which have not been aligned at all. We take care of the two problems using 2 different arguments for the file:
- -rf or --random_fill: Determines the X part of XY nomenclature as discussed later in -o argument. There are still some tokens with multiple contenders for the most likely tag. If this argument is PRESENT, a tag from those contenders is selected at random as the final tag. If this argument is ABSENT, we start looking at the most often tagged POS for the lemma, and select from there. If there is again no disambiguation possible, we end up with a random tag from the contenders of the lemma. Based on the value-filling here, we create a POS-dict containing all the alignments and the POS values encountered so far.
- -f or --lemma_based_decision: Determines the Y part of XY nomenclature as discussed later in -o argument. Due to incomplete alignments, there will be tokens (at sentence level) which don't have any tags to select from. This is the classical cold start problem. If this argument is ABSENT, the POS-dict from previous argument is searched for a contender to fill the value from, and if an element is found, refreshed. If not, the value is left empty to be handled later while writing the outputs. If the argument is PRESENT, we start looking at the most often tagged POS for the lemma, as done in the case of -rf above. Based on the value-filling here, we create a lemma-dict containing all the lemmas and all the POS values tagged as so far.
After the above process, not all the values have still been computed. There remain a lot of values which haven't been filled in at all. We take care of those while writing outputs with the argument as follows:

-o or --output: After all the possible values have been filled, we look at the values that remain to be filled from the -f argument. We refresh the lemma-dict as well as the POS-dict, and start filling in the values which might be now available to be filled, repeating the process of selecting the most suitable tag, and then tagging by the POS tag of lemma of the token. Eventually, we are left with words without any analysis whatsoever. We assign NOUN category to all such tokens. The final analysis are then written into the output file. The file name has appended XY to the end, where X,Y belong to {0,1}. The nomenclature XY is as follows:

-rf argument -f argument X Y

Present Absent 1 0

Present Present 1 1

Absent Absent 0 0

Absent Present 0 1
training_accuracy.py

This file is used to calculate the accuracy of the generated conllu files. For the --true and --generated argument pair, the values are checked line by line for the matching UPOS values, keeping the tokenisation constant. Notice that usually the true argument takes the UDPIPE tagged conllu file. The reported scores are out of 100, expressed in percentage. The file can be used as follows:
```
python3 training_accuracy.py --true <true (UDPIPE) tagged conllu file> --generated <program generated conllu file>
```
The output result is in the following format:
```
<program generated conllu file> <tab> Accuracy_score
```
The above mentioned Accuracy_score is the value reported in the Train_Accuracy column for each model, as reported in Statistics section.
clean_conllu.py

This file is used to clean the values of the form weight(in float)*POS type values creeping in as a result of improper cleaning to generate the 0x conllu files. Notice that the unclear values generated although while affecting the training, and testing accuracy, don't affect the second part of the pipeline (x in the 0x model, as elaborated above). The reason is that while the values are erroneously copied in the output writing list, they do not affect the values in the stored variables, thus not affecting the second part of the pipeline. The file can be used as follows:
```
python3 clean_conllu.py <input conllu files>
```
The output format is as follows:
```
input_conllu_file1: <tab> number_of_patterns_encountered
input_conllu_file2: <tab> number_of_patterns_encountered
```
If the number_of_patterns_encountered value is non-zero, a new file with _final appended to the file name will be created, cleaned of the encountered patterns.

Statistics

The values in the Language Similarity Scores were calculated by using wals.py from here as mentioned above. The maximum similarity of a language can be 1. The table shows similarity scores only for languages that have been kept after looking at the alignment loss percentages. These values can also be found in the language folder's lang_scores file.
The values in the Alignment Accuracy were calculated by using accuracy.py file from here. The file calculates the accuracy of the alignments of the source-target parallel data done through mGIZA tool, available from here.
The values in parallel file were generated using parallel_data_accuracy.py file from here and records the quality of parallel corpus created in this case. The values are expressed as a factor of 1, rather than 100 as in original file to maintain the format of input scores.
The values in the Train_Accuracy column were generated by using training_accuracy.py file here.

The following are the details of the various runs:

Run ID	Parameter1	Parameter2	Parameter3	Operation	Languages Used
Run1	-	-	-	-	tur
Run2	-	-	-	-	ta
Run3	-	-	-	-	ta, tur
Run4	Language Similarity Scores (max = 1)	-	-	Normalized	ta, tur
Run5	Alignment Accuracy (max = 1)	-	-	Normalized	ta, tur
Run6	Language Similarity Scores (max = 1)	Alignment Accuracy (max = 1)	-	Normalized harmonic mean of Parameter1 and Parameter2	ta, tur
Run7	Language Similarity Scores (max = 1)	Alignment Accuracy (max = 1)	1 - Parallel Data Loss Percentage (max = 1)	Normalized harmonic mean of Parameter1 and Parameter3 * Parameter2	ta, tur

Time format- Hours:Minutes:Seconds.microseconds
All the values of time were calculated by running on a single-cpu, single-processor in metacentrum, without GPU acceleration.

Run Details

Language Similarity Scores

Language Code Similarity Score

ta 0.21782178217821782

tur 0.24752475247524752
Alignment Accuracy

Alignment File Accuracy (in %)

ta_final 38.063

tur_final 35.652
Parallel Corpora Accuracy Loss

Corpus Loss (in %)

tel-ta 35.523

tel-tur 35.399
Note that A_i denotes Run[i]

Run0, setting the baseline

Setting1: When all the data was tagged as NOUN:

Train_Accuracy: 21.7758 %
Test_Accuracy: 20.36 %
Setting2: Using English as source language for tagging. Best entry marked in bold, all the values are in percentages:

Model Train_Accuracy Test_Accuracy

10 21.8113 43.49

11 21.8096 43.35

00 21.9664 43.77

01 24.9635 43.91
The higher of the two values, individually for training and testing accuracy from both the settings was chosen as the desired baseline.
Baseline Scores

Train_Accuracy: 21.97 %
Test_Accuracy: 43.91 %

Run [1,2], single-source tagging

Parsing accuracies in %, all runs using exactly 1 language. Bold face indicates best entry across the row (in all runs), whereas the best entry in the run is marked by a super-scripted '#' symbol.

Model	Train_Accuracy₁	Test_Accuracy₁	Train_Accuracy₂	Test_Accuracy₂
10	38.6307	46.81	41.3313	53.46
11	38.6532	47.78	41.4073	53.88
00	38.7707	47.51	41.3822	55.26^#
01	39.3176^#	49.45^#	41.5376^#	54.99

Run [3, 4, 5, 6, 7], multiple-source tagging, and the effect of parameters

Parsing Time Details for Run3

Model	Part1 Time	Part2 Time	Outputs Time	Ratio	% Unfilled	Train_Accuracy	Test_Accuracy
10	00:00:02.558072	000:00:06.289822	05:16:10.029413	008802 / 568972	01.5470 %	44.4567 %	56.65 %
11	00:00:02.206317	098:02:39.560497	05:13:34.968975	568538 / 568972	99.9237 %	44.4318 %	57.34 %
00	67:21:20.843114	000:00:05.366652	05:16:28.812640	016764 / 568972	02.9464 %	44.2461 %	57.62 %
01	62:11:16.775110	117:28:12.942471	05:47:27.038786	568589 / 568972	99.9327 %	44.1444 %	57.20 %

Parsing accuracies in %, all runs using more than 1 language. Bold face indicates best entry across the row (in all runs), whereas the best entry in the run is marked by a super-scripted '#' symbol.

Model	Train_Accuracy₃	Test_Accuracy₃	Train_Accuracy₄	Test_Accuracy₄	Train_Accuracy₅	Test_Accuracy₅	Train_Accuracy₆	Test_Accuracy₆	Train_Accuracy₇	Test_Accuracy₇
10	44.4567^#	56.65	44.3386^#	56.37	44.3478^#	56.51	44.3376	56.79^#	43.9031^#	55.96
11	44.4318	57.34	44.3301	56.23	44.2890	56.65	44.3666^#	56.37	43.6192	55.54
00	44.2461	57.62^#	44.2729	56.79	44.3206	56.79	44.2173	56.09	43.6431	56.37^#
01	44.1444	57.20	44.1538	57.06^#	44.1204	56.93^#	44.2267	56.23	43.6376	55.68

Results and Discussions

It is worth noting that the Test_Accuracy values mentioned above are based on the UDPipe models trained on UDv2.2 treebanks. It is worth noting that the values reported are done so as to standardize them, making them reproducible.

General Observations: We can see that all the models, irrespective of number of sources used, and the parameters selected far exceed the set baseline. So, the source language selection for cross-lingual parsing matters. If we had to organize the runs from multi-source tagging part in decreasing order of their average test-scores, we get the following hierarchy:

Run3 > Run5 > Run4 > Run6 > Run7

However, the t-test determines that there is no significant different difference in performance of the different runs. Since we have less degrees of freedom to do a reliable t-test analysis, we analyse a bit over the average test-score metric.

Single-source vs Multi-source: Looking at the scores, we notice that the test accuracy scores for multi-source tagging in the worst case scenario (Run 7) were still better than the best of the test accuracy scores for single-source tagging (p < 0.05). This proves that multiple languages, if combined can result in higher test accuracy scores than when used individually.

0x vs 1x: Since it is irresponsible to deduce from just looking at the data, we perform a t-test to determine if they perform equally well. We see that the p-value > 0.05, thus implying that there is a clear winner between the two. Looking at the data, we can see that 0x models perform better than 1x models. This is rather obvious, since methodological sampling is better at performing than the randomised sampling.

x0 vs x1: In case of this as well, we perform a t-test to determine if the two models perform well. We see that the p-value < 0.05, and so there is no statistical difference between the performance of the two models.

Parameter Selection: Here are the quick takeaways from a quick glance at the general observations:

Equal weights to all languages works the best, than associating weights to languages.
While assigning weights, the alignment-accuracy score-based weighting is only very slightly better than WALS-similarity score-based weighting.
A combination of the two weights performs worse than the individual weighting scheme.
Combining the third weight performed the worst.

Let us try to address each of them seperately. Let us start with the third weight being used, which is very similar to the second weight (in Run7). We observe that piling on this additional weight brought the test accuracy scores lower. A possible explanation is that since the two weights are almost similar, they need to be combined in a better way rather than simple harmonic mean. The case of erroneous weight-combination can also be a problem in other runs, and that might be responsible for bringing the scores down (Run6 compared to Run4, Run5).

The first two takeaways might be related. While it is interesting to see that the test accuracy scores are lower for the language with higher WALS score (Run1 and Run2), it is also important to note that changing the language from en to ta, tur did have an improvement effect. There is a high possibility, due to the rudimentary implementations of WALS-similarity-score as well as alignment-accuracy score metrics, the values might be pushed up, if improved upon. This we discuss more in the next section. Also, again as previously mentioned, the reason for a lower score of combined weighting than the individual weights could again be attributed to a faulty measure of weight combination. This is another issue that needs to investigated further.

Future Directions

Looking at the results, there are 3 major issues here.

The primary issue is the non-agreement of computed WALS scores with the tagging results. The WALS data might be misleading in this case, or we might be computing the scores wrongly. Either way, this needs to be determined. Although intuitively, WALS data should reflect similarity between languages, however given the multiple null/unfilled values in the original data for different columns makes it as a loophole that needs to be covered up. Also, the different fields in WALS data (columns in the tsv file) can be grouped by some features. In future, it would be interesting to see how different fields affect such cross-lingual tasks, and if some of them can be fixed for tagging and/or parsing task(s) in particular.

The second issue is of alignment-accuracy. The current script is very rudimentary in the sense that it computes the alignment accuracies based on the numeric data on source and the target side. It is worth mentioning that even this is not foolproof as either of them could express the numerics into the word (Example- 4, four). The current script doesn't account for those issues. This is an important metric to calculate if the data can be projected succesfully from source to the target side. Consider the case of two alignments with accuracy 80 % and 20 % respectively. The former should be able to generate more reliable projections, as compared to the latter simply because of the information available. It would be an interesting idea to see the effect of different alignment accuracies of the corpora, and if the metric for the calculation of the accuracy score can be improved. The same is extendable to calculation of parallel data quality. If the parallel data is generated manually, as our case here, it is liable to be error-prone thereby making it less reliable.

The final issue is decision of determinal component between WALS similarity scores and alignment-accuracy score. We saw how the run with alignment-accuracy score gave a generally higher test accuracy, but owing to "faulty" implementation of WALS scores, this can be misleading. If we know for sure that the system benefits from knowing the similarity of languages it is offered, and also the alignment-accuracy score, a combination of the two should definitely help pick more suitable tags. Thus, there might be a better way to combine the two scores rather than just the harmonic mean of the two metrics. This could be a future project worth looking into.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cross-Lingual Tagging

Files Included

Statistics

Run Details

Run0, setting the baseline

Run [1,2], single-source tagging

Run [3, 4, 5, 6, 7], multiple-source tagging, and the effect of parameters

Results and Discussions

Future Directions

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
.gitignore		.gitignore
README.md		README.md
align.py		align.py
clean_conllu.py		clean_conllu.py
makefile		makefile
training_accuracy.py		training_accuracy.py

-rf argument	-f argument	X	Y
Present	Absent	1	0
Present	Present	1	1
Absent	Absent	0	0
Absent	Present	0	1

Model	Train_Accuracy	Test_Accuracy
10	21.8113	43.49
11	21.8096	43.35
00	21.9664	43.77
01	24.9635	43.91

Pair	Loss (in %)	Dropped?
kaz-ja	66.942	Yes
kaz-ko	60.746	Yes
kaz-tur	48.578	Yes

Pair	Loss (in %)	Dropped?
tel-hi	40.53	Yes
tel-ja	72.374	Yes
tel-ta	35.523	No
tel-tur	35.399	No

Alignment File	Accuracy (in %)
ta_final	38.063
tur_final	35.652

Corpus	Loss (in %)
tel-ta	35.523
tel-tur	35.399

Akshayanti/cross-lingual-tagging

Folders and files

Latest commit

History

Repository files navigation

Cross-Lingual Tagging

Files Included

Statistics

Run Details

Run0, setting the baseline

Run [1,2], single-source tagging

Run [3, 4, 5, 6, 7], multiple-source tagging, and the effect of parameters

Results and Discussions

Future Directions

About

Topics

Resources

Stars

Watchers

Forks

Languages