Skip to content
This repository has been archived by the owner on Oct 26, 2020. It is now read-only.

Cross-Lingual tagging, with single/multiple sources and parameter estimation to improve projection quality

Notifications You must be signed in to change notification settings

Akshayanti/cross-lingual-tagging

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cross-Lingual Tagging

The following work is an attempt to create a cross-lingual tagger for the two languages- kaz and tel, hereafter referred to as proposed target languages.

For each proposed target language, a list of most similar languages was calculated by using languages_most_similar.py file from here. The file uses WALS data to return the WALS Code of the languages with most similarity to the given language with the corresponding similarity measure. From the resultant list, there were some languages removed, on the basis of following criteria:

  • Unavailability of UD Treebank for the given language.
  • Unavailability of Watchtower data for the given language.
  • Language Similarity Measure too low.

It is worth noting that the WALS Code for a language might differ from the ISO Standard Code for the language. A list of languages, with their language codes in WALS, ISO 639-1, ISO 693-2 and ISO 693-3 standards can be found here.

For each language, the loss percentage (100-Accuracy %) for the generated parallel-data with the proposed target language with the filtered languages was calculated using parallel_data_accuracy.py file from here. The threshold for the loss characterstic was kept at 40 %. Correspondingly, the parallel corpora with loss characteristic of <= 40 % were kept, the rest were not used. The following are the alignment loss percentages for each proposed target - source language pair.

  • kaz

    Pair Loss (in %) Dropped?
    kaz-ja 66.942 Yes
    kaz-ko 60.746 Yes
    kaz-tur 48.578 Yes
  • tel

    Pair Loss (in %) Dropped?
    tel-hi 40.53 Yes
    tel-ja 72.374 Yes
    tel-ta 35.523 No
    tel-tur 35.399 No

The third column in all the above tables shows if the pair was dropped from calculations or not, for the fear of lowering down the accuracy factor. Having lost all the pairs for kaz, the list of target-languages drops to one, i.e. tel.

Files Included

  1. makefile

    The makefile can be used to UDPipe parse the data, and generate the alignments. This can be done by using clean_data, align_data and UDpipe targets in the makefile.

    The dummy targets demonstrate how the files can be used to generate pickles (pickle), tag the data (tag), train the UDPipe Models on the tagged data (train_models) and finally the training and test accuracy of the generated models (train_accuracy and test_accuracy respectively).

  2. align.py

    This is the main file of the pipeline. At first, the file generates sentence and word-level alignments given input files. The alignments are generated by using the following arguments:

    -i or --input: CONLLU file with the target language data, tagged/un-tagged. Required argument.
    -a or --alignments: mGiza generated file containing the alignments from source to target language. Can take multiple inputs. Required argument.
    -c or --conllu: CONLLU format tagged files for the sources listed in -a argument. Used for generating alignments. Required argument.

    The file reads in the alignments data, and the corresponding conllu files, creating sentence and word-level alignments. Since these alignments are used and needed for every run, they can be saved or loaded from a file with the following mutually exclusive arguments:

    --pickle: Saves the sentence and word-level alignments in two different files as a pickle object. Exits after saving the pickles.
    --already_pickled: Loads the sentence and word-level alignments from the pickles (in that order).

    Once the alignments have been generated, the language scores are generated. These scores differ for each run and have been elaborated in a table later. The files for specifying the language based scores can be input by using the following argument:

    -l or --lang_scores: TSV files with ISO language code, and the score. Can take multiple inputs.

    The resulting alignments from the earlier arguments, combined with the scores help decide the POS tags for the target from the source(s). If no -l argument is given, all the sources are given equal weights. The aligned tokens are allotted POS values, based on the disambiguation procedure where the highest scored POS is selected, in case of a singleton winner. Once we have disambiguated amongst the most probable POS tags, there are tokens with more than 1 possible candidates, and tokens in a sentence which have not been aligned at all. We take care of the two problems using 2 different arguments for the file:

    • -rf or --random_fill: Determines the X part of XY nomenclature as discussed later in -o argument. There are still some tokens with multiple contenders for the most likely tag. If this argument is PRESENT, a tag from those contenders is selected at random as the final tag. If this argument is ABSENT, we start looking at the most often tagged POS for the lemma, and select from there. If there is again no disambiguation possible, we end up with a random tag from the contenders of the lemma. Based on the value-filling here, we create a POS-dict containing all the alignments and the POS values encountered so far.
    • -f or --lemma_based_decision: Determines the Y part of XY nomenclature as discussed later in -o argument. Due to incomplete alignments, there will be tokens (at sentence level) which don't have any tags to select from. This is the classical cold start problem. If this argument is ABSENT, the POS-dict from previous argument is searched for a contender to fill the value from, and if an element is found, refreshed. If not, the value is left empty to be handled later while writing the outputs. If the argument is PRESENT, we start looking at the most often tagged POS for the lemma, as done in the case of -rf above. Based on the value-filling here, we create a lemma-dict containing all the lemmas and all the POS values tagged as so far.

    After the above process, not all the values have still been computed. There remain a lot of values which haven't been filled in at all. We take care of those while writing outputs with the argument as follows:

    -o or --output: After all the possible values have been filled, we look at the values that remain to be filled from the -f argument. We refresh the lemma-dict as well as the POS-dict, and start filling in the values which might be now available to be filled, repeating the process of selecting the most suitable tag, and then tagging by the POS tag of lemma of the token. Eventually, we are left with words without any analysis whatsoever. We assign NOUN category to all such tokens. The final analysis are then written into the output file. The file name has appended XY to the end, where X,Y belong to {0,1}. The nomenclature XY is as follows:

    -rf argument -f argument X Y
    Present Absent 1 0
    Present Present 1 1
    Absent Absent 0 0
    Absent Present 0 1
  3. training_accuracy.py

    This file is used to calculate the accuracy of the generated conllu files. For the --true and --generated argument pair, the values are checked line by line for the matching UPOS values, keeping the tokenisation constant. Notice that usually the true argument takes the UDPIPE tagged conllu file. The reported scores are out of 100, expressed in percentage. The file can be used as follows:

    python3 training_accuracy.py --true <true (UDPIPE) tagged conllu file> --generated <program generated conllu file>

    The output result is in the following format:

    <program generated conllu file> <tab> Accuracy_score

    The above mentioned Accuracy_score is the value reported in the Train_Accuracy column for each model, as reported in Statistics section.

  4. clean_conllu.py

    This file is used to clean the values of the form weight(in float)*POS type values creeping in as a result of improper cleaning to generate the 0x conllu files. Notice that the unclear values generated although while affecting the training, and testing accuracy, don't affect the second part of the pipeline (x in the 0x model, as elaborated above). The reason is that while the values are erroneously copied in the output writing list, they do not affect the values in the stored variables, thus not affecting the second part of the pipeline. The file can be used as follows:

    python3 clean_conllu.py <input conllu files>

    The output format is as follows:

    input_conllu_file1: <tab> number_of_patterns_encountered
    input_conllu_file2: <tab> number_of_patterns_encountered

    If the number_of_patterns_encountered value is non-zero, a new file with _final appended to the file name will be created, cleaned of the encountered patterns.

Statistics

  • The values in the Language Similarity Scores were calculated by using wals.py from here as mentioned above. The maximum similarity of a language can be 1. The table shows similarity scores only for languages that have been kept after looking at the alignment loss percentages. These values can also be found in the language folder's lang_scores file.

  • The values in the Alignment Accuracy were calculated by using accuracy.py file from here. The file calculates the accuracy of the alignments of the source-target parallel data done through mGIZA tool, available from here.

  • The values in parallel file were generated using parallel_data_accuracy.py file from here and records the quality of parallel corpus created in this case. The values are expressed as a factor of 1, rather than 100 as in original file to maintain the format of input scores.

  • The values in the Train_Accuracy column were generated by using training_accuracy.py file here.

  • The following are the details of the various runs:

    Run ID Parameter1 Parameter2 Parameter3 Operation Languages Used
    Run1 - - - - tur
    Run2 - - - - ta
    Run3 - - - - ta, tur
    Run4 Language Similarity Scores (max = 1) - - Normalized ta, tur
    Run5 Alignment Accuracy (max = 1) - - Normalized ta, tur
    Run6 Language Similarity Scores (max = 1) Alignment Accuracy (max = 1) - Normalized harmonic mean of Parameter1 and Parameter2 ta, tur
    Run7 Language Similarity Scores (max = 1) Alignment Accuracy (max = 1) 1 - Parallel Data Loss Percentage (max = 1) Normalized harmonic mean of Parameter1 and Parameter3 * Parameter2 ta, tur
  • Time format- Hours:Minutes:Seconds.microseconds

  • All the values of time were calculated by running on a single-cpu, single-processor in metacentrum, without GPU acceleration.

Run Details

  • Language Similarity Scores

    Language Code Similarity Score
    ta 0.21782178217821782
    tur 0.24752475247524752
  • Alignment Accuracy

    Alignment File Accuracy (in %)
    ta_final 38.063
    tur_final 35.652
  • Parallel Corpora Accuracy Loss

    Corpus Loss (in %)
    tel-ta 35.523
    tel-tur 35.399
  • Note that A i denotes Run[i]

Run0, setting the baseline

  • Setting1: When all the data was tagged as NOUN:

    Train_Accuracy: 21.7758 %
    Test_Accuracy: 20.36 %

  • Setting2: Using English as source language for tagging. Best entry marked in bold, all the values are in percentages:

    Model Train_Accuracy Test_Accuracy
    10 21.8113 43.49
    11 21.8096 43.35
    00 21.9664 43.77
    01 24.9635 43.91
  • The higher of the two values, individually for training and testing accuracy from both the settings was chosen as the desired baseline.

  • Baseline Scores

    Train_Accuracy: 21.97 %
    Test_Accuracy: 43.91 %

Run [1,2], single-source tagging

  • Parsing accuracies in %, all runs using exactly 1 language. Bold face indicates best entry across the row (in all runs), whereas the best entry in the run is marked by a super-scripted '#' symbol.

    Model Train_Accuracy1 Test_Accuracy1 Train_Accuracy2 Test_Accuracy2
    10 38.6307 46.81 41.3313 53.46
    11 38.6532 47.78 41.4073 53.88
    00 38.7707 47.51 41.3822 55.26#
    01 39.3176# 49.45# 41.5376# 54.99

Run [3, 4, 5, 6, 7], multiple-source tagging, and the effect of parameters

  • Parsing Time Details for Run3

    Model Part1 Time Part2 Time Outputs Time Ratio % Unfilled Train_Accuracy Test_Accuracy
    10 00:00:02.558072 000:00:06.289822 05:16:10.029413 008802 / 568972 01.5470 % 44.4567 % 56.65 %
    11 00:00:02.206317 098:02:39.560497 05:13:34.968975 568538 / 568972 99.9237 % 44.4318 % 57.34 %
    00 67:21:20.843114 000:00:05.366652 05:16:28.812640 016764 / 568972 02.9464 % 44.2461 % 57.62 %
    01 62:11:16.775110 117:28:12.942471 05:47:27.038786 568589 / 568972 99.9327 % 44.1444 % 57.20 %
  • Parsing accuracies in %, all runs using more than 1 language. Bold face indicates best entry across the row (in all runs), whereas the best entry in the run is marked by a super-scripted '#' symbol.

    Model Train_Accuracy3 Test_Accuracy3 Train_Accuracy4 Test_Accuracy4 Train_Accuracy5 Test_Accuracy5 Train_Accuracy6 Test_Accuracy6 Train_Accuracy7 Test_Accuracy7
    10 44.4567# 56.65 44.3386# 56.37 44.3478# 56.51 44.3376 56.79# 43.9031# 55.96
    11 44.4318 57.34 44.3301 56.23 44.2890 56.65 44.3666# 56.37 43.6192 55.54
    00 44.2461 57.62# 44.2729 56.79 44.3206 56.79 44.2173 56.09 43.6431 56.37#
    01 44.1444 57.20 44.1538 57.06# 44.1204 56.93# 44.2267 56.23 43.6376 55.68

Results and Discussions

It is worth noting that the Test_Accuracy values mentioned above are based on the UDPipe models trained on UDv2.2 treebanks. It is worth noting that the values reported are done so as to standardize them, making them reproducible.

General Observations: We can see that all the models, irrespective of number of sources used, and the parameters selected far exceed the set baseline. So, the source language selection for cross-lingual parsing matters. If we had to organize the runs from multi-source tagging part in decreasing order of their average test-scores, we get the following hierarchy:

Run3 > Run5 > Run4 > Run6 > Run7

However, the t-test determines that there is no significant different difference in performance of the different runs. Since we have less degrees of freedom to do a reliable t-test analysis, we analyse a bit over the average test-score metric.

Single-source vs Multi-source: Looking at the scores, we notice that the test accuracy scores for multi-source tagging in the worst case scenario (Run 7) were still better than the best of the test accuracy scores for single-source tagging (p < 0.05). This proves that multiple languages, if combined can result in higher test accuracy scores than when used individually.

0x vs 1x: Since it is irresponsible to deduce from just looking at the data, we perform a t-test to determine if they perform equally well. We see that the p-value > 0.05, thus implying that there is a clear winner between the two. Looking at the data, we can see that 0x models perform better than 1x models. This is rather obvious, since methodological sampling is better at performing than the randomised sampling.

x0 vs x1: In case of this as well, we perform a t-test to determine if the two models perform well. We see that the p-value < 0.05, and so there is no statistical difference between the performance of the two models.

Parameter Selection: Here are the quick takeaways from a quick glance at the general observations:

  • Equal weights to all languages works the best, than associating weights to languages.
  • While assigning weights, the alignment-accuracy score-based weighting is only very slightly better than WALS-similarity score-based weighting.
  • A combination of the two weights performs worse than the individual weighting scheme.
  • Combining the third weight performed the worst.

Let us try to address each of them seperately. Let us start with the third weight being used, which is very similar to the second weight (in Run7). We observe that piling on this additional weight brought the test accuracy scores lower. A possible explanation is that since the two weights are almost similar, they need to be combined in a better way rather than simple harmonic mean. The case of erroneous weight-combination can also be a problem in other runs, and that might be responsible for bringing the scores down (Run6 compared to Run4, Run5).

The first two takeaways might be related. While it is interesting to see that the test accuracy scores are lower for the language with higher WALS score (Run1 and Run2), it is also important to note that changing the language from en to ta, tur did have an improvement effect. There is a high possibility, due to the rudimentary implementations of WALS-similarity-score as well as alignment-accuracy score metrics, the values might be pushed up, if improved upon. This we discuss more in the next section. Also, again as previously mentioned, the reason for a lower score of combined weighting than the individual weights could again be attributed to a faulty measure of weight combination. This is another issue that needs to investigated further.

Future Directions

Looking at the results, there are 3 major issues here.

The primary issue is the non-agreement of computed WALS scores with the tagging results. The WALS data might be misleading in this case, or we might be computing the scores wrongly. Either way, this needs to be determined. Although intuitively, WALS data should reflect similarity between languages, however given the multiple null/unfilled values in the original data for different columns makes it as a loophole that needs to be covered up. Also, the different fields in WALS data (columns in the tsv file) can be grouped by some features. In future, it would be interesting to see how different fields affect such cross-lingual tasks, and if some of them can be fixed for tagging and/or parsing task(s) in particular.

The second issue is of alignment-accuracy. The current script is very rudimentary in the sense that it computes the alignment accuracies based on the numeric data on source and the target side. It is worth mentioning that even this is not foolproof as either of them could express the numerics into the word (Example- 4, four). The current script doesn't account for those issues. This is an important metric to calculate if the data can be projected succesfully from source to the target side. Consider the case of two alignments with accuracy 80 % and 20 % respectively. The former should be able to generate more reliable projections, as compared to the latter simply because of the information available. It would be an interesting idea to see the effect of different alignment accuracies of the corpora, and if the metric for the calculation of the accuracy score can be improved. The same is extendable to calculation of parallel data quality. If the parallel data is generated manually, as our case here, it is liable to be error-prone thereby making it less reliable.

The final issue is decision of determinal component between WALS similarity scores and alignment-accuracy score. We saw how the run with alignment-accuracy score gave a generally higher test accuracy, but owing to "faulty" implementation of WALS scores, this can be misleading. If we know for sure that the system benefits from knowing the similarity of languages it is offered, and also the alignment-accuracy score, a combination of the two should definitely help pick more suitable tags. Thus, there might be a better way to combine the two scores rather than just the harmonic mean of the two metrics. This could be a future project worth looking into.