Skip to content
This repository has been archived by the owner on Oct 26, 2020. It is now read-only.

Latest commit

 

History

History
107 lines (76 loc) · 5.04 KB

README.md

File metadata and controls

107 lines (76 loc) · 5.04 KB

Mining Errors in Low-Resource Languages by Combining LISCA And Cross-Validation

Contents

  1. Documentation
  2. Problem Statement
  3. Included Files
  4. Using This Module
  5. Conclusion
  6. References

Documentation

All the details related to the experiment, from the problems related to the error type, to the final evaluation can be read about in Chapter 6 of the thesis document.

Problem Statement

For the low-resource languages, when there is no reference corpus, LISCA cannot be used directly. A common approach used in the case of low-resource languages, k-fold cross-validation is explored in this experiment. However, just using cross-validation is not enough, as the choice of the number of folds can affect the results significantly.

In this experiment, we therefore

  1. evaluate if k-fold cross-validation is an optimal strategy against the approach of keeping the test and train data separated, and

  2. try to map the behaviour of the algorithm to the choice of the number of folds in k-fold cross-validation approach.

Included Files

  1. Annotations/*: Refer to documentation here.

  2. baseline/*: Contains results (baseline_all.tsv) and the identified 0-scored arcs (baseline_zero.tsv) in the baseline run of LISCA. The scores are reported on test set of UDv2.4 hi-HDTB treebank data.

  3. CV/*: Contains the identified instances used for the manual evaluation from the CV run of LISCA. While allArcs directory contains the 0-scored arcs from the entire dataset, the testArcs directory contains the list of 0-scored arcs from the test set of UDv2.4 hi-HDTB treebank.

  4. scripts/*: Directory of *.py files used in the experiment.

  5. TARs: Directory containing result files generated when LISCA algorithm is run on the dataset for CV run, and for the baseline run in *_lisca.tar files. Also contains the dataset being used for each run in *_conll.tar files.

  6. shuffleKey: The key used as seed for shuffling in shuf commands. Excerpt from /dev/zero in Linux system. Works without any problems in Linux, but a bit buggy in MacOS.

  7. stats.md: Uses data from different files in Annotations directory to generate a .md file that summarises the results of the manual evaluation.

Using This Module

To start with the module, clone this repository in your system, and then run the commands in the given order:

make getdata

Downloads the required dependencies using requirements.txt file, UDv2.4 data using the link here and then prepares working copies of the treebanks in the current directory.

make process_baseline

Using the stored .tar files in TARs, generates baseline directory containing results and identified 0-scored arcs in the baseline run of LISCA.

make process_CV

Using the stored .tar files in TARs, generates CV directory containing identified 0-scored arcs in the CV run of LISCA.

make stats

Runs process_baseline and process_CV targets as mentioned above, and analyses the manually evaluated data in Annotations directory to report the comparative statistics in stats.md file.

Conclusion

In the experiment, we narrowed the search scope from the bins as used by [2] to focus on the arcs that were considered as improbable by the algorithm. Additionally, we found that using cross-validation to train the algorithm has no significant performance gain.

For low-resource languages with little to no reference corpus data, we tried cross-validation approach for finding the errors. We discovered that the choice of folds in cross-validation strategy is determined by the size of the reference corpus; and in case of unavailability of one, the strategy can be used on the data itself without a significant loss in the error-detection rate.

References

  1. Felice Dell’Orletta, Giulia Venturi, and Simonetta Montemagni. Linguistically-driven Selection of Correct Arcs for Dependency Parsing. Computación y Sistemas, 17(2):125–136, 2013.

  2. Chiara Alzetta, Felice Dell’Orletta, Simonetta Montemagni, and Giulia Venturi. Dangerous Relations in Dependency Treebanks. In Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories, pages 201–210, Prague, Czech Republic, 2017. URL https://www.aclweb.org/anthology/W17-7624.

  3. Nivre, Joakim; Abrams, Mitchell; Agić, Željko; et al., 2019, Universal Dependencies 2.4, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-2988.