Parsing as pretraining

Installation

Checked on Ubuntu 18.04 64 bits

Create a virtualenv: virtualenv -p python3.6 parsing-as-pretraining
Activate the virtualenv: source parsing-as-pretraining/bin/activate
Install the required dependencies: pip install -r requirements.txt

Preliminaries: Linearizing constituent and dependency trees

To know HOWTO transform a constituent or a dependency tree into a sequence of labels, please check and read the README.md of the repositories tree2labels and dep2labels

In what follows, we assume the linearized datasets are stored in PTB-linearized/ and EN_EWT-linearized/: Each folder contains three different files: train.tsv, dev.tsv, and test.tsv

Training a model

Training an NCRFpp model

Execute:

cd NCRFpp
python main.py --config $PATH_CONFIG_FILE

The folders NCRFpp/const_confs/ and NCRFpp/dep_confs/ show some examples of configuration files.

Parameters used to train different types of models:

contextualize: [True|False] to specify whether to further contextualize word vectors through the NCRFpp BILSTMs
use_elmo: [True|False] Execute ELMo to compute the word vectors, instead of using precomputed or random representations
fine_tune_emb: [True|False] To finetune or not the pretraining encoder during training
use_char: [True|False] To use or not the character LSTMs supported by NCRFpp (always False in our work)
use_features: [True|False] To use or not features other than words that are present in the linearized dataset (always False in our work)
word_emb_dim: Size of the word embeddings. Used when training random representations

Specific parameters to train constituent models:

###PathsToAdditionalScripts###
tree2labels=../tree2labels
evaluate=../tree2labels/evaluate.py
evalb=../tree2labels/EVALB/evalb
gold_dev_trees=../data/datasets/PTB/dev.trees
optimize_with_evalb=True

Specific parameters to train dependency models:

###PathsToAdditionalScripts###
dep2labels=../dep2labels
gold_dev_trees=../data/datasets/en-ewt/en_ewt-ud-dev.conllu
optimize_with_las=True
conll_ud=../dep2labels/conll17_ud_eval.py

Training a BERT model

Adapt the paths accordingly and run ./train_bert_model.sh

The script assumes that the dataset is inside a folder and separated in three different files name train.tsv, dev.tsv, and test.tsv.

General parameter description:

--bert_model: The base model used during training, i.e. bert-base-cased
--task_name: sl_tsv It specifies the format of the input files (always sl_tsv)
--model_dir: Path where to save the model
--max_seq_length: Expected maximum sequence length
--output_dir: Path to where store the outputs generated by the model
--do_train: Activate to train the model
--do_eval: Activate to evaluate the model on the dev set
--do_test: Activate to run the model on the test set
--do_lower_case: To lower case the input when using an uncased model (e.g. bert-base-uncased)

Additional options:

--parsing_paradigm: [dependencies|constituency]
--not_finetune: Keeps BERT weights frozen during training
--use_bilstms: Flag to indicate whether to use BILSTMs before the output layer

Additional specific options for dependency parsers:

--path_gold_conll: Path to the gold conll file to evaluate

Additional specific options for constituent parsers:

--evalb_param: [True|False] to indicate whether to use the COLLINS.prm parameter file to compute the F1 bracketing score
--path_gold_parenthesized: Path to the gold parenthesized tree to evaluate

Example:

python run_token_classifier.py \
--data_dir ./data/datasets/PTB-linearized/ \
--bert_model bert-base-cased \
--task_name sl_tsv \
--model_dir /tmp/bert.finetune.linear.model \
--output_dir /tmp/dev.bert.finetune.linear.output \
--path_gold_parenthesized ../data/datasets/PTB/dev.trees \
--parsing_paradigm constituency --do_train --do_eval --num_train_epochs 15 --max_seq_length 250 [--use_bilstms] [--not_finetune]

Running an NCRFpp model

Adapt the paths and run the scripts ./run_const_ncrfpp.sh (constituents) and ./run_dep_ncrfpp.sh (dependencies)

Running a BERT model

Adapt the paths and model names accordingly and execute ./run_token_classifier.sh

Example for constituency parsing:

	python run_token_classifier.py \
	--data_dir ./data/datasets/PTB-linearized/ \
	--bert_model bert-base-cased \
	--task_name sl_tsv \
	--model_dir  ./data/bert_models_const/bert.const.finetune.linear \
	--output_dir ./data/outputs_const/test.bert.finetune.linear.output \
    --evalb_param True \
    --max_seq_length 250 \
	--path_gold_parenthesized ./data/datasets/PTB/test.trees \
	--parsing_paradigm constituency --do_test [--use_bilstms]

Example for dependency parsing:

	python run_token_classifier.py \
	--data_dir ./data/datasets/EN_EWT-pred-linearized \
	--bert_model bert-base-cased \
	--task_name sl_tsv \
	--model_dir  ./data/bert_models_dep/bert.dep.finetune.linear \
	--output_dir ./data/outputs_dep/test.bert.finetune.linear.output \
	--path_gold_conll ./data/datasets/en-ewt/en_ewt-ud-test.conllu \
	--max_seq_length 350 \
	--parsing_paradigm dependencies --do_test [--use_bilstms]

Note: Remember to use the option --do_lower_case too, in case you trained an uncased model.

Getting some extra stats

Use python evaluate_spans.py [--predicted] [--gold] to show some charts referring to constituent experiments:

--predicted Path to the directory containing the files (each of them in PTB, parenthesized format) for which to plot the charts
--gold Path to the file containing the gold trees in PTB (parenthesized format)

Use python evaluate_dependencies.py [--predicted] [--gold] to show some charts referring to the dependency experiments:

--predicted: Path to the directory containing the files (each of them a predicted conllu file) for which to plot the charts
--gold: Path to the corresponding gold conllu file.

References

Vilares, D. and Strzyz, M. and Søgaard, A. and Gómez-Rodríguez, C. Parsing as Pretraining. In AAAI 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Parsing as pretraining

Installation

Preliminaries: Linearizing constituent and dependency trees

Training a model

Training an NCRFpp model

Training a BERT model

Running an NCRFpp model

Running a BERT model

Getting some extra stats

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
NCRFpp		NCRFpp
dep2labels		dep2labels
tree2labels		tree2labels
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate_dependencies.py		evaluate_dependencies.py
evaluate_spans.py		evaluate_spans.py
requirements.txt		requirements.txt
run_const_ncrfpp.sh		run_const_ncrfpp.sh
run_dep_ncrfpp.sh		run_dep_ncrfpp.sh
run_token_classifier.py		run_token_classifier.py
run_token_classifier.sh		run_token_classifier.sh
train_bert_model.sh		train_bert_model.sh

License

aghie/parsing-as-pretraining

Folders and files

Latest commit

History

Repository files navigation

Parsing as pretraining

Installation

Preliminaries: Linearizing constituent and dependency trees

Training a model

Training an NCRFpp model

Training a BERT model

Running an NCRFpp model

Running a BERT model

Getting some extra stats

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages