Checked on Ubuntu 18.04 64 bits
- 
Create a virtualenv:
virtualenv -p python3.6 parsing-as-pretraining - 
Activate the virtualenv:
source parsing-as-pretraining/bin/activate - 
Install the required dependencies:
pip install -r requirements.txt 
To know HOWTO transform a constituent or a dependency tree into a sequence of labels, please check and read the README.md of the repositories tree2labels and dep2labels
In what follows, we assume the linearized datasets are stored in PTB-linearized/ and EN_EWT-linearized/: Each folder contains three different files: train.tsv, dev.tsv, and test.tsv
Execute:
cd NCRFpp
python main.py --config $PATH_CONFIG_FILE
The folders NCRFpp/const_confs/ and NCRFpp/dep_confs/ show some examples of configuration files.
Parameters used to train different types of models:
contextualize: [True|False] to specify whether to further contextualize word vectors through the NCRFpp BILSTMsuse_elmo: [True|False] Execute ELMo to compute the word vectors, instead of using precomputed or random representationsfine_tune_emb: [True|False] To finetune or not the pretraining encoder during traininguse_char: [True|False] To use or not the character LSTMs supported by NCRFpp (alwaysFalsein our work)use_features: [True|False] To use or not features other than words that are present in the linearized dataset (alwaysFalsein our work)word_emb_dim: Size of the word embeddings. Used when training random representations
Specific parameters to train constituent models:
###PathsToAdditionalScripts###
tree2labels=../tree2labels
evaluate=../tree2labels/evaluate.py
evalb=../tree2labels/EVALB/evalb
gold_dev_trees=../data/datasets/PTB/dev.trees
optimize_with_evalb=True
Specific parameters to train dependency models:
###PathsToAdditionalScripts###
dep2labels=../dep2labels
gold_dev_trees=../data/datasets/en-ewt/en_ewt-ud-dev.conllu
optimize_with_las=True
conll_ud=../dep2labels/conll17_ud_eval.py
Adapt the paths accordingly and run ./train_bert_model.sh
The script assumes that the dataset is inside a folder and separated in three different files name train.tsv, dev.tsv, and test.tsv.
General parameter description:
--bert_model: The base model used during training, i.e.bert-base-cased--task_name:sl_tsvIt specifies the format of the input files (alwayssl_tsv)--model_dir: Path where to save the model--max_seq_length: Expected maximum sequence length--output_dir: Path to where store the outputs generated by the model--do_train: Activate to train the model--do_eval: Activate to evaluate the model on the dev set--do_test: Activate to run the model on the test set--do_lower_case: To lower case the input when using an uncased model (e.g.bert-base-uncased)
Additional options:
--parsing_paradigm: [dependencies|constituency]--not_finetune: Keeps BERT weights frozen during training--use_bilstms: Flag to indicate whether to use BILSTMs before the output layer
Additional specific options for dependency parsers:
--path_gold_conll: Path to the gold conll file to evaluate
Additional specific options for constituent parsers:
--evalb_param: [True|False] to indicate whether to use the COLLINS.prm parameter file to compute the F1 bracketing score--path_gold_parenthesized: Path to the gold parenthesized tree to evaluate
Example:
python run_token_classifier.py \
--data_dir ./data/datasets/PTB-linearized/ \
--bert_model bert-base-cased \
--task_name sl_tsv \
--model_dir /tmp/bert.finetune.linear.model \
--output_dir /tmp/dev.bert.finetune.linear.output \
--path_gold_parenthesized ../data/datasets/PTB/dev.trees \
--parsing_paradigm constituency --do_train --do_eval --num_train_epochs 15 --max_seq_length 250 [--use_bilstms] [--not_finetune]
Adapt the paths and run the scripts ./run_const_ncrfpp.sh (constituents) and ./run_dep_ncrfpp.sh (dependencies)
Adapt the paths and model names accordingly and execute ./run_token_classifier.sh
Example for constituency parsing:
	python run_token_classifier.py \
	--data_dir ./data/datasets/PTB-linearized/ \
	--bert_model bert-base-cased \
	--task_name sl_tsv \
	--model_dir  ./data/bert_models_const/bert.const.finetune.linear \
	--output_dir ./data/outputs_const/test.bert.finetune.linear.output \
    --evalb_param True \
    --max_seq_length 250 \
	--path_gold_parenthesized ./data/datasets/PTB/test.trees \
	--parsing_paradigm constituency --do_test [--use_bilstms]
Example for dependency parsing:
	python run_token_classifier.py \
	--data_dir ./data/datasets/EN_EWT-pred-linearized \
	--bert_model bert-base-cased \
	--task_name sl_tsv \
	--model_dir  ./data/bert_models_dep/bert.dep.finetune.linear \
	--output_dir ./data/outputs_dep/test.bert.finetune.linear.output \
	--path_gold_conll ./data/datasets/en-ewt/en_ewt-ud-test.conllu \
	--max_seq_length 350 \
	--parsing_paradigm dependencies --do_test [--use_bilstms]
Note: Remember to use the option --do_lower_case too, in case you trained an uncased model.
Use python evaluate_spans.py [--predicted] [--gold] to show some charts referring to constituent experiments:
--predictedPath to the directory containing the files (each of them in PTB, parenthesized format) for which to plot the charts--goldPath to the file containing the gold trees in PTB (parenthesized format)
Use python evaluate_dependencies.py [--predicted] [--gold] to show some charts referring to the dependency experiments:
--predicted: Path to the directory containing the files (each of them a predicted conllu file) for which to plot the charts--gold: Path to the corresponding gold conllu file.
Vilares, D. and Strzyz, M. and Søgaard, A. and Gómez-Rodríguez, C. Parsing as Pretraining. In AAAI 2020