Skip to content

Latest commit

 

History

History
104 lines (68 loc) · 3.73 KB

README.md

File metadata and controls

104 lines (68 loc) · 3.73 KB

Python: 3.8+

Documentation

Setup

This repository is tested on Python 3.8+. First, you should install a virtual environment:

python3 -m venv .venv/gdsr
source .venv/gdsr/bin/activate

Then, you can install all dependencies:

pip install -r requirements.txt

Preliminaries

Domain-adaptative pretraining

Due to the specificity of the legal language the article encoder has to deal with, we continue pre-training a CamemBERT checkpoint on BSARD statutory articles to adapt it to the target legal domain. You can use the following command to perform the domain adaptation:

bash scripts/run_mlm_training.sh

Domain-specific augmentation

We propose to augment BSARD with synthetic domain-targeted queries using a mT5 model fine-tuned on general domain data from the French mMARCO dataset. You can use the following command to perform the data augmentation:

bash scripts/augment_data.sh $method

where $method is either "back-translation" or "query-generation".

Hard negatives generation

In addition to in-batch negatives, we also use BM25 negatives during training as hard negatives, i.e., the top articles returned by BM25 that are not relevant to the question. To generate a set of five BM25 negatives for each query, you can run the following command:

bash scripts/generate_bm25_negatives.sh

Legislative graph creation

You can create the legislative graph of BSARD by running the following command:

bash scripts/create_graph.sh

Training

In order to train G-DSR, you first have to train the dense statute retriever (DSR), which learns high-quality low-dimensional embedding spaces for questions and articles so that relevant question-article pairs appear closer than irrelevant ones in those spaces. You can perform the training by running:

bash scripts/run_contrastive_training.sh biencoder

Once the dense retriever is trained, we use it to train the legislative graph encoder (LGE), which aims to enrich article representations given by the trained retriever's article encoder by fusing information from a legislative graph. You can perform the training by running:

bash scripts/run_contrastive_training.sh gnn

Additionnally, you can perform hyperparameter tuning with Weights & Biases on both models by running the following commands:

wandb sweep src/config/<sweep_file>.yaml   #<sweep_file> in ['sweep_biencoder', 'sweep_gnn']
wandb agent <USERNAME/PROJECTNAME/SWEEPID>

Evaluation

In order to evaluate DSR only, you can run the following command:

bash scripts/run_evaluation.sh biencoder

To evaluate our ultimate G-DSR model, run:

bash scripts/run_evaluation.sh gnn

Baselines

We compare our approach against three strong retrieval systems: BM25, docT5query and DPR. You can evaluate those approaches by running the following commands, respectively:

bash scripts/run_bm25.sh
bash scripts/run_doc2query.sh
bash scripts/run_evaluation.sh #after changing BIENCODER_CKPT to the corresponding finetuned checkpoint

Note that you can also find the optimal BM25 hyperparameters by running:

bash scripts/tune_bm25.sh