MINERS : Multilingual Language Models as Semantic Retrievers

⚡ Introducing the MINERS benchmark, designed to assess the multilingual LMs' prowess in semantic retrieval tasks, including bitext mining and classification through retrieval-augmented contexts without fine-tuning. A comprehensive framework has been developed to evaluate the effectiveness of language models in retrieving samples across over 200 diverse languages, including low-resource languages in challenging cross-lingual (XS) and code-switching (CS) settings. The results show that achieving competitive performance with state-of-the-art methods is possible by solely retrieving semantically similar embeddings, without requiring any fine-tuning.

📜 Paper

This is the source code of the paper [Arxiv]:

This code has been written using PyTorch. If you use any code or datasets from this toolkit in your research, please cite the associated paper.

@article{winata2024miners,
  title={MINERS: Multilingual Language Models as Semantic Retrievers},
  author={Winata, Genta Indra and Zhang, Ruochen and Adelani, David Ifeoluwa},
  journal={arXiv preprint arXiv:2406.07424},
  year={2024}
}

📊 Benchmark

MINERS comprises 11 datasets: 7 multilingual and 4 code-switching datasets, covering more than 200 languages and encompassing both parallel and classification formats. Parallel datasets are suited for bitext retrieval as they contain aligned multilingual content, facilitating bitext mining and machine translation tasks. Additionally, the classification datasets cover intent classification, sentiment analysis, and topic classification, which we assess for retrieval-based and ICL classification assignments.

Our benchmark evaluates LMs on three tasks: bitext retrieval, retrieval-based classification, and ICL classification. The settings include monolingual (Mono), cross-lingual (XS), code-switching (CS), and cross-lingual code-switching (XS CS).

⚡ Environment Setup

pip install -r requirements.txt

If you wish to utilize the APIs or models from OpenAI, Cohere, or Hugging Face, modify the OPENAI_TOKEN, COHERE_TOKEN, and HF_TOKEN. Note that most models on Hugging Face do not require the HF_TOKEN, which is specifically intended for the llama and gemma models.

📝 Experiment Logs

If you wish to get all results and prompt examples from our experiments, feel free to download them here (~360MB).

🧪 Running Experiments

All experiment results will be stored in the logs/ directory. You can execute each experiment using the following commands:

Bitext Retrieval

Cross-lingual setting

❱❱❱ python bitext.py --src_lang {src_lang} --dataset {dataset} --seed {seed} --cuda --model_checkpoint {model_checkpoint}
❱❱❱ python bitext.py --src_lang de --dataset bucc --seed 42 --cuda --model_checkpoint sentence-transformers/LaBSE

Ensemble

The arguments are similar as above, except we use --model_checkpoints and --weights

❱❱❱ python bitext.py --src_lang {src_lang} --dataset {dataset} --seed {seed} --cuda --model_checkpoint {model_checkpoint}
❱❱❱ python bitext.py --src_lang de --dataset bucc --seed 42 --cuda --model_checkpoint sentence-transformers/LaBSE

Retrieval-based Classification

Monolingual setting

❱❱❱ python classification.py --dataset {dataset} --seed {seed} --cuda --model_checkpoint {model_checkpoint}
❱❱❱ python classification.py --dataset nusax --seed 42 --cuda --model_checkpoint sentence-transformers/LaBSE

Cross-lingual setting

Add --src_lang and --cross to the command.

❱❱❱ python classification.py --src_lang {src_lang} --cross --dataset {dataset} --seed {seed} --cuda --model_checkpoint {model_checkpoint}
❱❱❱ python classification.py --src_lang eng --cross --dataset nusax --seed 42 --cuda --model_checkpoint sentence-transformers/LaBSE

Ensemble

The arguments are similar as above, except we use --model_checkpoints and --weights

❱❱❱ python classification.py --dataset {dataset} --seed {seed} --cuda --model_checkpoints {model_checkpoint1} {model_checkpoint2} {...} --weights {weight1} {weight2} {...}
❱❱❱ python classification.py --dataset nusax --seed 42 --cuda --model_checkpoints sentence-transformers/LaBSE intfloat/multilingual-e5-large --weights 0.25 0.75

ICL Classification

Monolingual setting

❱❱❱ python icl.py --dataset {dataset} --seed 42 --instruction {instruction} --model_checkpoint {model} --gen_model_checkpoint {gen_model_checkpoint}  --cuda --load_in_8bit --k {k}
❱❱❱ python icl.py --dataset nusax --seed 42 --instruction "Generate a sentiment label for a given input.\nPlease only output the label." --model_checkpoint sentence-transformers/LaBSE --gen_model_checkpoint meta-llama/Meta-Llama-3-8B-Instruct  --cuda --load_in_8bit --k 1

Cross-lingual setting

Add --src_lang and --cross to the command.

❱❱❱ python icl.py --src_lang {src_lang} --cross --dataset {dataset} --seed 42 --instruction {instruction} --model_checkpoint {model} --gen_model_checkpoint {gen_model_checkpoint}  --cuda --load_in_8bit --k {k}
❱❱❱ python icl.py --src_lang eng --cross --dataset nusax --seed 42 --instruction "Generate a sentiment label for a given input.\nPlease only output the label." --model_checkpoint sentence-transformers/LaBSE --gen_model_checkpoint meta-llama/Meta-Llama-3-8B-Instruct  --cuda --load_in_8bit --k 1

📈 Aggregating Experiment Results

Add --k to modify the number of retrieved samples.

❱❱❱ python script/aggregate/aggregate_bitext_mining.py --k {k}
❱❱❱ python script/aggregate/aggregate_classification.py --k {k}
❱❱❱ python script/aggregate/aggregate_classification_cross.py --k {k}
❱❱❱ python script/aggregate/aggregate_icl.py --k {k}
❱❱❱ python script/aggregate/aggregate_icl_cross.py --k {k}
❱❱❱ python script/aggregate/aggregate_icl_percentile.py --k {k}

🏞️ Visualizing the Embeddings

❱❱❱ python visualize.py --model_checkpoint {model_checkpoint} --dataset {dataset} --seed {seed} --cuda
❱❱❱ python visualize.py --model_checkpoint sentence-transformers/LaBSE --dataset nusax --seed 42 --cuda

Examples of the visualization by class labels: LaBSE (left) and XLM-R BASE (right)

Examples of the visualization by sample ID: LaBSE (left) and XLM-R BASE (right)

💻 Models Support

Our codebase supports the usage of multiple models for the experiments, providing flexibility for customization beyond the list shown below:

Encoder LMs and APIs

Open-source LMs:

Commercial embedding APIs (last tested as of June 2024)

Cohere-Embedv3
OpenAI-Embedv3

Generative LMs:

BLOOMZ bigscience/bloomz-560m bigscience/bloom-1b7 bigscience/bloomz-3b
mT0 bigscience/mt0-xl
XGLM facebook/xglm-564M facebook/xglm-2.9B
Aya-23 CohereForAI/aya-23-8B
Aya-101 CohereForAI/aya-101
Gemma 1.1 Instruct google/gemma-1.1-7b-it
Llama 3 8B Instruct meta-llama/Meta-Llama-3-8B-Instruct
GPT models (last tested as of June 2024)
Cohere Command R (last tested as of June 2024)

🚀 How to Contribute?

Feel free to create an issue if you have any questions. And, create a PR for fixing bugs or adding improvements (i.e., adding new datasets or models).

If you are interested to create an extension of this work, feel free to reach out to us!

Support our open source effort ⭐

On Progress

We are improving the code to make it more user-friendly and customizable. We have created a new repository for implementing DistFuse, which is available at https://github.com/gentaiscool/distfuse/. You can install it by running pip install distfuse. Later, it will be integrated to this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
datasets		datasets
results		results
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bitext.py		bitext.py
bitext_ensemble.py		bitext_ensemble.py
classification.py		classification.py
classification_ensemble.py		classification_ensemble.py
finetune_baseline.py		finetune_baseline.py
icl.py		icl.py
icl_percentile.py		icl_percentile.py
requirements.txt		requirements.txt
utils.py		utils.py
visualize.py		visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MINERS : Multilingual Language Models as Semantic Retrievers

Table of Contents

📜 Paper

📊 Benchmark

⚡ Environment Setup

📝 Experiment Logs

🧪 Running Experiments

Bitext Retrieval

Cross-lingual setting

Ensemble

Retrieval-based Classification

Monolingual setting

Cross-lingual setting

Ensemble

ICL Classification

Monolingual setting

Cross-lingual setting

📈 Aggregating Experiment Results

🏞️ Visualizing the Embeddings

Examples of the visualization by class labels: LaBSE (left) and XLM-R BASE (right)

Examples of the visualization by sample ID: LaBSE (left) and XLM-R BASE (right)

💻 Models Support

Encoder LMs and APIs

Open-source LMs:

Commercial embedding APIs (last tested as of June 2024)

Generative LMs:

🚀 How to Contribute?

On Progress

About

Releases

Packages

Languages

License

gentaiscool/miners

Folders and files

Latest commit

History

Repository files navigation

MINERS : Multilingual Language Models as Semantic Retrievers

Table of Contents

📜 Paper

📊 Benchmark

⚡ Environment Setup

📝 Experiment Logs

🧪 Running Experiments

Bitext Retrieval

Cross-lingual setting

Ensemble

Retrieval-based Classification

Monolingual setting

Cross-lingual setting

Ensemble

ICL Classification

Monolingual setting

Cross-lingual setting

📈 Aggregating Experiment Results

🏞️ Visualizing the Embeddings

Examples of the visualization by class labels: LaBSE (left) and XLM-R BASE (right)

Examples of the visualization by sample ID: LaBSE (left) and XLM-R BASE (right)

💻 Models Support

Encoder LMs and APIs

Open-source LMs:

Commercial embedding APIs (last tested as of June 2024)

Generative LMs:

🚀 How to Contribute?

On Progress

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages