llm-embedding

Overview

This project trains an encoder-only embedding model (based on Qwen2.5) for semantic retrieval in the astrophysics domain. The goal is query-to-document retrieval: given a short query (typically a paper title), retrieve the most relevant astrophysics documents from a corpus of arXiv articles. Training uses contrastive InfoNCE loss and evaluation compares against OpenAI's general-purpose embedding model.

Dataset

Data is pulled from the arXiv API using multiple categories:

Main topic: gr-qc (general relativity and quantum cosmology)
Soft negatives: hep-th, astro-ph.CO
Hard negatives: math-ph, cond-mat.stat-mech
Cross-domain negatives: cs.LG, cs.AI, stat.ML
Different topics: q-bio

Each item includes title, abstract, and topic. Datasets are concatenated with Hugging Face datasets and tokenized with the Qwen2.5 tokenizer. Abstract token lengths are inspected to inform dynamic batching.

Model

Backbone: Qwen/Qwen2.5-0.5B-Instruct (AutoModel, no LM head)
Pooling: masked average pooling over the last hidden state
Fine-tuning: LoRA adapters injected into linear layers
Mixed precision: bfloat16 for efficiency

Training

Contrastive InfoNCE (title ↔ abstract)
Gradient accumulation to reach an effective batch size
Gradient clipping (norm 1.0)
Optimizer: AdamW
TensorBoard logging for loss curves

Evaluation

Retrieval quality is measured with ID-based metrics using ragas:

Recall
Precision
MRR

Evaluation runs two baselines:

OpenAI embeddings (e.g. text-embedding-3-small)
The custom LoRA-tuned Qwen2.5 embedding model

FAISS is used to build an index over abstract embeddings; title embeddings are used as queries.

Notebook Flow (experiment.ipynb)

Create and concatenate arXiv datasets
Tokenize and analyze abstract lengths
Build custom category-aware batches
Define encoder + masked pooling + LoRA
Train with contrastive loss
Evaluate with FAISS + ragas metrics

Notes / Limitations

The notebook is written for interactive execution and assumes GPU availability for speed.
Some cells rely on Colab (e.g. Google Drive checkpointing).
Evaluation code for OpenAI embeddings assumes OPENAI_API_KEY is set.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
experiment.ipynb		experiment.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-embedding

Overview

Dataset

Model

Training

Evaluation

Notebook Flow (experiment.ipynb)

Notes / Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-embedding

Overview

Dataset

Model

Training

Evaluation

Notebook Flow (experiment.ipynb)

Notes / Limitations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages