Mind the Gap: Leveraging AI and LLMs to connect research articles and metadata #11
TineClaeys
announced in
Hackathon proposals
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Title
Mind the Gap: Leveraging AI and LLMs to connect research articles and metadata
Abstract
This project aims to automate the extraction of metadata directly from research articles and additional publicly available resources. The output supports the annotation of datasets and assembly into the Sample and Data Relationship Format (SDRF). We will explore, test, and validate multiple approaches for such metadata extraction, such as using existing annotation tools like those from EuropePMC, Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG), Named Entity Recognition (NER), and ontology-based disambiguation. We aim to create a robust system that minimizes ambiguity (e.g., distinguishing between "Python" the programming language, and "Python" the organism) and improves metadata mapping accuracy.
The goal is to develop a solution that allows researchers to upload articles and automatically generate a near-complete metadata annotation file template, significantly reducing manual annotation efforts, enhancing reproducibility and opening the door for large-scale repurposing and reanalysis of the wealth of public data.
Project Plan
Training data will be assembled prior to the hackathon and consist of previously SDRF-annotated projects from open-access articles containing research articles, filenames, and expected metadata outputs.
Explore the suitability of distinct options for automated metadata annotation including amongst others:
Suggestions and alternative approaches are encouraged and greatly appreciated!
After the hackathon, the various approaches will be combined in a hybrid model that will be applied to a large amount of research articles to generate metadata. Further integration within lesSDRF, the environment of other potentially interested parties and packaging as a separate python module will make this tool widely applicable.
We aim to have this drastically impact the way we handle metadata within the proteomics community and even beyond.
Technical Details
Recommended programming languages: Python, R
Datasets will be extracted from PRIDE and the GitHub annotation effort
LLMs will be run online or locally
Contact information
Tine Claeys
Ghent University - VIB Center for Medical Biotechnology
Department for Biomolecular Medicine
[email protected]
Beta Was this translation helpful? Give feedback.
All reactions