Mind the Gap: Leveraging AI and LLMs to connect research articles and metadata #11

TineClaeys · 2024-10-01T13:32:04Z

TineClaeys
Oct 1, 2024

Title

Mind the Gap: Leveraging AI and LLMs to connect research articles and metadata

Abstract

This project aims to automate the extraction of metadata directly from research articles and additional publicly available resources. The output supports the annotation of datasets and assembly into the Sample and Data Relationship Format (SDRF). We will explore, test, and validate multiple approaches for such metadata extraction, such as using existing annotation tools like those from EuropePMC, Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG), Named Entity Recognition (NER), and ontology-based disambiguation. We aim to create a robust system that minimizes ambiguity (e.g., distinguishing between "Python" the programming language, and "Python" the organism) and improves metadata mapping accuracy.

The goal is to develop a solution that allows researchers to upload articles and automatically generate a near-complete metadata annotation file template, significantly reducing manual annotation efforts, enhancing reproducibility and opening the door for large-scale repurposing and reanalysis of the wealth of public data.

Project Plan

Training data will be assembled prior to the hackathon and consist of previously SDRF-annotated projects from open-access articles containing research articles, filenames, and expected metadata outputs.

Define validation standard for annotation output

Simple metrics for validating the annotation: For example, are the sample attributes correct?
Integrate with existing SDRF validator for automatic assessment

Compare various approaches for metadata extraction
Explore the suitability of distinct options for automated metadata annotation including amongst others:

EuropePMC annotation: Evaluate how well EuropePMC’s existing annotation methods map metadata.
LLMs + Retrieval-Augmented Generation (RAG): Implement LLMs for context-aware metadata extraction and disambiguation.
NER + Ontology-Based Entity Disambiguation: Use Named Entity Recognition paired with ontologies to resolve ambiguous terms.
Other approaches, including regex-based extraction of metadata from file names (e.g., identifying fractions, replicates) and extracting technical metadata from raw/mzML files, will also be explored.
Suggestions and alternative approaches are encouraged and greatly appreciated!

Each method will be assessed for completeness, efficiency, and scalability for large-scale annotation tasks.

After the hackathon, the various approaches will be combined in a hybrid model that will be applied to a large amount of research articles to generate metadata. Further integration within lesSDRF, the environment of other potentially interested parties and packaging as a separate python module will make this tool widely applicable.
We aim to have this drastically impact the way we handle metadata within the proteomics community and even beyond.

Technical Details

Recommended programming languages: Python, R
Datasets will be extracted from PRIDE and the GitHub annotation effort
LLMs will be run online or locally

Contact information

Tine Claeys
Ghent University - VIB Center for Medical Biotechnology
Department for Biomolecular Medicine
[email protected]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mind the Gap: Leveraging AI and LLMs to connect research articles and metadata #11

{{title}}

Replies: 0 comments

Select a reply

Mind the Gap: Leveraging AI and LLMs to connect research articles and metadata #11

TineClaeys Oct 1, 2024

Title

Abstract

Project Plan

Technical Details

Contact information

Replies: 0 comments

TineClaeys
Oct 1, 2024