An open-source system for extracting and structuring data from scientific literature using Large Language Models (LLMs). It integrates a computational backend with an interactive user interface to facilitate efficient data extraction, structuring, and refinement for evidence synthesis in scientific research.
Note: This repository contains two main branches:
main: The latest version optimized for customization and extensionscidasynth: The original version as described in our research paper
- 🔍 Automated data extraction from scientific papers (text, tables, and figures)
- 📊 Structured data table output in standardized formats
- 🖥️ Interactive user interface for data validation and refinement
- 🚀 Retrieval-augmented generation (RAG) for enhanced accuracy and speed
- 📈 Quality evaluation metrics for extracted data
- 👥 Support for both technical and non-technical users
# Clone the repository
git clone https://github.com/xingbow/SciDaEx.git
cd SciDaEx
# Set up a virtual environment
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
# Install backend dependencies (python 3.10)
pip install -r requirements.txt && pip install "pdfservices-sdk==2.3.0"
# Install frontend dependencies
cd frontend
npm install- Backend configuration
- Create a
.envfile in thebackend/app/dataServicedirectory by copying from.env.example:cp backend/app/dataService/.env.example backend/app/dataService/.env
- Update the
.envfile with the required configurations:
# OpenAI Configuration OPENAI_API_KEY=your_openai_api_key_here # Adobe Credentials ADOBE_CLIENT_ID=your_adobe_client_id_here ADOBE_CLIENT_SECRET=your_adobe_client_secret_here ```
- Create a
- Place your PDF documents in the
backend/app/dataService/datadirectory. - Run the preprocessing script:
This script will extract tables, figures, and metadata from the PDFs and store them in the respective directories.
cd backend/app/dataService python preprocess.py --pdf_dir data --table_dir data/table --figure_dir data/figure --meta_dir data/meta
For details, please refer to the preprocessing documentation.
-
Start the backend server
cd backend python run-data-backend.py -
Start the frontend server
cd frontend npm run serve -
Open your browser and navigate to
http://localhost:8080to access the SciDaEx interface.
| Period | Role | Contributor | Details |
|---|---|---|---|
| 2024-08-06 to present | Project Maintainer | Xingbo Wang | - |
| Until 2024-08-06 | Lead Developer | Xingbo Wang | 63 commits, +20,575 lines |
| Until 2024-08-06 | Contributor | Rui Sheng | 14 commits, +166 lines |
| Until 2024-08-06 | Contributor | Winston Tsui | 2 commits, +106 lines |
If you use the repository, please cite the following paper:
@article{wang2024scidasynth,
title={SciDaSynth: Interactive Structured Knowledge Extraction and Synthesis from Scientific Literature with Large Language Model},
author={Wang, Xingbo and Huey, Samantha L and Sheng, Rui and Mehta, Saurabh and Wang, Fei},
journal={arXiv preprint arXiv:2404.13765},
year={2024}
}