Skip to content

xingbow/SciDaEx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SciDaEx: Scientific Data Extraction and Structuring System

SciDaEx Logo

Python Version License arXiv

An open-source system for extracting and structuring data from scientific literature using Large Language Models (LLMs). It integrates a computational backend with an interactive user interface to facilitate efficient data extraction, structuring, and refinement for evidence synthesis in scientific research.

Note: This repository contains two main branches:

  • main: The latest version optimized for customization and extension
  • scidasynth: The original version as described in our research paper

📞 Contact

Xingbo Wang - Website | Email

📋 Table of Contents

✨ Features

  • 🔍 Automated data extraction from scientific papers (text, tables, and figures)
  • 📊 Structured data table output in standardized formats
  • 🖥️ Interactive user interface for data validation and refinement
  • 🚀 Retrieval-augmented generation (RAG) for enhanced accuracy and speed
  • 📈 Quality evaluation metrics for extracted data
  • 👥 Support for both technical and non-technical users

🚀 Installation

# Clone the repository
git clone https://github.com/xingbow/SciDaEx.git
cd SciDaEx

# Set up a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

# Install backend dependencies (python 3.10)
pip install -r requirements.txt && pip install "pdfservices-sdk==2.3.0"

# Install frontend dependencies
cd frontend
npm install

Configuration

  1. Backend configuration
    • Create a .env file in the backend/app/dataService directory by copying from .env.example:
      cp backend/app/dataService/.env.example backend/app/dataService/.env
    • Update the .env file with the required configurations:
      • Get Adobe service API credentials here
      • Get OpenAI API key here
    # OpenAI Configuration
    OPENAI_API_KEY=your_openai_api_key_here
    
    # Adobe Credentials
    ADOBE_CLIENT_ID=your_adobe_client_id_here
    ADOBE_CLIENT_SECRET=your_adobe_client_secret_here
         ``` 
    

💻 Usage

Preprocess Documents

  1. Place your PDF documents in the backend/app/dataService/data directory.
  2. Run the preprocessing script:
    cd backend/app/dataService
    python preprocess.py --pdf_dir data --table_dir data/table --figure_dir data/figure --meta_dir data/meta
    This script will extract tables, figures, and metadata from the PDFs and store them in the respective directories.

For details, please refer to the preprocessing documentation.

Running the Web Application

  1. Start the backend server

    cd backend
    python run-data-backend.py
  2. Start the frontend server

    cd frontend
    npm run serve
  3. Open your browser and navigate to http://localhost:8080 to access the SciDaEx interface.

👥 Contributors

Project Timeline

Period Role Contributor Details
2024-08-06 to present Project Maintainer Xingbo Wang -
Until 2024-08-06 Lead Developer Xingbo Wang 63 commits, +20,575 lines
Until 2024-08-06 Contributor Rui Sheng 14 commits, +166 lines
Until 2024-08-06 Contributor Winston Tsui 2 commits, +106 lines

📚 Citation

If you use the repository, please cite the following paper:

@article{wang2024scidasynth,
  title={SciDaSynth: Interactive Structured Knowledge Extraction and Synthesis from Scientific Literature with Large Language Model},
  author={Wang, Xingbo and Huey, Samantha L and Sheng, Rui and Mehta, Saurabh and Wang, Fei},
  journal={arXiv preprint arXiv:2404.13765},
  year={2024}
}

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •