SciDaEx: Scientific Data Extraction and Structuring System

An open-source system for extracting and structuring data from scientific literature using Large Language Models (LLMs). It integrates a computational backend with an interactive user interface to facilitate efficient data extraction, structuring, and refinement for evidence synthesis in scientific research.

Note: This repository contains two main branches:

main: The latest version optimized for customization and extension

scidasynth: The original version as described in our research paper

📞 Contact

Xingbo Wang - Website | Email

📋 Table of Contents

SciDaEx: Scientific Data Extraction and Structuring System

✨ Features

🔍 Automated data extraction from scientific papers (text, tables, and figures)
📊 Structured data table output in standardized formats
🖥️ Interactive user interface for data validation and refinement
🚀 Retrieval-augmented generation (RAG) for enhanced accuracy and speed
📈 Quality evaluation metrics for extracted data
👥 Support for both technical and non-technical users

🚀 Installation

# Clone the repository
git clone https://github.com/xingbow/SciDaEx.git
cd SciDaEx

# Set up a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

# Install backend dependencies (python 3.10)
pip install -r requirements.txt && pip install "pdfservices-sdk==2.3.0"

# Install frontend dependencies
cd frontend
npm install

Configuration

Backend configuration

Create a .env file in the backend/app/dataService directory by copying from .env.example:
```
cp backend/app/dataService/.env.example backend/app/dataService/.env
```
Update the .env file with the required configurations:
- Get Adobe service API credentials here
- Get OpenAI API key here

# OpenAI Configuration
OPENAI_API_KEY=your_openai_api_key_here

# Adobe Credentials
ADOBE_CLIENT_ID=your_adobe_client_id_here
ADOBE_CLIENT_SECRET=your_adobe_client_secret_here
     ```

💻 Usage

Preprocess Documents

Place your PDF documents in the backend/app/dataService/data directory.
Run the preprocessing script:
```
cd backend/app/dataService
python preprocess.py --pdf_dir data --table_dir data/table --figure_dir data/figure --meta_dir data/meta
```
This script will extract tables, figures, and metadata from the PDFs and store them in the respective directories.

For details, please refer to the preprocessing documentation.

Running the Web Application

Start the backend server
```
cd backend
python run-data-backend.py
```
Start the frontend server
```
cd frontend
npm run serve
```
Open your browser and navigate to http://localhost:8080 to access the SciDaEx interface.

👥 Contributors

Project Timeline

Period	Role	Contributor	Details
2024-08-06 to present	Project Maintainer	Xingbo Wang	-
Until 2024-08-06	Lead Developer	Xingbo Wang	63 commits, +20,575 lines
Until 2024-08-06	Contributor	Rui Sheng	14 commits, +166 lines
Until 2024-08-06	Contributor	Winston Tsui	2 commits, +106 lines

📚 Citation

If you use the repository, please cite the following paper:

@article{wang2024scidasynth,
  title={SciDaSynth: Interactive Structured Knowledge Extraction and Synthesis from Scientific Literature with Large Language Model},
  author={Wang, Xingbo and Huey, Samantha L and Sheng, Rui and Mehta, Saurabh and Wang, Fei},
  journal={arXiv preprint arXiv:2404.13765},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
scidaex_system.png		scidaex_system.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SciDaEx: Scientific Data Extraction and Structuring System

📞 Contact

📋 Table of Contents

✨ Features

🚀 Installation

Configuration

💻 Usage

Preprocess Documents

Running the Web Application

👥 Contributors

Project Timeline

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

xingbow/SciDaEx

Folders and files

Latest commit

History

Repository files navigation

SciDaEx: Scientific Data Extraction and Structuring System

📞 Contact

📋 Table of Contents

✨ Features

🚀 Installation

Configuration

💻 Usage

Preprocess Documents

Running the Web Application

👥 Contributors

Project Timeline

📚 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages