GitHub - sciknoworg/schema-miner: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models

LLMs4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models

This is an open-source implementation of Schema-miner (LLMs4SchemaDiscovery).

📋 Schema-miner Overview

Schema-miner (LLMs4SchemaDiscovery) is novel framework that leverages Large Language Models (LLMs) and continuous human feedback to automate and enhance schema mining task. Through an iterative process, the framework uses LLMs to extract and organize properties from unstructured text, refine schemas with expert input, and incorporate domain-specific ontologies to add semantic knowledge. A comprehensive documentation for schema-miner, including detailed guides and examples, is available at schema-miner.readthedocs.io.

Figure 1: Overview of the LLMs4SchemaDiscovery workflow.

⚙️ System Requirements

The computational requirements for running this project vary depending on the model being used. If utilizing OpenAI models such as GPT-4o and GPT-4-turbo, no specialized hardware is needed since inference is performed via API calls. A basic system with a stable internet connection is sufficient for executing API-based workflow.

For users opting to run open-source models such as Llama 3.1 8B or other large-scale transformer-based models, local execution demands significantly higher computational resources. While these models can be executed on a CPU, inference times will be considerably longer. However, for efficient execution, a dedicated GPU with VRAM (specified by the model's documentation) is strongly recommended.

While the hardware configuration can be adjusted based on the model size and performance needs, using a GPU significantly accelerates inference processes, reducing execution time drastically compared to CPU-only setups.

Experimental Configuration

For our experiments, we used the following hardware setup:

Processor: 64-core CPU
Memory: 500 GB RAM
GPU: NVIDIA RTX 3090
Models Used:
- Cloud-based: GPT-4o and GPT-4-turbo (via OpenAI API)
- Locally run: Llama 3.1 8B

🧪 Installation

Install all the necessary Python packages listed in the requirements.txt file.

pip install -r requirements.txt

🚀 Quick Start

For a quick start, see the provided example notebooks highlighting the overall workflows of the schema-miner.

	Notebook
1	Schema Mining With LLMs and expert Example

🧑‍💻 Schema-miner Tool Usage

🛠️ Configuration

Parameters such as API keys, base URLs, and model settings are managed using a dedicated environment file. The .env file has to be saved at the root directory of this project.

Setting Up the Environment File

Before running the project, you need to setup the environment variables.

Copy the example file:
```
cp .env.example .env
```
Open .env file and replace the placeholder values with the actual configuration.

🗂️ Preparing Knowledge Base

The knowledge base includes an initial domain specification document and relevant research papers for stages 2 and 3. For machine processing, the input documents are in plain text format. If research papers are in PDF format, the pdf to text conversion script can be used. The script takes the directory of the PDFs and the directory to save the converted text files.

$ python file_text_extractor.py

LLMs4SchemaDiscovery Framework -- A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models 
Formatting Knowledge Base - Converting PDF Documents to Text Files

Please input the directory location containing the PDF documents  
Directory Path> data/research-papers

Please input the directory Location to store the converted text files
Directory Path> data/research-papers

Extracting text from all PDFs from the directory: data/research-papers

Extracting text from the PDF: ALD-E_Simulation-Parameters-Observables-List.pdf
Text file saved successfully: True

PDF documents successfully convert to text format!

📊 Generating Initial Schema from Stage 1

An initial JSON schema can be generated based on the domain specification document and the preferred LLM. Stage 1 can be executed using this script. An example run is show below:

$ python schema_extraction_stage1.py

LLMs4SchemaDiscovery Framework -- A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models
Stage 1: Initial Schema Mining

Please specify the LLM name to perform schema mining...
List of possible LLMs:
    1. OPENAI Models:
        - gpt-4o
        - gpt-4-turbo
    2. LLAMA Models:
        - meta-llama-3.1-8b-instruct
    3. Any other local OLLAMA Model

LLM> gpt-4o

Please specify the location of the process specification document
Document location> data/stage-1/ALD-Process-Development.txt

Please specify the location to save the schema
Schema location> results/stage-1

Performing LLM (gpt-4o) Inference to extract schema...
Using OPENAI - LLM Inference with Model: gpt-4o

Writing the models response to the file at the specified location: results/stage-1

Extracting the JSON object from the models output...
JSON schema Saved at location: results/stage-1

📊🔄 Refining Stage 1 Schemas with scientific literature and human feedback

The schema from stage 1 can refined by the LLM iteratively by analyzing a curated set of research papers and incorporating expert feedback. Stage 2 can be executed using this script. An example run is show below:

$python schema_extraction_stage2.py

LLMs4SchemaDiscovery Framework -- A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models 
Stage 2: Preliminary Schema Refinement

Please specify the LLM name to perform schema mining...
List of possible LLMs:
    1. OPENAI Models:
        - gpt-4o
        - gpt-4-turbo
    2. LLAMA Models:
        - meta-llama-3.1-8b-instruct
    3. Any other local OLLAMA Model

LLM> gpt-4o

Please specify the location to the schema from stage 1
Stage 1 - schema location> results/stage-1

Please specify the location to the domain-expert feedbacks on stage 1 schema
Expert feedback location> data/stage-2/domain-expert-reviews/experimental-usecase/method-1

Please specify the location to the small domain-expert curated collection of research papers
Research papers location> data/stage-2/research-papers/experimental-usecase

Please specify the location to save the schema
Schema location> results/stage-2

Performing LLM (gpt-4o) Inference to extract updated experimental schema...
Using OPENAI - LLM Inference with Model: gpt-4o

Reading the schema from stage-1: results/stage-1/gpt-4o.json

Reading the domain expert review on the initial schema: data/stage-2/domain-expert-reviews/experimental-usecase/method-1/gpt-4o.txt

========================== Iteration 1 =================================

Reading the scientific paper: data/stage-2/research-papers/experimental-usecase/1 Groner et al.txt

Calling the completion API of the model
Writing the models response to the file at the specified location: results/stage-2

Do you want to provide feedback?
Feedback (yes/no)> yes

Please input your feedback
Feedback> The reactivity should be mentioned either at the process conditions or film properties. It’s the result of deposition under specific conditions not a standard property of the precursor or co-reactant.

Do you want to continue with the next paper?
Continue (yes)/Stop (no)> yes

========================== Iteration 2 =================================
Reading the scientific paper: data/stage-2/research-papers/experimental-usecase/2 Aaltonen et al.txt

Calling the completion API of the model
Writing the models response to the file at the specified location: results/stage-2

Do you want to provide feedback?
Feedback (yes/no)> no

Do you want to continue with the next paper?
Continue (yes)/Stop (no)> no
JSON schema updated with LLM: gpt-4o

Stage 3, validates and finalizes the schema using a larger, uncurated corpus of research papers, ensuring generalizability and semantic robustness. Stage 3 can be executed using this script. The execution of stage 3 is similar to stage 2 except only the scientific corpus is been changed.

📃 License

This work is licensed under a MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
assets		assets
data		data
docs		docs
notebooks		notebooks
results		results
src		src
.env.example		.env.example
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
MAINTENANCE.md		MAINTENANCE.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📋 Schema-miner Overview

⚙️ System Requirements

Experimental Configuration

🧪 Installation

🚀 Quick Start

🧑‍💻 Schema-miner Tool Usage

🛠️ Configuration

🗂️ Preparing Knowledge Base

📊 Generating Initial Schema from Stage 1

📊🔄 Refining Stage 1 Schemas with scientific literature and human feedback

📃 License

About

Releases 1

Languages

License

sciknoworg/schema-miner

Folders and files

Latest commit

History

Repository files navigation

📋 Schema-miner Overview

⚙️ System Requirements

Experimental Configuration

🧪 Installation

🚀 Quick Start

🧑‍💻 Schema-miner Tool Usage

🛠️ Configuration

🗂️ Preparing Knowledge Base

📊 Generating Initial Schema from Stage 1

📊🔄 Refining Stage 1 Schemas with scientific literature and human feedback

📃 License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Languages