scrapontologies

The generate schemas can be used to infer from document to use for tables in a database or for generating knowledge graph.

Features

Entity Extraction: Automatically identifies and extracts entities from PDF files.
Schema Generation: Constructs a schema based and structure of the extracted entities.
Visualization: Dynamic schema visualization

Quick Start

Prerequisites

Before you begin, ensure you have the following installed on your system:

Python: Make sure Python 3.9+ is installed.
Poppler: This tool is necessary for converting PDF to images.

MacOS Installation

To install Poppler on MacOS, use the following command:

brew install poppler

Linux Installation

To install Graphviz on Linux, use the following command:

sudo apt-get install poppler-utils

Windows

Download the latest Poppler release for Windows from poppler releases.
Extract the downloaded zip file to a location on your computer (e.g., C:\Program Files\poppler).
Add the bin directory of the extracted folder to your system's PATH environment variable.

To add to PATH:

Search for "Environment Variables" in the Start menu and open it.
Under "System variables", find and select "Path", then click "Edit".
Click "New" and add the path to the Poppler bin directory (e.g., C:\Program Files\poppler\bin).
Click "OK" to save the changes.

After installation, restart your terminal or command prompt for the changes to take effect. If doesn't work try the magic restart button.

Installation

After installing the prerequisites and dependencies, you can start using scrape_schema to extract entities and their schema from PDFs.

Here’s a basic example:

git clone https://github.com/ScrapeGraphAI/scrape_schema
pip install -r requirements.txt

Usage

from scrape_schema import FileExtractor, PDFParser
import os
from dotenv import load_dotenv

load_dotenv()  # Load environment variables from .env file
api_key = os.getenv("OPENAI_API_KEY")

# Path to your PDF file
pdf_path = "./test.pdf"

# Create an LLMClient instance
llm_client = LLMClient(api_key)

# Create a PDFParser instance with the LLMClient
pdf_parser = PDFParser(llm_client)

# Create a FileExtraxctor instance with the PDF parser
pdf_extractor = FileExtractor(pdf_path, pdf_parser)

# Extract entities from the PDF
entities = pdf_extractor.generate_json_schema()

print(entities)

Output

{
  "ROOT": {
    "portfolio": {
      "type": "object",
      "properties": {
        "name": {
          "type": "string"
        },
        "series": {
          "type": "string"
        },
        "fees": {
          "type": "object",
          "properties": {
            "salesCharges": {
              "type": "string"
            },
            "fundExpenses": {
              "type": "object",
              "properties": {
                "managementExpenseRatio": {
                  "type": "string"
                },
                "tradingExpenseRatio": {
                  "type": "string"
                },
                "totalExpenses": {
                  "type": "string"
                }
              }
            },
            "trailingCommissions": {
              "type": "string"
            }
          }
        },
        "withdrawalRights": {
          "type": "object",
          "properties": {
            "timeLimit": {
              "type": "string"
            },
            "conditions": {
              "type": "array",
              "items": {
                "type": "string"
              }
            }
          }
        },
        "contactInformation": {
          "type": "object",
          "properties": {
            "companyName": {
              "type": "string"
            },
            "address": {
              "type": "string"
            },
            "phone": {
              "type": "string"
            },
            "email": {
              "type": "string"
            },
            "website": {
              "type": "string"
            }
          }
        },
        "yearByYearReturns": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "year": {
                "type": "string"
              },
              "return": {
                "type": "string"
              }
            }
          }
        },
        "bestWorstReturns": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "type": {
                "type": "string"
              },
              "return": {
                "type": "string"
              },
              "date": {
                "type": "string"
              },
              "investmentValue": {
                "type": "string"
              }
            }
          }
        },
        "averageReturn": {
          "type": "string"
        },
        "targetInvestors": {
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "taxInformation": {
          "type": "string"
        }
      }
    }
  }
}

🤝 Contributing

Feel free to contribute and join our Discord server to discuss with us improvements and give us suggestions!

Please see the contributing guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.github		.github
docs/assets		docs/assets
examples		examples
scrapontologies		scrapontologies
.gitignore		.gitignore
.releaserc.yml		.releaserc.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.lock		requirements-dev.lock
requirements.lock		requirements.lock
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scrapontologies

Features

Quick Start

Prerequisites

MacOS Installation

Linux Installation

Windows

Installation

Usage

Output

🤝 Contributing

Created by Scrapegraphai

About

Releases 1

Sponsor this project

Packages

Contributors 6

Languages

License

ScrapeGraphAI/Scrapontologies

Folders and files

Latest commit

History

Repository files navigation

scrapontologies

Features

Quick Start

Prerequisites

MacOS Installation

Linux Installation

Windows

Installation

Usage

Output

🤝 Contributing

Created by Scrapegraphai

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Sponsor this project

Packages 0

Contributors 6

Languages

Packages