This is a tool for scraping and processing data from various sources used in RAG project. See our app.
Ensure you have Python 3.12 installed. Remember to add OPEN_AI_KEY and MONGO_KEY to .env file.
git clone https://github.com/GHOST-Science-Club/statutscan-data-scraping.git
cd statutscan-data-scraping
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install setuptools
pip install .
Add urls you want to scrape to to_scrape/urls_to_scrape.csv and run app:
python3 run.py --param
Parameters:
- --scrape
- --crawl
- --crawl_and_scrape
statutscan-data-scraping/
│-- uniscrape/ # Application source code
│-- to_scrape/ # Folder for files to be scraped
│ ├── urls_to_scrape.csv
│ ├── pdfs/
│-- logs/ # Application logs
│ ├── app_log.log
│-- visited/ # Visited documents
│-- setup.py # Installation script
│-- requirements.txt # List of dependencies
│-- README.md # Documentation
pip uninstall statutscan-data-scraping
rm -rf venv
Please add all issues to Issues section on Github.