GitHub - GHOST-Science-Club/statutscan-data-scraping

This is a tool for scraping and processing data from various sources used in RAG project. See our app.

Installation (for Linux/MacOS)

Ensure you have Python 3.12 installed. Remember to add OPEN_AI_KEY and MONGO_KEY to .env file.

Clone repo and cd into

git clone https://github.com/GHOST-Science-Club/statutscan-data-scraping.git
cd statutscan-data-scraping

Create and activate a virtual environment

python3 -m venv venv
source venv/bin/activate

Install dependencies and create project structure

pip install --upgrade pip
pip install setuptools
pip install .

Run application

Add urls you want to scrape to to_scrape/urls_to_scrape.csv and run app:

python3 run.py --param

Parameters:

--scrape
--crawl
--crawl_and_scrape

Structure

statutscan-data-scraping/
│-- uniscrape/             # Application source code
│-- to_scrape/             # Folder for files to be scraped
│   ├── urls_to_scrape.csv
│   ├── pdfs/
│-- logs/                 # Application logs
│   ├── app_log.log
│-- visited/              # Visited documents
│-- setup.py              # Installation script
│-- requirements.txt      # List of dependencies
│-- README.md             # Documentation

Uninstallation

pip uninstall statutscan-data-scraping
rm -rf venv

Issues

Please add all issues to Issues section on Github.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation (for Linux/MacOS)

Clone repo and cd into

Create and activate a virtual environment

Install dependencies and create project structure

Run application

Structure

Uninstallation

Issues

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
uniscrape		uniscrape
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Installation (for Linux/MacOS)

Clone repo and cd into

Create and activate a virtual environment

Install dependencies and create project structure

Run application

Structure

Uninstallation

Issues

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages