AI Web Scraper

This project is an AI-powered web scraper that allows you to extract information from HTML sources based on user-defined requirements. It generates scraping code and executes it to retrieve the desired data.

Prerequisites

Before running the AI Web Scraper, ensure you have the following prerequisites installed:

Python 3.x
The required Python packages specified in the requirements.txt file
An API key for the OpenAI GPT-4

Installation

Clone the project repository:

git clone https://github.com/dirkjbreeuwer/gpt-automated-web-scraper

Navigate to the project directory:
```
cd gpt-automated-web-scraper
```
Install the required Python packages:
```
pip install -r requirements.txt
```
Set up the OpenAI GPT-4 API key:
- Obtain an API key from OpenAI by following their documentation.
- Rename the file called .env.example to .env in the project directory.
- Add the following line to the .env file, replacing YOUR_API_KEY with your actual API key:
```
OPENAI_API_KEY=YOUR_API_KEY
```

Usage

To use the AI Web Scraper, run the gpt-scraper.py script with the desired command-line arguments.

Command-line Arguments

The following command-line arguments are available:

--source: The URL or local path to the HTML source to scrape.
--source-type: Type of the source. Specify either "url" or "file".
--requirements: User-defined requirements for scraping.
--target-string: Due to the maximum token limit of GPT-4 (4k tokens), the AI model processes a smaller subset of the HTML where the desired data is located. The target string should be an example string that can be found within the website you want to scrape.

Example Usage

Here are some example commands for using the AI Web Scraper:

python3 gpt-scraper.py --source-type "url" --source "https://www.scrapethissite.com/pages/forms/" --requirements "Print a JSON file with all the information available for the Chicago Blackhawks" --target-string "Chicago Blackhawks"

Replace the values for --source, --requirements, and --target-string with your specific values.

License

This project is licensed under the MIT License. Feel free to modify and use it according to your needs.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data_extraction		data_extraction
gpt_interaction		gpt_interaction
results		results
scraper_generation		scraper_generation
tests		tests
website_analysis		website_analysis
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
Untitled Diagram.drawio		Untitled Diagram.drawio
__main__.py		__main__.py
config.json.example		config.json.example
gpt-scraper.py		gpt-scraper.py
prd.md		prd.md
requirements.txt		requirements.txt
tdd.md		tdd.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Web Scraper

Prerequisites

Installation

Usage

Command-line Arguments

Example Usage

License

About

Releases

Packages

Languages

dirkjbreeuwer/gpt-automated-web-scraper

Folders and files

Latest commit

History

Repository files navigation

AI Web Scraper

Prerequisites

Installation

Usage

Command-line Arguments

Example Usage

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages