WebScrape

WebScrape is a powerful Python-based web scraping tool that combines traditional web scraping with AI-powered content parsing. It allows users to extract, analyze, and visualize website data with advanced features for data processing and storage.

🌟 Features

Core Scraping Features

Extract comprehensive website data:
- Title and metadata
- Anchor tags and links
- Images and their sources
- Headings (H1, H2, H3)
- Paragraphs and text content
- Custom content based on user queries

AI Integration

AI-powered content parsing using:
- Hugging Face models (OPT-1.3B/350M)
- Customizable queries for targeted extraction
- Intelligent content analysis and structuring
- Context-aware information extraction

Data Management

Structured data storage in JSON format only (CSV/Excel export removed for simplicity and reliability)
All scraped data is stored in a single scraped_data.json file
Each website's data is stored under a unique alias key inside the JSON file
Caching system for improved performance
Unique alias system for data organization

User Interface

Command-line interface (CLI)
Interactive data visualization
Progress tracking and feedback

🚀 Getting Started

Prerequisites

Python 3.7+
Git
(Optional) Hugging Face token for AI features

Installation

Clone the repository:

git clone https://github.com/aa-sikkkk/WebScrape.git
cd WebScrape

Install dependencies:

pip install -r requirements.txt

Run the scraper:

python web_scraper_notebook.py

Using Google Colab

You can use WebScrape on Google Colab for free. The project uses Hugging Face models for data parsing.

Note: Please use the free resources responsibly:

Maximum 12 hours per session
Avoid creating back-to-back sessions
Consider Colab Pro/Pro+ for better GPUs and longer runtimes

📊 Data Structure

All data is stored in scraped_data.json.
Each website's data is accessible by its alias (e.g., medium, wikipedia, etc.).
All data management is via JSON for reliability and clarity.

🤝 Contributing

We welcome contributions! Please feel free to:

Fork the repository
Create a feature branch
Submit a pull request
Open issues for bugs or feature requests

📝 License

This project is licensed under the MIT License. See the LICENSE file for details.

🙏 Acknowledgements

BeautifulSoup - HTML parsing
BeautifulTable - Data display
Hugging Face - AI models
Django - Web framework
All contributors and users of the project

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
scrap.py		scrap.py
scrap_with_ollama.py		scrap_with_ollama.py
web_scraper_notebook.py		web_scraper_notebook.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WebScrape

🌟 Features

Core Scraping Features

AI Integration

Data Management

User Interface

🚀 Getting Started

Prerequisites

Installation

Using Google Colab

📊 Data Structure

🤝 Contributing

📝 License

🙏 Acknowledgements

About

Uh oh!

Releases

Uh oh!

Contributors 2

Languages

License

aa-sikkkk/WebScrape

Folders and files

Latest commit

History

Repository files navigation

WebScrape

🌟 Features

Core Scraping Features

AI Integration

Data Management

User Interface

🚀 Getting Started

Prerequisites

Installation

Using Google Colab

📊 Data Structure

🤝 Contributing

📝 License

🙏 Acknowledgements

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 2

Languages