Skip to content

aa-sikkkk/WebScrape

Repository files navigation


Pull Requests Stars Downloads Issues License

WebScrape

WebScrape is a powerful Python-based web scraping tool that combines traditional web scraping with AI-powered content parsing. It allows users to extract, analyze, and visualize website data with advanced features for data processing and storage.

web_scrapper-ipynb-Colab-05-27-2025_10_06_PM

🌟 Features

Core Scraping Features

  • Extract comprehensive website data:
    • Title and metadata
    • Anchor tags and links
    • Images and their sources
    • Headings (H1, H2, H3)
    • Paragraphs and text content
    • Custom content based on user queries

AI Integration

  • AI-powered content parsing using:
    • Hugging Face models (OPT-1.3B/350M)
    • Customizable queries for targeted extraction
    • Intelligent content analysis and structuring
    • Context-aware information extraction

Data Management

  • Structured data storage in JSON format only (CSV/Excel export removed for simplicity and reliability)
  • All scraped data is stored in a single scraped_data.json file
  • Each website's data is stored under a unique alias key inside the JSON file
  • Caching system for improved performance
  • Unique alias system for data organization

User Interface

  • Command-line interface (CLI)
  • Interactive data visualization
  • Progress tracking and feedback

🚀 Getting Started

Prerequisites

  • Python 3.7+
  • Git
  • (Optional) Hugging Face token for AI features

Installation

  1. Clone the repository:
git clone https://github.com/aa-sikkkk/WebScrape.git
cd WebScrape
  1. Install dependencies:
pip install -r requirements.txt
  1. Run the scraper:
python web_scraper_notebook.py

Using Google Colab

You can use WebScrape on Google Colab for free. The project uses Hugging Face models for data parsing.

Note: Please use the free resources responsibly:

  • Maximum 12 hours per session
  • Avoid creating back-to-back sessions
  • Consider Colab Pro/Pro+ for better GPUs and longer runtimes

📊 Data Structure

Screenshot (106)

  • All data is stored in scraped_data.json.
  • Each website's data is accessible by its alias (e.g., medium, wikipedia, etc.).
  • All data management is via JSON for reliability and clarity.

🤝 Contributing

We welcome contributions! Please feel free to:

  1. Fork the repository
  2. Create a feature branch
  3. Submit a pull request
  4. Open issues for bugs or feature requests

📝 License

This project is licensed under the MIT License. See the LICENSE file for details.

🙏 Acknowledgements

  • BeautifulSoup - HTML parsing
  • BeautifulTable - Data display
  • Hugging Face - AI models
  • Django - Web framework
  • All contributors and users of the project