WebScrape is a powerful Python-based web scraping tool that combines traditional web scraping with AI-powered content parsing. It allows users to extract, analyze, and visualize website data with advanced features for data processing and storage.
- Extract comprehensive website data:
- Title and metadata
 - Anchor tags and links
 - Images and their sources
 - Headings (H1, H2, H3)
 - Paragraphs and text content
 - Custom content based on user queries
 
 
- AI-powered content parsing using:
- Hugging Face models (OPT-1.3B/350M)
 - Customizable queries for targeted extraction
 - Intelligent content analysis and structuring
 - Context-aware information extraction
 
 
- Structured data storage in JSON format only (CSV/Excel export removed for simplicity and reliability)
 - All scraped data is stored in a single 
scraped_data.jsonfile - Each website's data is stored under a unique alias key inside the JSON file
 - Caching system for improved performance
 - Unique alias system for data organization
 
- Command-line interface (CLI)
 - Interactive data visualization
 - Progress tracking and feedback
 
- Python 3.7+
 - Git
 - (Optional) Hugging Face token for AI features
 
- Clone the repository:
 
git clone https://github.com/aa-sikkkk/WebScrape.git
cd WebScrape- Install dependencies:
 
pip install -r requirements.txt- Run the scraper:
 
python web_scraper_notebook.pyYou can use WebScrape on Google Colab for free. The project uses Hugging Face models for data parsing.
Note: Please use the free resources responsibly:
- Maximum 12 hours per session
 - Avoid creating back-to-back sessions
 - Consider Colab Pro/Pro+ for better GPUs and longer runtimes
 
- All data is stored in 
scraped_data.json. - Each website's data is accessible by its alias (e.g., 
medium,wikipedia, etc.). - All data management is via JSON for reliability and clarity.
 
We welcome contributions! Please feel free to:
- Fork the repository
 - Create a feature branch
 - Submit a pull request
 - Open issues for bugs or feature requests
 
This project is licensed under the MIT License. See the LICENSE file for details.
- BeautifulSoup - HTML parsing
 - BeautifulTable - Data display
 - Hugging Face - AI models
 - Django - Web framework
 - All contributors and users of the project
 


