- Introduction
- Features
- Installation
- Usage
- Advanced Options
- Output Formats
- Examples
- Troubleshooting
- Contributing
- License
GroqCrawl is a powerful and user-friendly web crawling and scraping application built with Streamlit and powered by PocketGroq. It provides an intuitive interface for extracting LLM friendly AI consumable content from websites, with support for single-page scraping, multi-page crawling, and site mapping.
Whether you're a data scientist, researcher, or web developer, GroqCrawl offers a seamless experience for gathering web data in various formats, including Markdown, HTML, and structured data.
- Single URL Scraping: Extract content from individual web pages.
- Website Crawling: Traverse multiple pages of a website, respecting depth and page limits.
- Site Mapping: Generate a list of all accessible URLs within a website.
- Multiple Output Formats: Choose from Markdown, HTML, and structured data representations.
- Advanced Crawling Options: Customize your crawl with exclude/include paths, depth limits, and more.
- Interactive Results Display: View scraped content directly in the Streamlit interface.
- Download Options: Save your results as JSON files for further processing.
-
Ensure you have Python 3.7 or later installed on your system.
-
Clone the GroqCrawl repository:
git clone https://github.com/yourusername/groqcrawl.git cd groqcrawl
-
Install the required dependencies:
pip install -r requirements.txt
-
Set up your PocketGroq API key:
- Create a
.env
file in the project root directory. - Add your API key to the file:
GROQ_API_KEY=your_api_key_here
- Create a
To run GroqCrawl:
-
Navigate to the project directory:
cd path/to/groqcrawl
-
Launch the Streamlit app:
streamlit run groqcrawl.py
-
Open your web browser and go to the URL displayed in the terminal (usually
http://localhost:8501
). -
Use the interface to select your scraping type, enter a URL, and configure options.
-
Click "Run" to start the scraping/crawling process.
- Max Depth: Set the maximum depth for crawling (Crawl mode only).
- Max Pages: Limit the total number of pages to crawl (Crawl mode only).
- Exclude Paths: Specify URL patterns to exclude from crawling.
- Include Only Paths: Limit crawling to specific URL patterns.
- Ignore Sitemap: Skip using the sitemap.xml for crawling.
- Allow Backwards Links: Enable crawling of links that point to previously visited pages.
-
Markdown:
- Human-readable text format.
- Ideal for content analysis and easy viewing.
-
HTML:
- Raw HTML content of the page.
- Useful for detailed structure analysis or further processing.
-
Structured Data:
- JSON format containing:
- Full text content
- Headings (h1 to h6)
- Links (text and href)
- Images (src and alt attributes)
- JSON-LD data (if available)
- JSON format containing:
- Select "Single URL (/scrape)" from the radio buttons.
- Enter a URL, e.g.,
https://example.com
. - Choose desired output formats.
- Click "Run".
- Select "Crawl (/crawl)" from the radio buttons.
- Enter the starting URL, e.g.,
https://example.com
. - Set Max Depth and Max Pages in the Options section.
- Choose desired output formats.
- Click "Run".
- Select "Map (/map)" from the radio buttons.
- Enter the website URL, e.g.,
https://example.com
. - Click "Run".
- API Key Issues: Ensure your PocketGroq API key is correctly set in the
.env
file. - Connection Errors: Check your internet connection and verify the URL is accessible.
- Slow Performance: For large websites, try reducing Max Depth or Max Pages.
- Missing Content: Some websites may block scraping. Check the site's robots.txt file and consider respecting their scraping policies.
We welcome contributions to GroqCrawl! Please follow these steps:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Make your changes and commit them with clear, descriptive messages.
- Push your changes to your fork.
- Submit a pull request with a detailed description of your changes.
GroqCrawl is released under the MIT License. See the LICENSE file for details.
For more information or support, please open an issue on the GitHub repository.
Happy crawling!