Scraper uc #294

aravindkarnam · 2024-11-26T11:37:10Z

No description provided.

- Implement playwright_stealth for better bot detection avoidance - Add user simulation and navigator override options - Improve iframe processing and browser selection - Enhance error reporting and debugging capabilities - Optimize image processing and parallel crawling - Add new example for user simulation feature - Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.

- Update version number to 0.3.71 - Add sleep_on_close option to AsyncPlaywrightCrawlerStrategy - Enhance context creation with additional options - Improve error message formatting and visibility - Update quickstart documentation

- Add OverlappingWindowChunking and improve SlidingWindowChunking - Update CHUNK_TOKEN_THRESHOLD to 2048 tokens - Optimize AsyncPlaywrightCrawlerStrategy close method - Enhance flexibility in CosineStrategy with generic embedding model loading - Improve JSON-based extraction strategies - Add knowledge graph generation example

- Integrate customized html2text library for flexible Markdown output - Add options to exclude external links and images - Improve content scraping efficiency and error handling - Update AsyncPlaywrightCrawlerStrategy for faster closing - Enhance CosineStrategy with generic embedding model loading

- Add support for extracting Base64 encoded images - Improve image format detection to include Base64 images - Enhance compatibility with locally saved HTML files using Base64 image encoding

- Add ContentCleaningStrategy for improved content extraction - Implement advanced proxy configuration with authentication - Enhance image source detection and handling - Add fit_markdown and fit_html for refined content output - Improve external link and image handling flexibility

…l-base-directory Support for custom crawl base directory

…pabilities • Add smart overlay removal system for handling popups and modals • Improve screenshot functionality with configurable timing controls • Implement URL normalization and enhanced link processing • Add custom base directory support for cache storage • Refine external content filtering and social media domain handling This commit significantly improves the crawler's ability to handle modern websites by automatically removing intrusive overlays and providing better screenshot capabilities. URL handling is now more robust with proper normalization and duplicate detection. The cache system is more flexible with customizable base directory support. Breaking changes: None Issue numbers: None

According to unclecode#102 the requirements specified are minimum version. Currently they are defined as fixed versions in requirements.txt and setup.py leading to projects consuming this package are limited to using exactly these requirements instead of a more flexible range. This PR addresses this.

- Introduced the PruningContentFilter for better content relevance. - Implemented comprehensive unit tests for verification of functionality. - Enhanced existing BM25ContentFilter tests for edge case coverage. - Updated documentation to include usage examples for new filter.

…in PruningContentFilter

…arsing logic

…new features and improvements

- Enhanced error handling in async crawler. - Added flexible options in Markdown generation. - Updated user agent settings for improved reliability. - Reflected changes in documentation and examples.

- Enhanced the web scraping strategy with new methods for optimized media handling. - Added new utility functions for better content processing. - Refined existing features for improved accuracy and efficiency in scraping tasks. - Introduced more robust filtering criteria for media elements.

… modes, dynamic viewport adjustment, and session management ### New Features: - **Text-Only Mode**: Added support for text-only crawling by disabling images, JavaScript, GPU, and other non-essential features. - **Light Mode**: Optimized browser settings to reduce resource usage and improve efficiency during crawling. - **Dynamic Viewport Adjustment**: Automatically adjusts viewport dimensions based on content size, ensuring accurate rendering and scaling. - **Full Page Scanning**: Introduced a feature to scroll and capture dynamic content for pages with infinite scroll or lazy-loading elements. - **Session Management**: Added `create_session` method for creating and managing browser sessions with unique IDs. ### Improvements: - Unified viewport handling across contexts by dynamically setting dimensions using `self.viewport_width` and `self.viewport_height`. - Enhanced logging and error handling for viewport adjustments, page scanning, and content evaluation. - Reduced resource usage with additional browser flags for both `light_mode` and `text_only` configurations. - Improved handling of cookies, headers, and proxies in session creation. ### Refactoring: - Removed hardcoded viewport dimensions and replaced them with dynamic configurations. - Cleaned up unused and commented-out code for better readability and maintainability. - Introduced defaults for frequently used parameters like `delay_before_return_html`. ### Fixes: - Resolved potential inconsistencies in viewport handling. - Improved robustness of content loading and dynamic adjustments to avoid failures and timeouts. ### Docs Update: - Updated schema usage in `quickstart_async.py` example: - Changed `OpenAIModelFee.schema()` to `OpenAIModelFee.model_json_schema()` for compatibility. - Enhanced LLM extraction instruction documentation. This commit introduces significant enhancements to improve efficiency, flexibility, and reliability of the crawler strategy.

…True. (unclecode#314) Co-authored-by: lu4nx <lu4nx@lx-pc>

Enhance Async Crawler with storage state handling - Updated Async Crawler to support storage state management. - Added error handling for URL validation in Async Web Crawler. - Modified README logo and improved .gitignore entries. - Fixed issues in multiple files for better code robustness.

- Introduced new async crawl strategy with session management. - Added BrowserManager for improved browser management. - Enhanced documentation, focusing on storage state and usage examples. - Improved error handling and logging for sessions. - Added JavaScript snippets for customizing navigator properties.

- Added support for exporting pages as PDFs - Enhanced screenshot functionality for long pages - Created a tutorial on dynamic content loading with 'Load More' buttons. - Updated web crawler to handle PDF data in responses.

add @asynccontextmanager

- Introduced a new approach for capturing full-page screenshots by exporting them as PDFs first, enhancing reliability and performance. - Added documentation for the feature in `docs/examples/full_page_screenshot_and_pdf_export.md`. - Refactored `perform_completion_with_backoff` in `crawl4ai/utils.py` to include necessary extra parameters. - Updated `quickstart_async.py` to utilize LLM extraction with refined arguments.

- Introduced new configuration classes: BrowserConfig and CrawlerRunConfig. - Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management. - Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters. - Improved error handling with detailed context extraction during exceptions. - Enhanced overall maintainability and usability of the web crawler.

…rawlers, Session Management, and Enhanced Screenshot/PDF features

- Added markdown generator parameter to CrawlerRunConfig in `async_configs.py`. - Implemented logic for Markdown generation in content scraping in `async_webcrawler.py`. - Updated version number to 0.4.21 in `__version__.py`.

bump to 0.4.22

Pulling version 0.4.22 from main into scraper

unclecode and others added 30 commits October 17, 2024 21:37

Update gitignore

dbb587d

Rename some flags name, introducing magic flag.

dd17ed0

Update requirements and switch to 0.3.8

aab6ea0

Fix the model nam ein quick start example

b309bc3

Update Changelog

e7cd8a1

Refactor content scrapping strategy and improve error handling

1dd36f9

Fix Base64 image parsing in WebScrappingStrategy (issue 182)

04d16e6

- Add support for extracting Base64 encoded images - Improve image format detection to include Base64 images - Enhance compatibility with locally saved HTML files using Base64 image encoding

feat: customize crawl base directory

a5f627b

Merge pull request unclecode#194 from IdrisHanafi/feat/customize-craw…

32f57c4

…l-base-directory Support for custom crawl base directory

Update version

38474bd

Update Documentation

4239654

Merge branch 'main' of https://github.com/unclecode/crawl4ai

ff9149b

Update gitignore

ac9d83c

Merge branch '0.3.72'

d61615e

Update Docs folder, prepare branch for new version 0.3.73

c2a71a5

Update Readme

d913e20

Add badges to README

b2800fe

Fix README badge

d9e0b7a

Update new tutorial documents and added to the docs folder.

3529c2e

Merge branch '0.3.73'

e9f7d5e

fix dev requirements and lock playwright due to failing tests

605a827

Update documents, upload new version of quickstart.

9307c19

Merge branch '0.3.73'

982d203

unclecode and others added 30 commits December 1, 2024 19:17

fix: pass logger to WebScrapingStrategy and update score computation …

95a4f74

…in PruningContentFilter

refactor: improve error handling in DataProcessor and optimize data p…

e9639ad

…arsing logic

docs: update README and blog for version 0.4.0 release, highlighting …

b02544b

…new features and improvements

Updated to version 0.4.0 with new features

486db3a

- Enhanced error handling in async crawler. - Added flexible options in Markdown generation. - Updated user agent settings for improved reliability. - Reflected changes in documentation and examples.

Merge branch 'next'

56f82f3

Merge issues with 0.4.0 is over

a45b8b1

Merge branch 'next'

740214e

fixing Readmen tap (unclecode#313)

e3488da

fix: The extract method logs output only when self.verbose is set to …

ba3e808

…True. (unclecode#314) Co-authored-by: lu4nx <lu4nx@lx-pc>

Fixed typo (unclecode#324)

ded554d

Add PDF & screenshot functionality, new tutorial

5431fa2

- Added support for exporting pages as PDFs - Enhanced screenshot functionality for long pages - Created a tutorial on dynamic content loading with 'Load More' buttons. - Updated web crawler to handle PDF data in responses.

Update async_webcrawler.py (unclecode#337)

7591648

add @asynccontextmanager

Bump version to 0.4.2

de1766d

chore: Update .gitignore to include new files and directories

3d69715

Merge branch 'main' of https://github.com/unclecode/crawl4ai

20d6f5f

Add release notes and documentation for version 0.4.2: Configurable C…

4a72c5e

…rawlers, Session Management, and Enhanced Screenshot/PDF features

Merge branch 'next'

399af80

Update README for version 0.4.2: Reflect new features and enhancements

7af1d32

Feature: Add Markdown generation to CrawlerRunConfig

7524aa7

- Added markdown generator parameter to CrawlerRunConfig in `async_configs.py`. - Implemented logic for Markdown generation in content scraping in `async_webcrawler.py`. - Updated version number to 0.4.21 in `__version__.py`.

Fix js_snipprt issue 0.4.21

e9e5b56

bump to 0.4.22

Bump version to 0.4.22

ed7bc19

Merge pull request #9 from aravindkarnam/main

7c0fa26

Pulling version 0.4.22 from main into scraper

fix: Added browser config and crawler run config from 0.4.22

7a5f83b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraper uc #294

Scraper uc #294

aravindkarnam commented Nov 26, 2024

Scraper uc #294

Are you sure you want to change the base?

Scraper uc #294

Conversation

aravindkarnam commented Nov 26, 2024