Merge remote-tracking branch 'upstream/main' into push-to-hf

unclecode · Dec 5, 2024 · 34724d9 · 34724d9
2 parents ce11df9 + a45b8b1
commit 34724d9
Show file tree

Hide file tree

Showing 18 changed files with 1,147 additions and 335 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,55 @@
 # Changelog
 
+## [0.3.75] December 1, 2024
+
+### PruningContentFilter
+
+#### 1. Introduced PruningContentFilter (Dec 01, 2024) (Dec 01, 2024)
+A new content filtering strategy that removes less relevant nodes based on metrics like text and link density.
+
+**Affected Files:**
+- `crawl4ai/content_filter_strategy.py`: Enhancement of content filtering capabilities.
+```diff
+Implemented effective pruning algorithm with comprehensive scoring.
+```
+- `README.md`: Improved documentation regarding new features.
+```diff
+Updated to include usage and explanation for the PruningContentFilter.
+```
+- `docs/md_v2/basic/content_filtering.md`: Expanded documentation for users.
+```diff
+Added detailed section explaining the PruningContentFilter.
+```
+
+#### 2. Added Unit Tests for PruningContentFilter (Dec 01, 2024) (Dec 01, 2024)
+Comprehensive tests added to ensure correct functionality of PruningContentFilter
+
+**Affected Files:**
+- `tests/async/test_content_filter_prune.py`: Increased test coverage for content filtering strategies.
+```diff
+Created test cases for various scenarios using the PruningContentFilter.
+```
+
+### Development Updates
+
+#### 3. Enhanced BM25ContentFilter tests (Dec 01, 2024) (Dec 01, 2024)
+Extended testing to cover additional edge cases and performance metrics.
+
+**Affected Files:**
+- `tests/async/test_content_filter_bm25.py`: Improved reliability and performance assurance.
+```diff
+Added tests for new extraction scenarios including malformed HTML.
+```
+
+### Infrastructure & Documentation
+
+#### 4. Updated Examples (Dec 01, 2024) (Dec 01, 2024)
+Altered examples in documentation to promote the use of PruningContentFilter alongside existing strategies.
+
+**Affected Files:**
+- `docs/examples/quickstart_async.py`: Enhanced usability and clarity for new users.
+- Revised example to illustrate usage of PruningContentFilter.
+
 ## [0.3.746] November 29, 2024
 
 ### Major Features

diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md
@@ -18,6 +18,7 @@ We would like to thank the following people for their contributions to Crawl4AI:
 
 ## Pull Requests
 
+- [dvschuyl](https://github.com/dvschuyl) - AsyncPlaywrightCrawlerStrategy page-evaluate context destroyed by navigation [#304](https://github.com/unclecode/crawl4ai/pull/304)
 - [nelzomal](https://github.com/nelzomal) - Enhance development installation instructions [#286](https://github.com/unclecode/crawl4ai/pull/286)
 - [HamzaFarhan](https://github.com/HamzaFarhan) - Handled the cases where markdown_with_citations, references_markdown, and filtered_html might not be defined [#293](https://github.com/unclecode/crawl4ai/pull/293)
 - [NanmiCoder](https://github.com/NanmiCoder) - fix: crawler strategy exception handling and fixes [#271](https://github.com/unclecode/crawl4ai/pull/271)

diff --git a/README.md b/README.md
@@ -11,7 +11,10 @@
 
 Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.  
 
-[✨ Check out latest update v0.3.745](#-recent-updates)
+
+🎉 **Version 0.4.0 is out!** Introducing our experimental PruningContentFilter - a powerful new algorithm for smarter Markdown generation. Test it out and [share your feedback](https://github.com/unclecode/crawl4ai/issues)! [Read the release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.0.md)
+
+[✨ Check out latest update v0.4.0](#-recent-updates)
 
 ## 🧐 Why Crawl4AI?
 
@@ -422,7 +425,7 @@ You can check the project structure in the directory [https://github.com/uncleco
 ```python
 import asyncio
 from crawl4ai import AsyncWebCrawler, CacheMode
-from crawl4ai.content_filter_strategy import BM25ContentFilter
+from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
 from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
 
 async def main():
@@ -434,8 +437,11 @@ async def main():
             url="https://docs.micronaut.io/4.7.6/guide/",
             cache_mode=CacheMode.ENABLED,
             markdown_generator=DefaultMarkdownGenerator(
-                content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
+                content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
             ),
+            # markdown_generator=DefaultMarkdownGenerator(
+            #     content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
+            # ),
         )
         print(len(result.markdown))
         print(len(result.fit_markdown))
@@ -620,18 +626,21 @@ async def test_news_crawl():
 
 ## ✨ Recent Updates   
 
-- 🚀 **Improved ManagedBrowser Configuration**: Dynamic host and port support for more flexible browser management.  
-- 📝 **Enhanced Markdown Generation**: New generator class for better formatting and customization.  
-- ⚡ **Fast HTML Formatting**: Significantly optimized HTML formatting in the web crawler.  
-- 🛠️ **Utility & Sanitization Upgrades**: Improved sanitization and expanded utility functions for streamlined workflows.  
-- 👥 **Acknowledgments**: Added contributor details and pull request acknowledgments for better transparency.  
+- 🔬 **PruningContentFilter**: New unsupervised filtering strategy for intelligent content extraction based on text density and relevance scoring.
+- 🧵 **Enhanced Thread Safety**: Improved multi-threaded environment handling with better locks and parallel processing support.
+- 🤖 **Smart User-Agent Generation**: Advanced user-agent generator with customization options and randomization capabilities.
+- 📝 **New Blog Launch**: Stay updated with our detailed release notes and technical deep dives at [crawl4ai.com/blog](https://crawl4ai.com/blog).
+- 🧪 **Expanded Test Coverage**: Comprehensive test suite for both PruningContentFilter and BM25ContentFilter with edge case handling.
 
+Read the full details of this release in our [0.4.0 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.0.md).
 
 ## 📖 Documentation & Roadmap 
 
-For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
+> 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
+
+For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
 
-Moreover to check our development plans and upcoming features, check out our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
+To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
 
 <details>
 <summary>📈 <strong>Development TODOs</strong></summary>

diff --git a/crawl4ai/__version__.py b/crawl4ai/__version__.py
@@ -1,2 +1,2 @@
 # crawl4ai/_version.py
-__version__ = "0.3.746"
+__version__ = "0.4.0"
diff --git a/crawl4ai/async_crawler_strategy.py b/crawl4ai/async_crawler_strategy.py
@@ -6,6 +6,7 @@
 import os, sys, shutil
 import tempfile, subprocess
 from playwright.async_api import async_playwright, Page, Browser, Error
+from playwright.async_api import TimeoutError as PlaywrightTimeoutError
 from io import BytesIO
 from PIL import Image, ImageDraw, ImageFont
 from pathlib import Path
@@ -16,6 +17,7 @@
 import uuid
 from .models import AsyncCrawlResponse
 from .utils import create_box_message
+from .user_agent_generator import UserAgentGenerator
 from playwright_stealth import StealthConfig, stealth_async
 
 stealth_config = StealthConfig(
@@ -222,14 +224,21 @@ def __init__(self, use_cached_html=False, js_code=None, logger = None, **kwargs)
         self.use_cached_html = use_cached_html
         self.user_agent = kwargs.get(
             "user_agent",
-            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
-            "(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
+            # "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
+            "Mozilla/5.0 (Linux; Android 11; SM-G973F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.120 Mobile Safari/537.36"
         )
+        user_agenr_generator = UserAgentGenerator()
+        if kwargs.get("user_agent_mode") == "random":
+            self.user_agent = user_agenr_generator.generate(
+                 **kwargs.get("user_agent_generator_config", {})
+            )
         self.proxy = kwargs.get("proxy")
         self.proxy_config = kwargs.get("proxy_config")
         self.headless = kwargs.get("headless", True)
         self.browser_type = kwargs.get("browser_type", "chromium")
         self.headers = kwargs.get("headers", {})
+        self.browser_hint = user_agenr_generator.generate_client_hints(self.user_agent)
+        self.headers.setdefault("sec-ch-ua", self.browser_hint)
         self.cookies = kwargs.get("cookies", [])
         self.sessions = {}
         self.session_ttl = 1800 
@@ -307,7 +316,9 @@ async def start(self):
 
                     if self.user_agent:
                         await self.default_context.set_extra_http_headers({
-                            "User-Agent": self.user_agent
+                            "User-Agent": self.user_agent,
+                            "sec-ch-ua": self.browser_hint,
+                            # **self.headers
                         })
             else:
                 # Base browser arguments
@@ -321,7 +332,9 @@ async def start(self):
                         "--disable-infobars",
                         "--window-position=0,0",
                         "--ignore-certificate-errors",
-                        "--ignore-certificate-errors-spki-list"
+                        "--ignore-certificate-errors-spki-list",
+                        "--disable-blink-features=AutomationControlled",
+
                     ]
                 }
 
@@ -642,6 +655,15 @@ async def _crawl_web(self, url: str, **kwargs) -> AsyncCrawlResponse:
         self._cleanup_expired_sessions()
         session_id = kwargs.get("session_id")
 
+        # Check if in kwargs we have user_agent that will override the default user_agent
+        user_agent = kwargs.get("user_agent", self.user_agent)
+
+        # Generate random user agent if magic mode is enabled and user_agent_mode is not random
+        if kwargs.get("user_agent_mode") != "random" and kwargs.get("magic", False):
+            user_agent = UserAgentGenerator().generate(
+                **kwargs.get("user_agent_generator_config", {})
+            )
+
         # Handle page creation differently for managed browser
         context = None
         if self.use_managed_browser:
@@ -666,7 +688,7 @@ async def _crawl_web(self, url: str, **kwargs) -> AsyncCrawlResponse:
                     else:
                         # Normal context creation for non-persistent or non-Chrome browsers
                         context = await self.browser.new_context(
-                            user_agent=self.user_agent,
+                            user_agent=user_agent,
                             viewport={"width": 1200, "height": 800},
                             proxy={"server": self.proxy} if self.proxy else None,
                             java_script_enabled=True,
@@ -686,10 +708,11 @@ async def _crawl_web(self, url: str, **kwargs) -> AsyncCrawlResponse:
                 else:
                     # Normal context creation
                     context = await self.browser.new_context(
-                        user_agent=self.user_agent,
+                        user_agent=user_agent,
                         viewport={"width": 1920, "height": 1080},
                         proxy={"server": self.proxy} if self.proxy else None,
                         accept_downloads=self.accept_downloads,
+                        ignore_https_errors=True  # Add this line
                     )
                     if self.cookies:
                             await context.add_cookies(self.cookies)
@@ -920,8 +943,24 @@ async def _crawl_web(self, url: str, **kwargs) -> AsyncCrawlResponse:
                 });
             }
             """
-            await page.wait_for_load_state()
-            await page.evaluate(update_image_dimensions_js)
+
+            try:
+                try:
+                    await page.wait_for_load_state(
+                        # state="load",
+                        state="domcontentloaded",
+                        timeout=5
+                    )
+                except PlaywrightTimeoutError:
+                    pass
+                await page.evaluate(update_image_dimensions_js)
+            except Exception as e:
+                self.logger.error(
+                    message="Error updating image dimensions ACS-UPDATE_IMAGE_DIMENSIONS_JS: {error}",
+                    tag="ERROR",
+                    params={"error": str(e)}
+                )
+                # raise RuntimeError(f"Error updating image dimensions ACS-UPDATE_IMAGE_DIMENSIONS_JS: {str(e)}")
 
             # Wait a bit for any onload events to complete
             await page.wait_for_timeout(100)