Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/main' into push-to-hf
Browse files Browse the repository at this point in the history
  • Loading branch information
AndreaFrancis committed Dec 5, 2024
2 parents ce11df9 + a45b8b1 commit 34724d9
Show file tree
Hide file tree
Showing 18 changed files with 1,147 additions and 335 deletions.
50 changes: 50 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,55 @@
# Changelog

## [0.3.75] December 1, 2024

### PruningContentFilter

#### 1. Introduced PruningContentFilter (Dec 01, 2024) (Dec 01, 2024)
A new content filtering strategy that removes less relevant nodes based on metrics like text and link density.

**Affected Files:**
- `crawl4ai/content_filter_strategy.py`: Enhancement of content filtering capabilities.
```diff
Implemented effective pruning algorithm with comprehensive scoring.
```
- `README.md`: Improved documentation regarding new features.
```diff
Updated to include usage and explanation for the PruningContentFilter.
```
- `docs/md_v2/basic/content_filtering.md`: Expanded documentation for users.
```diff
Added detailed section explaining the PruningContentFilter.
```

#### 2. Added Unit Tests for PruningContentFilter (Dec 01, 2024) (Dec 01, 2024)
Comprehensive tests added to ensure correct functionality of PruningContentFilter

**Affected Files:**
- `tests/async/test_content_filter_prune.py`: Increased test coverage for content filtering strategies.
```diff
Created test cases for various scenarios using the PruningContentFilter.
```

### Development Updates

#### 3. Enhanced BM25ContentFilter tests (Dec 01, 2024) (Dec 01, 2024)
Extended testing to cover additional edge cases and performance metrics.

**Affected Files:**
- `tests/async/test_content_filter_bm25.py`: Improved reliability and performance assurance.
```diff
Added tests for new extraction scenarios including malformed HTML.
```

### Infrastructure & Documentation

#### 4. Updated Examples (Dec 01, 2024) (Dec 01, 2024)
Altered examples in documentation to promote the use of PruningContentFilter alongside existing strategies.

**Affected Files:**
- `docs/examples/quickstart_async.py`: Enhanced usability and clarity for new users.
- Revised example to illustrate usage of PruningContentFilter.

## [0.3.746] November 29, 2024

### Major Features
Expand Down
1 change: 1 addition & 0 deletions CONTRIBUTORS.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ We would like to thank the following people for their contributions to Crawl4AI:

## Pull Requests

- [dvschuyl](https://github.com/dvschuyl) - AsyncPlaywrightCrawlerStrategy page-evaluate context destroyed by navigation [#304](https://github.com/unclecode/crawl4ai/pull/304)
- [nelzomal](https://github.com/nelzomal) - Enhance development installation instructions [#286](https://github.com/unclecode/crawl4ai/pull/286)
- [HamzaFarhan](https://github.com/HamzaFarhan) - Handled the cases where markdown_with_citations, references_markdown, and filtered_html might not be defined [#293](https://github.com/unclecode/crawl4ai/pull/293)
- [NanmiCoder](https://github.com/NanmiCoder) - fix: crawler strategy exception handling and fixes [#271](https://github.com/unclecode/crawl4ai/pull/271)
Expand Down
29 changes: 19 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,10 @@

Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.

[✨ Check out latest update v0.3.745](#-recent-updates)

🎉 **Version 0.4.0 is out!** Introducing our experimental PruningContentFilter - a powerful new algorithm for smarter Markdown generation. Test it out and [share your feedback](https://github.com/unclecode/crawl4ai/issues)! [Read the release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.0.md)

[✨ Check out latest update v0.4.0](#-recent-updates)

## 🧐 Why Crawl4AI?

Expand Down Expand Up @@ -422,7 +425,7 @@ You can check the project structure in the directory [https://github.com/uncleco
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def main():
Expand All @@ -434,8 +437,11 @@ async def main():
url="https://docs.micronaut.io/4.7.6/guide/",
cache_mode=CacheMode.ENABLED,
markdown_generator=DefaultMarkdownGenerator(
content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
),
# markdown_generator=DefaultMarkdownGenerator(
# content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
# ),
)
print(len(result.markdown))
print(len(result.fit_markdown))
Expand Down Expand Up @@ -620,18 +626,21 @@ async def test_news_crawl():

## ✨ Recent Updates

- 🚀 **Improved ManagedBrowser Configuration**: Dynamic host and port support for more flexible browser management.
- 📝 **Enhanced Markdown Generation**: New generator class for better formatting and customization.
- **Fast HTML Formatting**: Significantly optimized HTML formatting in the web crawler.
- 🛠️ **Utility & Sanitization Upgrades**: Improved sanitization and expanded utility functions for streamlined workflows.
- 👥 **Acknowledgments**: Added contributor details and pull request acknowledgments for better transparency.
- 🔬 **PruningContentFilter**: New unsupervised filtering strategy for intelligent content extraction based on text density and relevance scoring.
- 🧵 **Enhanced Thread Safety**: Improved multi-threaded environment handling with better locks and parallel processing support.
- 🤖 **Smart User-Agent Generation**: Advanced user-agent generator with customization options and randomization capabilities.
- 📝 **New Blog Launch**: Stay updated with our detailed release notes and technical deep dives at [crawl4ai.com/blog](https://crawl4ai.com/blog).
- 🧪 **Expanded Test Coverage**: Comprehensive test suite for both PruningContentFilter and BM25ContentFilter with edge case handling.

Read the full details of this release in our [0.4.0 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.0.md).

## 📖 Documentation & Roadmap

For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
> 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).

Moreover to check our development plans and upcoming features, check out our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).

<details>
<summary>📈 <strong>Development TODOs</strong></summary>
Expand Down
2 changes: 1 addition & 1 deletion crawl4ai/__version__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
# crawl4ai/_version.py
__version__ = "0.3.746"
__version__ = "0.4.0"
55 changes: 47 additions & 8 deletions crawl4ai/async_crawler_strategy.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import os, sys, shutil
import tempfile, subprocess
from playwright.async_api import async_playwright, Page, Browser, Error
from playwright.async_api import TimeoutError as PlaywrightTimeoutError
from io import BytesIO
from PIL import Image, ImageDraw, ImageFont
from pathlib import Path
Expand All @@ -16,6 +17,7 @@
import uuid
from .models import AsyncCrawlResponse
from .utils import create_box_message
from .user_agent_generator import UserAgentGenerator
from playwright_stealth import StealthConfig, stealth_async

stealth_config = StealthConfig(
Expand Down Expand Up @@ -222,14 +224,21 @@ def __init__(self, use_cached_html=False, js_code=None, logger = None, **kwargs)
self.use_cached_html = use_cached_html
self.user_agent = kwargs.get(
"user_agent",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
# "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
"Mozilla/5.0 (Linux; Android 11; SM-G973F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.120 Mobile Safari/537.36"
)
user_agenr_generator = UserAgentGenerator()
if kwargs.get("user_agent_mode") == "random":
self.user_agent = user_agenr_generator.generate(
**kwargs.get("user_agent_generator_config", {})
)
self.proxy = kwargs.get("proxy")
self.proxy_config = kwargs.get("proxy_config")
self.headless = kwargs.get("headless", True)
self.browser_type = kwargs.get("browser_type", "chromium")
self.headers = kwargs.get("headers", {})
self.browser_hint = user_agenr_generator.generate_client_hints(self.user_agent)
self.headers.setdefault("sec-ch-ua", self.browser_hint)
self.cookies = kwargs.get("cookies", [])
self.sessions = {}
self.session_ttl = 1800
Expand Down Expand Up @@ -307,7 +316,9 @@ async def start(self):

if self.user_agent:
await self.default_context.set_extra_http_headers({
"User-Agent": self.user_agent
"User-Agent": self.user_agent,
"sec-ch-ua": self.browser_hint,
# **self.headers
})
else:
# Base browser arguments
Expand All @@ -321,7 +332,9 @@ async def start(self):
"--disable-infobars",
"--window-position=0,0",
"--ignore-certificate-errors",
"--ignore-certificate-errors-spki-list"
"--ignore-certificate-errors-spki-list",
"--disable-blink-features=AutomationControlled",

]
}

Expand Down Expand Up @@ -642,6 +655,15 @@ async def _crawl_web(self, url: str, **kwargs) -> AsyncCrawlResponse:
self._cleanup_expired_sessions()
session_id = kwargs.get("session_id")

# Check if in kwargs we have user_agent that will override the default user_agent
user_agent = kwargs.get("user_agent", self.user_agent)

# Generate random user agent if magic mode is enabled and user_agent_mode is not random
if kwargs.get("user_agent_mode") != "random" and kwargs.get("magic", False):
user_agent = UserAgentGenerator().generate(
**kwargs.get("user_agent_generator_config", {})
)

# Handle page creation differently for managed browser
context = None
if self.use_managed_browser:
Expand All @@ -666,7 +688,7 @@ async def _crawl_web(self, url: str, **kwargs) -> AsyncCrawlResponse:
else:
# Normal context creation for non-persistent or non-Chrome browsers
context = await self.browser.new_context(
user_agent=self.user_agent,
user_agent=user_agent,
viewport={"width": 1200, "height": 800},
proxy={"server": self.proxy} if self.proxy else None,
java_script_enabled=True,
Expand All @@ -686,10 +708,11 @@ async def _crawl_web(self, url: str, **kwargs) -> AsyncCrawlResponse:
else:
# Normal context creation
context = await self.browser.new_context(
user_agent=self.user_agent,
user_agent=user_agent,
viewport={"width": 1920, "height": 1080},
proxy={"server": self.proxy} if self.proxy else None,
accept_downloads=self.accept_downloads,
ignore_https_errors=True # Add this line
)
if self.cookies:
await context.add_cookies(self.cookies)
Expand Down Expand Up @@ -920,8 +943,24 @@ async def _crawl_web(self, url: str, **kwargs) -> AsyncCrawlResponse:
});
}
"""
await page.wait_for_load_state()
await page.evaluate(update_image_dimensions_js)

try:
try:
await page.wait_for_load_state(
# state="load",
state="domcontentloaded",
timeout=5
)
except PlaywrightTimeoutError:
pass
await page.evaluate(update_image_dimensions_js)
except Exception as e:
self.logger.error(
message="Error updating image dimensions ACS-UPDATE_IMAGE_DIMENSIONS_JS: {error}",
tag="ERROR",
params={"error": str(e)}
)
# raise RuntimeError(f"Error updating image dimensions ACS-UPDATE_IMAGE_DIMENSIONS_JS: {str(e)}")

# Wait a bit for any onload events to complete
await page.wait_for_timeout(100)
Expand Down
Loading

0 comments on commit 34724d9

Please sign in to comment.