Skip to content

Conversation

@Aman071106
Copy link
Contributor

@Aman071106 Aman071106 commented Dec 31, 2025

feat(core): add more file extensions to ignore in HTML link extraction

Description

This PR enhances the HTML link extraction utility in
libs/core/langchain_core/utils/html.py by expanding the SUFFIXES_TO_IGNORE list to include additional common binary file extensions:

  • .webp
  • .pdf
  • .docx
  • .xlsx
  • .pptx
  • .pptm

These file types are non-HTML, non-crawlable resources. Ignoring them prevents find_all_links and extract_sub_links from mistakenly treating such binary assets as navigable links. This improves link filtering, reduces unnecessary crawling, and aligns behavior with typical web scraping expectations.

Summary of Changes

  • Updated libs/core/langchain_core/utils/html.py: Added .webp, .pdf, .docx, .xlsx, .pptx, .pptm to SUFFIXES_TO_IGNORE.

Related Issues

N/A

Verification

  • ruff check libs/core/langchain_core/utils/html.py: Passed
  • mypy libs/core/langchain_core/utils/html.py: Passed
  • pytest libs/core/tests/unit_tests/utils/test_html.py: Passed (11 tests)

@Aman071106 Aman071106 requested a review from eyurtsev as a code owner December 31, 2025 13:26
@github-actions github-actions bot added feature For PRs that implement a new feature; NOT A FEATURE REQUEST core `langchain-core` package issues & PRs labels Dec 31, 2025
@codspeed-hq
Copy link

codspeed-hq bot commented Dec 31, 2025

CodSpeed Performance Report

Merging this PR will improve performance by 23.98%

Comparing Aman071106:enhancement/ignore-more-suffixes (df8d0e8) with master (50c5bb5)1

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

⚡ 12 improved benchmarks
✅ 1 untouched benchmark
⏩ 21 skipped benchmarks2

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime test_import_time[LangChainTracer] 543.6 ms 449.3 ms +20.99%
WallTime test_import_time[RunnableLambda] 616.6 ms 497.4 ms +23.98%
WallTime test_import_time[BaseChatModel] 622.6 ms 538.8 ms +15.56%
WallTime test_import_time[HumanMessage] 299.5 ms 260.6 ms +14.91%
WallTime test_import_time[Runnable] 584.6 ms 497.1 ms +17.59%
WallTime test_import_time[InMemoryRateLimiter] 204.1 ms 172.9 ms +18.02%
WallTime test_import_time[tool] 599.7 ms 520 ms +15.32%
WallTime test_import_time[InMemoryVectorStore] 739.7 ms 600.3 ms +23.22%
WallTime test_import_time[CallbackManager] 539.3 ms 468.8 ms +15.02%
WallTime test_import_time[Document] 217 ms 184.4 ms +17.71%
WallTime test_import_time[ChatPromptTemplate] 766.6 ms 629.5 ms +21.78%
WallTime test_import_time[PydanticOutputParser] 648.4 ms 540 ms +20.07%

Footnotes

  1. No successful run was found on master (d383f00) during the generation of this report, so 50c5bb5 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

  2. 21 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@Aman071106 Aman071106 force-pushed the enhancement/ignore-more-suffixes branch from 81cf17a to 050806b Compare January 8, 2026 18:30
@github-actions github-actions bot added feature For PRs that implement a new feature; NOT A FEATURE REQUEST and removed feature For PRs that implement a new feature; NOT A FEATURE REQUEST labels Jan 8, 2026
@github-actions github-actions bot added feature For PRs that implement a new feature; NOT A FEATURE REQUEST and removed feature For PRs that implement a new feature; NOT A FEATURE REQUEST labels Jan 8, 2026
@mdrxy mdrxy merged commit 2847814 into langchain-ai:master Jan 8, 2026
88 checks passed
@Aman071106 Aman071106 deleted the enhancement/ignore-more-suffixes branch January 8, 2026 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core `langchain-core` package issues & PRs feature For PRs that implement a new feature; NOT A FEATURE REQUEST

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants