feat(core): add more file extensions to ignore in HTML link extraction #34552

Aman071106 · 2025-12-31T13:26:04Z

feat(core): add more file extensions to ignore in HTML link extraction

Description

This PR enhances the HTML link extraction utility in
libs/core/langchain_core/utils/html.py by expanding the SUFFIXES_TO_IGNORE list to include additional common binary file extensions:

.webp
.pdf
.docx
.xlsx
.pptx
.pptm

These file types are non-HTML, non-crawlable resources. Ignoring them prevents find_all_links and extract_sub_links from mistakenly treating such binary assets as navigable links. This improves link filtering, reduces unnecessary crawling, and aligns behavior with typical web scraping expectations.

Summary of Changes

Updated libs/core/langchain_core/utils/html.py: Added .webp, .pdf, .docx, .xlsx, .pptx, .pptm to SUFFIXES_TO_IGNORE.

Related Issues

N/A

Verification

ruff check libs/core/langchain_core/utils/html.py: Passed
mypy libs/core/langchain_core/utils/html.py: Passed
pytest libs/core/tests/unit_tests/utils/test_html.py: Passed (11 tests)

codspeed-hq · 2025-12-31T13:28:45Z

CodSpeed Performance Report

Merging this PR will improve performance by 23.98%

_{Comparing Aman071106:enhancement/ignore-more-suffixes (df8d0e8) with master (50c5bb5)¹}

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

⚡ 12 improved benchmarks
✅ 1 untouched benchmark
⏩ 21 skipped benchmarks²

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	WallTime	`test_import_time[LangChainTracer]`	543.6 ms	449.3 ms	+20.99%
⚡	WallTime	`test_import_time[RunnableLambda]`	616.6 ms	497.4 ms	+23.98%
⚡	WallTime	`test_import_time[BaseChatModel]`	622.6 ms	538.8 ms	+15.56%
⚡	WallTime	`test_import_time[HumanMessage]`	299.5 ms	260.6 ms	+14.91%
⚡	WallTime	`test_import_time[Runnable]`	584.6 ms	497.1 ms	+17.59%
⚡	WallTime	`test_import_time[InMemoryRateLimiter]`	204.1 ms	172.9 ms	+18.02%
⚡	WallTime	`test_import_time[tool]`	599.7 ms	520 ms	+15.32%
⚡	WallTime	`test_import_time[InMemoryVectorStore]`	739.7 ms	600.3 ms	+23.22%
⚡	WallTime	`test_import_time[CallbackManager]`	539.3 ms	468.8 ms	+15.02%
⚡	WallTime	`test_import_time[Document]`	217 ms	184.4 ms	+17.71%
⚡	WallTime	`test_import_time[ChatPromptTemplate]`	766.6 ms	629.5 ms	+21.78%
⚡	WallTime	`test_import_time[PydanticOutputParser]`	648.4 ms	540 ms	+20.07%

No successful run was found on master (d383f00) during the generation of this report, so 50c5bb5 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩
21 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

libs/core/tests/unit_tests/utils/test_html.py

Aman071106 requested a review from eyurtsev as a code owner December 31, 2025 13:26

github-actions bot added feature For PRs that implement a new feature; NOT A FEATURE REQUEST core `langchain-core` package issues & PRs labels Dec 31, 2025

mdrxy requested changes Jan 8, 2026

View reviewed changes

libs/core/tests/unit_tests/utils/test_html.py Outdated Show resolved Hide resolved

Aman071106 added 3 commits January 8, 2026 23:50

updated html link extraction ignore suffixes

fd91f76

fix linting

09b11fd

removed reduntant test and added new suffixes to existing test

050806b

Aman071106 force-pushed the enhancement/ignore-more-suffixes branch from 81cf17a to 050806b Compare January 8, 2026 18:30

github-actions bot added feature For PRs that implement a new feature; NOT A FEATURE REQUEST and removed feature For PRs that implement a new feature; NOT A FEATURE REQUEST labels Jan 8, 2026

Aman071106 added 2 commits January 9, 2026 00:05

fixing linting

b173b41

added new line

1925f42

github-actions bot added feature For PRs that implement a new feature; NOT A FEATURE REQUEST and removed feature For PRs that implement a new feature; NOT A FEATURE REQUEST labels Jan 8, 2026

Merge branch 'master' into enhancement/ignore-more-suffixes

df8d0e8

mdrxy merged commit 2847814 into langchain-ai:master Jan 8, 2026
88 checks passed

Aman071106 deleted the enhancement/ignore-more-suffixes branch January 8, 2026 20:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(core): add more file extensions to ignore in HTML link extraction #34552

feat(core): add more file extensions to ignore in HTML link extraction #34552

Aman071106 commented Dec 31, 2025 •

edited

Loading

Uh oh!

codspeed-hq bot commented Dec 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(core): add more file extensions to ignore in HTML link extraction #34552

feat(core): add more file extensions to ignore in HTML link extraction #34552

Conversation

Aman071106 commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat(core): add more file extensions to ignore in HTML link extraction

Description

Summary of Changes

Related Issues

Verification

Uh oh!

codspeed-hq bot commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging this PR will improve performance by 23.98%

Summary

Performance Changes

Footnotes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Aman071106 commented Dec 31, 2025 •

edited

Loading

codspeed-hq bot commented Dec 31, 2025 •

edited

Loading