Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force strict if URL matching regex. #624

Merged
merged 1 commit into from
Nov 20, 2024

Conversation

kris-sigur
Copy link
Collaborator

Adds a list of regular expressions that URLs being processed by the ConfigurableExtractorJS are evaluated against. If they match the extraction is performed in strict mode, even if strict mode is not set.

This requires a minor modification to ExtractorJS so that the CrawlURI is passed to the shouldAddUri method that ConfigurableExtractorJS overrides.

This PR was motivated by common Wordpress JSON files like: https://frettatiminn.is/wp-json/wp/v2/media/52942

They contain many absolute URLs, so excluding them may well cause missed content, but they also list filename etc. that can not be resolved relative to the JSON file itself and should just be ignored.

Adding a regex of ^.*/wp-json/.*$ to the new setting will address this.

It may also be of value to be able to apply this rule based on content-type.

Adds a list of regular expressions that URLs being processed by the
ConfigurableExtractorJS are evaluted against. If they match the
extraction is performed in strict mode, even if strict mode is not set.

This requires a minor modification to ExtractorJS so that the CrawlURI
is passed to the shouldAddUri method that ConfigurableExtractorJS
overrides.
@ato ato merged commit 13075ec into internetarchive:master Nov 20, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants