Skip to content

Conversation

@tw4l
Copy link
Member

@tw4l tw4l commented Nov 5, 2025

Fixes #897

  • After successfully parsing all seeds from a seed file, store those seeds in Redis and do not attempt to re-download the seed file again on subsequent runs (e.g. after a crawl is paused, picked up from serialized state, or otherwise restarted)
  • Refactor parseSeeds into its own module to avoid circular imports

Manually tested with pausing crawls in Browsertrix and with picking up from interrupted crawls via serialized state YAML files with the crawler, in addition to tests added.

@tw4l tw4l force-pushed the issue-897-seedfile-expiration branch 3 times, most recently from dee29ca to 7c409f6 Compare November 6, 2025 17:57
@tw4l tw4l force-pushed the issue-897-seedfile-expiration branch from f970dd8 to 610b774 Compare November 18, 2025 20:16
tw4l added 4 commits November 19, 2025 10:05
Use hacky any to avoid circular import, will fix properly in later
commit
Also move parseSeeds to separate module to avoid circular import
Allow crawlState to be undefined in parseSeeds for use in scope tests
@tw4l tw4l force-pushed the issue-897-seedfile-expiration branch 3 times, most recently from a603ec9 to 2899593 Compare November 19, 2025 22:27
The only information in seed files is the URL, so no need to
complicate things by storing more than that.
@tw4l tw4l force-pushed the issue-897-seedfile-expiration branch from ae4321b to df395a7 Compare November 19, 2025 22:43
@tw4l tw4l changed the title WIP: Only parse seed files once Only parse seed files once Nov 19, 2025
@tw4l tw4l changed the title Only parse seed files once Only download and parse seed files once Nov 19, 2025
@tw4l tw4l marked this pull request as ready for review November 19, 2025 23:37
@tw4l tw4l requested a review from ikreymer November 19, 2025 23:37
@tw4l
Copy link
Member Author

tw4l commented Nov 19, 2025

@ikreymer This is now ready for review!

@tw4l
Copy link
Member Author

tw4l commented Nov 26, 2025

Closing in favor of #921

@tw4l tw4l closed this Nov 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Request expired on seedfile

2 participants