Skip to content

Lazy enqueue links #1371

@Pijukatel

Description

@Pijukatel

Crawlers can be used for different scenarios. In cases where the maximum amount of requests is defined, there should be a reasonable enqueue strategy that does not "overenqueue" links too much.

What is the benefit? Mainly, more reasonable use of RequestQueue and related API calls. (Maybe this could be implemented on a specialized RequestQueue storage client in https://github.com/apify/apify-sdk-python ?)

Example: enqueue links could be cached and enqueued only if we do not have sufficient links in the queue or we run out of memory
Use case 1
Lets say we start crawler with
max_requests_per_crawl = 1000
then enqueue_links should never enqueue more than 1000-handled_requests. Anything above this number should go to some enqueue cache which would be used only if the limits somehow changed.

Use case 2
This could also be tight to some callback, even for crawler without max_requests_per_crawl. For example, based on amount of results in dataset. Lets say we are looking for 1000 results, that will be stored in dataset. We can use it the same way as we used max_requests_per_crawl in previous example.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request.t-toolingIssues with this label are in the ownership of the tooling team.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions