-
Notifications
You must be signed in to change notification settings - Fork 440
Description
Crawlers can be used for different scenarios. In cases where the maximum amount of requests is defined, there should be a reasonable enqueue strategy that does not "overenqueue" links too much.
What is the benefit? Mainly, more reasonable use of RequestQueue and related API calls. (Maybe this could be implemented on a specialized RequestQueue storage client in https://github.com/apify/apify-sdk-python ?)
Example: enqueue links could be cached and enqueued only if we do not have sufficient links in the queue or we run out of memory
Use case 1
Lets say we start crawler with
max_requests_per_crawl = 1000
then enqueue_links should never enqueue more than 1000-handled_requests. Anything above this number should go to some enqueue cache which would be used only if the limits somehow changed.
Use case 2
This could also be tight to some callback, even for crawler without max_requests_per_crawl
. For example, based on amount of results in dataset. Lets say we are looking for 1000 results, that will be stored in dataset. We can use it the same way as we used max_requests_per_crawl
in previous example.