Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add keep_alive flag to crawler.__init__ #921

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

Pijukatel
Copy link
Contributor

Description

Add keep_alive flag to crawler.__init__

If True, this flag will keep crawler alive even when there are no more requests in queue. Crawler is then waiting for more requests to be added or to be explicitly stopped by crawler.stop().

Add test.

Issues

@Pijukatel Pijukatel added enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. labels Jan 20, 2025
@github-actions github-actions bot added this to the 106th sprint - Tooling team milestone Jan 20, 2025
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Jan 20, 2025
@Pijukatel Pijukatel marked this pull request as ready for review January 20, 2025 09:22
@Pijukatel Pijukatel requested review from vdusek and janbuchar January 20, 2025 09:26
Comment on lines 1129 to 1136
@pytest.mark.parametrize(
('keep_alive', 'max_requests_per_crawl', 'should_process_added_request'),
[
pytest.param(True, 1, True, id='keep_alive'),
pytest.param(True, 0, False, id='keep_alive, but max_requests_per_crawl achieved'),
pytest.param(False, 1, False, id='Crawler without keep_alive (default)'),
],
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test case with max_requests_per_crawl > 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

crawler_run_task = asyncio.create_task(crawler.run())

# Give some time to crawler to finish(or be in keep_alive state) and add new request.
await asyncio.sleep(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't 1 second like... a lot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well any time related test is tricky if there is no event to wait for. How do I make sure that the crawler is alive because keep_alive = True and not just because it is randomly slow and takes time to shut down?
I could wrap basic_crawler.__is_finished_function in some mock and wait until it is called at least once, instead of wait time. It will be faster test, but it will leak implementation. Do you prefer that or some other option?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a keep_alive flag to BasicCrawler
2 participants