feat: Add keep_alive flag to `crawler.init` #921

Pijukatel · 2025-01-20T08:54:12Z

Description

Add keep_alive flag to crawler.__init__

If True, this flag will keep crawler alive even when there are no more requests in queue. Crawler is then waiting for more requests to be added or to be explicitly stopped by crawler.stop().

Add test.

Issues

Closes: Add a keep_alive flag to BasicCrawler #891

Add test.

Add test for this usecase.

janbuchar · 2025-01-20T12:32:02Z

tests/unit/crawlers/_basic/test_basic_crawler.py

+@pytest.mark.parametrize(
+    ('keep_alive', 'max_requests_per_crawl', 'should_process_added_request'),
+    [
+        pytest.param(True, 1, True, id='keep_alive'),
+        pytest.param(True, 0, False, id='keep_alive, but max_requests_per_crawl achieved'),
+        pytest.param(False, 1, False, id='Crawler without keep_alive (default)'),
+    ],
+)


Could you add a test case with max_requests_per_crawl > 1?

tests/unit/crawlers/_basic/test_basic_crawler.py

janbuchar · 2025-01-20T13:45:44Z

tests/unit/crawlers/_basic/test_basic_crawler.py

+    crawler_run_task = asyncio.create_task(crawler.run())
+
+    # Give some time to crawler to finish(or be in keep_alive state) and add new request.
+    await asyncio.sleep(1)


Isn't 1 second like... a lot?

Well any time related test is tricky if there is no event to wait for. How do I make sure that the crawler is alive because keep_alive = True and not just because it is randomly slow and takes time to shut down?
I could wrap basic_crawler.__is_finished_function in some mock and wait until it is called at least once, instead of wait time. It will be faster test, but it will leak implementation. Do you prefer that or some other option?

Add keep_alive flag to crawler.__init__

63ecb5e

Add test.

Pijukatel added enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. labels Jan 20, 2025

github-actions bot assigned Pijukatel Jan 20, 2025

github-actions bot added this to the 106th sprint - Tooling team milestone Jan 20, 2025

github-actions bot added the tested Temporary label used only programatically for some analytics. label Jan 20, 2025

Pijukatel added 2 commits January 20, 2025 10:15

Add explicit description of keep_alive + max_requests_per_crawl

d90e17d

Add test for this usecase.

Docstrings

64a8560

Pijukatel marked this pull request as ready for review January 20, 2025 09:22

Pijukatel requested review from vdusek and janbuchar January 20, 2025 09:26

janbuchar reviewed Jan 20, 2025

View reviewed changes

tests/unit/crawlers/_basic/test_basic_crawler.py Show resolved Hide resolved

Pijukatel added 2 commits January 20, 2025 14:24

Review comments

22b9d68

Remove redundant test argument

f73a856

janbuchar approved these changes Jan 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add keep_alive flag to `crawler.init` #921

feat: Add keep_alive flag to `crawler.init` #921

Pijukatel commented Jan 20, 2025

janbuchar Jan 20, 2025

Pijukatel Jan 20, 2025

janbuchar Jan 20, 2025

Pijukatel Jan 21, 2025

feat: Add keep_alive flag to crawler.__init__ #921

Are you sure you want to change the base?

feat: Add keep_alive flag to crawler.__init__ #921

Conversation

Pijukatel commented Jan 20, 2025

Description

Issues

janbuchar Jan 20, 2025

Choose a reason for hiding this comment

Pijukatel Jan 20, 2025

Choose a reason for hiding this comment

janbuchar Jan 20, 2025

Choose a reason for hiding this comment

Pijukatel Jan 21, 2025

Choose a reason for hiding this comment

feat: Add keep_alive flag to `crawler.init` #921

feat: Add keep_alive flag to `crawler.init` #921