Skip to content

Commit b6353a9

Browse files
authored
docs: Improve docs of Apify storage clients and export SQL storage client (#639)
- Export `SqlStorageClient` from Crawlee. - Improve docs of `ApifyStorageClient` and `SmartApifyStorageClient`.
1 parent 3ee7896 commit b6353a9

File tree

3 files changed

+60
-21
lines changed

3 files changed

+60
-21
lines changed

src/apify/storage_clients/_apify/_storage_client.py

Lines changed: 40 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -21,23 +21,50 @@
2121

2222
@docs_group('Storage clients')
2323
class ApifyStorageClient(StorageClient):
24-
"""Apify storage client."""
24+
"""Apify platform implementation of the storage client.
25+
26+
This storage client provides access to datasets, key-value stores, and request queues that persist data
27+
to the Apify platform. Each storage type is implemented with its own specific Apify client that stores data
28+
in the cloud, making it accessible from anywhere.
29+
30+
The communication with the Apify platform is handled via the Apify API client for Python, which is an HTTP API
31+
wrapper. For maximum efficiency and performance of the storage clients, various caching mechanisms are used to
32+
minimize the number of API calls made to the Apify platform. Data can be inspected and manipulated through
33+
the Apify console web interface or via the Apify API.
34+
35+
The request queue client supports two access modes controlled by the `request_queue_access` parameter:
36+
37+
### Single mode
38+
39+
The `single` mode is optimized for scenarios with only one consumer. It minimizes API calls, making it faster
40+
and more cost-efficient compared to the `shared` mode. This option is ideal when a single Actor is responsible
41+
for consuming the entire request queue. Using multiple consumers simultaneously may lead to inconsistencies
42+
or unexpected behavior.
43+
44+
In this mode, multiple producers can safely add new requests, but forefront requests may not be processed
45+
immediately, as the client relies on local head estimation instead of frequent forefront fetching. Requests can
46+
also be added or marked as handled by other clients, but they must not be deleted or modified, since such changes
47+
would not be reflected in the local cache. If a request is already fully cached locally, marking it as handled
48+
by another client will be ignored by this client. This does not cause errors but can occasionally result in
49+
reprocessing a request that was already handled elsewhere. If the request was not yet cached locally, marking
50+
it as handled poses no issue.
51+
52+
### Shared mode
53+
54+
The `shared` mode is designed for scenarios with multiple concurrent consumers. It ensures proper synchronization
55+
and consistency across clients, at the cost of higher API usage and slightly worse performance. This mode is safe
56+
for concurrent access from multiple processes, including Actors running in parallel on the Apify platform. It
57+
should be used when multiple consumers need to process requests from the same queue simultaneously.
58+
"""
2559

2660
def __init__(self, *, request_queue_access: Literal['single', 'shared'] = 'single') -> None:
27-
"""Initialize the Apify storage client.
61+
"""Initialize a new instance.
2862
2963
Args:
30-
request_queue_access: Controls the implementation of the request queue client based on expected scenario:
31-
- 'single' is suitable for single consumer scenarios. It makes less API calls, is cheaper and faster.
32-
- 'shared' is suitable for multiple consumers scenarios at the cost of higher API usage.
33-
Detailed constraints for the 'single' access type:
34-
- Only one client is consuming the request queue at the time.
35-
- Multiple producers can put requests to the queue, but their forefront requests are not guaranteed to
36-
be handled so quickly as this client does not aggressively fetch the forefront and relies on local
37-
head estimation.
38-
- Requests are only added to the queue, never deleted by other clients. (Marking as handled is ok.)
39-
- Other producers can add new requests, but not modify existing ones.
40-
(Modifications would not be included in local cache)
64+
request_queue_access: Defines how the request queue client behaves. Use `single` mode for a single
65+
consumer. It has fewer API calls, meaning better performance and lower costs. If you need multiple
66+
concurrent consumers use `shared` mode, but expect worse performance and higher costs due to
67+
the additional overhead.
4168
"""
4269
self._request_queue_access = request_queue_access
4370

src/apify/storage_clients/_smart_apify/_storage_client.py

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -19,10 +19,18 @@
1919

2020
@docs_group('Storage clients')
2121
class SmartApifyStorageClient(StorageClient):
22-
"""SmartApifyStorageClient that delegates to cloud_storage_client or local_storage_client.
22+
"""Storage client that automatically selects cloud or local storage client based on the environment.
2323
24-
When running on Apify platform use cloud_storage_client, else use local_storage_client. This storage client is
25-
designed to work specifically in Actor context.
24+
This storage client provides access to datasets, key-value stores, and request queues by intelligently
25+
delegating to either the cloud or local storage client based on the execution environment and configuration.
26+
27+
When running on the Apify platform (which is detected via environment variables), this client automatically
28+
uses the `cloud_storage_client` to store storage data there. When running locally, it uses the
29+
`local_storage_client` to store storage data there. You can also force cloud storage usage from your
30+
local machine by using the `force_cloud` argument.
31+
32+
This storage client is designed to work specifically in `Actor` context and provides a seamless development
33+
experience where the same code works both locally and on the Apify platform without any changes.
2634
"""
2735

2836
def __init__(
@@ -31,13 +39,13 @@ def __init__(
3139
cloud_storage_client: ApifyStorageClient | None = None,
3240
local_storage_client: StorageClient | None = None,
3341
) -> None:
34-
"""Initialize the Apify storage client.
42+
"""Initialize a new instance.
3543
3644
Args:
37-
cloud_storage_client: Client used to communicate with the Apify platform storage. Either through
38-
`force_cloud` argument when opening storages or automatically when running on the Apify platform.
39-
local_storage_client: Client used to communicate with the storage when not running on the Apify
40-
platform and not using `force_cloud` argument when opening storages.
45+
cloud_storage_client: Storage client used when an Actor is running on the Apify platform, or when
46+
explicitly enabled via the `force_cloud` argument. Defaults to `ApifyStorageClient`.
47+
local_storage_client: Storage client used when an Actor is not running on the Apify platform and when
48+
`force_cloud` flag is not set. Defaults to `FileSystemStorageClient`.
4149
"""
4250
self._cloud_storage_client = cloud_storage_client or ApifyStorageClient(request_queue_access='single')
4351
self._local_storage_client = local_storage_client or ApifyFileSystemStorageClient()

website/docusaurus.config.js

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -239,6 +239,10 @@ module.exports = {
239239
url: 'https://crawlee.dev/python/api/class/FileSystemStorageClient',
240240
group: 'Storage clients',
241241
},
242+
{
243+
url: 'https://crawlee.dev/python/api/class/SqlStorageClient',
244+
group: 'Storage clients',
245+
},
242246
// Request loaders
243247
{
244248
url: 'https://crawlee.dev/python/api/class/RequestLoader',

0 commit comments

Comments
 (0)