You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While it is possible to run the same application on multiple nodes and filter out events manually (for example, on a per-partition basis), this has the following disadvantages:
All instances of the application will query all streams from the current generation. This is unnecessary because workers could agree to divide the polling work and each of them could poll a disjoint subset of streams from the current generation.
When working in a model where each worker polls only a subset of the streams, it is important for all workers to synchronize when the current generation is changing - i.e. workers should switch to polling streams from the new generation only after all workers read all the data from the old generation.
This is necessary because events about some partition P will be reported in stream S1 of the old generation and then in stream S2 of the new generation, and it might happen that S2 will be handled on a different worker than S1 was - if worker for S2 is quicker than S1, then some events for partition P may be processed out of order.
Please note that currently using multiple workers and filtering on a per-partition basis doesn't have this problem, although it requires every worker to poll every stream, which can be wasteful.
The scylla-cdc-java library already supports a model which divides streams across multiple workers and synchronizes them properly - something similar could be implemented here as well.
The text was updated successfully, but these errors were encountered:
While it is possible to run the same application on multiple nodes and filter out events manually (for example, on a per-partition basis), this has the following disadvantages:
All instances of the application will query all streams from the current generation. This is unnecessary because workers could agree to divide the polling work and each of them could poll a disjoint subset of streams from the current generation.
When working in a model where each worker polls only a subset of the streams, it is important for all workers to synchronize when the current generation is changing - i.e. workers should switch to polling streams from the new generation only after all workers read all the data from the old generation.
This is necessary because events about some partition P will be reported in stream S1 of the old generation and then in stream S2 of the new generation, and it might happen that S2 will be handled on a different worker than S1 was - if worker for S2 is quicker than S1, then some events for partition P may be processed out of order.
Please note that currently using multiple workers and filtering on a per-partition basis doesn't have this problem, although it requires every worker to poll every stream, which can be wasteful.
The scylla-cdc-java library already supports a model which divides streams across multiple workers and synchronizes them properly - something similar could be implemented here as well.
The text was updated successfully, but these errors were encountered: