Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support distributing work across multiple machines #4

Open
piodul opened this issue Apr 14, 2021 · 0 comments
Open

Support distributing work across multiple machines #4

piodul opened this issue Apr 14, 2021 · 0 comments
Labels
enhancement New feature or request

Comments

@piodul
Copy link
Collaborator

piodul commented Apr 14, 2021

While it is possible to run the same application on multiple nodes and filter out events manually (for example, on a per-partition basis), this has the following disadvantages:

  • All instances of the application will query all streams from the current generation. This is unnecessary because workers could agree to divide the polling work and each of them could poll a disjoint subset of streams from the current generation.

  • When working in a model where each worker polls only a subset of the streams, it is important for all workers to synchronize when the current generation is changing - i.e. workers should switch to polling streams from the new generation only after all workers read all the data from the old generation.

    This is necessary because events about some partition P will be reported in stream S1 of the old generation and then in stream S2 of the new generation, and it might happen that S2 will be handled on a different worker than S1 was - if worker for S2 is quicker than S1, then some events for partition P may be processed out of order.

    Please note that currently using multiple workers and filtering on a per-partition basis doesn't have this problem, although it requires every worker to poll every stream, which can be wasteful.

The scylla-cdc-java library already supports a model which divides streams across multiple workers and synchronizes them properly - something similar could be implemented here as well.

@dkropachev dkropachev added the enhancement New feature or request label Aug 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants