Support distributing work across multiple machines #4

piodul · 2021-04-14T11:36:16Z

While it is possible to run the same application on multiple nodes and filter out events manually (for example, on a per-partition basis), this has the following disadvantages:

All instances of the application will query all streams from the current generation. This is unnecessary because workers could agree to divide the polling work and each of them could poll a disjoint subset of streams from the current generation.
When working in a model where each worker polls only a subset of the streams, it is important for all workers to synchronize when the current generation is changing - i.e. workers should switch to polling streams from the new generation only after all workers read all the data from the old generation.

This is necessary because events about some partition P will be reported in stream S1 of the old generation and then in stream S2 of the new generation, and it might happen that S2 will be handled on a different worker than S1 was - if worker for S2 is quicker than S1, then some events for partition P may be processed out of order.

Please note that currently using multiple workers and filtering on a per-partition basis doesn't have this problem, although it requires every worker to poll every stream, which can be wasteful.

The scylla-cdc-java library already supports a model which divides streams across multiple workers and synchronizes them properly - something similar could be implemented here as well.

dkropachev added the enhancement New feature or request label Aug 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support distributing work across multiple machines #4

Support distributing work across multiple machines #4

piodul commented Apr 14, 2021

Support distributing work across multiple machines #4

Support distributing work across multiple machines #4

Comments

piodul commented Apr 14, 2021