Side Input improvements#38363
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enables the use of side inputs within timer-related callbacks, specifically @ontimer and @OnWindowExpiration. It addresses limitations in the current runner implementations to ensure that side inputs are correctly fetched and cached when these callbacks are triggered. Additionally, it includes robust fixes for streaming side input handling, ensuring that blocked elements and timers are properly managed and persisted across bundle boundaries. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for side inputs within @ontimer and @OnWindowExpiration methods in DoFn. The implementation refactors SimpleParDoFn to handle streaming side inputs directly via a StreamingSideInputFetcher, replacing the previous StreamingSideInputDoFnRunner wrapper. It also updates the Beam SDK to permit side input parameters in these callbacks and provides necessary implementations for the Dataflow worker and Fn API runner. Feedback includes a high-priority suggestion to nullify the sideInputFetcher in finishBundle to prevent memory leaks, an optimization to move redundant state cleanup registration out of a loop, the removal of a side-effect-free line in StreamingSideInputFetcher, and a recommendation to replace long Thread.sleep calls in tests with more deterministic approaches like TestStream.
| if (sideInputFetcher != null) { | ||
| sideInputFetcher.persist(); | ||
| } |
There was a problem hiding this comment.
The sideInputFetcher field should be cleared at the end of the bundle. Since SimpleParDoFn instances are typically reused across bundles in the Dataflow worker, failing to null out this field can lead to stale state being carried over to subsequent bundles or potential memory leaks. Clearing it here ensures that each bundle starts with a clean state.
| if (sideInputFetcher != null) { | |
| sideInputFetcher.persist(); | |
| } | |
| if (sideInputFetcher != null) { | |
| sideInputFetcher.persist(); | |
| sideInputFetcher = null; | |
| } |
| boolean hasState = fnSignature != null && !fnSignature.stateDeclarations().isEmpty(); | ||
| if (hasState) { | ||
| // These elements are now processed. Register cleanup timers for all the unblocked | ||
| // windows. | ||
| registerStateCleanup( | ||
| (WindowingStrategy<?, W>) getDoFnInfo().getWindowingStrategy(), | ||
| (Collection<W>) readyWindows); | ||
| } |
There was a problem hiding this comment.
The call to registerStateCleanup is located inside the loop that iterates over elementsBags. Since it is passed the entire readyWindows collection, it will redundantly register cleanup timers for all unblocked windows in every iteration of the loop. This should be moved outside the loop to improve performance, especially when multiple windows are unblocked simultaneously.
| Set<Windmill.GlobalDataRequest> blocked = checkIfBlocked(window); | ||
| blockedMap().get(window); |
There was a problem hiding this comment.
The call to blockedMap().get(window) on line 213 is a side-effect-free operation whose result is ignored. It appears to be a leftover from refactoring and should be removed to keep the code clean.
| Set<Windmill.GlobalDataRequest> blocked = checkIfBlocked(window); | |
| blockedMap().get(window); | |
| Set<Windmill.GlobalDataRequest> blocked = checkIfBlocked(window); |
| new DoFn<KV<String, String>, String>() { | ||
| @ProcessElement | ||
| public void process(OutputReceiver<String> o) throws InterruptedException { | ||
| Thread.sleep(java.time.Duration.ofSeconds(15).toMillis()); |
There was a problem hiding this comment.
Using Thread.sleep(15000) in a test is highly discouraged as it significantly increases the execution time of the test suite and can lead to flakiness. Consider using a much shorter duration or a more deterministic approach using TestStream to simulate the side input availability delay if the test environment allows.
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #38363 +/- ##
=========================================
Coverage ? 72.59%
Complexity ? 3629
=========================================
Files ? 335
Lines ? 30806
Branches ? 3750
=========================================
Hits ? 22365
Misses ? 6736
Partials ? 1705
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request enables side input access within OnTimer and OnWindowExpiration contexts for the Beam Java SDK and Dataflow runner. It refactors streaming side input logic into a StreamingSideInputProcessor class, integrated into SimpleParDoFn and StreamingSideInputDoFnRunner. Feedback identifies a missing progress tracker notification for unblocked elements, a potential memory risk from eager list accumulation in the processor, and opportunities to optimize list allocations and test execution times.
| for (WindowedValue<InputT> unblockedElement : unblockedElements) { | ||
| fnRunner.processElement(unblockedElement); |
There was a problem hiding this comment.
The outputsPerElementTracker is not notified when processing unblocked elements in startBundle. This can lead to incorrect progress tracking and metrics in Dataflow, as the tracker expects onProcessElement() to be called before processElement(). You should wrap the processElement call with the appropriate tracker notifications.
for (WindowedValue<InputT> unblockedElement : unblockedElements) {
outputsPerElementTracker.onProcessElement();
fnRunner.processElement(unblockedElement);
outputsPerElementTracker.onProcessElementSuccess();| List<WindowedValue<InputT>> releaseElements = Lists.newArrayList(); | ||
| for (BagState<WindowedValue<InputT>> elementsBag : elementsBags) { | ||
| Iterable<WindowedValue<InputT>> elements = elementsBag.read(); | ||
| for (WindowedValue<InputT> elem : elements) { | ||
| releaseElements.add(elem); | ||
| } | ||
| elementsBag.clear(); | ||
| } |
There was a problem hiding this comment.
Flattening all unblocked elements from all ready windows into a single ArrayList poses a significant memory risk and could lead to OutOfMemoryError if a large number of elements were buffered. The previous implementation in StreamingSideInputDoFnRunner processed elements bag-by-bag, which is much more memory-efficient. Consider refactoring this to process elements lazily or bag-by-bag to avoid accumulating everything in memory at once.
| windowsProcessed = Lists.newArrayList(); | ||
| Iterable<WindowedValue<InputT>> elementsToProcess = | ||
| sideInputProcessor.handleProcessElement(elem); | ||
| for (WindowedValue<InputT> toProcess : elementsToProcess) { | ||
| fnRunner.processElement(toProcess); | ||
| windowsProcessed.addAll((Collection<W>) toProcess.getWindows()); |
There was a problem hiding this comment.
The windowsProcessed list is allocated and populated even if hasState is false, in which case it is never used. Deferring the allocation and population of this list until it's known that state cleanup is required would be more efficient.
| windowsProcessed = Lists.newArrayList(); | |
| Iterable<WindowedValue<InputT>> elementsToProcess = | |
| sideInputProcessor.handleProcessElement(elem); | |
| for (WindowedValue<InputT> toProcess : elementsToProcess) { | |
| fnRunner.processElement(toProcess); | |
| windowsProcessed.addAll((Collection<W>) toProcess.getWindows()); | |
| windowsProcessed = hasState ? Lists.newArrayList() : ImmutableList.of(); | |
| Iterable<WindowedValue<InputT>> elementsToProcess = | |
| sideInputProcessor.handleProcessElement(elem); | |
| for (WindowedValue<InputT> toProcess : elementsToProcess) { | |
| fnRunner.processElement(toProcess); | |
| if (hasState) { | |
| windowsProcessed.addAll((Collection<W>) toProcess.getWindows()); | |
| } |
| new DoFn<KV<String, String>, String>() { | ||
| @ProcessElement | ||
| public void process(OutputReceiver<String> o) throws InterruptedException { | ||
| Thread.sleep(java.time.Duration.ofSeconds(15).toMillis()); |
There was a problem hiding this comment.
Using Thread.sleep(15000) makes the test suite significantly slower. For simulating a delayed side input in these tests, a shorter duration (e.g., 5 seconds) is usually sufficient and helps maintain a faster CI pipeline.
| Thread.sleep(java.time.Duration.ofSeconds(15).toMillis()); | |
| Thread.sleep(java.time.Duration.ofSeconds(5).toMillis()); |
|
R: @ahmedabu98 |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for accessing side inputs within OnTimer and OnWindowExpiration methods in Beam's DoFn. It includes the necessary API changes in the SDK, implementation updates in the Fn API runner and Dataflow runner, and a new helper class StreamingSideInputProcessor to manage side input blocking and unblocking. The reviewer raised concerns regarding the use of a strict state assertion in the new processor and the use of Thread.sleep in the new test case, which could impact test stability and performance.
| // be set via element processing (or another timer) in the same window, the window should be | ||
| // unblocked once | ||
| // we get here. | ||
| Preconditions.checkState(!sideInputFetcher.storeIfBlocked(timer)); |
There was a problem hiding this comment.
The use of Preconditions.checkState(!sideInputFetcher.storeIfBlocked(timer)) assumes that side inputs are always ready when a timer fires. While this is generally true for user timers (which are set after an element is processed), it relies on the runner's guarantee that cleanup timers are only set after unblocking. If this assumption is ever violated, it will lead to an IllegalStateException rather than buffering the timer. Consider if it's safer to handle the blocked case by returning a boolean and letting the caller decide, or if this strict assertion is the intended behavior for the Dataflow runner.
| new DoFn<KV<String, String>, String>() { | ||
| @ProcessElement | ||
| public void process(OutputReceiver<String> o) throws InterruptedException { | ||
| Thread.sleep(java.time.Duration.ofSeconds(15).toMillis()); |
There was a problem hiding this comment.
Using Thread.sleep with a duration of 15 seconds in a test can lead to slow build times and potential flakiness in CI environments. While it's used here to simulate side input delay, consider if there's a more deterministic way to test this behavior, perhaps by using TestStream with multiple bundles or a custom PCollectionView that allows controlling readiness without blocking the thread.
Allow side inputs to be accessed in OnTimer and OnWindowExpiration.
Fix latent bug in Dataflow runner. Previously the cleanup timer was set when an element was seen, even if that element was buffered. This could cause the cleanup timer to fire early, while those elements are still buffered. The new code doesn't set a cleanup timer for a window until an element is actually processed for that window (i.e. the window is unblocked).