Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oneshot channel deadlock on async runtime #21916

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Mar 25, 2025

Description

Using tokio's oneshot channel on our async runtime can lead to an intermittent deadlock. Tested on macOS 14.6.1 / Apple M3 Pro.

As an alternative, we can currently use the connector instead.

Steps to reproduce

  • Checkout this branch and build with make build-release
    • Confirm that there are be 3 lines of dbg! in this branch:
      • Before awaiting the oneshot receiver crates/polars-stream/src/nodes/io_sources/multi_file_reader/reader_pipelines/generic.rs:505:17
      • After a value is sent from the sender crates/polars-stream/src/nodes/io_sources/parquet_reader/mod.rs:166:13
        • Note that this is sent from a spawned task async_executor::spawn
      • After a value is received from the oneshot receiver crates/polars-stream/src/nodes/io_sources/multi_file_reader/reader_pipelines/generic.rs:509:17
  • Run the following script:
import os

os.environ["POLARS_MAX_THREADS"] = "2"
os.environ["POLARS_FORCE_NEW_STREAMING"] = "1"

import polars as pl
import io

write = pl.DataFrame.write_parquet

ref = io.BytesIO()
write(pl.Series("c1", [i % 7 for i in range(13 * 7)]).to_frame(), ref)
ref.seek(0)

fs = [io.BytesIO() for _ in range(13)]
for f in fs:
    write(pl.Series("c1", range(7)).to_frame(), f)
    f.seek(0)

scan = pl.scan_parquet

q = scan(fs).slice(99, 89)

while True:
    q.collect()

The script runs an infinite loop, however the output should eventually stop and look something like this:

[crates/polars-stream/src/nodes/io_sources/multi_file_reader/reader_pipelines/generic.rs:504:17] scan_source_idx = 1
[crates/polars-stream/src/nodes/io_sources/parquet_reader/mod.rs:166:13] scan_source_idx = 1
[crates/polars-stream/src/nodes/io_sources/multi_file_reader/reader_pipelines/generic.rs:508:17] scan_source_idx = 1
[crates/polars-stream/src/nodes/io_sources/multi_file_reader/reader_pipelines/generic.rs:504:17] scan_source_idx = 2
[crates/polars-stream/src/nodes/io_sources/parquet_reader/mod.rs:166:13] scan_source_idx = 2
[crates/polars-stream/src/nodes/io_sources/multi_file_reader/reader_pipelines/generic.rs:508:17] scan_source_idx = 2
[crates/polars-stream/src/nodes/io_sources/multi_file_reader/reader_pipelines/generic.rs:504:17] scan_source_idx = 3
[crates/polars-stream/src/nodes/io_sources/parquet_reader/mod.rs:166:13] scan_source_idx = 3
await timed out crates/polars-stream/src/nodes/io_sources/multi_file_reader/reader_pipelines/generic.rs:506

It should show that for the latest scan source (3 in the above case), only 2 lines of dbg! are printed - the one from after a value is sent, and the one from before the receiver is awaited. But somehow the await on the receiver is not completing even though a value is sent.

This branch also uses an updated TracedAwait, if we set POLARS_AWAIT_TIMEOUT_REPOLL=1, the output from the script should resume after printing repolling in 2 seconds. It is as though we have missed a task wakeup notification.

Notes

  • This occurs intermittently
  • This does not occur when running on the tokio async runtime
  • Only occurs when using a release build (make build-release)
  • Does not occur on POLARS_MAX_THREADS=1

Waker oddness

Tokio's oneshot channel has a mechanism by which the receiver avoids cloning the context waker if it determines that it already has a waker from a previous poll that wakes the same task using will_wake. If we patch this to instead always replace the waker, the deadlock will not occur - tokio-rs/tokio@master...nameexhaustion:tokio:tokio-1.43-will-wake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant