Skip to content

Conversation

seberg
Copy link
Contributor

@seberg seberg commented Sep 3, 2025

This avoids one unnecssary copy, but also avoids get_physical_array() in general, which appears to be a pattern that is not ideal typically (and specifically with streaming).

@seberg seberg requested a review from RAMitchell September 3, 2025 17:22
This avoids one unnecssary copy, but also avoids `get_physical_array()`
in general, which appears to be a pattern that is not ideal typically
(and specifically with streaming).

Signed-off-by: Sebastian Berg <[email protected]>
@seberg seberg force-pushed the parquet-bind-rowgroups branch from db5ea0b to 569cc47 Compare September 3, 2025 19:20
@seberg
Copy link
Contributor Author

seberg commented Sep 4, 2025

@reazulhoque did this work out? Would be nice to confirm it's right, although, so long we trust the trick with the shared pointer keep alive in the closure, I think this is an improvement either way.

Copy link
Contributor

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code for creating the row_group_ranges logical array is a bit odd but its fine. Maybe you can make the vector backing the external allocation const or something to make it safer.

.write_accessor<legate::Rect<1>, 1, false>()
.ptr(0);
std::copy(row_group_ranges.begin(), row_group_ranges.end(), ptr);
auto alloc = legate::ExternalAllocation::create_sysmem(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit weird. Normally wouldn't you write to a store inside the task? I think its even ok now to write to a broadcast store.

Also if I did row_group_ranges->push_back() in a later part of the code, the vector could resize and delete the allocation used in ExteranAllocation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, point, will at least try and make it const.

Normally wouldn't you write to a store inside the task

Yeah, I think this should be refactored to a small task (ideally one worker per file maybe). This is more of a hot-fix to help the streaming, and for those tasks, it would be nice to use unbound results, but that would just break the streaming again right now...

@reazulhoque
Copy link

@seberg unfortunately it seems like to make this work we'll need to force the submit of previous tasks as well.

@seberg
Copy link
Contributor Author

seberg commented Sep 5, 2025

Superseded by other PR.

@seberg seberg closed this Sep 5, 2025
@seberg seberg deleted the parquet-bind-rowgroups branch September 5, 2025 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants