Simplify the iterator adaptive splitting strategy#857
Simplify the iterator adaptive splitting strategy#857cuviper wants to merge 1 commit intorayon-rs:mainfrom
Conversation
|
Here are my benchmark results with a 5% threshold. It's rather mixed at extremes, so I'm not sure how to evaluate this... |
|
cc @wagnerf42 |
|
hi, i'd like to take some time looking at the benches. on top of my head here are raw comments:
if you can give me a bit more time (1 week ?) i will take a closer look. |
Yes, I agree that needs will vary, but I hope this change makes a better default.
Yeah, it's hard. Even with the larger benchmarks (
I'm not in a rush, so I'm happy to let you dig further -- thanks! |
|
hi, so here is one example i had in mind: let r = (0..*size)
.into_par_iter()
.map(|e| (0..e).map(|x| x % 2).sum::<u64>())
.sum::<u64>();so, why that ? if everything is nicely balanced then i don't see how the proposed modification could be bad. in this example when the parallel range gets divided in two, the left part contains 1/4 of total work and the right part 3/4. for performances the scheduler will need to divide regularly throughout the execution. (log(n) times). i wonder of the following: if you need to keep generating tasks on one branch during the execution then i think at one point it should be possible that only one single stealable task remains with the new mechanism. what does it mean ? well it is still enough to distribute work to all threads but now all stealers must try around p times unsuccesfully before achieving a successful steal. i did a run of this code on a 32 cores machine using 64 threads and the new code was slower (40%) there (sizes 100k to 300k). i'd still need to take a look inside the run to see what is really going on. i'll also try some more benches next week. |
|
Thank you for your investigation! The slowdown is surprising and disappointing, but I hope you'll be able to get a complete picture of why that is. We should also encode that knowledge in source comments at least, because this is all currently very opaque. |
|
To add another data point to the discussion, I have an example at GeoStat-Framework/GSTools-Core#6 where this makes the difference between a slow down (serial versus parallel) and a speed-up because without it around 500 large temporary arrays are needed for a |
Before, when an iterator job was stolen, we would reset the split count all the way back to `current_num_threads` to adaptively split jobs more aggressively when threads seem to need more work. This ends up splitting a lot farther than a lot of people expect, especially in the tail end of a computation when threads are fighting over what's left. Excess splitting can also be harmful for things like `fold` or `map_with` that want to share state as much as possible. We can get a much lazier "adaptive" effect by just not updating the split count when we split a stolen job, effectively giving it only _one_ extra boost of splitting.
Before, when an iterator job was stolen, we would reset the split count
all the way back to
current_num_threadsto adaptively split jobs moreaggressively when threads seem to need more work. This ends up splitting
a lot farther than a lot of people expect, especially in the tail end of
a computation when threads are fighting over what's left. Excess
splitting can also be harmful for things like
foldormap_withthatwant to share state as much as possible.
We can get a much lazier "adaptive" effect by just not updating the
split count when we split a stolen job, effectively giving it only one
extra boost of splitting.