Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlapping collectives because of overwriting timestamps #181

Open
theodorbadea opened this issue Feb 7, 2025 · 0 comments
Open

Overlapping collectives because of overwriting timestamps #181

theodorbadea opened this issue Feb 7, 2025 · 0 comments

Comments

@theodorbadea
Copy link

Describe the Bug

I have noticed that some overlapping collectives occur because the timestamp of the GPU operation is overwritten with the timestamp of the runtime operation. The line in cause is in trace_linker.py, in the find_parent_cpu_op method:
kineto_gpu_op.timestamp = kineto_runtime_op.timestamp

Is this the intended behavior?

I dumped some relevant info from the device trace and from the resulting linked trace, where not only an overlap is noticeable, but also a mismatch between the starting/ending timestamps:
Kineto:
start:23:25:28.743436 duration:8932.93 end:23:25:28.752369 overlapped:False ncclDevKernel_AllReduce_Sum_f32_RING_LL
start:23:25:28.755013 duration:31272.55 end:23:25:28.786285 overlapped:False ncclDevKernel_AllReduce_Sum_f32_RING_LL
start:23:25:28.786286 duration:22644.126 end:23:25:28.808930 overlapped:False ncclDevKernel_AllReduce_Sum_f32_RING_LL
start:23:25:28.809750 duration:42481.079 end:23:25:28.852232 overlapped:False ncclDevKernel_AllReduce_Sum_f32_RING_LL
start:23:25:28.871565 duration:44283.024 end:23:25:28.915848 overlapped:False ncclDevKernel_AllReduce_Sum_f32_RING_LL
Linked:
start: 23:25:28.743424 duration: 8932.93 \ end:23:25:28.752357 overlapped:False ncclDevKernel_AllReduce_Sum_f32_RING_LL
start: 23:25:28.754414 duration: 31272.55 \ end:23:25:28.785686 overlapped:False ncclDevKernel_AllReduce_Sum_f32_RING_LL
start: 23:25:28.759648 duration: 22644.126 \ end:23:25:28.782292 overlapped:True ncclDevKernel_AllReduce_Sum_f32_RING_LL
start: 23:25:28.807910 duration: 42481.079 \ end:23:25:28.850391 overlapped:False ncclDevKernel_AllReduce_Sum_f32_RING_LL
start: 23:25:28.858260 duration: 44283.024 \ end:23:25:28.902543 overlapped:False ncclDevKernel_AllReduce_Sum_f32_RING_LL

Steps to Reproduce

Create a chakra trace and jsonize it for easier inspection
v0.0.4 - main branch

Expected Behavior

Fixed the suspected issue on our side and processed the same host and device traces from above and no overlaps occurred and timestamps are matching:
Kineto:
start:23:25:28.675025 duration:47.807 end:23:25:28.675073 overlapped:False ncclDevKernel_Broadcast_RING_LL
start:23:25:28.676516 duration:5.024 end:23:25:28.676521 overlapped:False ncclDevKernel_Broadcast_RING_LL
start:23:25:28.743436 duration:8932.93 end:23:25:28.752369 overlapped:False ncclDevKernel_AllReduce_Sum_f32_RING_LL
start:23:25:28.755013 duration:31272.55 end:23:25:28.786285 overlapped:False ncclDevKernel_AllReduce_Sum_f32_RING_LL
start:23:25:28.786286 duration:22644.126 end:23:25:28.808930 overlapped:False ncclDevKernel_AllReduce_Sum_f32_RING_LL
start:23:25:28.809750 duration:42481.079 end:23:25:28.852232 overlapped:False ncclDevKernel_AllReduce_Sum_f32_RING_LL
start:23:25:28.871565 duration:44283.024 end:23:25:28.915848 overlapped:False ncclDevKernel_AllReduce_Sum_f32_RING_LL
Linked:
start: 23:25:28.743436 duration: 8932.93 \ end:23:25:28.752369 overlapped:False ncclDevKernel_AllReduce_Sum_f32_RING_LL
start: 23:25:28.755013 duration: 31272.55 \ end:23:25:28.786285 overlapped:False ncclDevKernel_AllReduce_Sum_f32_RING_LL
start: 23:25:28.786286 duration: 22644.126 \ end:23:25:28.808930 overlapped:False ncclDevKernel_AllReduce_Sum_f32_RING_LL
start: 23:25:28.809750 duration: 42481.079 \ end:23:25:28.852232 overlapped:False ncclDevKernel_AllReduce_Sum_f32_RING_LL
start: 23:25:28.871565 duration: 44283.024 \ end:23:25:28.915848 overlapped:False ncclDevKernel_AllReduce_Sum_f32_RING_LL

Screenshots

N/A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant