Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CL_DOCA_UROM #978

Merged
merged 18 commits into from
Jul 1, 2024
Merged

CL_DOCA_UROM #978

merged 18 commits into from
Jul 1, 2024

Conversation

nsarka
Copy link
Collaborator

@nsarka nsarka commented May 22, 2024

This PR adds CL_DOCA_UROM, responsible for reading coll args and building a UROM command to send to the DPU for offload.

The PR will be 1/n PRs for getting the complete code in. This first one is just the CL boilerplate code and empty functions.

@janjust

@janjust
Copy link
Collaborator

janjust commented May 22, 2024

@Sergei-Lebedev @samnordmann
We're opening up the first of probably 2-3 additional PRs.

The first one is a bit larger; however, we trid to strip away as much as possible.
This is literally just a template code. If you were to copy/paste any other CL.
Please take a look. I'm already going to approve it because Nick and I went over this code multiple times. We also did a code review with @wfaderhold21

@janjust
Copy link
Collaborator

janjust commented May 24, 2024

bot:retest

@janjust
Copy link
Collaborator

janjust commented May 28, 2024

@B-a-S can you issue a re-test command? I think this is a clear CI issue

[2024-05-24T17:30:56.579Z] + docker pull harbor.mellanox.com/torch-ucc/ucc/1.0.0/x86_64/centos8/cuda12.1.1:3312
[2024-05-24T17:30:56.579Z] Error response from daemon: unknown: artifact torch-ucc/ucc/1.0.0/x86_64/centos8/cuda12.1.1:3312 not found
[Pipeline] echo
[2024-05-24T17:30:56.601Z] matrix_job INFO: Image NOT found - harbor.mellanox.com/torch-ucc/ucc/1.0.0/x86_64/centos8/cuda12.1.1:3312 - will build .ci/Dockerfile.ngc_pytorch ...

I can't re-run tests

config/m4/doca_urom.m4 Outdated Show resolved Hide resolved
config/m4/doca_urom_ucc.m4 Outdated Show resolved Hide resolved
config/m4/doca_urom_ucc.m4 Outdated Show resolved Hide resolved
config/m4/doca_urom_ucc.m4 Outdated Show resolved Hide resolved
configure.ac Outdated Show resolved Hide resolved
src/components/cl/doca_urom/cl_doca_urom_lib.c Outdated Show resolved Hide resolved
src/components/cl/ucc_cl.c Show resolved Hide resolved
src/components/tl/ucp/tl_ucp_context.c Outdated Show resolved Hide resolved
src/components/cl/doca_urom/cl_doca_urom_team.c Outdated Show resolved Hide resolved
src/components/cl/doca_urom/cl_doca_urom_lib.c Outdated Show resolved Hide resolved
@nsarka nsarka force-pushed the nsarka/doca-cl branch 6 times, most recently from 98f69c2 to 7841e58 Compare May 31, 2024 20:50
Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I only left some minor comments

src/components/tl/ucp/tl_ucp_context.c Outdated Show resolved Hide resolved
.github/workflows/codestyle.yaml Show resolved Hide resolved
src/components/cl/doca_urom/cl_doca_urom.h Outdated Show resolved Hide resolved
src/components/cl/doca_urom/cl_doca_urom.h Show resolved Hide resolved
src/components/cl/doca_urom/cl_doca_urom.h Outdated Show resolved Hide resolved
src/components/cl/doca_urom/cl_doca_urom.h Outdated Show resolved Hide resolved
src/components/cl/doca_urom/cl_doca_urom_coll.h Outdated Show resolved Hide resolved
@nsarka nsarka force-pushed the nsarka/doca-cl branch 4 times, most recently from ee5a31c to 5d0397e Compare June 4, 2024 20:08
@nsarka
Copy link
Collaborator Author

nsarka commented Jun 4, 2024

Hey Sam, I updated the PR with all of your comments. I only replied to one though to ask your opinion

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments!

@nsarka nsarka force-pushed the nsarka/doca-cl branch 4 times, most recently from 34e8717 to c552e41 Compare June 11, 2024 13:14
@janjust
Copy link
Collaborator

janjust commented Jun 25, 2024

@wfaderhold21 Please review
@Sergei-Lebedev this is ready now

@wfaderhold21
Copy link
Collaborator

@janjust I have started my review a few weeks ago now, but still have no answers to my questions. Waiting for @nsarka

@janjust
Copy link
Collaborator

janjust commented Jun 25, 2024

@wfaderhold21 I'm sorry we missed it - can you point me to the questions? I looked over the code and I'm not seeing it - what are we missing?

@wfaderhold21
Copy link
Collaborator

@wfaderhold21 I'm sorry we missed it - can you point me to the questions? I looked over the code and I'm not seeing it - what are we missing?

@janjust I think I'm just not good at github... I assume you can see them now.

@nsarka
Copy link
Collaborator Author

nsarka commented Jun 26, 2024

The failing CI test is in the IMB code:

/tmp/ompi/install/bin/mpicc -g -O0 -Wall -Werror -DCHECK=1 -Ihelpers -I../src_c -DMPI1 -I. -DMPI1 -c -o MPI1/CPU/IMB_scatter.o ../src_c/IMB_scatter.c
../src_c/IMB_chk_diff.c: In function ‘IMB_chk_distr’:
../src_c/IMB_chk_diff.c:1065:13: error: this ‘if’ clause does not guard... [-Werror=misleading-indentation]
 1065 |             if (ranks[i + 1] == rank && pos2 - pos1 + 1 == (lengths[i] + (lengths[i + 1])))
      |             ^~
../src_c/IMB_chk_diff.c:1066:41: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the ‘if’
 1066 |                 c_info->reccnt[rank]++; i++;

@nsarka nsarka force-pushed the nsarka/doca-cl branch 3 times, most recently from a94e9c1 to 8f611b5 Compare June 26, 2024 21:12
config/m4/doca_urom.m4 Outdated Show resolved Hide resolved
config/m4/doca_urom.m4 Outdated Show resolved Hide resolved
contrib/Makefile.am Outdated Show resolved Hide resolved
Makefile.am Show resolved Hide resolved
contrib/Makefile.am Show resolved Hide resolved
src/components/cl/doca_urom/cl_doca_urom_team.c Outdated Show resolved Hide resolved
src/components/cl/doca_urom/cl_doca_urom_team.c Outdated Show resolved Hide resolved
src/components/cl/doca_urom/cl_doca_urom_team.c Outdated Show resolved Hide resolved
src/components/cl/doca_urom/cl_doca_urom_team.c Outdated Show resolved Hide resolved
src/components/cl/doca_urom/cl_doca_urom_team.c Outdated Show resolved Hide resolved
@Sergei-Lebedev Sergei-Lebedev merged commit 37adf47 into openucx:master Jul 1, 2024
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants