Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

github-action: add a MPI4PY sanity check #12217

Merged
merged 2 commits into from
Jan 31, 2024

Conversation

hppritcha
Copy link
Member

seems like mpi4py finds a problem almost every time we advance openpmix/prrte shas so catch it early here.

related to #12195

@hppritcha
Copy link
Member Author

@dalcinl if you have suggestions here particularly about improved testing coverage they would be much appreciated

@dalcinl
Copy link
Contributor

dalcinl commented Jan 5, 2024

@hppritcha I would suggest to use the master branch for now, at least until I release mpi4py 4.0.0 near the end of January. Otherwise, using 3.1.5 will miss all of MPI-4. I'm not even sure if 3.1.5 runs cleanly against ompi@main. I understand you are worried of being disturbed by bugs introduced in mpi4py development, but that should be very rear, as mpi4py itself is heavily tested in various CI environments, OSs, and MPI implementations. PRs are not merged without all tests passed, 100% code coverage, etc. Therefore, I insist you use master for now and during a few weeks. Afterwards, once I release 4.0.0, you can switch to the maintenance branch I use to cut out patch releases. Nonetheless, I would argue testing mpi4py@master is still the best. For example, at some point you guys will start adding the large-count MPI-4 routines, and mpi4py is ready to contribute preliminary testing for all that, although mpi4py may still need commits in master to enable the new APIs one by one as they are added.

@dalcinl
Copy link
Contributor

dalcinl commented Jan 5, 2024

Of course, using mpi4py from the master or maintenance branch would require checking out mpi4py from its git repository, as done here: https://github.com/mpi4py/mpi4py-testing/blob/master/.github/workflows/openmpi.yml

If you really want to checkout the 3.1.5 tag, then you can just use:

 - name: Checkout mpi4py
   uses: actions/checkout@v4
   with:
     repository: "mpi4py/mpi4py"
     ref: "3.1.5"
     path: mpi4py.git

@hppritcha
Copy link
Member Author

Main.py doesn't exist on the 3.1.5 tag

@hppritcha
Copy link
Member Author

Okay we will go with the master branch for now

@hppritcha
Copy link
Member Author

note the mpi4py failure is a good thing as its catching the problem reported in #12195

Comment on lines 29 to 40
- name: Build MPI4py
run: |
echo "PYTHONPATH=${GITHUB_WORKSPACE}/mpi4py/" >> ${GITHUB_ENV}
export PATH=${GITHUB_WORKSPACE}/install/bin:$PATH
cd mpi4py.git
python3 setup.py build --mpicc="mpicc -shared"
python3 setup.py install --user
- name: Run MPI4py
run: |
export PATH=${GITHUB_WORKSPACE}/install/bin:$PATH
cd mpi4py.git/test
mpirun --map-by :OVERSUBSCRIBE -np 3 python3 ./main.py -v -f
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: Build MPI4py
run: |
echo "PYTHONPATH=${GITHUB_WORKSPACE}/mpi4py/" >> ${GITHUB_ENV}
export PATH=${GITHUB_WORKSPACE}/install/bin:$PATH
cd mpi4py.git
python3 setup.py build --mpicc="mpicc -shared"
python3 setup.py install --user
- name: Run MPI4py
run: |
export PATH=${GITHUB_WORKSPACE}/install/bin:$PATH
cd mpi4py.git/test
mpirun --map-by :OVERSUBSCRIBE -np 3 python3 ./main.py -v -f
- name: Extend PATH
run: echo ${GITHUB_WORKSPACE}/install/bin >> $GITHUB_PATH
- name: Build MPI4py
run: python3 -m pip install -vvv mpi4py.git
- name: Run MPI4py
run: mpirun --map-by :OVERSUBSCRIBE -np 3 python3 mpi4py.git/test/main.py -v -f

Comment on lines 11 to 15
- name: Install dependencies
run: |
sudo apt update
sudo apt install -y --no-install-recommends wget lsb-core software-properties-common curl python3-pip
pip3 install cython
Copy link
Contributor

@dalcinl dalcinl Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: Install dependencies
run: |
sudo apt update
sudo apt install -y --no-install-recommends wget lsb-core software-properties-common curl python3-pip
pip3 install cython
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: 3

Comment on lines 16 to 18
- uses: actions/checkout@v3
with:
submodules: recursive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- uses: actions/checkout@v3
with:
submodules: recursive
- uses: actions/checkout@v4
with:
submodules: recursive

@dalcinl
Copy link
Contributor

dalcinl commented Jan 8, 2024

IMHO, this workflow is overly complicated and does not cover as much testing (singleton init, more MPI processes) as the one mpi4py uses: https://github.com/mpi4py/mpi4py-testing/blob/master/.github/workflows/openmpi.yml.
I left a bunch of suggestions to simplify things, but I still do not like it.

I would recommend to use the workflow linked above as a template and tweak things as needed, maybe removing everything related to mpi4py.futures as it can be occasionally flaky in CI environments. Just my two cents.

@hppritcha
Copy link
Member Author

@dalcinl check now.

@dalcinl
Copy link
Contributor

dalcinl commented Jan 13, 2024

@hppritcha I'm wondering whether it would make sense to add a np=4 run. That case is not strictly needed for mpi4py code coverage and that's the reason I don't originally have it. However, from the Open MPI point of view, np=4 may trigger different ompi code paths like power-of-two optimizations. Note however that np=4 may add about 3 extra minutes to the workflow run.

Other than that, all looks good.

@dalcinl
Copy link
Contributor

dalcinl commented Jan 13, 2024

BTW, perhaps it would be good to update the submodule pointer first before merging this PR?

@hppritcha
Copy link
Member Author

yes the shas should be advanced, then i'd rebase this to check that sha advance did fix things.

@hppritcha hppritcha force-pushed the add_mpi4py_gaction branch 2 times, most recently from a3b5179 to 5b0c99f Compare January 16, 2024 20:52
@hppritcha
Copy link
Member Author

great now we're hitting a new problem.

@hppritcha
Copy link
Member Author

@dalcinl have you seen this error before -

testIMProbe (test_util_pkl5.TestMPISelf.testIMProbe) ... [fv-az842-341:00000] *** An error occurred in PMIx Event Notification

?

@dalcinl
Copy link
Contributor

dalcinl commented Jan 17, 2024

@hppritcha No, this looks like a new regression in ompi@main. It obviously failed in my scheduled CI runs , too.

testIMProbe (test_util_pkl5.TestMPISelf.testIMProbe) ... [fv-az802-515:00000] *** An error occurred in PMIx Event Notification
[fv-az802-515:00000] *** reported by process [4010541056,0]
[fv-az802-515:00000] *** on a NULL communicator
[fv-az802-515:00000] *** Unknown error (this should not happen!)
[fv-az802-515:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[fv-az802-515:00000] ***    and MPI will try to terminate your MPI job as well)

I bet that if we run the test in isolation from the others, it will work. I'll try and report back.

@dalcinl
Copy link
Contributor

dalcinl commented Jan 17, 2024

@hppritcha https://github.com/open-mpi/ompi/actions/runs/7547383955/job/20550911757?pr=12217#step:17:1598
From there on, I guess there may be some internal state corruption making subsequent tests fail.

I can reproduce locally. Running the originally failing test script in isolation works just fine:

$ python test/main.py -q test_util_pkl5
----------------------------------------------------------------------
Ran 126 tests in 0.128s

OK (skipped=24)

@rhc54
Copy link
Contributor

rhc54 commented Jan 17, 2024

FWIW: I tried to reproduce this on my Mac, but cannot run any MPI procs using current HEAD of OMPI main:

dyld[20638]: symbol not found in flat namespace '_mca_common_monitoring_output_stream_id'

@hppritcha
Copy link
Member Author

uh oh, looks like nvidia CI disk is full

--disable-dependency-tracking
--disable-sphinx
--disable-man-pages
--disable-mpi-fortran
Copy link
Contributor

@dalcinl dalcinl Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you should enable Fortran? I had it disabled to save CI build minutes. If you enable it, then mpi4py can run some tests on predefined Fortran MPI datatypes. Up you you guys to decide whether it is worth it.

Suggested change
--disable-mpi-fortran

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll do that in a subsequent PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: I agree with @dalcinl -- it would probably be good to enable Fortran.

Also -- should we try to leverage multiple GitHub actions together? E.g., I think that adding this Github Action means we're going to have (at least?) 2 builds occurring in Github Actions (i.e., @edgargabriel's RoCM build). Instead, shouldn't we:

  1. Have a GitHub action that does a build that includes everything we want to test in Github Actions (e.g., RoCM, whatever we want/need for mpi4py, ... etc.).
  2. Have subsequent pipelined Github actions that use the build from step 1 to run tests on that build. E.g., have a 2nd Github action to run @edgargabriel's RoCM tests, and have a 3rd Github action to run the mpi4py tests.

That way, we only build once and save Github action minutes. The overall runtime might also end up being lower, too.

Copy link
Member

@edgargabriel edgargabriel Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily have strong feelings for one solution over another. I am not convinced that a single build is always feasible, since sometimes packages are conflicting and cannot be installed simultaneously (e.g I would not be surprised if a distro manager doesn't let you install the GPU software stack of multiple vendors simultaneously or if there are some dependency conflicts between those packages).

Regarding running jobs, in the past that was not an option on github since they didn't have any instances with AMD GPUs. That might have changed recently (i.e. I know that Azure has instances with AMD GPUs meanwhile), but I haven't investigated whether that would be an option for github workflows. That was the reason we sticked to compilation only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edgargabriel If you manage to install CuPy within the runner, then you should definitely try to run mpi4py tests with mpiexec -n <np> python test/main.py --cupy. Every MPI routine will be tested for both CPU and GPU arrays.

@wenduwan
Copy link
Contributor

I will be watching this change. We are also bumping submodules in v5.0.x. I want to make sure the same issue does not affect the release branch.

hppritcha added a commit to hppritcha/pmix that referenced this pull request Jan 29, 2024
there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy,
but soon thereafter it was re-relesed in PMIx_Query_info, causing a

PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed.

for this case.

Related to open-mpi/ompi#12217
Related to open-mpi/ompi#10886

Signed-off-by: Howard Pritchard <[email protected]>
rhc54 pushed a commit to openpmix/openpmix that referenced this pull request Jan 29, 2024
there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy,
but soon thereafter it was re-relesed in PMIx_Query_info, causing a

PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed.

for this case.

Related to open-mpi/ompi#12217
Related to open-mpi/ompi#10886

Signed-off-by: Howard Pritchard <[email protected]>
rhc54 pushed a commit to rhc54/openpmix that referenced this pull request Jan 29, 2024
there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy,
but soon thereafter it was re-relesed in PMIx_Query_info, causing a

PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed.

for this case.

Related to open-mpi/ompi#12217
Related to open-mpi/ompi#10886

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 4baeb9f)
rhc54 pushed a commit to openpmix/openpmix that referenced this pull request Jan 29, 2024
there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy,
but soon thereafter it was re-relesed in PMIx_Query_info, causing a

PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed.

for this case.

Related to open-mpi/ompi#12217
Related to open-mpi/ompi#10886

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 4baeb9f)
@hppritcha
Copy link
Member Author

@dalcinl could you change your test_exceptions.py to, for the case of singleton, that the error returned is a check for unsporrted operation not err arg.

FAIL: testCreateGroup (test_exceptions.TestExcSession.testCreateGroup)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/users/hpritchard/mpi4py_sandbox/mpi4py/test/./test_exceptions.py", line 39, in assertRaisesMPI
    callableObj(*args, **kwargs)
  File "src/mpi4py/MPI/Session.pyx", line 103, in mpi4py.MPI.Session.Create_group
mpi4py.MPI.Exception: MPI_ERR_UNSUPPORTED_OPERATION: operation not supported

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/users/hpritchard/mpi4py_sandbox/mpi4py/test/./test_exceptions.py", line 391, in testCreateGroup
    self.assertRaisesMPI(
  File "/users/hpritchard/mpi4py_sandbox/mpi4py/test/./test_exceptions.py", line 56, in assertRaisesMPI
    raise self.failureException(
AssertionError: generated error class is 'ERR_UNSUPPORTED_OPERATION' (52), but expected 'ERR_ARG' (13)

hppritcha added a commit to hppritcha/ompi that referenced this pull request Jan 29, 2024
Improve error handling when querying PMIx for available process sets.

Related to open-mpi#12217

also related to openpmix/openpmix#3267

Signed-off-by: Howard Pritchard <[email protected]>
@dalcinl
Copy link
Contributor

dalcinl commented Jan 30, 2024

@dalcinl could you change your test_exceptions.py to, for the case of singleton, that the error returned is a check for unsporrted operation not err arg.

Yes, I can, of course. However, wouldn't that be misleading? I'm just sending an invalid, garbage argument value to an MPI call.

Fixed in mpi4py/mpi4py@3b2fe1c.

@dalcinl
Copy link
Contributor

dalcinl commented Jan 30, 2024

However, wouldn't that be misleading?

Oh, I understand that things may be a little different in singleton init mode. As there is no server, you fail with unsupported operation. All good now. The fix you requested to mpi4py testsuite is already pushed to master.

hppritcha added a commit to hppritcha/ompi that referenced this pull request Jan 30, 2024
related to open-mpi#12217

Signed-off-by: Howard Pritchard <[email protected]>
hppritcha added a commit to hppritcha/ompi that referenced this pull request Jan 30, 2024
Improve error handling when querying PMIx for available process sets.

Related to open-mpi#12217

also related to openpmix/openpmix#3267

Signed-off-by: Howard Pritchard <[email protected]>
seems like mpi4py finds a problem almost every time
we advance openpmix/prrte shas so catch it early here.

we test mpi4py master as it contains the mpi 4 stuff that so often breaks.

related to open-mpi#12195

Signed-off-by: Howard Pritchard <[email protected]>
Signed-off-by: Howard Pritchard <[email protected]>
@hppritcha
Copy link
Member Author

shmem: mmap: an error occurred while determining whether or not /tmp/ompi.st-master.5462/jf.0/2130837504/shared_mem_cuda_pool.st-master

@hppritcha Who can we get to fix the name of the session directory? I explained elsewhere what needs to be done - basically, retrieve the session directory using PMIx_Get of the PMIX_NSDIR key.

i'll see about fixing this in another PR.

@dalcinl
Copy link
Contributor

dalcinl commented Jan 31, 2024

My automated CI ran 4 hours ago to completion [link] in up to 5 MPI processes.

I'm not sure whether the deadlock in the last run here was a hiccup, or some actual issue my testing is not catching consistently.

@hppritcha hppritcha requested review from wenduwan and removed request for ggouaillardet January 31, 2024 16:02
@hppritcha
Copy link
Member Author

@wenduwan could you review? we can add other suggestions from @dalcinl in subsequent PRs.

Copy link
Contributor

@wenduwan wenduwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! This is a much needed change.

@hppritcha hppritcha merged commit 3c10787 into open-mpi:main Jan 31, 2024
11 checks passed
rhc54 pushed a commit to rhc54/openpmix that referenced this pull request Feb 5, 2024
there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy,
but soon thereafter it was re-relesed in PMIx_Query_info, causing a

PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed.

for this case.

Related to open-mpi/ompi#12217
Related to open-mpi/ompi#10886

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 4baeb9f)
rhc54 pushed a commit to openpmix/openpmix that referenced this pull request Feb 5, 2024
there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy,
but soon thereafter it was re-relesed in PMIx_Query_info, causing a

PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed.

for this case.

Related to open-mpi/ompi#12217
Related to open-mpi/ompi#10886

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit 4baeb9f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants