github-action: add a MPI4PY sanity check #12217

hppritcha · 2024-01-04T18:55:13Z

seems like mpi4py finds a problem almost every time we advance openpmix/prrte shas so catch it early here.

related to #12195

hppritcha · 2024-01-04T18:56:43Z

@dalcinl if you have suggestions here particularly about improved testing coverage they would be much appreciated

dalcinl · 2024-01-05T13:12:52Z

@hppritcha I would suggest to use the master branch for now, at least until I release mpi4py 4.0.0 near the end of January. Otherwise, using 3.1.5 will miss all of MPI-4. I'm not even sure if 3.1.5 runs cleanly against ompi@main. I understand you are worried of being disturbed by bugs introduced in mpi4py development, but that should be very rear, as mpi4py itself is heavily tested in various CI environments, OSs, and MPI implementations. PRs are not merged without all tests passed, 100% code coverage, etc. Therefore, I insist you use master for now and during a few weeks. Afterwards, once I release 4.0.0, you can switch to the maintenance branch I use to cut out patch releases. Nonetheless, I would argue testing mpi4py@master is still the best. For example, at some point you guys will start adding the large-count MPI-4 routines, and mpi4py is ready to contribute preliminary testing for all that, although mpi4py may still need commits in master to enable the new APIs one by one as they are added.

dalcinl · 2024-01-05T13:19:35Z

Of course, using mpi4py from the master or maintenance branch would require checking out mpi4py from its git repository, as done here: https://github.com/mpi4py/mpi4py-testing/blob/master/.github/workflows/openmpi.yml

If you really want to checkout the 3.1.5 tag, then you can just use:

 - name: Checkout mpi4py
   uses: actions/checkout@v4
   with:
     repository: "mpi4py/mpi4py"
     ref: "3.1.5"
     path: mpi4py.git

hppritcha · 2024-01-05T15:50:46Z

Main.py doesn't exist on the 3.1.5 tag

hppritcha · 2024-01-05T15:52:39Z

Okay we will go with the master branch for now

hppritcha · 2024-01-06T19:28:20Z

note the mpi4py failure is a good thing as its catching the problem reported in #12195

dalcinl · 2024-01-08T14:49:58Z

.github/workflows/ompi_mpi4py.yaml

+    - name: Build MPI4py
+      run: |
+        echo "PYTHONPATH=${GITHUB_WORKSPACE}/mpi4py/" >> ${GITHUB_ENV}
+        export PATH=${GITHUB_WORKSPACE}/install/bin:$PATH
+        cd mpi4py.git
+        python3 setup.py build --mpicc="mpicc -shared"
+        python3 setup.py install --user
+    - name: Run MPI4py
+      run: |
+        export PATH=${GITHUB_WORKSPACE}/install/bin:$PATH
+        cd mpi4py.git/test
+        mpirun --map-by :OVERSUBSCRIBE -np 3 python3 ./main.py -v -f


Suggested change

- name: Build MPI4py

run: |

echo "PYTHONPATH=${GITHUB_WORKSPACE}/mpi4py/" >> ${GITHUB_ENV}

export PATH=${GITHUB_WORKSPACE}/install/bin:$PATH

cd mpi4py.git

python3 setup.py build --mpicc="mpicc -shared"

python3 setup.py install --user

- name: Run MPI4py

run: |

export PATH=${GITHUB_WORKSPACE}/install/bin:$PATH

cd mpi4py.git/test

mpirun --map-by :OVERSUBSCRIBE -np 3 python3 ./main.py -v -f

- name: Extend PATH

run: echo ${GITHUB_WORKSPACE}/install/bin >> $GITHUB_PATH

- name: Build MPI4py

run: python3 -m pip install -vvv mpi4py.git

- name: Run MPI4py

run: mpirun --map-by :OVERSUBSCRIBE -np 3 python3 mpi4py.git/test/main.py -v -f

dalcinl · 2024-01-08T14:50:36Z

.github/workflows/ompi_mpi4py.yaml

+    - name: Install dependencies
+      run: |
+        sudo apt update
+        sudo apt install -y --no-install-recommends wget lsb-core software-properties-common curl python3-pip
+        pip3 install cython


Suggested change

- name: Install dependencies

run: |

sudo apt update

sudo apt install -y --no-install-recommends wget lsb-core software-properties-common curl python3-pip

pip3 install cython

- name: Setup Python

uses: actions/setup-python@v5

with:

python-version: 3

dalcinl · 2024-01-08T14:51:37Z

.github/workflows/ompi_mpi4py.yaml

+    - uses: actions/checkout@v3
+      with:
+            submodules: recursive


Suggested change

- uses: actions/checkout@v3

with:

submodules: recursive

- uses: actions/checkout@v4

with:

submodules: recursive

dalcinl · 2024-01-08T15:01:33Z

IMHO, this workflow is overly complicated and does not cover as much testing (singleton init, more MPI processes) as the one mpi4py uses: https://github.com/mpi4py/mpi4py-testing/blob/master/.github/workflows/openmpi.yml.
I left a bunch of suggestions to simplify things, but I still do not like it.

I would recommend to use the workflow linked above as a template and tweak things as needed, maybe removing everything related to mpi4py.futures as it can be occasionally flaky in CI environments. Just my two cents.

.github/workflows/ompi_mpi4py.yaml

hppritcha · 2024-01-12T18:14:09Z

@dalcinl check now.

dalcinl · 2024-01-13T15:41:11Z

@hppritcha I'm wondering whether it would make sense to add a np=4 run. That case is not strictly needed for mpi4py code coverage and that's the reason I don't originally have it. However, from the Open MPI point of view, np=4 may trigger different ompi code paths like power-of-two optimizations. Note however that np=4 may add about 3 extra minutes to the workflow run.

Other than that, all looks good.

dalcinl · 2024-01-13T15:42:00Z

BTW, perhaps it would be good to update the submodule pointer first before merging this PR?

hppritcha · 2024-01-13T22:21:32Z

yes the shas should be advanced, then i'd rebase this to check that sha advance did fix things.

hppritcha · 2024-01-16T21:21:52Z

great now we're hitting a new problem.

hppritcha · 2024-01-16T23:16:52Z

@dalcinl have you seen this error before -

testIMProbe (test_util_pkl5.TestMPISelf.testIMProbe) ... [fv-az842-341:00000] *** An error occurred in PMIx Event Notification

?

dalcinl · 2024-01-17T04:37:44Z

@hppritcha No, this looks like a new regression in ompi@main. It obviously failed in my scheduled CI runs , too.

testIMProbe (test_util_pkl5.TestMPISelf.testIMProbe) ... [fv-az802-515:00000] *** An error occurred in PMIx Event Notification
[fv-az802-515:00000] *** reported by process [4010541056,0]
[fv-az802-515:00000] *** on a NULL communicator
[fv-az802-515:00000] *** Unknown error (this should not happen!)
[fv-az802-515:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[fv-az802-515:00000] ***    and MPI will try to terminate your MPI job as well)

I bet that if we run the test in isolation from the others, it will work. I'll try and report back.

dalcinl · 2024-01-17T05:44:23Z

@hppritcha https://github.com/open-mpi/ompi/actions/runs/7547383955/job/20550911757?pr=12217#step:17:1598
From there on, I guess there may be some internal state corruption making subsequent tests fail.

I can reproduce locally. Running the originally failing test script in isolation works just fine:

$ python test/main.py -q test_util_pkl5
----------------------------------------------------------------------
Ran 126 tests in 0.128s

OK (skipped=24)

.github/workflows/ompi_mpi4py.yaml

rhc54 · 2024-01-17T12:47:46Z

FWIW: I tried to reproduce this on my Mac, but cannot run any MPI procs using current HEAD of OMPI main:

dyld[20638]: symbol not found in flat namespace '_mca_common_monitoring_output_stream_id'

hppritcha · 2024-01-17T20:28:30Z

uh oh, looks like nvidia CI disk is full

dalcinl · 2024-01-17T20:31:14Z

.github/workflows/ompi_mpi4py.yaml

+              --disable-dependency-tracking
+              --disable-sphinx
+              --disable-man-pages
+              --disable-mpi-fortran


Perhaps you should enable Fortran? I had it disabled to save CI build minutes. If you enable it, then mpi4py can run some tests on predefined Fortran MPI datatypes. Up you you guys to decide whether it is worth it.

Suggested change

--disable-mpi-fortran

i'll do that in a subsequent PR.

FWIW: I agree with @dalcinl -- it would probably be good to enable Fortran.

Also -- should we try to leverage multiple GitHub actions together? E.g., I think that adding this Github Action means we're going to have (at least?) 2 builds occurring in Github Actions (i.e., @edgargabriel's RoCM build). Instead, shouldn't we:

Have a GitHub action that does a build that includes everything we want to test in Github Actions (e.g., RoCM, whatever we want/need for mpi4py, ... etc.).

Have subsequent pipelined Github actions that use the build from step 1 to run tests on that build. E.g., have a 2nd Github action to run @edgargabriel's RoCM tests, and have a 3rd Github action to run the mpi4py tests.

That way, we only build once and save Github action minutes. The overall runtime might also end up being lower, too.

I don't necessarily have strong feelings for one solution over another. I am not convinced that a single build is always feasible, since sometimes packages are conflicting and cannot be installed simultaneously (e.g I would not be surprised if a distro manager doesn't let you install the GPU software stack of multiple vendors simultaneously or if there are some dependency conflicts between those packages).

Regarding running jobs, in the past that was not an option on github since they didn't have any instances with AMD GPUs. That might have changed recently (i.e. I know that Azure has instances with AMD GPUs meanwhile), but I haven't investigated whether that would be an option for github workflows. That was the reason we sticked to compilation only.

@edgargabriel If you manage to install CuPy within the runner, then you should definitely try to run mpi4py tests with mpiexec -n <np> python test/main.py --cupy. Every MPI routine will be tested for both CPU and GPU arrays.

wenduwan · 2024-01-17T21:25:09Z

I will be watching this change. We are also bumping submodules in v5.0.x. I want to make sure the same issue does not affect the release branch.

there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy, but soon thereafter it was re-relesed in PMIx_Query_info, causing a PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed. for this case. Related to open-mpi/ompi#12217 Related to open-mpi/ompi#10886 Signed-off-by: Howard Pritchard <[email protected]>

there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy, but soon thereafter it was re-relesed in PMIx_Query_info, causing a PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed. for this case. Related to open-mpi/ompi#12217 Related to open-mpi/ompi#10886 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 4baeb9f)

hppritcha · 2024-01-29T20:40:25Z

@dalcinl could you change your test_exceptions.py to, for the case of singleton, that the error returned is a check for unsporrted operation not err arg.

FAIL: testCreateGroup (test_exceptions.TestExcSession.testCreateGroup)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/users/hpritchard/mpi4py_sandbox/mpi4py/test/./test_exceptions.py", line 39, in assertRaisesMPI
    callableObj(*args, **kwargs)
  File "src/mpi4py/MPI/Session.pyx", line 103, in mpi4py.MPI.Session.Create_group
mpi4py.MPI.Exception: MPI_ERR_UNSUPPORTED_OPERATION: operation not supported

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/users/hpritchard/mpi4py_sandbox/mpi4py/test/./test_exceptions.py", line 391, in testCreateGroup
    self.assertRaisesMPI(
  File "/users/hpritchard/mpi4py_sandbox/mpi4py/test/./test_exceptions.py", line 56, in assertRaisesMPI
    raise self.failureException(
AssertionError: generated error class is 'ERR_UNSUPPORTED_OPERATION' (52), but expected 'ERR_ARG' (13)

Improve error handling when querying PMIx for available process sets. Related to open-mpi#12217 also related to openpmix/openpmix#3267 Signed-off-by: Howard Pritchard <[email protected]>

dalcinl · 2024-01-30T05:10:48Z

@dalcinl could you change your test_exceptions.py to, for the case of singleton, that the error returned is a check for unsporrted operation not err arg.

Yes, I can, of course. However, wouldn't that be misleading? I'm just sending an invalid, garbage argument value to an MPI call.

Fixed in mpi4py/mpi4py@3b2fe1c.

dalcinl · 2024-01-30T08:23:38Z

However, wouldn't that be misleading?

Oh, I understand that things may be a little different in singleton init mode. As there is no server, you fail with unsupported operation. All good now. The fix you requested to mpi4py testsuite is already pushed to master.

related to open-mpi#12217 Signed-off-by: Howard Pritchard <[email protected]>

Improve error handling when querying PMIx for available process sets. Related to open-mpi#12217 also related to openpmix/openpmix#3267 Signed-off-by: Howard Pritchard <[email protected]>

seems like mpi4py finds a problem almost every time we advance openpmix/prrte shas so catch it early here. we test mpi4py master as it contains the mpi 4 stuff that so often breaks. related to open-mpi#12195 Signed-off-by: Howard Pritchard <[email protected]>

Signed-off-by: Howard Pritchard <[email protected]>

hppritcha · 2024-01-30T22:08:58Z

shmem: mmap: an error occurred while determining whether or not /tmp/ompi.st-master.5462/jf.0/2130837504/shared_mem_cuda_pool.st-master

@hppritcha Who can we get to fix the name of the session directory? I explained elsewhere what needs to be done - basically, retrieve the session directory using PMIx_Get of the PMIX_NSDIR key.

i'll see about fixing this in another PR.

dalcinl · 2024-01-31T04:44:45Z

My automated CI ran 4 hours ago to completion [link] in up to 5 MPI processes.

I'm not sure whether the deadlock in the last run here was a hiccup, or some actual issue my testing is not catching consistently.

hppritcha · 2024-01-31T16:03:29Z

@wenduwan could you review? we can add other suggestions from @dalcinl in subsequent PRs.

wenduwan

Thank you very much! This is a much needed change.

there was a path through pmix_parse_localquery that ended up doing a PMIX_RELEASE on the caddy, but soon thereafter it was re-relesed in PMIx_Query_info, causing a PMIx_Query_info: Assertion `PMIX_OBJ_MAGIC_ID == _obj->obj_magic_id' failed. for this case. Related to open-mpi/ompi#12217 Related to open-mpi/ompi#10886 Signed-off-by: Howard Pritchard <[email protected]> (cherry picked from commit 4baeb9f)

github-actions bot added the Target: main label Jan 4, 2024

hppritcha requested a review from ggouaillardet January 4, 2024 18:56

hppritcha force-pushed the add_mpi4py_gaction branch from 5c102d4 to 67ee88d Compare January 6, 2024 19:30

dalcinl reviewed Jan 8, 2024

View reviewed changes

jsquyres reviewed Jan 8, 2024

View reviewed changes

.github/workflows/ompi_mpi4py.yaml Show resolved Hide resolved

hppritcha force-pushed the add_mpi4py_gaction branch from edfd2b5 to aa4a7da Compare January 12, 2024 18:13

hppritcha force-pushed the add_mpi4py_gaction branch 2 times, most recently from a3b5179 to 5b0c99f Compare January 16, 2024 20:52

dalcinl reviewed Jan 17, 2024

View reviewed changes

.github/workflows/ompi_mpi4py.yaml Show resolved Hide resolved

dalcinl reviewed Jan 17, 2024

View reviewed changes

hppritcha mentioned this pull request Jan 29, 2024

PMIx_Query_info: removed duplicated PMIX_RELEASE openpmix/openpmix#3267

Merged

rhc54 mentioned this pull request Jan 29, 2024

PMIx_Query_info: removed duplicated PMIX_RELEASE openpmix/openpmix#3268

Merged

hppritcha force-pushed the add_mpi4py_gaction branch from 6a81d84 to 5404f4c Compare January 29, 2024 20:32

hppritcha force-pushed the add_mpi4py_gaction branch from 5404f4c to b67092d Compare January 29, 2024 22:15

hppritcha mentioned this pull request Jan 29, 2024

Sessions: improve error checking for PMIX MEMBERSHIP #12289

Merged

hppritcha added a commit to hppritcha/ompi that referenced this pull request Jan 30, 2024

pmix: advance sha to pick up some fixes

48e46e0

related to open-mpi#12217 Signed-off-by: Howard Pritchard <[email protected]>

hppritcha mentioned this pull request Jan 30, 2024

pmix: advance sha to pick up some fixes #12293

Merged

hppritcha force-pushed the add_mpi4py_gaction branch from 6f2a038 to c9543ca Compare January 30, 2024 21:52

hppritcha added 2 commits January 30, 2024 17:04

add enable-mem-debug

5a1cf9c

Signed-off-by: Howard Pritchard <[email protected]>

hppritcha force-pushed the add_mpi4py_gaction branch from c9543ca to 5a1cf9c Compare January 30, 2024 22:04

hppritcha requested review from wenduwan and removed request for ggouaillardet January 31, 2024 16:02

wenduwan approved these changes Jan 31, 2024

View reviewed changes

hppritcha merged commit 3c10787 into open-mpi:main Jan 31, 2024
11 checks passed

rhc54 mentioned this pull request Feb 5, 2024

Multiple commits openpmix/openpmix#3280

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

github-action: add a MPI4PY sanity check #12217

github-action: add a MPI4PY sanity check #12217

hppritcha commented Jan 4, 2024

hppritcha commented Jan 4, 2024

dalcinl commented Jan 5, 2024

dalcinl commented Jan 5, 2024 •

edited

Loading

hppritcha commented Jan 5, 2024

hppritcha commented Jan 5, 2024

hppritcha commented Jan 6, 2024

dalcinl Jan 8, 2024

dalcinl Jan 8, 2024 •

edited

Loading

dalcinl Jan 8, 2024

dalcinl commented Jan 8, 2024

hppritcha commented Jan 12, 2024

dalcinl commented Jan 13, 2024

dalcinl commented Jan 13, 2024

hppritcha commented Jan 13, 2024

hppritcha commented Jan 16, 2024

hppritcha commented Jan 16, 2024

dalcinl commented Jan 17, 2024

dalcinl commented Jan 17, 2024

rhc54 commented Jan 17, 2024

hppritcha commented Jan 17, 2024

dalcinl Jan 17, 2024 •

edited

Loading

hppritcha Jan 18, 2024

jsquyres Jan 19, 2024

edgargabriel Jan 19, 2024 •

edited

Loading

dalcinl Jan 21, 2024

wenduwan commented Jan 17, 2024

hppritcha commented Jan 29, 2024

dalcinl commented Jan 30, 2024

dalcinl commented Jan 30, 2024

hppritcha commented Jan 30, 2024

dalcinl commented Jan 31, 2024

hppritcha commented Jan 31, 2024

wenduwan left a comment

github-action: add a MPI4PY sanity check #12217

github-action: add a MPI4PY sanity check #12217

Conversation

hppritcha commented Jan 4, 2024

hppritcha commented Jan 4, 2024

dalcinl commented Jan 5, 2024

dalcinl commented Jan 5, 2024 • edited Loading

hppritcha commented Jan 5, 2024

hppritcha commented Jan 5, 2024

hppritcha commented Jan 6, 2024

dalcinl Jan 8, 2024

Choose a reason for hiding this comment

dalcinl Jan 8, 2024 • edited Loading

Choose a reason for hiding this comment

dalcinl Jan 8, 2024

Choose a reason for hiding this comment

dalcinl commented Jan 8, 2024

hppritcha commented Jan 12, 2024

dalcinl commented Jan 13, 2024

dalcinl commented Jan 13, 2024

hppritcha commented Jan 13, 2024

hppritcha commented Jan 16, 2024

hppritcha commented Jan 16, 2024

dalcinl commented Jan 17, 2024

dalcinl commented Jan 17, 2024

rhc54 commented Jan 17, 2024

hppritcha commented Jan 17, 2024

dalcinl Jan 17, 2024 • edited Loading

Choose a reason for hiding this comment

hppritcha Jan 18, 2024

Choose a reason for hiding this comment

jsquyres Jan 19, 2024

Choose a reason for hiding this comment

edgargabriel Jan 19, 2024 • edited Loading

Choose a reason for hiding this comment

dalcinl Jan 21, 2024

Choose a reason for hiding this comment

wenduwan commented Jan 17, 2024

hppritcha commented Jan 29, 2024

dalcinl commented Jan 30, 2024

dalcinl commented Jan 30, 2024

hppritcha commented Jan 30, 2024

dalcinl commented Jan 31, 2024

hppritcha commented Jan 31, 2024

wenduwan left a comment

Choose a reason for hiding this comment

dalcinl commented Jan 5, 2024 •

edited

Loading

dalcinl Jan 8, 2024 •

edited

Loading

dalcinl Jan 17, 2024 •

edited

Loading

edgargabriel Jan 19, 2024 •

edited

Loading