Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

configury: harden IGNORE_TKR check #12681

Merged

Conversation

ggouaillardet
Copy link
Contributor

NVIDIA HPC Compiler (e.g. nvfortran) incorrectly selected the GCC style pragmas to support IGNORE_TKR. Harden the test by mimicking exactly the mpi f08 bindings in order to fix that false positive.

Thanks Chris Parrot for the report.

Refs. #11582

NVIDIA HPC Compiler (e.g. nvfortran) incorrectly selected the GCC
style pragmas to support IGNORE_TKR. Harden the test by mimicking
exactly the mpi f08 bindings in order to fix that false positive.

Thanks Chris Parrot for the report.

Refs. open-mpi#11582

Signed-off-by: Gilles Gouaillardet <[email protected]>
@ggouaillardet
Copy link
Contributor Author

@cparrott73 sorry it took so long for me to find the right way to approach this problem.

@jsquyres
Copy link
Member

@cparrott73 Can you confirm that this fixes the issue for you?

@cparrott73
Copy link

@cparrott73 Can you confirm that this fixes the issue for you?

I'll try to give this a go today or Monday and verify. Thanks for doing this!

@cparrott73
Copy link

I pulled top-of-tree from git and tested it with our compilers. I confirm that it now works correctly.

How easy would it be to patch Open MPI 4.1.x with this change? Our product management team is not ready to move forward to 5.x, but I'd like to see if it would be possible to backport this to the version we are currently using. If not, not the end of the world.

@ggouaillardet
Copy link
Contributor Author

thanks for testing it!

the backport was almost straightforward.
here is a diff you can apply on top of the v4.1.x branch:

diff --git a/config/ompi_fortran_check_ignore_tkr.m4 b/config/ompi_fortran_check_ignore_tkr.m4
index d996537609..766e697478 100644
--- a/config/ompi_fortran_check_ignore_tkr.m4
+++ b/config/ompi_fortran_check_ignore_tkr.m4
@@ -189,8 +189,23 @@ AC_DEFUN([OMPI_FORTRAN_CHECK_IGNORE_TKR_SUB], [
     call foo(a, count)
   end subroutine force_assumed_shape
 
+  module check_ignore_tkr
+  interface foobar
+     subroutine foobar_x(buffer, count)
+       $1 buffer
+       $2, intent(in) :: buffer
+       integer, intent(in) :: count
+     end subroutine foobar_x
+  end interface
+  end module
+
+  subroutine bar(var)
+    use check_ignore_tkr
+    implicit none
+    real, intent(inout) :: var(:, :, :)
+
+    call foobar(var(1,1,1), 1)
 ! Autoconf puts "end" after the last line
-  subroutine bogus
 ]]),
                     [msg=yes
                      ompi_fortran_ignore_tkr_predecl="$1"

@cparrott73
Copy link

Thanks! I will use this in our builds. We really appreciate the fix.

@jsquyres
Copy link
Member

@ggouaillardet Want to merge / cherry-pick to v5.0.x / cherry-pick/port to v4.1.x?

@ggouaillardet
Copy link
Contributor Author

@jsquyres yes, I will do that from now!

@ggouaillardet ggouaillardet merged commit ce2310a into open-mpi:main Jul 24, 2024
13 checks passed
@cparrott73
Copy link

cparrott73 commented Jul 25, 2024

foo2.f90.txt

Apologies, I may have spoken a little too soon here.

I tried applying the patch, regenerating ./configure, and running a build. I did not see any errors, and compiling a test program worked.

When I went to integrate this into our builds, our scripts complained about some missing F08 files in the lib directory. I took a look, and none of the F08 libraries had built. So I did some deeper investigation. I think the missing F08 libraries were due to a bug in the version of autoconf we were using, which caused ./configure to be regenerated incorrectly, so I tried on another system. However, when I did that, I still did not see any difference in behavior from before.

It turns out our compiler is complaining about the source file in question while running ./configure:

$ nvfortran -c foo2.f90
NVFORTRAN-S-0446-Argument number 1 to foo: rank mismatch (foo1.f90: 39)
  0 inform,   0 warnings,   1 severes, 0 fatal for main

So I tried the previous version (as shipped with Open MPI 4.1.5) - and nvfortran throws the same error. I'm not sure if this is expected.

gcc-14.1.0 also complains about this file for the same reasons.

So I'm not sure this is actually working. Very sorry for not catching this sooner.

Just in case - am I missing anything here? Let me know if you need more information from me. Thanks!

I have attached the updated configure test as foo2.f90 to the top of this comment.

@ggouaillardet
Copy link
Contributor Author

@cparrott73 something does not check out here. is there any chance you applied the patch on top of a previous one?
This does not pass and it should not.

Anyway, I issued #12707 for the v4.1.x branch and I suggest you build from there.

I can build the ring_usempif08.f90 example with both GCC 14.1.0 and NVHPC 24.5

FWIW:

  • NVHPC 24.5
$ grep OMPI_FORTRAN_IGNORE_TKR config.status
S["OMPI_FORTRAN_IGNORE_TKR_TYPE"]="real, dimension(*)"
S["OMPI_FORTRAN_IGNORE_TKR_PREDECL"]="!DIR$ IGNORE_TKR"
D["OMPI_FORTRAN_IGNORE_TKR_PREDECL"]=" \"!DIR$ IGNORE_TKR\""
D["OMPI_FORTRAN_IGNORE_TKR_TYPE"]=" "
  • GCC 14.1.0
$ grep OMPI_FORTRAN_IGNORE_TKR config.status 
S["OMPI_FORTRAN_IGNORE_TKR_TYPE"]="type(*), dimension(*)"
S["OMPI_FORTRAN_IGNORE_TKR_PREDECL"]="!GCC$ ATTRIBUTES NO_ARG_CHECK ::"
D["OMPI_FORTRAN_IGNORE_TKR_PREDECL"]=" \"!GCC$ ATTRIBUTES NO_ARG_CHECK ::\""
D["OMPI_FORTRAN_IGNORE_TKR_TYPE"]=" "

@cparrott73
Copy link

@ggouaillardet - thanks for the response. I do see that 4.1.x branch patch referenced in #12707 is different than the one I was using, which was a cut and paste from the issue page. I'll pull this file from #12707 and try again, and will let you know. Thanks!

@cparrott73
Copy link

@ggouaillardet - I pulled top-of-tree from git, built with our compilers, and then ran some tests that were previously known to fail. I can confirm they pass now. I'm not sure why the ompi-4.x version isn't working here yet, but I will continue to run a few experiments on that front.

@cparrott73
Copy link

@ggouaillardet - I pulled the diff from the pull request this time and applied it to our openmpi-4.x tree here. I can confirm it worked this time. I'm going to try again on our older build machine and see if I can reproduce the success there. I think my earlier failure was from when I tried to cut and paste the diff above into a patch. I think the actual commit is different? Anyway, I think this is good now. Thanks!

@cparrott73
Copy link

@ggouaillardet - I hate to keep bumping this, but we have found an issue with this change and wanted to get your thoughts on it.

The change works as expected for nvfortran. However, we have some developers internally who are also contributing to flang (specifically, flang-new present in LLVM 17 and later). We ran this through flang-new, and it is throwing the following error on the updated version of this code:

./fs33302.f90:76:8: error: No specific subroutine of generic 'foobar' matches the actual arguments
         call foobar(var(1, 1, 1), 1)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I asked our flang developers about this, and here is what they told me:

the fortran code for auto-detecting ignore_tkr is buggy.
the bug is near the end; it passes var(1,1,1) instead of passing var.
storage sequence association doesn’t/can’t work with a not-known-to-be-contiguous assumed-shape actual argument.

passing the whole var on the other hand will work, the compiler makes it contiguous

if the actual argument were known to be contiguous, that code would work even without ignore_tkr.

passing the whole array works with f18 and is probably closer to what they want to be testing in this autoconf code; haven’t tried the code snippet above with nvfortran

Thoughts?

If I change var(1, 1, 1) in the foobar call to just var - both nvfortran and flang accept it.

Does this change what you are trying to test here in any way? If not, I would propose we replace var(1, 1, 1) with var.

Thanks in advance.

@jsquyres
Copy link
Member

@ggouaillardet I see you added this check in 8c601e1; I'm not enough of a Fortran expert to know why it should be var instead of var(1, 1, 1). Can you double check?

@ggouaillardet
Copy link
Contributor Author

I am not a Fortran expert, I know less fortran than @jsquyres , I have not read the Fortran standard, and if I did, I am not even sure I would understand it.

That being said, the mentioned commit addresses the legit issue reported in #12506.
TL;DR MPI_Send(buffer, ...) must work both when buffer is an array an when it is a scalar.

The test does fail indeed with LLVM17, and this is what I want otherwise MPI_Send(buffer, ...) will fail to compile when buffer is a scalar. IIRC I tried upcoming LLVM19 and it passed (so as far as use mpi is concerned, the IGNORE_TKR directive is good to go).

ggouaillardet added a commit to ggouaillardet/ompi that referenced this pull request Aug 12, 2024
As reported by Chris Parrott in open-mpi#12681, the enhanced test was buggy.
Pass the expected argument to make it Fortran compliant and
LLVM 17 and above happy compilers

Signed-off-by: Gilles Gouaillardet <[email protected]>
@ggouaillardet
Copy link
Contributor Author

Never mind, it seems this PR indeed broke LLVM < 19.
I made the suggested changes and not only LLVM 17 and 18 pass, but if also fixes the false positive initially reported with LLVM 19.

@ggouaillardet
Copy link
Contributor Author

And I have now to take it back ...
if we pass var instead of var(1,1,1) we got a false positive with nvfortran 24.7 (it picks GCC style directive, and use mpi issues warnings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants