Enable distributed parallelism

### Motivation

As of today, working on distributed parallelism in Fortran mostly implies using MPI or coarray. But as of now, one has to decide early on which one to relay on. I would like to propose for stdlib to wrap certain basic reduction operators which can rely on either of them through C-preprocessing such that other procedures from stdlib could profit from such a wrapper.

I'll try to give a picture with a very simple example, let's say computing the `norm2` of a 1D array. This operations requires a parallel sum reduction before computing the square root:
```Fortran
<kind> :: x(:) !> in spmd, each process/image has a partial portion of the array
<kind> :: local_sum , global_sum
...
local_sum = dot_product( x, x ) !> this sum is incomplete with respect to the distributed data
#if defined(STDLIB_WITH_MPI)
call MPI_Allreduce(local_sum, global_sum, 1, MPI_<kind>, MPI_SUM, MPI_COMM_WORLD, ierr)
#elif defined(STDLIB_WITH_COARRAY)
global_sum = local_sum 
call co_sum( global_sum  )
#endif
...
norm2 =sqrt( global_sum )
```

If `stdlib` proposed wrappers for these reduction operators it could be possible to make some of the functionalities work also transparently on distributed frameworks. The idea could consist on having a module `stdlib_distributed` or `stdlib_coarray` (to promote coarray-like syntax ? ) and then:
```Fortran
module stdlib_<name_to_chose>

interface stdlib_co_sum
    module procedure ::  stdlib_co_sum_<kind>
    ...
end interface

contains

subroutine stdlib_co_sum_<kind>( A, result_image, stat, errmsg)
    <kind>, intent(inout) :: A(..)
    integer, intent(in), optional :: result_image
    integer, intent(out), optional :: stat
    character(*), intent(inout), optional :: errmsg
    ...
    select rank(A)
    rank(0)
#if defined(STDLIB_WITH_MPI)
        call MPI_Allreduce(A, global_sum, size( A ), MPI_<kind>, MPI_SUM, MPI_COMM_WORLD, ierr)
        A = global_sum
#elif defined(STDLIB_WITH_COARRAY)
        call co_sum(A, result_image, stat, errmsg)
#endif
    rank(1)
#if defined(STDLIB_WITH_MPI)
        call MPI_Allreduce(A, global_sum, size( A ), MPI_<kind>, MPI_SUM, MPI_COMM_WORLD, ierr)
        A = global_sum
#elif defined(STDLIB_WITH_COARRAY)
        call co_sum(A, result_image, stat, errmsg)
#endif
    rank(2)
#if defined(STDLIB_WITH_MPI)
        call MPI_Allreduce(A, global_sum, size( A ), MPI_<kind>, MPI_SUM, MPI_COMM_WORLD, ierr)
        A = global_sum
#elif defined(STDLIB_WITH_COARRAY)
        call co_sum(A, result_image, stat, errmsg)
#endif
...
    end select
end subroutine

end module
```
Like this, if one doesn't link against any of them, the kernels do nothing and return the same value. If linked, then one can rely on `stdlib` as an intermediate wrapper.

I haven't fully thought this through but I would like to open it for discussion.

### Prior Art

_No response_

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Enable distributed parallelism #1048

Motivation

Prior Art

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Enable distributed parallelism #1048

Description

Motivation

Prior Art

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions