Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare osc framework for bigcount #12379

Merged
merged 1 commit into from
Mar 12, 2024

Conversation

jtronge
Copy link
Contributor

@jtronge jtronge commented Feb 26, 2024

This updates the osc framework to use size_t for counts and ptrdiff_t for displacements.

Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hppritcha! It mostly looks good, just two points:

@wenduwan
Copy link
Contributor

Running AWS CI

@wenduwan
Copy link
Contributor

AWS CI failed due to OMB onesided suite. Will provide more info.

@hppritcha
Copy link
Member

You mean IMB right?

@wenduwan
Copy link
Contributor

Example failure

mpirun --wdir . -n 2  osu-micro-benchmarks/mpi/one-sided/osu_put_bw
# OSU MPI_Put Bandwidth Test v7.0-lrbison3
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size      Bandwidth (MB/s)
1                      11.14
2                       8.96
segfault

Backtrace

#0  0x00007f94d52edc1a in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#1  0x00007f94d4f159bc in opal_datatype_accelerator_memcpy (dest=0x7f94d6140100, src=0x187c050, size=4) at opal_datatype_copy.c:74
#2  0x00007f94d4f1648c in non_overlap_accelerator_copy_content_same_ddt (datatype=0x6176e0 <ompi_mpi_char>, count=4,
    destination_base=0x7f94d6140100 "\206\006A\016\070\203\aG\016@\002L\n\016\070D\016\060A\016(B\016 B\016\030B\016\020B\016\bA\v",
    source_base=0x187c050 'a' <repeats 176 times>) at opal_datatype_copy.h:151
#3  0x00007f94d4f1805b in opal_datatype_copy_content_same_ddt (datatype=0x6176e0 <ompi_mpi_char>, count=4,
    destination_base=0x7f94d6140100 "\206\006A\016\070\203\aG\016@\002L\n\016\070D\016\060A\016(B\016 B\016\030B\016\020B\016\bA\v",
    source_base=0x187c050 'a' <repeats 176 times>) at opal_datatype_copy.c:176
#4  0x00007f94d5803bc8 in ompi_datatype_sndrcv (sbuf=0x187c050, scount=4, sdtype=0x6176e0 <ompi_mpi_char>, rbuf=0x7f94d6140100, rcount=4,
    rdtype=0x6176e0 <ompi_mpi_char>) at ompi_datatype_sndrcv.c:62
#5  0x00007f94c40250fa in ompi_osc_sm_put (origin_addr=0x187c050, origin_count=4, origin_dt=0x6176e0 <ompi_mpi_char>, target=1,
    target_disp=80, target_count=4, target_dt=0x6176e0 <ompi_mpi_char>, win=0x1863050) at osc_sm_comm.c:232
#6  0x00007f94d585f97b in PMPI_Put (origin_addr=0x187c050, origin_count=4, origin_datatype=0x6176e0 <ompi_mpi_char>, target_rank=1,
    target_disp=80, target_count=4, target_datatype=0x6176e0 <ompi_mpi_char>, win=0x1863050) at put.c:82
#7  0x0000000000403150 in run_put_with_flush (rank=0, type=WIN_ALLOCATE) at osu_put_bw.c:268
#8  0x00000000004029c3 in main (argc=<optimized out>, argv=<optimized out>) at osu_put_bw.c:129

@hppritcha
Copy link
Member

having problems reproducing your findings @wenduwan . what are you config options and compiler used? and also processor type?

@wenduwan
Copy link
Contributor

./configure --with-sge --without-verbs --with-libfabric=/opt/amazon/efa --disable-man-pages --enable-ipv6 LDFLAGS=-Wl,--as-needed --enable-prte-prefix-by-default --enable-mca-dso=all --with-libevent=external --with-hwloc=external --enable-debug

Note that I don't imagine libfabric to be an issue since the failure happens in sm

The segfault happens on:

  • Amazon Linux 2 (GCC7) + x86
  • Amazon LInux 2 (GCC7) + arm
  • Ubuntu 20.04 + x86
  • Ubuntu 20.04 + arm

@hppritcha
Copy link
Member

@wenduwan could you please rerun AWS CI. @jtronge pushed something to hopefully fix what you're seeing.

@wenduwan
Copy link
Contributor

Thanks. The test passed.

@hppritcha
Copy link
Member

@devreal could you take at this PR again to see if your questions were addressed?

@hppritcha hppritcha requested a review from devreal March 12, 2024 14:44
Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few places where the format strings need to be adjusted

@@ -923,7 +923,7 @@ int ompi_osc_rdma_rget (void *origin_addr, int origin_count, ompi_datatype_t *or
ompi_osc_rdma_sync_t *sync;
int ret;

OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_TRACE, "rget: 0x%lx, %d, %s, %d, %d, %d, %s, %s", (unsigned long) origin_addr,
OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_TRACE, "rget: 0x%lx, %d, %s, %d, %d, %zu, %s, %s", (unsigned long) origin_addr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

origin_count is now size_t:

Suggested change
OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_TRACE, "rget: 0x%lx, %d, %s, %d, %d, %zu, %s, %s", (unsigned long) origin_addr,
OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_TRACE, "rget: 0x%lx, %zu, %s, %d, %d, %zu, %s, %s", (unsigned long) origin_addr,

ompi_datatype_t *source_datatype, ompi_win_t *win)
{
ompi_osc_rdma_module_t *module = GET_MODULE(win);
ompi_osc_rdma_peer_t *peer;
ompi_osc_rdma_sync_t *sync;

OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_TRACE, "get: 0x%lx, %d, %s, %d, %d, %d, %s, %s", (unsigned long) origin_addr,
OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_TRACE, "get: 0x%lx, %d, %s, %d, %d, %zu, %s, %s", (unsigned long) origin_addr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

origin_count is now size_t

Suggested change
OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_TRACE, "get: 0x%lx, %d, %s, %d, %d, %zu, %s, %s", (unsigned long) origin_addr,
OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_TRACE, "get: 0x%lx, %zu, %s, %d, %d, %zu, %s, %s", (unsigned long) origin_addr,

@@ -867,7 +867,7 @@ int ompi_osc_rdma_rput (const void *origin_addr, int origin_count, ompi_datatype
ompi_osc_rdma_sync_t *sync;
int ret;

OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_TRACE, "rput: 0x%lx, %d, %s, %d, %d, %d, %s, %s", (unsigned long) origin_addr, origin_count,
OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_TRACE, "rput: 0x%lx, %d, %s, %d, %d, %zu, %s, %s", (unsigned long) origin_addr, origin_count,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

origin_count is now size_t

Suggested change
OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_TRACE, "rput: 0x%lx, %d, %s, %d, %d, %zu, %s, %s", (unsigned long) origin_addr, origin_count,
OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_TRACE, "rput: 0x%lx, %zu, %s, %d, %d, %zu, %s, %s", (unsigned long) origin_addr, origin_count,

ompi_datatype_t *target_datatype, ompi_win_t *win)
{
ompi_osc_rdma_module_t *module = GET_MODULE(win);
ompi_osc_rdma_peer_t *peer;
ompi_osc_rdma_sync_t *sync;

OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_TRACE, "put: 0x%lx, %d, %s, %d, %d, %d, %s, %s", (unsigned long) origin_addr,
OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_TRACE, "put: 0x%lx, %d, %s, %d, %d, %zu, %s, %s", (unsigned long) origin_addr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

origin_count is now size_t

Suggested change
OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_TRACE, "put: 0x%lx, %d, %s, %d, %d, %zu, %s, %s", (unsigned long) origin_addr,
OSC_RDMA_VERBOSE(MCA_BASE_VERBOSE_TRACE, "put: 0x%lx, %zu, %s, %d, %d, %zu, %s, %s", (unsigned long) origin_addr,

@@ -3194,7 +3194,7 @@ ompi_osc_portals4_get_accumulate(const void *origin_addr,
ptrdiff_t length, origin_lb, target_lb, result_lb, extent;

OPAL_OUTPUT_VERBOSE((50, ompi_osc_base_framework.framework_output,
"get_accumulate: 0x%lx, %d, %s, 0x%lx, %d, %s, %d, %lu, %d, %s, %s, 0x%lx",
"get_accumulate: 0x%lx, %zu, %s, 0x%lx, %d, %s, %d, %lu, %zu, %s, %s, 0x%lx",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"get_accumulate: 0x%lx, %zu, %s, 0x%lx, %d, %s, %d, %lu, %zu, %s, %s, 0x%lx",
"get_accumulate: 0x%lx, %zu, %s, 0x%lx, %zu, %s, %d, %lu, %zu, %s, %s, 0x%lx",

Update the osc framework to use size_t for counts and ptrdiff_t for
displacements.

Signed-off-by: Jake Tronge <[email protected]>
@jtronge
Copy link
Contributor Author

jtronge commented Mar 12, 2024

Thanks @devreal. I think all the format strings should be fixed.

@hppritcha hppritcha merged commit 44880f6 into open-mpi:main Mar 12, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants