Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import several uct btl updates #12839

Merged

Commits on Oct 1, 2024

  1. pml/ob1: fix potential double return of RDMA fragment on get operatio…

    …n failure
    
    The mca_pml_ob1_recv_request_get_frag_failed method is responsible for returning
    or queueing the fragment but mca_pml_ob1_rget_completion was freeing it
    unconditionally. This will lead to a double return of the fragment to the free
    list and may lead to other errors if the fragment was queued for retry. This
    commit fixes the issue by only returning the fragment if it did not fail.
    
    Signed-off-by: Nathan Hjelm <[email protected]>
    (cherry picked from commit b7f8cae)
    hjelmn authored and bosilca committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    1f5ec3e View commit details
    Browse the repository at this point in the history
  2. pml/ob1: fix double increment of the RDMA frag retry counter

    If a put or get operation fails it may later be retried by
    mca_pml_ob1_process_pending_rdma which increments retries on each new attempt.
    There is a flaw in the code where both the put and get failures also increment
    this counter leading to it giving up twice as fast. This commit removes the
    increments on the put and get failures.
    
    Signed-off-by: Nathan Hjelm <[email protected]>
    (cherry picked from commit 27efeb9)
    hjelmn authored and bosilca committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    40c290e View commit details
    Browse the repository at this point in the history
  3. pml/ob1: ensure RDMA fragments are released in the get -> send/recv f…

    …allback
    
    Under a number of circumstances it may be necessary to abandon an RDMA get in
    ob1. In some cases it falls back to put but it may fall back to using send/recv.
    If that happens then we may either crash or leak RDMA fragments because they
    are still attached to the send request. Debug builds will crash due to a check
    on rdma_frag when they are returned. This CL fixes the flaw by releasing any
    rdma fragment when sceduling sends.
    
    Signed-off-by: Nathan Hjelm <[email protected]>
    (cherry picked from commit 020e83f)
    hjelmn authored and bosilca committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    25862a9 View commit details
    Browse the repository at this point in the history
  4. btl/uct: fix fetching atomics support for osc/rdma

    The check that requires 32 and 64 bit atomic support is way outdated at this
    point. This commit takes the union of the two instead of the intersection
    because technically it is up to the btl user to determine which of each type
    are supported. This commit also ensures that MCA_BTL_ATOMIC_SUPPORTS_32BIT is
    set when 32-bit atomics are available and ensures that the required
    MCA_BTL_FLAGS_RDMA_REMOTE_COMPLETION flag is set on the btl.
    
    Signed-off-by: Nathan Hjelm <[email protected]>
    (cherry picked from commit 8adf9f5)
    hjelmn authored and bosilca committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    18abef5 View commit details
    Browse the repository at this point in the history
  5. Cleanup unused code.

    Signed-off-by: George Bosilca <[email protected]>
    (cherry picked from commit 58400ad)
    bosilca committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    3d05ccc View commit details
    Browse the repository at this point in the history
  6. Allow for packing less data than expected.

    Return the updated amount to allow the upper level to gracefully handle
    the case.
    
    Signed-off-by: George Bosilca <[email protected]>
    (cherry picked from commit 27514c2)
    bosilca committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    8689370 View commit details
    Browse the repository at this point in the history
  7. Return faster if nothing to do.

    Signed-off-by: George Bosilca <[email protected]>
    (cherry picked from commit f29109a)
    bosilca committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    15cd497 View commit details
    Browse the repository at this point in the history