Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/IB/MLX5/RC: perf tuning - increase dc latency estimation when AR is enabled #10415

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

roiedanino
Copy link
Contributor

What?

Decrease latency estimation for RC

Why?

Currently, DC is selected too early, for a relatively small number of eps (~60), after some research it seems DC should be selected only when reaching 180 ~ 190 eps.

@roiedanino roiedanino requested a review from yosefe January 12, 2025 13:42
@roiedanino roiedanino self-assigned this Jan 12, 2025
@@ -188,7 +188,7 @@ static ucs_status_t uct_rc_mlx5_iface_query(uct_iface_h tl_iface, uct_iface_attr
uct_rc_mlx5_iface_common_query(&rc_iface->super, iface_attr, max_am_inline,
UCT_RC_MLX5_TM_EAGER_ZCOPY_MAX_IOV(0));
iface_attr->cap.flags |= UCT_IFACE_FLAG_EP_CHECK;
iface_attr->latency.m += 1e-9; /* 1 ns per each extra QP */
iface_attr->latency.m += 32e-11; /* 0.32 ns per each extra QP */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make it configurable?

@roiedanino roiedanino requested a review from brminich January 13, 2025 15:46
@@ -140,6 +140,12 @@ ucs_config_field_t uct_dc_mlx5_iface_config_sub_table[] = {
ucs_offsetof(uct_dc_mlx5_iface_config_t, dcis_initial_capacity),
UCS_CONFIG_TYPE_UINT},

{"FULL_HANDSHAKE_ADDED_LATENCY", "110ns",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we double (or even 3x) the latency since there is another round trip?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it adds 2.7us (which is ~x30) but I tried to set a number that makes UCX select DC only when it improves latency and bandwidth compared to RC (BTW setting bigger values eventually makes UCX select UD over DC unless we increase estimated UD latency)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean double the current latency estimation. Since FHS adds a round trip for connection establishment.
Does UD really have a better performance than DC FHS?

Copy link
Contributor Author

@roiedanino roiedanino Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, UD Latency is much better than DC FHS, 1.62us vs 4.27us in small sizes

@@ -242,6 +248,11 @@ static ucs_status_t uct_dc_mlx5_iface_query(uct_iface_h tl_iface, uct_iface_attr
sizeof(uct_dc_mlx5_iface_addr_t);
iface_attr->latency.c += 60e-9; /* connect packet + cqe */

/* Full handshake is used when AR is enabled */
if (iface->super.config.dp_ordering == UCT_IB_MLX5_DP_ORDERING_OOO_RW) {
iface_attr->latency.c += ucs_time_to_sec(iface->tx.fhs_added_latency);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should do it also when user forces full handshake or using SL with adaptive routing.
Maybe we could also query if FHS is enabled in HW on a given DCI, from DevX?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that mean creating a dummy dci?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm right, so better avoid it

@brminich
Copy link
Contributor

looks like the content of this PR does not correpond to the title. Can you pls update it?

@roiedanino roiedanino changed the title UCT/IB/MLX5/RC: perf tuning - decrease rc latency estimation UCT/IB/MLX5/RC: perf tuning - increase dc latency estimation when AR is enabled Jan 14, 2025
@roiedanino roiedanino requested a review from yosefe January 14, 2025 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants