Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize euclidean distance in host refine phase #689

Open
wants to merge 5 commits into
base: branch-25.04
Choose a base branch
from

Conversation

anstellaire
Copy link

@anstellaire anstellaire commented Feb 13, 2025

Issue

Original code (below) generated serial assembly and used strictly-ordered fadda instruction on ARM with gcc & clang. That resulted in suboptimal performance.

for (size_t k = 0; k < dim; k++) {
  distance += DC::template eval<DistanceT>(query[k], row[k]);
}

Proposed solution

This PR provides euclidean distance optimized with partial vector sum (below), that helps vectorization but loses strcictly-ordered compliance.

template <typename DC, typename DistanceT, typename DataT>
DistanceT euclidean_distance_squared_generic(DataT const* a, DataT const* b, size_t n) {
  size_t constexpr max_vreg_len = 512 / (8 * sizeof(DistanceT));

  // max_vreg_len is a power of two
  size_t n_rounded = n & (0xFFFFFFFF ^ (max_vreg_len - 1));
  DistanceT distance[max_vreg_len] = {0};

  for (size_t i = 0; i < n_rounded; i += max_vreg_len) {
    for (size_t j = 0; j < max_vreg_len; ++j) {
      distance[j] += DC::template eval<DistanceT>(a[i + j], b[i + j]);
    }
  }

  for (size_t i = n_rounded; i < n; ++i) {
    distance[i] += DC::template eval<DistanceT>(a[i], b[i]);
  }

  for (size_t i = 1; i < max_vreg_len; ++i) {
    distance[0] += distance[i];
  }

  return distance[0];
}

In addition, it has an implementation with NEON intrinsics which provides further speedup on certain test cases (can be removed if arch-specific code is undesired).

Results

image

@anstellaire anstellaire requested a review from a team as a code owner February 13, 2025 13:02
Copy link

copy-pr-bot bot commented Feb 13, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the cpp label Feb 13, 2025
@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Feb 13, 2025
@cjnolet
Copy link
Member

cjnolet commented Feb 13, 2025

/ok to test

@anstellaire
Copy link
Author

anstellaire commented Feb 14, 2025

/ok to test

UPD:
@cjnolet, seems like CI is triggered only by repository members, could you please do it one more time?
I changed formatting with clang-format.

@lowener
Copy link
Contributor

lowener commented Feb 19, 2025

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants