Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda: update interface to take 64-bit M #411

Merged
merged 6 commits into from
Jun 25, 2024

Conversation

janden
Copy link
Collaborator

@janden janden commented Dec 27, 2023

For compatibility with FINUFFT.

This is quite an invasive patch and I'm not sure it's doing the right thing in several places (in particular, I'm not certain that CUDA is happy taking int64_t for block dimensions, etc.). That being said, it is compiling and tests are passing, so that's something.

@ahbarnett
Copy link
Collaborator

This breaks the cufinufft interface, so is the plan to go into 2.2.1 ? 2.3?

@janden
Copy link
Collaborator Author

janden commented Dec 27, 2023

This breaks the cufinufft interface, so is the plan to go into 2.2.1 ? 2.3?

Right. The plan was to get this in before the 2.2.0 release (see discussion in #255). But perhaps that is too late now?

In terms of breaking the interface, it's quite a benign breakage (so we may not need to wait until version 3.0). From what I understand, it will break the ABI (since we have int64_ts where we previously had int), but code will still compile as expected (compilers will silently upconvert int to int64_t where necessary). That being said, I'm no C guru, so someone tell me if I'm wrong here.

@ahbarnett
Copy link
Collaborator

ahbarnett commented Dec 27, 2023

Thanks. Ah, I just read your response - hoping you're in USA not Sweden :).
Could you include the docs and CHANGELOG changes that go along with this. Maybe this could be a 2.2.1, but no need for pre-Lanczos. Stability is key not matching the FINUFFT interface, for that...

@janden
Copy link
Collaborator Author

janden commented Dec 27, 2023

Thanks. Ah, I just read your response - hoping you're in USA not Sweden :).

Heh. In Sweden actually… This is prime work time!

Could you include the docs and CHANGELOG changes that go along with this.

Done.

Maybe this could be a 2.2.1, but no need for pre-Lanczos. Stability is key not matching the FINUFFT interface, for that...

Yeah I'm thinking we wait on this. Would like to get @blackwer's eyes on this before I do anything.

@blackwer
Copy link
Member

I don't see how this could cause any problems, so it looks good to me. Since it's not a pointer type and we're increasing the size so it won't cause overflows. I falsely said in another post that this had to do with cufft having an int32 size limit, but this is the number of non-uniform points, so it's not constrained by the fft like the number of modes. Unless I'm missing something...

It is technically an API change so semantic versioning would probably tell us to wait until 2.3 for this if it doesn't make it into 2.2. That said, given the automatic type casting I'm inclined to ignore that since it really shouldn't ever break anything (unless someone is linking 2.2 while using 2.2.1 headers or decided to make their own bindings). I personally vote we just go ahead and merge it but I understand the reticence there

@janden
Copy link
Collaborator Author

janden commented Dec 28, 2023

this is the number of non-uniform points, so it's not constrained by the fft like the number of modes.

Right. That was my thinking too.

Still, I'm a bit concerned about how this interacts with dim3 and kernel launches where we assign M to variables that should be 32-bit, like here:

threadsPerBlock.x = 16;
threadsPerBlock.y = 1;
blocks.x = (M + threadsPerBlock.x - 1) / threadsPerBlock.x;
blocks.y = 1;
if (d_plan->opts.gpu_kerevalmeth) {
for (int t = 0; t < blksize; t++) {
spread_2d_nupts_driven<T, 1>
<<<blocks, threadsPerBlock, 0, stream>>>(d_kx, d_ky, d_c + t * M, d_fw + t * nf1 * nf2, M, ns, nf1, nf2,
es_c, es_beta, sigma, d_idxnupts, pirange);
RETURN_IF_CUDA_ERROR
}

From what I understand, dim3 is a tuple of three unsigned ints, so the compiler should complain here, but it's not. Is it silently truncating? If so, we'd expect problems if M was more than 16*2^64. Will run some tests to find out (should be unit testing these large-M problems anyhow).

It is technically an API change so semantic versioning would probably tell us to wait until 2.3 for this if it doesn't make it into 2.2. That said, given the automatic type casting I'm inclined to ignore that since it really shouldn't ever break anything (unless someone is linking 2.2 while using 2.2.1 headers or decided to make their own bindings). I personally vote we just go ahead and merge it but I understand the reticence there

Would tend to agree, but we can discuss later. As I said elsewhere, no need to rush this.

@ahbarnett
Copy link
Collaborator

ahbarnett commented Apr 23, 2024

Can you add a 1d test with M>2.2e9, which becomes part of the CI for cufinufft? (at 1e9 NU pt/sec, it should take only 3 secs!).

Thanks! Alex

Tips: keep the cufft size < int32)
For the analogous CPU tester, see: perftest/big2d2f.cpp

@janden
Copy link
Collaborator Author

janden commented May 9, 2024

I've been working on testing this yesterday and today. Some notes:

  • The code as written doesn't work due to the (silent) overflows when assigning int64_t to int in a few places. This can be fixed, but requires more work.
  • The set of possible problem sizes is actually quite limited on current hardware due to the limit on GPU memory. So far, I'm able to run 4e9 points in single precision without sorting. On the GPU, this takes up approximately 4e9 * 4 bytes for the locations, 4e9 * 8 bytes for the strengths and 4e9 * 4 bytes for idxnupts, which comes out to 60 GB. (Actually, the last one should be 4e9 * 8 so it can properly index the points, so the total comes out to 75 GB.)
  • Running the tests is actually very slow (more than 3 seconds). The vast majority of the time is not taken up by the cufinufft call, but generating the data and checking the accuracy of the output (on the CPU). This can of course be sped up using multithreading.

So I guess the question here is whether it's worth the effort (do we foresee GPU memory going into triple or quadruple digits anytime soon?) to finish the conversion. It is doable, of course, but would require some time and increase complexity of the code.

@ahbarnett
Copy link
Collaborator

ahbarnett commented May 9, 2024 via email

@janden
Copy link
Collaborator Author

janden commented May 31, 2024

Following discussion on Tuesday, we're scrapping the effort to transition the internals to handle 64-bit integers for the number of points. This is such a limited use case that it's not worth it at this point (may be relevant once we start having GPUs with more than 80 GB of memory).

As a result, we're going to do the same thing as for the number of modes. That is, we allow 64-bit integers, but check whether these will fit into 32-bit integers and if so cast (otherwise, we error).

janden and others added 3 commits June 11, 2024 08:51
While the interface changes, we still won't allow more than 2e9
points (for now) since this will complicate the code quite a bit with
little tangible benefits (transforms these size are currently out of
range for most GPUs in terms of memory consumption).
@janden janden marked this pull request as ready for review June 11, 2024 07:26
@janden
Copy link
Collaborator Author

janden commented Jun 11, 2024

Had to update Jenkins slightly to install dependencies without cache (pip install --no-cache-dir) to get rid of this error:

ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
    unknown package:
        Expected sha256 a884af6cb594e89b0669964b47e1ae6fb13ae6d8c580db3426d08ee2ca84ecd4
             Got        d738f15fb0b4bacd71f10fe8ccbb537cdd1c05c0c018b0dc4dcbd41afbc41fcb

Not sure why this worked (a fresh Docker image should have no cache, correct?), but it got rid of the error message.

@ahbarnett
Copy link
Collaborator

@janden do you need reviews from people on this before we bring in? you'll have to remind them re the final decision:
#411 (comment)

@janden
Copy link
Collaborator Author

janden commented Jun 25, 2024

@janden do you need reviews from people on this before we bring in? you'll have to remind them re the final decision: #411 (comment)

Yes I'll tag @blackwer.

@janden janden requested a review from blackwer June 25, 2024 16:53
Copy link
Member

@blackwer blackwer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me after the typo gets fixed

CHANGELOG Outdated Show resolved Hide resolved
include/cufinufft.h Outdated Show resolved Hide resolved
@blackwer blackwer merged commit 000f8dc into flatironinstitute:master Jun 25, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants