Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent gdr read latency #243

Open
anaanimous opened this issue Jan 18, 2023 · 3 comments
Open

Inconsistent gdr read latency #243

anaanimous opened this issue Jan 18, 2023 · 3 comments
Labels

Comments

@anaanimous
Copy link

I have a memory allocator based on gdr which initially allocates a large chunk of gdr memory (e.g. 16 MB) and then allocates pieces of this chunk to the subsequent memory requests. During performance benchmarking, I noticed that the read latency of the same memory size fluctuates quite significantly and I can't understand why. For example, if I allocate 3KB memory read 100 times and do the same thing again and again, the average read time fluctuates between 4.5 us and 70 us (i.e. 4.5 -> 70 -> 4.5 -> 70 ...) even though the same piece of memory is allocated for every 100 reads.

Here are some details regarding my settings:

  • I have set CU_POINTER_ATTRIBUTE_SYNC_MEMOPS to 1 for the entire chunk using cuPointerSetAttribute.
  • The api reports "using SSE4_1 implementation of gdr_copy_from_bar" for read operation.
  • The data is read into page-locked host memory allocated using cudaMallocHost.
  • Both the gdr-mapped source pointer and the destination pointer are 128-bit aligned.
  • The write latency is quite consistent.
@anaanimous anaanimous changed the title inconsistent gdr read latency Inconsistent gdr read latency Jan 18, 2023
@pakmarkthub
Copy link
Collaborator

Hi @anaanimous,

CPU and GPU clocks are usually the main cause (but not always) of performance fluctuation. Can you try the items below and rerun your test again?

  1. Fix the CPU clock or at least set your power governance to "performance" sudo cpupower frequency-set -g performance.
  2. Please also set the GPU clocks to max.
# To view the max clock values of GPU 0
$ nvidia-smi -i 0 -q
...
    Max Clocks
        Graphics                          : 1410 MHz
        SM                                : 1410 MHz
        Memory                            : 1593 MHz
        Video                             : 1290 MHz
...

# To set the clocks of GPU 0 to max
$ sudo nvidia-smi -i 0 -ac 1539,1410

@anaanimous
Copy link
Author

Thank you for the quick response.

I have been running the benchmark on an AWS instance. However, today after running the same program on a local server, the performance has been stable. I don't know what's causing the fluctuation on AWS. It may be due to frequency scaling but I doubt it. Here is why:

I have a benchmark program where I measure the read latency for different sizes (1, 2, 4, ..., 1MB), similar to the copylat program, except that I use my allocator and its APIs to allocate memory and perform the reading. If I run this program on a local server the performance nicely matches that of the copylat. But on AWS the read latency for sizes above 512 bytes suddenly increases significantly (e.g. the latency of reading 512 bytes goes from 1.5 us to 12 us). But strangely enough, if I only skip the one-byte read (i.e. if I perform the test for 2, 4, ..., 1MB instead of 1, 2,... 1MB) the numbers match the copylat output.

@pakmarkthub
Copy link
Collaborator

Let's split into two topics here. The first one is the performance fluctuation, which seems to be resolved now. Depending on how your instance is allocated, I guess that you might share the host with other instances. I cannot say much about the performance predictability if you are not in full control of the entire system. There are so many external factors that can affect the performance.

The second topic is about the reading latency jumps to 12 us when reading 512 bytes. Can you share the code? I will try to reproduce this behavior on our system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants