Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Amount of lag expected for metrics #170

Open
jaywonchung opened this issue May 13, 2024 · 2 comments
Open

[Question] Amount of lag expected for metrics #170

jaywonchung opened this issue May 13, 2024 · 2 comments

Comments

@jaywonchung
Copy link

jaywonchung commented May 13, 2024

I'm drawing a timeline of DCGM metrics (gathered with DcgmReader and update interval 10 ms) together with Python application-level metrics like the number of running requests at each moment. DCGM metrics have their own microsecond timestamp, and I gathered the timestamp of appilcation-level metrics with time.time_ns() // 1000.

I see that in the beginning of the application, the SM activity metric (which I'm taking as "at least one kernel is running on the GPU") becomes non-zero something like 300 ms after dispatching the first batch of computations to the GPU. That delay is way too long for any kernel launch overhead or the Python interpreter overhead. I don't think it would be cache misses either since the DRAM activity metric goes up at the same moment SM activity goes up.

  1. In general, how long of a lag should I expect for DCGM metrics?
  2. Can different metrics have different amounts of lag, if any? Largely, can the lag be different for the DCGM_FI_DEV_* group and the DCGM_FI_PROF_* group?
@george-kuanli-peng
Copy link

I also observe a lag ~ 40 seconds of the metrics reported by DCGM-Exporter. I do not observe such a long lagging of the metrics reported by cAdvisor on the same platform ,though.

@george-kuanli-peng
Copy link

I found the lag could be reduced by setting its time of metric collection interval ($DCGM_EXPORTER_INTERVAL, or -c). The default is 30000 ms.

https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants