[Question] Amount of lag expected for metrics #170

jaywonchung · 2024-05-13T00:24:41Z

I'm drawing a timeline of DCGM metrics (gathered with DcgmReader and update interval 10 ms) together with Python application-level metrics like the number of running requests at each moment. DCGM metrics have their own microsecond timestamp, and I gathered the timestamp of appilcation-level metrics with time.time_ns() // 1000.

I see that in the beginning of the application, the SM activity metric (which I'm taking as "at least one kernel is running on the GPU") becomes non-zero something like 300 ms after dispatching the first batch of computations to the GPU. That delay is way too long for any kernel launch overhead or the Python interpreter overhead. I don't think it would be cache misses either since the DRAM activity metric goes up at the same moment SM activity goes up.

In general, how long of a lag should I expect for DCGM metrics?
Can different metrics have different amounts of lag, if any? Largely, can the lag be different for the DCGM_FI_DEV_* group and the DCGM_FI_PROF_* group?

The text was updated successfully, but these errors were encountered:

george-kuanli-peng · 2024-06-12T09:41:02Z

I also observe a lag ~ 40 seconds of the metrics reported by DCGM-Exporter. I do not observe such a long lagging of the metrics reported by cAdvisor on the same platform ,though.

george-kuanli-peng · 2024-07-03T07:25:06Z

I found the lag could be reduced by setting its time of metric collection interval ($DCGM_EXPORTER_INTERVAL, or -c). The default is 30000 ms.

https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Amount of lag expected for metrics #170

[Question] Amount of lag expected for metrics #170

jaywonchung commented May 13, 2024 •

edited

Loading

george-kuanli-peng commented Jun 12, 2024

george-kuanli-peng commented Jul 3, 2024

[Question] Amount of lag expected for metrics #170

[Question] Amount of lag expected for metrics #170

Comments

jaywonchung commented May 13, 2024 • edited Loading

george-kuanli-peng commented Jun 12, 2024

george-kuanli-peng commented Jul 3, 2024

jaywonchung commented May 13, 2024 •

edited

Loading