You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm drawing a timeline of DCGM metrics (gathered with DcgmReader and update interval 10 ms) together with Python application-level metrics like the number of running requests at each moment. DCGM metrics have their own microsecond timestamp, and I gathered the timestamp of appilcation-level metrics with time.time_ns() // 1000.
I see that in the beginning of the application, the SM activity metric (which I'm taking as "at least one kernel is running on the GPU") becomes non-zero something like 300 ms after dispatching the first batch of computations to the GPU. That delay is way too long for any kernel launch overhead or the Python interpreter overhead. I don't think it would be cache misses either since the DRAM activity metric goes up at the same moment SM activity goes up.
In general, how long of a lag should I expect for DCGM metrics?
Can different metrics have different amounts of lag, if any? Largely, can the lag be different for the DCGM_FI_DEV_* group and the DCGM_FI_PROF_* group?
The text was updated successfully, but these errors were encountered:
I also observe a lag ~ 40 seconds of the metrics reported by DCGM-Exporter. I do not observe such a long lagging of the metrics reported by cAdvisor on the same platform ,though.
I'm drawing a timeline of DCGM metrics (gathered with
DcgmReader
and update interval 10 ms) together with Python application-level metrics like the number of running requests at each moment. DCGM metrics have their own microsecond timestamp, and I gathered the timestamp of appilcation-level metrics withtime.time_ns() // 1000
.I see that in the beginning of the application, the SM activity metric (which I'm taking as "at least one kernel is running on the GPU") becomes non-zero something like 300 ms after dispatching the first batch of computations to the GPU. That delay is way too long for any kernel launch overhead or the Python interpreter overhead. I don't think it would be cache misses either since the DRAM activity metric goes up at the same moment SM activity goes up.
DCGM_FI_DEV_*
group and theDCGM_FI_PROF_*
group?The text was updated successfully, but these errors were encountered: