Skip to content

[BUG] Windows Server Unexpectedly Shuts Down When Using Nvitop to Monitor GPU Usage #136

@NI-MingCheng

Description

@NI-MingCheng

Required prerequisites

  • I have read the documentation https://nvitop.readthedocs.io.
  • I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)
  • I have tried the latest version of nvitop in a new isolated virtual environment.

What version of nvitop are you using?

1.3.2

Operating system and version

Windows Server 2022 Datacenter

NVIDIA driver version

516.01

NVIDIA-SMI

Wed Oct 23 15:33:05 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 516.01       Driver Version: 516.01       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000   WDDM  | 00000000:02:00.0 Off |                  Off |
| 30%   38C    P8    17W / 230W |    464MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000   WDDM  | 00000000:21:00.0 Off |                  Off |
| 30%   34C    P8     6W / 230W |      0MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000   WDDM  | 00000000:49:00.0 Off |                  Off |
| 30%   30C    P8     5W / 230W |      0MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000   WDDM  | 00000000:4A:00.0 Off |                  Off |
| 30%   33C    P8     4W / 230W |      0MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      9436    C+G   ...lPanel\SystemSettings.exe    N/A      |
|    0   N/A  N/A     12172    C+G   ...5n1h2txyewy\SearchApp.exe    N/A      |
|    0   N/A  N/A     15244    C+G   C:\Windows\explorer.exe         N/A      |
|    0   N/A  N/A     17912    C+G   ...2txyewy\TextInputHost.exe    N/A      |
|    0   N/A  N/A     18180    C+G   ...y\ShellExperienceHost.exe    N/A      |
+-----------------------------------------------------------------------------+

Python environment

python -m pip freeze
(base) C:\Users\Administrator>python -m pip freeze
absl-py==2.1.0
accelerate==0.24.1
aiofiles==23.2.0
aiohttp==3.8.5
aiosignal==1.3.1
altair==5.1.2
anaconda-anon-usage @ file:///C:/b/abs_95v3x0wy8p/croot/anaconda-anon-usage_1697038984188/work
anaconda-client==1.12.0
anaconda-cloud-auth @ file:///C:/b/abs_410afndtyf/croot/anaconda-cloud-auth_1697462767853/work
anaconda-navigator @ file:///C:/b/abs_cfvv8k_j21/croot/anaconda-navigator_1704813334508/work
anaconda-project @ file:///C:/ci_311/anaconda-project_1676458365912/work
annotated-types==0.6.0
ansicon==1.89.0
antlr4-python3-runtime==4.9.3
anyio==3.7.1
archspec @ file:///croot/archspec_1709217642129/work
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
async-lru==2.0.4
async-timeout==4.0.3
attrs==24.2.0
Babel==2.14.0
backports.functools-lru-cache @ file:///tmp/build/80754af9/backports.functools_lru_cache_1618170165463/work
backports.tempfile @ file:///home/linux1/recipes/ci/backports.tempfile_1610991236607/work
backports.weakref==1.0.post1
beautifulsoup4 @ file:///C:/b/abs_0agyz1wsr4/croot/beautifulsoup4-split_1681493048687/work
bleach==6.1.0
blessed==1.20.0
blinker==1.7.0
boltons @ file:///C:/ci_311/boltons_1677729932371/work
Brotli @ file:///C:/ci_311/brotli-split_1676435766766/work
cachetools==5.3.2
certifi @ file:///C:/b/abs_1fw_exq1si/croot/certifi_1725551736618/work/certifi
cffi @ file:///C:/b/abs_924gv1kxzj/croot/cffi_1700254355075/work
chardet @ file:///C:/ci_311/chardet_1676436134885/work
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
click @ file:///C:/b/abs_f9ihnt72pu/croot/click_1698129847492/work
clip==0.2.0
clyent==1.2.1
colorama @ file:///C:/ci_311/colorama_1676422310965/work
coloredlogs==15.0.1
comm==0.2.1
conda @ file:///C:/b/abs_85jnuwc__u/croot/conda_1729193917673/work
conda-build @ file:///C:/b/abs_3ed9gavxgz/croot/conda-build_1708025907525/work
conda-content-trust @ file:///tmp/build/80754af9/conda-content-trust_1617045594566/work
conda-libmamba-solver @ file:///croot/conda-libmamba-solver_1727775630457/work/src
conda-pack @ file:///tmp/build/80754af9/conda-pack_1611163042455/work
conda-package-handling @ file:///C:/b/abs_b9wp3lr1gn/croot/conda-package-handling_1691008700066/work
conda-repo-cli==1.0.75
conda-token @ file:///Users/paulyim/miniconda3/envs/c3i/conda-bld/conda-token_1662660369760/work
conda-verify @ file:///D:/bld/conda-verify_1667049856137/work
conda_index @ file:///croot/conda-index_1706633791028/work
conda_package_streaming @ file:///C:/b/abs_6c28n38aaj/croot/conda-package-streaming_1690988019210/work
contourpy==1.2.0
cpm-kernels==1.0.11
cryptography @ file:///C:/b/abs_531eqmhgsd/croot/cryptography_1707523768330/work
cycler==0.12.1
ddddocr==1.5.5
debugpy==1.8.1
decorator==5.1.1
defusedxml @ file:///tmp/build/80754af9/defusedxml_1615228127516/work
distro @ file:///C:/b/abs_a3uni_yez3/croot/distro_1701455052240/work
easydict==1.12
einops==0.7.0
executing==2.0.1
fastapi==0.104.1
fastjsonschema @ file:///C:/ci_311/python-fastjsonschema_1679500568724/work
ffmpy==0.3.1
filelock @ file:///C:/b/abs_f2gie28u58/croot/filelock_1700591233643/work
flatbuffers==24.3.25
fonttools==4.49.0
fqdn==1.5.1
frozendict @ file:///C:/b/abs_2alamqss6p/croot/frozendict_1713194885124/work
frozenlist==1.4.1
fsspec==2023.10.0
ftfy==6.1.3
future @ file:///C:/ci_311_rebuilds/future_1678998246262/work
gitdb==4.0.11
GitPython==3.1.40
gmpy2 @ file:///C:/ci_311/gmpy2_1677743390134/work
gpustat==1.1.1
gradio==3.39.0
gradio_client==0.7.0
grpcio==1.60.1
h11==0.14.0
httpcore==1.0.1
httpx==0.25.1
huggingface-hub==0.19.0
humanfriendly==10.0
idna @ file:///C:/ci_311/idna_1676424932545/work
importlib-metadata==6.8.0
ipykernel==6.29.2
ipython==8.21.0
ipywidgets==8.1.2
isoduration==20.11.0
jaraco.classes @ file:///tmp/build/80754af9/jaraco.classes_1620983179379/work
jedi==0.19.1
Jinja2 @ file:///C:/b/abs_f7x5a8op2h/croot/jinja2_1706733672594/work
jinxed==1.2.1
json5==0.9.14
jsonpatch @ file:///tmp/build/80754af9/jsonpatch_1615747632069/work
jsonpointer==2.1
jsonschema @ file:///C:/b/abs_d1c4sm8drk/croot/jsonschema_1699041668863/work
jsonschema-specifications @ file:///C:/b/abs_0brvm6vryw/croot/jsonschema-specifications_1699032417323/work
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.9.0
jupyter-lsp==2.2.2
jupyter_client==8.6.0
jupyter_core @ file:///C:/b/abs_c769pbqg9b/croot/jupyter_core_1698937367513/work
jupyter_server==2.12.5
jupyter_server_terminals==0.5.2
jupyterlab==4.1.1
jupyterlab-language-pack-zh-CN==4.0.post3
jupyterlab_pygments==0.3.0
jupyterlab_server==2.25.3
jupyterlab_widgets==3.0.10
keyring @ file:///C:/b/abs_dbjc7g0dh2/croot/keyring_1678999228878/work
kiwisolver==1.4.5
latex2mathml==3.76.0
libarchive-c @ file:///tmp/build/80754af9/python-libarchive-c_1617780486945/work
libmambapy @ file:///C:/b/abs_2euls_1a38/croot/mamba-split_1704219444888/work/libmambapy
linkify-it-py==2.0.3
Markdown==3.7
markdown-it-py==2.2.0
MarkupSafe @ file:///C:/b/abs_ecfdqh67b_/croot/markupsafe_1704206030535/work
matplotlib==3.8.3
matplotlib-inline==0.1.6
mdit-py-plugins==0.3.3
mdtex2html==1.2.0
mdurl==0.1.2
menuinst @ file:///C:/b/abs_099kybla52/croot/menuinst_1706732987063/work
mistune==3.0.2
mkl-fft @ file:///C:/b/abs_19i1y8ykas/croot/mkl_fft_1695058226480/work
mkl-random @ file:///C:/b/abs_edwkj1_o69/croot/mkl_random_1695059866750/work
mkl-service==2.4.0
more-itertools @ file:///C:/b/abs_36p38zj5jx/croot/more-itertools_1700662194485/work
mpmath @ file:///C:/b/abs_7833jrbiox/croot/mpmath_1690848321154/work
multidict==6.1.0
navigator-updater @ file:///C:/b/abs_895otdwmo9/croot/navigator-updater_1695210220239/work
nbclient==0.9.0
nbconvert==7.16.1
nbformat @ file:///C:/b/abs_5a2nea1iu2/croot/nbformat_1694616866197/work
nest-asyncio==1.6.0
networkx @ file:///C:/b/abs_e6gi1go5op/croot/networkx_1690562046966/work
notebook==7.1.0
notebook_shim==0.2.4
numpy @ file:///C:/b/abs_16b2j7ad8n/croot/numpy_and_numpy_base_1704311752418/work/dist/numpy-1.26.3-cp311-cp311-win_amd64.whl#sha256=5f2c4b54fd5d52b9fb18e32607c79b03cf14665cecce8a5a10e2950559df4651
nvidia-ml-py==12.535.161
nvitop==1.3.2
omegaconf==2.3.0
onnxruntime==1.19.2
opencv-python-headless==4.10.0.84
orjson==3.9.10
outcome==1.3.0.post0
overrides==7.7.0
packaging @ file:///C:/b/abs_28t5mcoltc/croot/packaging_1693575224052/work
pandas==2.0.3
pandocfilters==1.5.1
parso==0.8.3
pathlib @ file:///Users/ktietz/demo/mc3/conda-bld/pathlib_1629713961906/work
pillow @ file:///C:/b/abs_e22m71t0cb/croot/pillow_1707233126420/work
pkce @ file:///C:/b/abs_d0z4444tb0/croot/pkce_1690384879799/work
pkginfo @ file:///C:/b/abs_d18srtr68x/croot/pkginfo_1679431192239/work
platformdirs @ file:///C:/b/abs_b6z_yqw_ii/croot/platformdirs_1692205479426/work
pluggy @ file:///C:/ci_311/pluggy_1676422178143/work
ply==3.11
prettytable==3.9.0
prometheus_client==0.20.0
prompt-toolkit==3.0.43
propcache==0.2.0
protobuf==4.25.0
psutil @ file:///C:/ci_311_rebuilds/psutil_1679005906571/work
pure-eval==0.2.2
pyarrow==12.0.1
pycosat @ file:///C:/b/abs_31zywn1be3/croot/pycosat_1696537126223/work
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pydantic @ file:///C:/b/abs_9byjrk31gl/croot/pydantic_1695798904828/work
pydantic_core==2.10.1
pydeck==0.8.1b0
pydub==0.25.1
Pygments==2.17.2
PyJWT @ file:///C:/ci_311/pyjwt_1676438890509/work
pyparsing==3.1.1
PyQt5==5.15.10
PyQt5-sip @ file:///C:/b/abs_c0pi2mimq3/croot/pyqt-split_1698769125270/work/pyqt_sip
pyreadline3==3.5.4
PySocks @ file:///C:/ci_311/pysocks_1676425991111/work
python-dateutil @ file:///tmp/build/80754af9/python-dateutil_1626374649649/work
python-dotenv @ file:///C:/ci_311/python-dotenv_1676455170580/work
python-json-logger==2.0.7
python-multipart==0.0.6
pytz @ file:///C:/b/abs_19q3ljkez4/croot/pytz_1695131651401/work
pywin32==305.1
pywin32-ctypes @ file:///C:/ci_311/pywin32-ctypes_1676427747089/work
pywinpty==2.0.12
PyYAML @ file:///C:/b/abs_782o3mbw7z/croot/pyyaml_1698096085010/work
pyzmq==25.1.2
qtconsole==5.5.1
QtPy @ file:///C:/b/abs_derqu__3p8/croot/qtpy_1700144907661/work
referencing @ file:///C:/b/abs_09f4hj6adf/croot/referencing_1699012097448/work
regex==2023.12.25
requests @ file:///C:/b/abs_474vaa3x9e/croot/requests_1707355619957/work
requests-mock==1.12.1
requests-toolbelt @ file:///C:/b/abs_2fsmts66wp/croot/requests-toolbelt_1690874051210/work
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.6.0
rpds-py @ file:///C:/b/abs_76j4g4la23/croot/rpds-py_1698947348047/work
ruamel-yaml-conda @ file:///C:/ci_311/ruamel_yaml_1676455799258/work
ruamel.yaml @ file:///C:/ci_311/ruamel.yaml_1676439214109/work
safetensors==0.3.3
selenium==4.25.0
semantic-version==2.10.0
semver @ file:///tmp/build/80754af9/semver_1603822362442/work
Send2Trash==1.8.2
sentencepiece==0.1.99
sip @ file:///C:/b/abs_edevan3fce/croot/sip_1698675983372/work
six @ file:///tmp/build/80754af9/six_1644875935023/work
smmap==5.0.1
sniffio==1.3.0
sortedcontainers==2.4.0
soupsieve @ file:///C:/b/abs_bbsvy9t4pl/croot/soupsieve_1696347611357/work
sse-starlette==1.6.5
stack-data==0.6.3
starlette==0.27.0
streamlit==1.28.1
sympy @ file:///C:/b/abs_82njkonm7f/croot/sympy_1701397685028/work
tenacity==8.2.3
tensorboard==2.16.2
tensorboard-data-server==0.7.2
termcolor==2.4.0
terminado==0.18.0
timm==0.9.12
tinycss2==1.2.1
tokenizers==0.13.3
toml==0.10.2
toolz==1.0.0
torch==2.2.0
torchaudio==2.2.0
torchvision==0.17.0
tornado @ file:///C:/b/abs_0cbrstidzg/croot/tornado_1696937003724/work
tqdm @ file:///C:/b/abs_f76j9hg7pv/croot/tqdm_1679561871187/work
traitlets @ file:///C:/ci_311/traitlets_1676423290727/work
transformers==4.30.2
trio==0.26.2
trio-websocket==0.11.1
truststore @ file:///C:/b/abs_55z7b3r045/croot/truststore_1695245455435/work
types-python-dateutil==2.8.19.20240106
typing_extensions==4.12.2
tzdata==2023.3
tzlocal==5.2
uc-micro-py==1.0.3
ujson @ file:///C:/ci_311/ujson_1676434714224/work
uri-template==1.3.0
urllib3 @ file:///C:/b/abs_4etpfrkumr/croot/urllib3_1707770616184/work
uvicorn==0.24.0.post1
validators==0.22.0
watchdog==3.0.0
wcwidth==0.2.13
webcolors==1.13
webdriver-manager==4.0.2
webencodings==0.5.1
websocket-client==1.8.0
websockets==11.0.3
Werkzeug==3.0.4
widgetsnbextension==4.0.10
win-inet-pton @ file:///C:/ci_311/win_inet_pton_1676425458225/work
windows-curses==2.3.3
wsproto==1.2.0
yarl==1.14.0
zipp @ file:///C:/b/abs_b0beoc27oa/croot/zipp_1704206963359/work
zstandard==0.19.0

Problem description

When monitoring GPU usage with nvitop on Windows Server systems, the system experiences unexpected shutdowns. This issue appears to be caused by compatibility conflicts between nvitop and Windows Server's hardware monitoring system.

日志名称:          System
来源:            Microsoft-Windows-Kernel-Power
日期:            2024/10/23 15:18:48
事件 ID:         41
任务类别:          (63)
级别:            关键
关键字:           (70368744177664),(2)
用户:            SYSTEM
计算机:           WIN-3I9RKHAQAH5
描述:
系统已在未先正常关机的情况下重新启动。如果系统停止响应、发生崩溃或意外断电,则可能会导致此错误。
事件 Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-Kernel-Power" Guid="{331c3b3a-2005-44c2-ac5e-77220c37d6b4}" />
    <EventID>41</EventID>
    <Version>8</Version>
    <Level>1</Level>
    <Task>63</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000400000000002</Keywords>
    <TimeCreated SystemTime="2024-10-23T07:18:48.0040790Z" />
    <EventRecordID>210129</EventRecordID>
    <Correlation />
    <Execution ProcessID="4" ThreadID="8" />
    <Channel>System</Channel>
    <Computer>WIN-3I9RKHAQAH5</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <EventData>
    <Data Name="BugcheckCode">80</Data>
    <Data Name="BugcheckParameter1">0xffffdc8d2fdfa000</Data>
    <Data Name="BugcheckParameter2">0x2</Data>
    <Data Name="BugcheckParameter3">0xfffff8029b2505e6</Data>
    <Data Name="BugcheckParameter4">0x0</Data>
    <Data Name="SleepInProgress">0</Data>
    <Data Name="PowerButtonTimestamp">0</Data>
    <Data Name="BootAppStatus">0</Data>
    <Data Name="Checkpoint">0</Data>
    <Data Name="ConnectedStandbyInProgress">true</Data>
    <Data Name="SystemSleepTransitionsToOn">0</Data>
    <Data Name="CsEntryScenarioInstanceId">136</Data>
    <Data Name="BugcheckInfoFromEFI">false</Data>
    <Data Name="CheckpointStatus">0</Data>
    <Data Name="CsEntryScenarioInstanceIdV2">136</Data>
    <Data Name="LongPowerButtonPressDetected">false</Data>
  </EventData>
</Event>

Steps to Reproduce

Deep learning training using GPU first
Then use Nvitop to view GPU usage
Unexpected system shutdown

Traceback

None

Logs

None

Expected behavior

None

Additional context

None

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions