[Bug] Ascend910b安装mmcv后训练报错 #3203

BoomSky0416 · 2024-11-25T08:10:47Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmcv).

Environment

OrderedDict([('sys.platform', 'linux'), ('Python', '3.7.5 (default, Mar 20 2023, 04:32:29) [GCC 7.5.0]'), ('CUDA available', False), ('numpy_random_seed', 2147483648), ('GCC', 'gcc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0'), ('PyTorch', '1.8.0a0+56b43f4'), ('PyTorch compiling details', 'PyTorch built with:\n - GCC 7.3\n - C++ Version: 201402\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - NNPACK is enabled\n - CPU capability usage: NO AVX\n - Build settings: BLAS_INFO=generic, BUILD_TYPE=Release, CXX_COMPILER=/opt/buildtools/gcc-7.3.0/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -DMISSING_ARM_VST1 -DMISSING_ARM_VLD1 -Wno-stringop-overflow, LAPACK_INFO=generic, TORCH_VERSION=1.8.0, USE_CUDA=OFF, USE_CUDNN=OFF, USE_EIGEN_FOR_BLAS=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=OFF, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, USE_OPENMP=ON, \n'), ('TorchVision', '0.9.1'), ('OpenCV', '4.10.0'), ('MMEngine', '0.7.3'), ('MMCV', '2.0.1'), ('MMCV Compiler', 'GCC 7.5'), ('MMCV CUDA Compiler', 'not available')])

absl-py 2.1.0
addict 2.4.0
albumentations 1.3.1
apex 0.1+ascend
attrs 22.2.0
auto-tune 0.1.0
cachetools 5.5.0
certifi 2022.12.7
cffi 1.12.3
charset-normalizer 3.1.0
chumpy 0.70
click 8.1.7
cycler 0.11.0
Cython 3.0.11
decorator 5.1.1
DLLogger 1.0.0
easydict 1.9
einops 0.6.1
exceptiongroup 1.1.1
fonttools 4.38.0
google-auth 2.36.0
google-auth-oauthlib 0.4.6
grpcio 1.51.3
grpcio-tools 1.51.3
hccl 0.1.0
idna 3.4
imageio 2.31.2
imgaug 0.4.0
importlib-metadata 6.0.0
iniconfig 2.0.0
joblib 1.2.0
json-tricks 3.17.3
kiwisolver 1.4.4
lmdb 1.5.1
lxml 4.5.2
Markdown 3.4.4
markdown-it-py 2.2.0
MarkupSafe 2.1.5
mat4py 0.6.0
matplotlib 3.5.3
mdurl 0.1.2
mmcv 2.0.1
mmdet 3.1.0 /workspace/open-mmlab-2.0/mmdetection
mmengine 0.7.3 /workspace/open-mmlab-2.0/mmengine
mmocr 1.0.1 /workspace/open-mmlab-2.0/mmocr
mmpose 1.2.0 /workspace/open-mmlab-2.0/mmpose
mmpretrain 1.0.0rc8 /workspace/open-mmlab-2.0/mmpretrain
mmrazor 1.0.0 /workspace/open-mmlab-2.0/mmrazor
mmsegmentation 1.1.0 /workspace/open-mmlab-2.0/mmsegmentation
model-index 0.1.11
modelindex 0.0.2
mpmath 1.3.0
munkres 1.1.4
networkx 2.6.3
numexpr 2.8.4
numpy 1.21.6
oauthlib 3.2.2
opc-tool 0.1.0
opencv-python 4.10.0.84
ordered-set 4.1.0
packaging 23.0
pandas 1.3.5
pathlib2 2.3.7.post1
Pillow 9.1.0
pip 23.0.1
pluggy 1.0.0
prettytable 3.7.0
protobuf 3.20.3
pyasn1 0.5.1
pyasn1-modules 0.3.0
pyclipper 1.3.0.post6
pycocotools 2.0.6
pycparser 2.21
Pygments 2.17.2
pyparsing 3.0.9
pytest 7.2.2
python-dateutil 2.8.2
pytz 2022.7.1
PyWavelets 1.3.0
PyYAML 6.0
qudida 0.0.4
rapidfuzz 3.4.0
requests 2.28.2
requests-oauthlib 2.0.0
rich 13.8.1
rsa 4.9
schedule-search 0.0.1
scikit-image 0.19.3
scikit-learn 1.0.2
scipy 1.7.3
setuptools 41.2.0
shapely 2.0.6
six 1.16.0
sklearn 0.0
sympy 1.4
tables 3.6.1
te 0.4.0
tensorboard 2.11.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
termcolor 2.3.0
terminaltables 3.1.10
threadpoolctl 3.1.0
tifffile 2021.11.2
tomli 2.0.1
topi 0.4.0
torch 1.8.0a0+56b43f4
torch-npu 1.8.1
torchvision 0.9.1
tqdm 4.67.0
typing_extensions 4.5.0
urllib3 1.26.15
wcwidth 0.2.13
Werkzeug 2.2.3
wheel 0.40.0
xdoctest 1.1.0
xtcocotools 1.14.3
yapf 0.32.0
zipp 3.15.0

Reproduces the problem - code sample

use mmengine runner

Reproduces the problem - command or script

use mmengine runner

Reproduces the problem - error message

Traceback (most recent call last):
File "tools/caip_train.py", line 725, in
main()
File "tools/caip_train.py", line 721, in main
runner.train()
File "/workspace/open-mmlab-2.0/mmengine/mmengine/runner/runner.py", line 1707, in train
self._init_model_weights()
File "/workspace/open-mmlab-2.0/mmengine/mmengine/runner/runner.py", line 899, in _init_model_weights
model.init_weights()
File "/workspace/open-mmlab-2.0/mmengine/mmengine/model/base_module.py", line 130, in init_weights
m.init_weights()
File "/workspace/open-mmlab-2.0/mmpretrain/mmpretrain/models/backbones/resnet.py", line 638, in init_weights
super(ResNet, self).init_weights()
File "/workspace/open-mmlab-2.0/mmengine/mmengine/model/base_module.py", line 124, in init_weights
initialize(self, other_cfgs)
File "/workspace/open-mmlab-2.0/mmengine/mmengine/model/weight_init.py", line 610, in initialize
initialize(module, cp_cfg)
File "/workspace/open-mmlab-2.0/mmengine/mmengine/model/weight_init.py", line 518, in initialize
func(module)
File "/workspace/open-mmlab-2.0/mmengine/mmengine/model/weight_init.py", line 437, in call
module.apply(init)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 473, in apply
module.apply(fn)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 474, in apply
fn(self)
File "/workspace/open-mmlab-2.0/mmengine/mmengine/model/weight_init.py", line 435, in init
self.bias, self.distribution)
File "/workspace/open-mmlab-2.0/mmengine/mmengine/model/weight_init.py", line 104, in kaiming_init
module.weight, a=a, mode=mode, nonlinearity=nonlinearity)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch/nn/init.py", line 413, in kaiming_normal
return tensor.normal(0, std)
RuntimeError: Run:/usr1/workspace/FPTA_Daily_Plugin_open_date/Plugin/torch_npu/csrc/framework/OpParamMaker.cpp:128 NPU error,NPU error code is:100000
EZ9999: Inner Error, Please contact support engineer!
EZ9999 Kernel task happen error, retCode=0x2a, [aicpu exception].[FUNC:PreCheckTaskErr][FILE:task.cc][LINE:1068]
TraceBack (most recent call last):
Aicpu kernel execute failed, device_id=0, stream_id=3, task_id=0.[FUNC:PrintAicpuErrorInfo][FILE:task.cc][LINE:774]
AICPU Kernel task happen error, retCode=0x2a.[FUNC:GetError][FILE:stream.cc][LINE:1044]
Aicpu kernel execute failed, device_id=0, stream_id=3, task_id=0, flip_num=0, fault so_name=, fault kernel_name=, fault op_name=, extend_info=.[FUNC:GetError][FILE:stream.cc][LINE:1044]
rtStreamSynchronize execute failed, reason=[aicpu exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:49]
Call rtStreamSynchronize(stream) fail, ret: 0x7BC8A[FUNC:KernelLaunchEx][FILE:model_manager.cc][LINE:145]
Failed to execute init graph[FUNC:Load][FILE:model_v2_executor.cc][LINE:119]
Assert ((executor->Load(arg)) == ge::SUCCESS) failed[FUNC:CreateAndLoad][FILE:stream_executor.cc][LINE:38]
Aicpu kernel execute failed, device_id=0, stream_id=4, task_id=0.[FUNC:PrintAicpuErrorInfo][FILE:task.cc][LINE:774]
Aicpu kernel execute failed, device_id=0, stream_id=4, task_id=0, flip_num=0, fault so_name=, fault kernel_name=, fault op_name=, extend_info=.[FUNC:GetError][FILE:stream.cc][LINE:1044]
Aicpu kernel execute failed, device_id=0, stream_id=5, task_id=0.[FUNC:PrintAicpuErrorInfo][FILE:task.cc][LINE:774]
Aicpu kernel execute failed, device_id=0, stream_id=5, task_id=0, flip_num=0, fault so_name=, fault kernel_name=, fault op_name=, extend_info=.[FUNC:GetError][FILE:stream.cc][LINE:1044]

THPModule_npu_shutdown success.

Additional information

麻烦提供一下ascend torch_npu版本和mmcv版本兼容的介绍

BoomSky0416 · 2024-11-25T11:30:39Z

@momo609

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Ascend910b安装mmcv后训练报错 #3203

[Bug] Ascend910b安装mmcv后训练报错 #3203

BoomSky0416 commented Nov 25, 2024

BoomSky0416 commented Nov 25, 2024

[Bug] Ascend910b安装mmcv后训练报错 #3203

[Bug] Ascend910b安装mmcv后训练报错 #3203

Comments

BoomSky0416 commented Nov 25, 2024

Prerequisite

Environment

Reproduces the problem - code sample

Reproduces the problem - command or script

Reproduces the problem - error message

Additional information

BoomSky0416 commented Nov 25, 2024