-
DescriptionError occurs in logging stage:
Error Message
EnvironmentPython3.7 So could you please help me with these questions? |
Beta Was this translation helpful? Give feedback.
Replies: 9 comments
-
Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. |
Beta Was this translation helpful? Give feedback.
-
Hi @seekFire. Because MXNet execution is asynchronous, the OOM error likely happened earlier. I'd suggest reducing model size or batch size to make it fit in your current GPU. If you have reason to believe that the current setting should fit in your GPU memory, it would be helpful if you elaborate on that so that I can take a closer look. |
Beta Was this translation helpful? Give feedback.
-
@szha Thank you for your suggestion! When I turn down the batch size to 2 on one GPU it works ok, I'm just surprised that the batch size is so low when training with HRNet-W18 for segmentation... |
Beta Was this translation helpful? Give feedback.
-
Is this a per-GPU batch size? I imagine it has to do with the input image sizes. |
Beta Was this translation helpful? Give feedback.
-
@szha Yes, you're right, and the input image size is 512*512, the GPU memory is 12GB. |
Beta Was this translation helpful? Give feedback.
-
@szha
But the reshape operation in my custom metric function for segmentation task will generate error during evaluating: The error message shown as below: And I don't think When I rectify So what do you think about the cause? |
Beta Was this translation helpful? Give feedback.
-
Have you enabled the numpy compatible mode? |
Beta Was this translation helpful? Give feedback.
-
@leezu |
Beta Was this translation helpful? Give feedback.
-
@leezu
The error is same as the mentioned above. I think the class mx.metric.CustomMetric of new version(1.7.0) is different from that of older version, because I used to use this class to warp the same custom metric function and it runs OK. |
Beta Was this translation helpful? Give feedback.
@leezu
I think I may find out the error reason: when I use the class mx.metric.CustomMetric to wrap my custom metric function, the type of input tensor (label & pred) of this function has converted from mxnet.ndarray.ndarray.NDArray to numpy.ndarray automatically, so it will generate this error. The validation process is as follows, same script as above except using numpy to replace mxnet.ndarray:
The error is same as the mentioned above. I think the class mx.me…