-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_shutdown_closed_peer fails locally #643
Comments
On CI, we see the following tests/test_disconnect.py::test_shutdown_closed_peer [1601917865.934649] [11b931505f43:32919:0] parser.c:1600 UCX WARN unused env variable: UCX_PATH (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1601917865.912238] [11b931505f43:32918:0] parser.c:1600 UCX WARN unused env variable: UCX_PATH (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1601917865.992168] [11b931505f43:32918:0] sock.c:344 UCX ERROR send(fd=-1) failed: Bad file descriptor
PASSED |
I can confirm I see this on my side too. However, I've been ignoring those errors for a long time, given I've seen them come and go several times over the past year, so I can't say since when they've been there. |
Thanks for confirming. Wasn't sure whether it was just something funky in my environment since CI seems unable to reproduce it. |
CI also reproduces it, but the Python test passes, the error is not fatal for it: https://gpuci.gpuopenanalytics.com/blue/organizations/jenkins/rapidsai%2Fgpuci%2Fucx-py%2Fprb%2Fucx-py-gpu-build/detail/ucx-py-gpu-build/2617/pipeline#log-984 . The link for the test is just one of the tests for the latest open PR: #626 . |
Damn, sorry @jakirkham , I just noticed the error you're asking about is #643 (comment) . I can reproduce #643 (comment) locally, but not #643 (comment) , this is on an environment I created earlier today, maybe 10h ago or so. |
Oops, thought I had set the environment variables, but didn't. Setting them fixes the Python test failure 🤦♂️ Interestingly I now don't see the warning messages that we see on CI 🤔 |
You don't? Can you share exactly what variables you're setting? I see it when I run the following on a DGX-1:
|
Same as above plus setting Interesting when I run the command as you show I now see the warning messages. Not sure why running |
Can you paste the exact command you're running? I still am able to see the error with $ UCXPY_IFNAME=enp1s0f0 UCX_TLS=tcp,cuda_copy,sockcm UCX_SOCKADDR_TLS_PRIORITY=sockcm python -m pytest test_disconnect.py -vs
================================================= test session starts ==================================================
platform linux -- Python 3.7.8, pytest-6.1.1, py-1.9.0, pluggy-0.13.1 -- /datasets/pentschev/miniconda3/envs/rn-110-0.16.201005/bin/python
cachedir: .pytest_cache
rootdir: /datasets/pentschev/src/ucx-py
plugins: asyncio-0.12.0
collected 1 item
test_disconnect.py::test_shutdown_closed_peer libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs3
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs2
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs3
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs2
[1601927571.966144] [dgx13:42334:0] sock.c:344 UCX ERROR send(fd=-1) failed: Bad file descriptor
PASSED
================================================== 1 passed in 0.90s =================================================== Ignore the ibverbs' warnings, they're unrelated to this, but something I have to check in that machine. |
Yeah I'm just not including the Adding Adding
There are some more details on how |
Omitting |
FWIW I think this is the line that emits this error. |
The reason this error happens is that the client process terminating is the trigger ucx-py/tests/test_disconnect.py Lines 67 to 68 in c3dd8f9
ucx-py/tests/test_disconnect.py Lines 29 to 30 in c3dd8f9
It seems like we have the wrong order for doing things here, or maybe the meaning of "disconnect" is indeed to test that things still don't crash after the client has terminated. If we're really testing that the other side of the endpoint has already been terminated, then I think the error message coming from UCX is expected. Could you confirm what's the correct context for this test @madsbk ? |
Yes, the test is part of #494 that test shutdown of an already closed peer. |
Would it make sense to rename the test to "test_terminate"/"test_unexpected_disconnect" or something that more clearly identifies that the client hasn't disconnected after |
I think both the renaming and adding a comment is a good idea. |
Thanks for the context Mads! Had forgotten about this case. Tried to clarify things a bit in PR ( #645 ). Suggestions would be very welcome 😄 |
As of #693 , all tests are confirmed passing in UCX >= 1.9 for various combinations of transports. Therefore, I believe this is now resolved and I'm closing, but please reopen if you see this still. |
When running the following locally...
...am getting the following test failure
The text was updated successfully, but these errors were encountered: