Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_shutdown_closed_peer fails locally #643

Closed
jakirkham opened this issue Oct 5, 2020 · 18 comments
Closed

test_shutdown_closed_peer fails locally #643

jakirkham opened this issue Oct 5, 2020 · 18 comments

Comments

@jakirkham
Copy link
Member

When running the following locally...

$ python -m pytest tests/test_disconnect.py 

...am getting the following test failure

=================================== FAILURES ===================================
__________________________ test_shutdown_closed_peer ___________________________

caplog = <_pytest.logging.LogCaptureFixture object at 0x7fe8339ca4f0>

    def test_shutdown_closed_peer(caplog):
        client_queue = mp.Queue()
        server_queue = mp.Queue()
        p1 = mp.Process(
            target=_test_shutdown_closed_peer_server, args=(client_queue, server_queue)
        )
        p1.start()
        p2 = mp.Process(
            target=_test_shutdown_closed_peer_client, args=(client_queue, server_queue)
        )
        p2.start()
        p2.join()
        server_queue.put("client is down")
        p1.join()
    
>       assert not p1.exitcode
E       AssertionError: assert not 1
E        +  where 1 = <SpawnProcess name='SpawnProcess-1' pid=54486 parent=54399 stopped exitcode=1>.exitcode

tests/test_disconnect.py:71: AssertionError
----------------------------- Captured stdout call -----------------------------
[1601921102.562273] [dgx15:54486:0]          rc_ep.c:321  UCX  WARN  destroying rc ep 0x55cdea70bcf8 with uncompleted operation 0x55cdea91e6c0
[1601921102.588147] [dgx15:54486:0]          mpool.c:43   UCX  WARN  object 0x55cdea91a380 was not returned to mpool ucp_requests
[1601921102.588153] [dgx15:54486:0]      callbackq.c:450  UCX  WARN  0 fast-path and 1 slow-path callbacks remain in the queue
----------------------------- Captured stderr call -----------------------------
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/datasets/jkirkham/miniconda/envs/rapids16dev/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/datasets/jkirkham/miniconda/envs/rapids16dev/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/datasets/jkirkham/devel/ucx-py/tests/test_disconnect.py", line 43, in _test_shutdown_closed_peer_server
    assert log.find("""UCXError('<[Send shutdown]""") != -1
AssertionError
=========================== short test summary info ============================
FAILED tests/test_disconnect.py::test_shutdown_closed_peer - AssertionError: ...
============================== 1 failed in 1.94s ===============================
@jakirkham
Copy link
Member Author

On CI, we see the following

tests/test_disconnect.py::test_shutdown_closed_peer [1601917865.934649] [11b931505f43:32919:0]         parser.c:1600 UCX  WARN  unused env variable: UCX_PATH (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1601917865.912238] [11b931505f43:32918:0]         parser.c:1600 UCX  WARN  unused env variable: UCX_PATH (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1601917865.992168] [11b931505f43:32918:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
PASSED

@pentschev
Copy link
Member

I can confirm I see this on my side too. However, I've been ignoring those errors for a long time, given I've seen them come and go several times over the past year, so I can't say since when they've been there.

@jakirkham
Copy link
Member Author

Thanks for confirming. Wasn't sure whether it was just something funky in my environment since CI seems unable to reproduce it.

@pentschev
Copy link
Member

CI also reproduces it, but the Python test passes, the error is not fatal for it: https://gpuci.gpuopenanalytics.com/blue/organizations/jenkins/rapidsai%2Fgpuci%2Fucx-py%2Fprb%2Fucx-py-gpu-build/detail/ucx-py-gpu-build/2617/pipeline#log-984 . The link for the test is just one of the tests for the latest open PR: #626 .

@pentschev
Copy link
Member

Damn, sorry @jakirkham , I just noticed the error you're asking about is #643 (comment) . I can reproduce #643 (comment) locally, but not #643 (comment) , this is on an environment I created earlier today, maybe 10h ago or so.

@jakirkham
Copy link
Member Author

Oops, thought I had set the environment variables, but didn't. Setting them fixes the Python test failure 🤦‍♂️

Interestingly I now don't see the warning messages that we see on CI 🤔

@pentschev
Copy link
Member

Interestingly I now don't see the warning messages that we see on CI 🤔

You don't? Can you share exactly what variables you're setting? I see it when I run the following on a DGX-1:

UCXPY_IFNAME=enp1s0f0 UCX_TLS=tcp,cuda_copy,sockcm py.test --cache-clear -vs --ignore-glob tests/test_send_recv_two_workers.py tests/test_disconnect.py

@jakirkham
Copy link
Member Author

Same as above plus setting UCX_MEMTYPE_CACHE=n and UCX_SOCKADDR_TLS_PRIORITY=sockcm.

Interesting when I run the command as you show I now see the warning messages. Not sure why running python -m pytest tests/ doesn't also show this warning. Is it possible pytest is interacting with UCX in some unfriendly way? Or are the warning messages just getting buried?

@pentschev
Copy link
Member

Can you paste the exact command you're running? I still am able to see the error with python -m pytest, although I can't run it from ucx-py's root directory anymore as it tries to import ucp directly from there.

$ UCXPY_IFNAME=enp1s0f0 UCX_TLS=tcp,cuda_copy,sockcm UCX_SOCKADDR_TLS_PRIORITY=sockcm python -m pytest test_disconnect.py -vs
================================================= test session starts ==================================================
platform linux -- Python 3.7.8, pytest-6.1.1, py-1.9.0, pluggy-0.13.1 -- /datasets/pentschev/miniconda3/envs/rn-110-0.16.201005/bin/python
cachedir: .pytest_cache
rootdir: /datasets/pentschev/src/ucx-py
plugins: asyncio-0.12.0
collected 1 item

test_disconnect.py::test_shutdown_closed_peer libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs3
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs2
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs3
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0
libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs2
[1601927571.966144] [dgx13:42334:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
PASSED

================================================== 1 passed in 0.90s ===================================================

Ignore the ibverbs' warnings, they're unrelated to this, but something I have to check in that machine.

@jakirkham
Copy link
Member Author

Yeah I'm just not including the -vs options.

Adding -v has no effect.

Adding -s would enable capturing stdout & stderr. There are some subtle differences between Python's stdout& stderr and C's stdout & stderr. Here's a good blogpost from a Python core dev explaining how this works. So I'm not sure exactly how pytest capturing affects the latter, but that may be causing issues for UCX. Once I add this option, the warning crops up.

tests/test_disconnect.py::test_shutdown_closed_peer [1601928631.455811] [dgx15:10438:0]           sock.c:344  UCX  ERROR send(fd=-1) failed: Bad file descriptor
PASSED

There are some more details on how pytest captures stdout & stderr. Only using --capture=fd does not present the bad file descriptor warning.

@pentschev
Copy link
Member

Omitting -s usually doesn't print any stdout/stderr, only the Python traceback at the end, so I'm not sure you're not seeing that error when you don't have -s. Seems like a normal behavior of pytest to me.

@jakirkham
Copy link
Member Author

FWIW I think this is the line that emits this error.

@pentschev
Copy link
Member

The reason this error happens is that the client process terminating is the trigger

p2.join()
server_queue.put("client is down")
to close the endpoint to the client
await mp_queue_get_nowait(server_queue)
await ep.close()

It seems like we have the wrong order for doing things here, or maybe the meaning of "disconnect" is indeed to test that things still don't crash after the client has terminated. If we're really testing that the other side of the endpoint has already been terminated, then I think the error message coming from UCX is expected. Could you confirm what's the correct context for this test @madsbk ?

@madsbk
Copy link
Member

madsbk commented Oct 6, 2020

It seems like we have the wrong order for doing things here, or maybe the meaning of "disconnect" is indeed to test that things still don't crash after the client has terminated. If we're really testing that the other side of the endpoint has already been terminated, then I think the error message coming from UCX is expected. Could you confirm what's the correct context for this test @madsbk ?

Yes, the test is part of #494 that test shutdown of an already closed peer.

@pentschev
Copy link
Member

Yes, the test is part of #494 that test shutdown of an already closed peer.

Would it make sense to rename the test to "test_terminate"/"test_unexpected_disconnect" or something that more clearly identifies that the client hasn't disconnected after ep.close? Alternatively, we should probably add a comment that we expect to see UCX errors because it's testing for an unexpected disconnect.

@madsbk
Copy link
Member

madsbk commented Oct 6, 2020

Would it make sense to rename the test to "test_terminate"/"test_unexpected_disconnect" or something that more clearly identifies that the client hasn't disconnected after ep.close? Alternatively, we should probably add a comment that we expect to see UCX errors because it's testing for an unexpected disconnect.

I think both the renaming and adding a comment is a good idea.

@jakirkham
Copy link
Member Author

Thanks for the context Mads! Had forgotten about this case. Tried to clarify things a bit in PR ( #645 ). Suggestions would be very welcome 😄

@pentschev
Copy link
Member

As of #693 , all tests are confirmed passing in UCX >= 1.9 for various combinations of transports. Therefore, I believe this is now resolved and I'm closing, but please reopen if you see this still.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants