Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Distribute Tests PR test fails #5544

Closed
bong-furiosa opened this issue Jun 14, 2024 · 0 comments · Fixed by #5546
Closed

[Bug]: Distribute Tests PR test fails #5544

bong-furiosa opened this issue Jun 14, 2024 · 0 comments · Fixed by #5546
Labels
bug Something isn't working

Comments

@bong-furiosa
Copy link
Contributor

Your current environment

vLLM version 0.5.0.post1

🐛 Describe the bug

Hello!

I would like to know if the tests/distributed/test_utils.py file (Merged at #5473) might be causing errors during the Distribute Tests process on BuildKite.

When I checked #5422 and #5412, I found that both PRs failed during the Distribute Tests process. The reason for the failure is as follows:

[2024-06-14T00:24:15Z] Running 1 items in this shard: tests/distributed/test_utils.py::test_cuda_device_count_stateless
[2024-06-14T00:24:15Z]
[2024-06-14T00:24:30Z] distributed/test_utils.py::test_cuda_device_count_stateless 2024-06-14 00:24:30,636	INFO worker.py:1753 -- Started a local Ray instance.
[2024-06-14T00:24:33Z] FAILED
[2024-06-14T00:24:33Z]
[2024-06-14T00:24:33Z] =================================== FAILURES ===================================
[2024-06-14T00:24:33Z] _______________________ test_cuda_device_count_stateless _______________________
[2024-06-14T00:24:33Z]
[2024-06-14T00:24:33Z]     def test_cuda_device_count_stateless():
[2024-06-14T00:24:33Z]         """Test that cuda_device_count_stateless changes return value if
[2024-06-14T00:24:33Z]         CUDA_VISIBLE_DEVICES is changed."""
[2024-06-14T00:24:33Z]
[2024-06-14T00:24:33Z]         actor = _CUDADeviceCountStatelessTestActor.options(num_gpus=2).remote()
[2024-06-14T00:24:33Z] >       assert ray.get(actor.get_cuda_visible_devices.remote()) == "0,1"
[2024-06-14T00:24:33Z] E       AssertionError: assert '1,0' == '0,1'
[2024-06-14T00:24:33Z] E         
[2024-06-14T00:24:33Z] E         - 0,1
[2024-06-14T00:24:33Z] E         + 1,0
[2024-06-14T00:24:33Z]
[2024-06-14T00:24:33Z] distributed/test_utils.py:26: AssertionError
[2024-06-14T00:24:33Z] =========================== short test summary info ============================
[2024-06-14T00:24:33Z] FAILED distributed/test_utils.py::test_cuda_device_count_stateless - AssertionError: assert '1,0' == '0,1'
[2024-06-14T00:24:33Z]
[2024-06-14T00:24:33Z]   - 0,1
[2024-06-14T00:24:33Z]   + 1,0
[2024-06-14T00:24:33Z] ============================== 1 failed in 17.55s ==============================
[2024-06-14T00:24:36Z] 🚨 Error: The command exited with status 1
[2024-06-14T00:24:36Z] user command error: The plugin docker command hook exited with status 1

Since I am not an expert in the Ray framework, so I am not sure how critical the difference between 0, 1 and 1, 0.
I think the fact that "1,0" was output in a simple test code using Ray indicates that under certain conditions, the result can be "1,0". Therefore, it might be reasonable to conclude that the assert line should allow "1,0".

Would there be any issues if the assert line is modified as shown below?

# assert ray.get(actor.get_cuda_visible_devices.remote()) == "0,1"
assert ray.get(actor.get_cuda_visible_devices.remote()) in ["0,1", "1,0"] 

I am concerned that this issue might be affecting the correct check of the Distribute Tests, and would like to inquire about it.
If this issue is not the cause of the test fail problem, I would greatly appreciate it if you could check the Distribute Tests logs and proved some hints on what might be causing the errors. 🙇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant