Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test test_delete_after_create - failed to remove cgroup #422 #423

Open
jsturtevant opened this issue Dec 11, 2023 · 7 comments
Open
Labels
bug Something isn't working

Comments

@jsturtevant
Copy link
Contributor

          same failure as on #422 
2023-12-11T16:24:37.6263250Z �[1m�[32m    Finished�[0m test [unoptimized] target(s) in 28.73s
2023-12-11T16:24:38.0688247Z ././target/build/x86_64-unknown-linux-musl//x86_64-unknown-linux-musl/debug/deps/containerd_shim_wasmer-4cf9e40c15afbde7 --nocapture --test-threads=1
2023-12-11T16:24:38.1117914Z 
2023-12-11T16:24:38.1118414Z running 7 tests
2023-12-11T16:24:38.6147511Z test wasmer_tests::test_custom_entrypoint ... ok
2023-12-11T16:24:38.7404433Z [2023-12-11T16:24:38Z ERROR libcontainer::container::container_delete] failed to remove cgroup due to: V2(WrappedIo(Other { err: Custom { kind: TimedOut, error: "could not delete" }, path: "/sys/fs/cgroup/test" })) cgroup_path="test"
2023-12-11T16:24:38.7408303Z Error: io error: at /sys/fs/cgroup/test: could not delete
2023-12-11T16:24:38.7409256Z 
2023-12-11T16:24:38.7409508Z Caused by:
2023-12-11T16:24:38.7409998Z     0: io error: at /sys/fs/cgroup/test: could not delete
2023-12-11T16:24:38.7411140Z     1: at /sys/fs/cgroup/test: could not delete
2023-12-11T16:24:38.7412084Z test wasmer_tests::test_delete_after_create ... FAILED
2023-12-11T16:24:39.6997094Z test wasmer_tests::test_exit_code ... ok
2023-12-11T16:24:40.4872602Z test wasmer_tests::test_has_default_devices ... ok
2023-12-11T16:24:40.9817523Z test wasmer_tests::test_hello_world ... ok
2023-12-11T16:24:41.9711765Z test wasmer_tests::test_seccomp ... ok
2023-12-11T16:24:42.3465552Z [2023-12-11T16:24:42Z ERROR wasmer_wasix::state::env] wasi[1]::wasi initialize error (Missing export memory)
2023-12-11T16:24:42.8445822Z test wasmer_tests::test_unreachable ... ok
2023-12-11T16:24:42.8446223Z 
2023-12-11T16:24:42.8446548Z failures:
2023-12-11T16:24:42.8446734Z 
2023-12-11T16:24:42.8446917Z failures:
2023-12-11T16:24:42.8447238Z     wasmer_tests::test_delete_after_create
2023-12-11T16:24:42.8447557Z 
2023-12-11T16:24:42.8448359Z test result: FAILED. 6 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 4.73s
2023-12-11T16:24:42.8449129Z 
2023-12-11T16:24:42.8470890Z make: *** [Makefile:126: test-wasmer] Error 123
2023-12-11T16:24:42.8487753Z ##[error]Process completed with exit code 2.

Originally posted by @jsturtevant in #420 (comment)

@Mossaka
Copy link
Member

Mossaka commented Jan 4, 2024

@Mossaka
Copy link
Member

Mossaka commented Jan 8, 2024

@jprendes
Copy link
Collaborator

AFAICS the failing test point is always test_delete_after_create.
Unlike the other test points, this one never calls start() nor wait().
I can imagine two options:

  • Some setup is happening in start that this tespoint doesn't hit. But then I would expect this test point to always fail.
  • There's some race condition between delete and the cgroups path being created. As we usually wait for the child process to finish, this is not a proble, but in this test point we might be calling delete too quickly? This seems the most likely to me.

I don't have the bandwidth to validate any of this at the moment, but I can take a look next week.

@jprendes jprendes changed the title Flaky test - failed to remove cgroup (wasmer) #422 Flaky test - failed to remove cgroup #422 Feb 6, 2024
@jprendes
Copy link
Collaborator

jprendes commented Feb 6, 2024

The failure is not specific to wasmer. I can reproduce with any runtime.

I'm running

while RUST_LOG=trace cargo test --quiet --package containerd-shim-wasmtime -- --nocapture --test-threads=1 test_delete_after_create; do echo .; done

That sporadically fails to remove the cgroup folder (that's why it's in a loop).

It seems that youki's libcgroups is already trying to work around this issue (see youki-dev/youki#63 and youki-dev/youki#333), where youki tries to delete the folder a few times with a small delay between attempts.

The underlying cause is that trying to remove the folder results in Err(Os { code: 16, kind: ResourceBusy, message: "Device or resource busy" }).
Locally, I usually see that the first attempt fails, and the second succeeds, but some times all 4 attempts fail.

@jprendes
Copy link
Collaborator

jprendes commented Feb 6, 2024

When we delete the container, we send SIGKILL to the container init process.
IIUC, it takes some time from when we send SIGKILL to the process, until the kernel allows us to delete the cgroup.
My best guess is that we could increase the number of attempts that youki makes to remove the folder.
But that doesn't explain why we only see the failure in this test and not others.
I should try reproduce this with youki alone.

ping @utam0k

@jprendes jprendes changed the title Flaky test - failed to remove cgroup #422 Flaky test test_delete_after_create - failed to remove cgroup #422 Feb 6, 2024
@utam0k
Copy link
Member

utam0k commented Feb 7, 2024

Other test cases in this file take 10 seconds. It may be related... I'm not sure...

fn test_delete_after_create() -> anyhow::Result<()> {

It seems runc attempts 100 retries. We need to implement the same logic... 😭
https://github.com/opencontainers/runc/blob/02120488a4c0fc487d1ed2867e901eeed7ce8ecf/libcontainer/state_linux.go#L49-L51
https://github.com/opencontainers/runc/blob/02120488a4c0fc487d1ed2867e901eeed7ce8ecf/delete.go#L16-L25

@Mossaka
Copy link
Member

Mossaka commented Feb 7, 2024

That 100 retries is wild.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants