Skip to content

Conversation

@Alexey-Rivkin
Copy link
Contributor

@Alexey-Rivkin Alexey-Rivkin commented Sep 30, 2025

What?

Enable host IPC at build-matrix to resolve intermittent CI test failures.

What?

Mount /dev/shm volume for CI build pods to fix intermittent UCX test failures.

Why?

CI tests randomly fail with UCX memory errors (exit codes 134, 135, 141).
When multiple pods scheduled on the same node, the shared memory for UCX transport might be insufficient.

How?

Create isolated /dev/shm emptyDir volume per pod using empty_volumes config. Each pod gets dedicated shared memory for UCX, isolated from other pods. Also add debug output to verify configuration.

@github-actions
Copy link

👋 Hi Alexey-Rivkin! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@Alexey-Rivkin
Copy link
Contributor Author

/build

@Alexey-Rivkin
Copy link
Contributor Author

/build

@Alexey-Rivkin Alexey-Rivkin changed the title CI: Enable hostIPC to debug UCX shared mem failures CI: Mount /dev/shm volume to fix intermitent test failures Sep 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant