Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for hpc7g family of images #1

Merged
merged 1 commit into from
Dec 10, 2023

Conversation

vsoch
Copy link
Contributor

@vsoch vsoch commented Jul 1, 2023

This is a follow up to eksctl-io/eksctl#6743 to update the config here to allow support for the new hpc7g family of ARM images.

Note that this should not be merged yet, as the image family does not appear to work with EFA. I cannot see the source code of the build, but I can show you a working vs. not working example. Here is the result of using this branch with the im4gn.16xlarge instance (also ARM) and it works:

2023/07/01 20:11:59 Fetching EFA devices.
2023/07/01 20:11:59 device: rdmap0s6,uverbs0,/sys/class/infiniband_verbs/uverbs0,/sys/class/infiniband/rdmap0s6
2023/07/01 20:11:59 EFA Device list: [{rdmap0s6 uverbs0 /sys/class/infiniband_verbs/uverbs0 /sys/class/infiniband/rdmap0s6}]
2023/07/01 20:11:59 Starting FS watcher.
2023/07/01 20:11:59 Starting OS watcher.
2023/07/01 20:11:59 device: rdmap0s6,uverbs0,/sys/class/infiniband_verbs/uverbs0,/sys/class/infiniband/rdmap0s6
2023/07/01 20:11:59 Starting to serve on /var/lib/kubelet/device-plugins/aws-efa-device-plugin.sock
2023/07/01 20:11:59 Registered device plugin with Kubelet

And for comparison, using the same logic / setup (aside from the instance type / resources) we cannot find an EFA plugin for hpc7g.16xlarge

2023/07/01 19:35:42 Fetching EFA devices.
2023/07/01 19:35:42 No devices found.

And the containers go into CrashLoopBackoff. I'm including this here in case anyone that reads this repository has access to the private repository where the plugin code is hosted. We have a good debugging setup from the above - two arm images, where it works for one but not for the other! So my suggestion would be to find the plugin code that asserts that the devices exist, and assess the logic in the context of the different ARM images. Likely whatever detail is different is leading to this result.

Thanks for your help!

@lipovsek-aws
Copy link
Contributor

Sorry for late response, can you please rebase and I'm happy to merge.

@vsoch
Copy link
Contributor Author

vsoch commented Dec 10, 2023

Yes of course! All set. I also ensured they were in alphabetical order.

@lipovsek-aws lipovsek-aws merged commit 65f8d5a into aws-samples:main Dec 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants