Skip to content

Conversation

@eriknordmark
Copy link
Contributor

@eriknordmark eriknordmark commented Oct 28, 2025

Description

After a device has been reinstalled but reusing the the device certificate the attempts to publish EdgeNodeCerts will be rejected by the controller for security reasons since they have changed. (They are stored in /persist/certs hence are lost and recreated on a device reinstall.) This makes that unusual condition visible to the user by putting the device in maintenance mode.

PR dependencies

lf-edge/eve-api#123

How to test and validate this PR

This can be tested using either the qemu setup (make live run) or using a physical device.
For a physical device reasonable steps are:

  1. Onboard the device to zedcloud
  2. ssh into the device and mv /persist/certs/*.cert.pem /persist/; sync; reboot

Without the fix the device will show as Online after it has rebooted (but some subtle functionality will not work).

With the fix in place the device will appear in Maintenance mode in the controller (but with an unknown reason due to controller not knowing about the new enum value).

Once the testing has completed you can ssh into EVE and restore using
mv /persist/*.cert.pem /persist/certs/; sync; reboot

Changelog notes

If a device looses is /persist partition it will automatically create that partition on the next boot. However, that results in recreating some certificates in /persist/certs, and the controller might not accept the new certificates). That typically results in the device being marked as SUSPECT in the controller, which is not very informative.

This fix ensures that this results in a visible maintenance mode instead of some subtle failures to deploy applications with cloud-init user data.

PR Backports

N/A

Checklist

  • I've provided a proper description
  • I've added the proper documentation
  • I've tested my PR on amd64 device
  • I've tested my PR on arm64 device
  • I've written the test verification instructions
  • I've set the proper labels to this PR

And the last but not least:

  • I've checked the boxes above, or I've provided a good reason why I didn't
    check them.

Please, check the boxes above after submitting the PR in interactive mode.

@eriknordmark eriknordmark requested a review from shjala October 28, 2025 16:27
@eriknordmark eriknordmark marked this pull request as draft October 28, 2025 16:34
@eriknordmark eriknordmark force-pushed the fetchcerts branch 2 times, most recently from ce38931 to c8ec63a Compare October 31, 2025 18:34
@eriknordmark eriknordmark marked this pull request as ready for review October 31, 2025 18:57
@shjala
Copy link
Member

shjala commented Nov 3, 2025

that results in recreating some certificates in /persist/certs

@eriknordmark shouldn't we restore the certs from TPM instead of recreating them? or this PR is only for devices with no TPM?

@eriknordmark
Copy link
Contributor Author

that results in recreating some certificates in /persist/certs

@eriknordmark shouldn't we restore the certs from TPM instead of recreating them? or this PR is only for devices with no TPM?

These three certs are not stored in the TPM nvram. I don't know if we have space to add them across all possible TPMs.

In any case, the supported way to reinstall EVE is to do a TPM clear. This fix is to make it obvious when that wasn't done by flagging the issue.

@rene
Copy link
Contributor

rene commented Nov 5, 2025

@eriknordmark , please, rebase on top of master in order to get Yetus fixes...

@eriknordmark
Copy link
Contributor Author

@eriknordmark , please, rebase on top of master in order to get Yetus fixes...

Done

@codecov
Copy link

codecov bot commented Nov 5, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 20.39%. Comparing base (2281599) to head (62b5d9a).
⚠️ Report is 60 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5328      +/-   ##
==========================================
+ Coverage   19.52%   20.39%   +0.86%     
==========================================
  Files          19       19              
  Lines        3021     2314     -707     
==========================================
- Hits          590      472     -118     
+ Misses       2310     1721     -589     
  Partials      121      121              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@eriknordmark
Copy link
Contributor Author

@rene For some reason yetus/golangcilint complains about files which this PR does not touch. Is that due to the new version of golangcilint?

@rene
Copy link
Contributor

rene commented Nov 6, 2025

@rene For some reason yetus/golangcilint complains about files which this PR does not touch. Is that due to the new version of golangcilint?

Yes @eriknordmark , I'm still not sure why unchanged files were checked, but for now we can ignore these....

Copy link
Contributor

@rene rene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shjala
Copy link
Member

shjala commented Nov 10, 2025

We got job failure with "No space left on device", I thought this was fixed...

@github-actions github-actions bot requested a review from shjala November 10, 2025 23:25
@eriknordmark eriknordmark force-pushed the fetchcerts branch 2 times, most recently from 88abdca to e68fefa Compare November 12, 2025 05:46
@rene
Copy link
Contributor

rene commented Nov 12, 2025

We got job failure with "No space left on device", I thought this was fixed...

The Go Tests action doesn't run on our runners yet, that's why we cannot fix out of disk space....

To get MaintenanceModeReason_MAINTENANCE_MODE_REASON_EDGE_NODE_CERTS_REFUSED

Signed-off-by: eriknordmark <[email protected]>
After a device has been reinstalled but reusing the the device
certificate the attempts to publish EdgeNodeCerts will be rejected
by the controller for security reasons since they have changed.
(They are stored in /persist/certs hence are lost and recreated on
a device reinstall.) This makes that unusual condition visible to the
user by putting the device in maintenance mode.

Signed-off-by: eriknordmark <[email protected]>
For debug->recovertpm dependency

Signed-off-by: eriknordmark <[email protected]>
@eriknordmark eriknordmark merged commit 305712a into lf-edge:master Dec 4, 2025
44 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants