Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto-provisioning: improve behavior with power cuts #217

Merged
merged 1 commit into from
Feb 5, 2025

Conversation

jsrc27
Copy link
Collaborator

@jsrc27 jsrc27 commented Jan 30, 2025

Some users when auto-provisioning, power down the system once they see the auto-provisioning service exit successfully. However, in some cases this can cause the files created from the provisioning process to not be fully written to disk, resulting in 0-byte files upon next boot.

Therefore, add an explicit sync after the files are written to improve the behavior of the auto-provisioning in these cases.

Related-to: TOR-3743

@EdTheBearded
Copy link
Collaborator

Hey @jsrc27!
It seems that this sync would help on this scenario where the power cut happens because the operator is waiting for a prompt to power off the device.
But thinking broadly on improving this for power cuts, couldn't we add some hash verification on those files on check_provisioning_status? So that we know that the files are there and are correct? Maybe even add a 'flag' file after we confirm this (on the next boot), just so we're not comparing hashes everytime?

Something like:

for i in $(unzip -Z1 credentials.zip); do
    orig=$(unzip -p credentials.zip $i | md5sum | cut -d' ' -f1)
    [ -f "$i" ] && file=$(md5sum $i | cut -d' ' -f1) || file=""
    [ "$orig" != "$file" ] && return
done

@rborn-tx
Copy link
Contributor

Related to @EdTheBearded's comment on making the whole operation generally more robust, another approach we could consider is relying on the atomicity of the rename operation. Basically, we could simply create a new version of the /var/sota/import directory, say a /var/sota/import.tmp, do a sync and then (forcefully) rename /var/sota/import.tmp -> /var/sota/import maybe followed by another sync.

In practice this would probably mean just rewriting write_credentials() to something like this:

write_credentials() {
    log "Updating device credentials"

    rm -Rf ${SOTA_CRED_DIR}.tmp && mkdir -p ${SOTA_CRED_DIR}.tmp
    if ! unzip device.zip -d ${SOTA_CRED_DIR}.tmp >/dev/null; then
        exit_error "Failed extracting credentials file"
    fi
    sync
    mv -f ${SOTA_CRED_DIR}.tmp ${SOTA_CRED_DIR}

    rm -rf $SOTA_BASE_DIR/sql.db
    rm -rf $CONFIG_FILE
    sync
}

An even more robust method would be to do the same with the whole /var/sota directory (I mean create a new version of the directory and do the rename as the last step), but I don't think we need to go that far.

@jsrc27
Copy link
Collaborator Author

jsrc27 commented Jan 30, 2025

@EdTheBearded Not sure if I fully understood your feedback. I have some questions for clarification.

First of all did you mean device.zip instead of credentials.zip? We don't use any credentials.zip anywhere in this script.

Also did you mean to suggest adding this logic to write_credentials instead of check_provisioning_status? Because in check_provisioning_status the device.zip hasn't been downloaded yet from the server. Or did I misunderstand what you meant?

@EdTheBearded
Copy link
Collaborator

@EdTheBearded Not sure if I fully understood your feedback. I have some questions for clarification.

First of all did you mean device.zip instead of credentials.zip? We don't use any credentials.zip anywhere in this script.

Also did you mean to suggest adding this logic to write_credentials instead of check_provisioning_status? Because in check_provisioning_status the device.zip hasn't been downloaded yet from the server. Or did I misunderstand what you meant?

Yeah, I meant device.zip. I wrote with credentials.zip because that was a zip file I had in hand to test the output.

No, I meant on check_provisioning_status, since this is what decides whether the script will run or not. In the case that we have the zip file and the hashes match, we exit early, if not we just keep running and (re)do the provisioning.

But as @rborn-tx suggested, maybe doing all this in a .tmp folder and as a last step renaming it is a bit cleaner.

@jsrc27
Copy link
Collaborator Author

jsrc27 commented Jan 30, 2025

No, I meant on check_provisioning_status, since this is what decides whether the script will run or not. In the case that we have the zip file and the hashes match, we exit early, if not we just keep running and (re)do the provisioning.

I don't think that would ever realistically be the case no? The device.zip gets downloaded during register_device into a temp directory in /tmp. If the device rebooted then /tmp would have been cleared and the zip file would be lost.

For device.zip to exist during check_provisioning_status, the service would need to have been manually started/restarted while device.zip is still in /tmp somehow. Which probably means someone is just messing with the process while it's occurring, which is difficult to account for.

So we'd be checking for the case where device.zip somehow exists and auto-provisioning.json still exists and was not removed from a prior successful run of auto-provisioning. Which would be weird cause how do we have existing certs and auto-provisioning.json still exists, but we have cert files installed on the device that do not match what we have in device.zip. I just have a hard time imagining the scenario this would be checking for.

It would make sense if we want to do the hash verification after we unzip just for extra robustness I guess.

@EdTheBearded
Copy link
Collaborator

No, I meant on check_provisioning_status, since this is what decides whether the script will run or not. In the case that we have the zip file and the hashes match, we exit early, if not we just keep running and (re)do the provisioning.

I don't think that would ever realistically be the case no? The device.zip gets downloaded during register_device into a temp directory in /tmp. If the device rebooted then /tmp would have been cleared and the zip file would be lost.

For device.zip to exist during check_provisioning_status, the service would need to have been manually started/restarted while device.zip is still in /tmp somehow. Which probably means someone is just messing with the process while it's occurring, which is difficult to account for.

So we'd be checking for the case where device.zip somehow exists and auto-provisioning.json still exists and was not removed from a prior successful run of auto-provisioning. Which would be weird cause how do we have existing certs and auto-provisioning.json still exists, but we have cert files installed on the device that do not match what we have in device.zip. I just have a hard time imagining the scenario this would be checking for.

It would make sense if we want to do the hash verification after we unzip just for extra robustness I guess.

Or we download device.zip somewhere else. I did not provide a full solution on my comment, just an idea and an example code of hash verification.
I'm certain a bunch of other details would need to be sorted as well.

But either way, @rborn-tx suggestion makes more sense.

@jsrc27 jsrc27 force-pushed the TOR-3743 branch 2 times, most recently from b7e59ab to 46ceffb Compare January 31, 2025 01:26
@jsrc27
Copy link
Collaborator Author

jsrc27 commented Jan 31, 2025

V2 submitted

EdTheBearded
EdTheBearded previously approved these changes Jan 31, 2025
Copy link
Contributor

@rborn-tx rborn-tx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raising some other points.

@jsrc27
Copy link
Collaborator Author

jsrc27 commented Jan 31, 2025

V3 submitted

@rborn-tx
Copy link
Contributor

rborn-tx commented Feb 3, 2025

@jsrc27 Would you like to update the commit message before merging this?

Some users when auto-provisioning, power down the system once they see
the auto-provisioning service exit successfully. However, in some cases
this can cause the files created from the provisioning process to not be
fully written to disk, resulting in 0-byte files upon next boot.

Therefore, add some explicit syncs to improve this behavior. Also make use
of a temp directory to improve the overall atomicity of the process.

Related-to: TOR-3743

Signed-off-by: Jeremias Cordoba <[email protected]>
@jsrc27
Copy link
Collaborator Author

jsrc27 commented Feb 4, 2025

Commit message updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants