Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add instance update and reactivation #38

Merged
merged 1 commit into from
May 3, 2021
Merged

Conversation

WilboMo
Copy link
Contributor

@WilboMo WilboMo commented Apr 13, 2021

Issue number:
Addresses #9

Description of changes:
Adds functionality to update drained Bottlerocket instances, reboot updated instances and change state to "ACTIVE" after successful upgrade.

Testing done:
Executed changes against ECS test cluster containing a mix of AL2 instances and 4 Bottlerocket version 1.0.5 hosts with service and non-service tasks present in the cluster:

Output (truncated for brevity):

---[ RESPONSE ]--------------------------------------
HTTP/1.1 200 OK
Connection: keep-alive
Content-Type: application/x-amz-json-1.1
Date: Thu, 15 Apr 2021 16:48:30 GMT
Server: Server
X-Amzn-Requestid: c3f71a22-c586-4c1f-b62b-d9724aefd926


-----------------------------------------------------
2021/04/15 09:48:30 {"CloudWatchOutputConfig":{"CloudWatchLogGroupName":"","CloudWatchOutputEnabled":false},"CommandId":"8865c81a-83ae-4c2e-958d-05dc107ef47e","Comment":"","DocumentName":"AWS-RunShellScript","DocumentVersion":"$DEFAULT","ExecutionElapsedTime":"PT0.484S","ExecutionEndDateTime":"2021-04-15T16:48:25.429Z","ExecutionStartDateTime":"2021-04-15T16:48:25.429Z","InstanceId":"i-##########","PluginName":"aws:runShellScript","ResponseCode":0,"StandardErrorContent":"16:48:25 \u001B[0m\u001B[34m[INFO] \u001B[0mRefreshing updates...\n","StandardErrorUrl":"","StandardOutputContent":"{\n  \"active_partition\": {\n    \"image\": {\n      \"arch\": \"x86_64\",\n      \"variant\": \"aws-ecs-1\",\n      \"version\": \"1.0.7\"\n    },\n    \"next_to_boot\": true\n  },\n  \"available_updates\": [\n    \"1.0.7\",\n    \"1.0.6\",\n    \"1.0.5\",\n    \"1.0.4\",\n    \"1.0.3\",\n    \"1.0.2\",\n    \"1.0.1\",\n    \"1.0.0\"\n  ],\n  \"chosen_update\": null,\n  \"most_recent_command\": {\n    \"cmd_status\": \"Success\",\n    \"cmd_type\": \"refresh\",\n    \"exit_status\": 0,\n    \"stderr\": \"\",\n    \"timestamp\": \"2021-04-15T16:48:25.835131261Z\"\n  },\n  \"staging_partition\": null,\n  \"update_state\": \"Idle\"\n}\n","StandardOutputUrl":"","Status":"Success","StatusDetails":"Success"}
2021/04/15 09:48:30 Instance i-########## updated to Bottlerocket 1.0.7

3 Bottlerocket instances running a Service and 1 with no tasks:

BR - 1
$ apiclient update check
04:27:15 [INFO] Refreshing updates...
{
  "active_partition": {
    "image": {
      "arch": "x86_64",
      "variant": "aws-ecs-1",
      "version": "1.0.7"
    },

BR - 2
$ apiclient update check
04:27:15 [INFO] Refreshing updates...
{
  "active_partition": {
    "image": {
      "arch": "x86_64",
      "variant": "aws-ecs-1",
      "version": "1.0.7"
    },

BR - 3

$ apiclient update check
04:27:59 [INFO] Refreshing updates...
{
  "active_partition": {
    "image": {
      "arch": "x86_64",
      "variant": "aws-ecs-1",
      "version": "1.0.7"
    },
 
BR - 4

$ apiclient update check
04:28:31 [INFO] Refreshing updates...
{
  "active_partition": {
    "image": {
      "arch": "x86_64",
      "variant": "aws-ecs-1",
      "version": "1.0.7"
    },

The previous, updated, instances were then replaced with 1.0.5 instances and a non-service tasks placed into the cluster, running on BR-1:

BR-1 
{
  "active_partition": {
    "image": {
      "arch": "x86_64",
      "variant": "aws-ecs-1",
      "version": "1.0.5"
    },

BR-2
{
  "active_partition": {
    "image": {
      "arch": "x86_64",
      "variant": "aws-ecs-1",
      "version": "1.0.8"
    },
    "next_to_boot": true
  },

 BR-3
{
  "active_partition": {
    "image": {
      "arch": "x86_64",
      "variant": "aws-ecs-1",
      "version": "1.0.8"
    },

BR-4
{
  "active_partition": {
    "image": {
      "arch": "x86_64",
      "variant": "aws-ecs-1",
      "version": "1.0.7"
    },

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@WilboMo WilboMo force-pushed the DrainInstances branch 2 times, most recently from e44ab4b to 430af79 Compare April 13, 2021 17:08
@WilboMo WilboMo force-pushed the UpgradeBRInstances branch 3 times, most recently from 7880e70 to 75bf459 Compare April 14, 2021 17:02
@WilboMo WilboMo force-pushed the DrainInstances branch 2 times, most recently from eb94b5e to 1f7371c Compare April 15, 2021 16:20
@WilboMo WilboMo force-pushed the DrainInstances branch 2 times, most recently from af45757 to e779565 Compare April 15, 2021 18:47
@WilboMo WilboMo marked this pull request as ready for review April 15, 2021 19:31
Copy link
Contributor

@webern webern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Some stream of consciousness while reading.

updater/aws.go Outdated
}

err := changeInstanceState(u.ecs, aws.String(containerInstance), &u.cluster, "DRAINING")
if err != nil {
err2 := changeInstanceState(u.ecs, aws.String(containerInstance), &u.cluster, "ACTIVE")
if err2 != nil {
return fmt.Errorf("failed to undrain: %w", err2.Error())
return "", fmt.Errorf("failed to undrain: %w", err2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The saddest outcome.

updater/aws.go Outdated
@@ -62,20 +62,20 @@ func containsAttribute(attrs []*ecs.Attribute, searchString string) bool {

// checkAndDrain drains eligible container instances. Container instances are eligible if all running
// tasks were started by a service, or if there are no running tasks.
func (u *updater) checkAndDrain(containerInstance string) error {
func (u *updater) checkAndDrain(ec2ID string, containerInstance string) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this function is taking an ec2ID and then returning it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EC2ID goes through some filtering to make sure its eligible for update. So the EC2ID It returns is one thats been vetted and cleared to proceed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, I don't see ec2ID being used anywhere in the method ? If the function does not return an error doesn't that mean that ec2ID is ready for update ?

updater/aws.go Outdated
}
log.Printf("Response: %s", *resp.Status)
if *resp.Status == "Success" {
log.Printf("Updated %s succeessfully", instanceID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? Or does resp.Status only mean that the SSM Command invocation succeeded? (Typo succeessfully)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, my mistake. I have another idea that may do the trick.

updater/aws.go Outdated
@@ -218,3 +218,53 @@ func (u *updater) checkCommandOutput(commandID string, instanceIDs []string) ([]
}
return updateCandidates, nil
}

func (u *updater) checkUpdateResult(commandID string, instanceID string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is checkSsmCommandResult a better name, or will this function eventually check more deeply that the update actually occurred?

updater/aws.go Outdated
return nil
}

func (u *updater) getVersion(commandID string, instanceIDs []string) (map[string]string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe getActiveVersions is a better name? Also maybe explain these functions with doc comments.

updater/main.go Outdated
@@ -78,7 +78,7 @@ func _main() error {
return err
}

instancesToUpdate, err := up.checkCommandOutput(commandID, instances)
instancesToUpdate, err := up.checkUpdateStatus(commandID, instances)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is still a little weird. This might read better:

instancesToUpdate, err := up.listInstancesToUpdate(commandID, instances)

updater/main.go Outdated
Comment on lines 99 to 111
if checkedEC2ID == "" {
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using this empty string to carry some meaning is a little strange. You could use the ok bool pattern instead.

updater/main.go Outdated
}

updatedVersions, err := up.getVersion(updateStatus, checkedEC2IDAsSlice)
for _ec2ID, version := range updatedVersions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_ec2ID: the underscore isn't needed since you are using the variable.

updater/aws.go Outdated
@@ -62,20 +62,20 @@ func containsAttribute(attrs []*ecs.Attribute, searchString string) bool {

// checkAndDrain drains eligible container instances. Container instances are eligible if all running
// tasks were started by a service, or if there are no running tasks.
func (u *updater) checkAndDrain(containerInstance string) error {
func (u *updater) checkAndDrain(ec2ID string, containerInstance string) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, I don't see ec2ID being used anywhere in the method ? If the function does not return an error doesn't that mean that ec2ID is ready for update ?

updater/main.go Outdated
Comment on lines 106 to 134
if err != nil {
return err
}

err = up.checkUpdateResult(updateCommandID, checkedEC2ID)
if err != nil {
return err
}

err = changeInstanceState(up.ecs, aws.String(containerARN), flagCluster, "ACTIVE")
if err != nil {
return err
}

updateStatus, err := up.sendCommand(checkedEC2IDAsSlice, "apiclient update check")
if err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just worried about all these return on err. Do we really like to return if anything failed in between ?
I think we should just:
1 . make sure to mark Instance "Active".
2. continue updating next instance and worry about updating this instance in next program iteration.
Similar things for lot many errors, I think we need to scan all the errors and proceed for non fatal error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're right (regarding checkAndDrain's handling of the EC2ID)!

On your notes regarding returning err

  1. Is handled by the functions/methods themselves.
  2. I like the idea of scanning for fatals, I'll create a new issue to add that in. For now, what about capturing the error output and just printing it to log but not returning an error? This way the information about problems is still provided but it doesn't kill the operation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll create a new issue to add that in.

I will do this, because I would not like to rush on how each error is handled. We would have to think of all the cases that can cause an error and then make a decision accordingly. For now happy path works, so we can get this merged and not block on this.

updater/aws.go Outdated Show resolved Hide resolved
updater/main.go Outdated
return nil
}

checkedEC2IDAsSlice := []string{checkedEC2ID}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we are adding checkedEC2ID to a list ? Due this we had to run a loop inside getVersion. We are updating only one instance at a time why do we need a list ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this because some functions like sendCommand expect a slice as an argument. This ensures those functions can use the EC2ID we're operating on throughout the process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case we can just pass checkedEC2IDAsSlice only to sendCommand and we can directly use ec2ID for getVersion. So that we don't have to unnecessarily run a loop when we know there is only one item.

updater/aws.go Outdated
Comment on lines 122 to 140
if desiredState == "DRAINING" {
err := waitUntilDrained(ecsClient, *containerInstance, cluster)
if err != nil {
return err
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting issue ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll run format again, but I have it on save now and I did a go fmt before pushing 🤔 .

updater/main.go Outdated
return nil
}

checkedEC2IDAsSlice := []string{checkedEC2ID}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case we can just pass checkedEC2IDAsSlice only to sendCommand and we can directly use ec2ID for getVersion. So that we don't have to unnecessarily run a loop when we know there is only one item.

Copy link
Contributor

@srgothi92 srgothi92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good.

updater/main.go Outdated Show resolved Hide resolved
@WilboMo WilboMo force-pushed the DrainInstances branch 3 times, most recently from c13d549 to b2d845c Compare April 16, 2021 20:34
@WilboMo WilboMo force-pushed the UpgradeBRInstances branch 2 times, most recently from b5e7a25 to 563c69f Compare April 16, 2021 21:05
updater/aws.go Outdated
@@ -236,3 +237,49 @@ func (u *updater) checkSSMCommandOutput(commandID string, instanceIDs []string)
}
return updateCandidates, nil
}

// getActiveVersion queries a Bottlerocket instnace to determine the active version in use by the instance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// getActiveVersion queries a Bottlerocket instnace to determine the active version in use by the instance.
// getActiveVersion queries a Bottlerocket instance to determine the active version in use by the instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that!

updater/main.go Outdated
Comment on lines 103 to 104
ec2IDAsSlice := []string{ec2ID}
if len(ec2IDAsSlice) != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic here is a bit weird. Since line 103 constructs a slice with exactly one element, line 104 will always evaluate to true.

Aside from removing the if, it's also not necessary to say "AsSlice" to record the type of the variable in the name.

Suggested change
ec2IDAsSlice := []string{ec2ID}
if len(ec2IDAsSlice) != 0 {
ec2IDs := []string{ec2ID}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I had considered doing away with the if len(ec2IDAsSlice) for the reasons you pointed out, but was concerned I may not have been thinking of a situation where it would evaluate false and left it for safety. I'll make the suggested change!

updater/main.go Outdated
Comment on lines 106 to 123
if err != nil {
log.Printf("%#v", err)
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a pattern here of continue-on-error, which I think is probably not the right pattern. If there are errors in updating, we risk taking out the entire cluster and leaving no remaining capacity.

  • What can we do in terms of recovery if an error occurs?
  • If no recovery is possible, should we abort entirely?
  • If no recovery is possible, how do we notify the cluster administrator that some action needs to be taken, and what that action is?

Copy link
Contributor Author

@WilboMo WilboMo Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What can we do in terms of recovery if an error occurs?

I'm not yet sure on what recovery actions we can take for some of these failures. The apiclient update apply -r could be re-attempted on the next pass and it may work just fine. But things like reboot failures or an instance entering a non Ok state may be beyond the scope of what Updater should act on. We could add in logic to attempt restarting it again, describe the instance or take some other action but that would require an escalation in IAM permissions to EC2 that I don't think, off the cuff, is a good idea.

  • If no recovery is possible, should we abort entirely?

In my first iteration of this bit I had it returning err rather than logging it so that it would abort the entire operation. However, it was pointed out by another commenter that we wanted it to proceed with the other instances to be updated as one failure doesn't necessarily disqualify the rest and that we could just let Updater try again on the next pass.

However, if they were all to fail taking down an entire cluster would be horrible as you rightly point out. I'll think on this but I think it may be the safest option, until better error handling is added later on, that we abort on a failure to preserve the integrity of the cluster as a whole.

  • If no recovery is possible, how do we notify the cluster administrator that some action needs to be taken, and what that action is?

I had debated adding in additional log statements here in tandem with the functions already defined error message reporting to provide more context on the problem. However, I ended up removing that before posting the PR because I was concerned the messages may be too verbose.

Would it better to remove the log messages from their functions and allow them to be defined in main.go when the functions are called and go through the if err != nil checks? This would allow the error message to be tailor made for the operation and give more clear indication to users where things failed and why.

Copy link
Contributor

@samuelkarp samuelkarp Apr 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not yet sure on what recovery actions we can take for some of these failures.

Depending on the type of failure, it may be reasonable to put the container instance back to ACTIVE and continue on, or to continue on directly. For example:

  • If our call to drain the container instance fails due to something like a permission issue and the instance stays ACTIVE: it's safe to continue forward since no capacity was removed
  • If our call to check update status fails: it's safe to undrain the instance and continue forward, unless the undrain fails

However, if they were all to fail taking down an entire cluster would be horrible as you rightly point out. I'll think on this but I think it may be the safest option, until better error handling is added later on, that we abort on a failure to preserve the integrity of the cluster as a whole.

I think it's worth thinking about it this way: any failure which results in a reduction in cluster capacity that cannot be successfully restored (undrained) should probably result in us aborting with a clear error message to the administrator.

Would it better to remove the log messages from their functions and allow them to be defined in main.go when the functions are called and go through the if err != nil checks? This would allow the error message to be tailor made for the operation and give more clear indication to users where things failed and why.

It's definitely worth revisiting the logging that we have before the updater is really ready to be used, but it doesn't have to be in the context of this PR.

In general: it's good to embed context into errors. Newer versions of Go (since 1.13 if I'm remembering correctly) have a native error-wrapping feature that allows you to add context. So you can do something like this:

if err != nil {
	return fmt.Errorf("failed to undrain: %w", err)
}

and then have that error bubble all the way up to the top.

However, it can also be useful to add local logs with additional local context. Structured logging implementations (like logrus) enable contextual fields to be added and automatically included:

myLog := logrus.WithField("cluster", cluster)
// ... omitted ...
if err != nil {
	myLog.WithError(err).Error("failed to undrain")
	return fmt.Errorf("failed to undrain: %w", err)
}

updater/main.go Outdated
Comment on lines 132 to 134
if len(updateResult) == 0 {
log.Printf("Update successfully applied")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if len(updateResult) != 0? What should happen then?

updater/main.go Outdated
Comment on lines 128 to 129
updateResult, _ := u.checkSSMCommandResult(updateStatus, ec2IDAsSlice)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the error is unintentionally ignored here.

Suggested change
updateResult, _ := u.checkSSMCommandResult(updateStatus, ec2IDAsSlice)
if err != nil {
updateResult, err := u.checkSSMCommandResult(updateStatus, ec2IDAsSlice)
if err != nil {

updater/main.go Outdated
Comment on lines 140 to 142
if len(updatedVersion) != 0 {
log.Printf("Instance %s updated to Bottlerocket: %s", ec2ID, updatedVersion)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should happen if len(updatedVersion) == 0?

updater/aws.go Outdated
Comment on lines 241 to 217
func (u *updater) getActiveVersion(commandID string, instanceID string) (string, error) {
resp, err := u.ssm.GetCommandInvocation(&ssm.GetCommandInvocationInput{
CommandId: aws.String(commandID),
InstanceId: aws.String(instanceID),
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like GetCommandInvocation is called twice for the same command ID; once here and once in checkSSMCommandResult. In checkSSMCommandResult, there is further logic applied to map the result into a set of candidates. That looks like a bit of an anti-pattern, since we can instead make a single remote call and have two separate ways of interpreting the results.

There are a couple ways to restructure the code to achieve that, for example:

  • Instead of having two calls to GetCommandInvocation, have one single place where that is called and have the raw bytes returned such that they can be unmarshalled in two separate places with different interpretation
  • Instead of having two functions that unmarshal the result differently, have a single function that calls GetCommandInvocation, unmarshalls the results into a struct, and then returns that struct. Then you can interpret the struct appropriately in based on your needs.

This would be a bit of refactoring, but is appropriate to do.

@WilboMo WilboMo force-pushed the UpgradeBRInstances branch 3 times, most recently from 02929ef to bd290af Compare April 28, 2021 19:32
updater/main.go Outdated
err = u.activateInstance(aws.String(containerInstance))
if err != nil {
log.Printf("instance %s failed to return to ACTIVE after reboot. Aborting update operations.", ec2ID)
return fmt.Errorf("%#v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there's no additional error message so there's no need for fmt.Errorf

Suggested change
return fmt.Errorf("%#v", err)
return err

updater/main.go Outdated
Comment on lines 132 to 148
updateStatus, err := u.sendCommand(ec2IDs, "apiclient update check")
if err != nil {
log.Printf("%#v", err)
}

updateResult, err := u.getCommandResult(updateStatus, ec2ID)
if err != nil {
log.Printf("%#v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sending the command "apiclient update check" fails, getCommandResult will probably also fail. Since we've already returned the instance to "ACTIVE", we can continue here.

Suggested change
updateStatus, err := u.sendCommand(ec2IDs, "apiclient update check")
if err != nil {
log.Printf("%#v", err)
}
updateResult, err := u.getCommandResult(updateStatus, ec2ID)
if err != nil {
log.Printf("%#v", err)
}
updateStatus, err := u.sendCommand(ec2IDs, "apiclient update check")
if err != nil {
log.Printf("%#v", err)
continue
}
updateResult, err := u.getCommandResult(updateStatus, ec2ID)
if err != nil {
log.Printf("%#v", err)
continue
}

updater/aws.go Outdated
}
// getActiveVersion unmarshals GetCommandInvocation output to determine the active version of a Bottlerocket instance.
// Takes an SSM Command ID and EC2ID as parameters and returns the active version in use.
func getActiveVersion(commandOutput []byte, instanceID string) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like the instanceID parameter is used

Suggested change
func getActiveVersion(commandOutput []byte, instanceID string) (string, error) {
func getActiveVersion(commandOutput []byte) (string, error) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. Thanks!

updater/aws.go Outdated
type updateCheckResult struct {
UpdateState string `json:"update_state"`
}
func getUpdateState(commandOutput []byte) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: getUpdateState implies that the function would return some sort of state indicator, but instead it's just returning whether an update is available

Suggested change
func getUpdateState(commandOutput []byte) bool {
func isUpdateAvailable(commandOutput []byte) bool {

updater/main.go Outdated
@@ -96,6 +104,56 @@ func _main() error {
continue
}
log.Printf("Instance %s drained", ec2ID)

ec2IDs := []string{ec2ID}
_, err = u.sendCommand(ec2IDs, "apiclient update apply -r")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: short flags are present for convenience when typing into a terminal, but when scripting it's more clear to use long flags

Suggested change
_, err = u.sendCommand(ec2IDs, "apiclient update apply -r")
_, err = u.sendCommand(ec2IDs, "apiclient update apply --reboot")

updater/main.go Show resolved Hide resolved
updater/main.go Outdated
Comment on lines 144 to 172
if isUpdateAvailable(updateResult) {
log.Printf("Instance %s did not update. Manual update advised.", ec2ID)
} else {
log.Printf("Instance %s updated successfully", ec2ID)
}

updatedVersion, err := getActiveVersion(updateResult)
if err != nil {
log.Printf("%#v", err)
}
if len(updatedVersion) != 0 {
log.Printf("Instance %s updated to Bottlerocket: %s", ec2ID, updatedVersion)
} else {
log.Printf("Unable to verify updated version. Manual verification of %s required.", ec2ID)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic enables the following scenario to occur:

  • Instance updates
  • Update is still available (perhaps because another update's wave is now visible to the instance)
  • Log "Instance %s did not update. Manual update advised."
  • Read the active version
  • Log "Instance %s updated to Bottlerocket: %s"

Or this scenario:

  • Instance fails to update
  • Update is still available
  • Log "Instance %s did not update. Manual update advised."
  • Read the active version
  • Log "Instance %s updated to Bottlerocket: %s"

In the first case, even though the instance did update the log claims it didn't. In the second case, even though the instance didn't update, the last log line claims it did.

These should probably not happen. Either the instance updated or it didn't; we shouldn't be able to claim both at the same time. There are a couple ways this could be changed:

  1. Remove the check for isUpdateAvailable, since the question is really "did the version change". Instead, store the original version and compare to the new version.
  2. Adjust the logic such that if a "manual update advised" we skip the second check about the version.
  3. Adjust the language in the second change so it doesn't say "updated to" and instead says something more like "Instance %s is running Bottlerocket %s".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment! I think PR #47 will enable option 1 nicely (which I think is the right option long term) as we could cast the version to the struct introduced by @srgothi92 in that pull request and use that to compare with during the update process. I could implement the version check in a different way, but unless I'm mistaken, it would make sense to move it into that struct when its merged anyways.

I considered waiting until that PR was merged and revising my PR off that, but in the interest of moving forward I'm going to implement the second and third options you lay out, add a TODO and open an issue to put the version comparison in a subsequent PR.

updater/aws.go Outdated
Comment on lines 260 to 258
if err != nil {
log.Printf("failed to unmarshal command invocation output: %#v", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the error is ignored here.

Suggested change
if err != nil {
log.Printf("failed to unmarshal command invocation output: %#v", err)
}
if err != nil {
log.Printf("failed to unmarshal command invocation output: %#v", err)
return "", err
}

updater/aws.go Outdated
Comment on lines 236 to 240
switch updateState.UpdateState {
case "Available":
return true
}
return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This can be simplified

Suggested change
switch updateState.UpdateState {
case "Available":
return true
}
return false
return updateState.UpdateState == "Available"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats awesome, thanks for pointing that out. Didn't know that'd return bool.

@WilboMo WilboMo force-pushed the UpgradeBRInstances branch 2 times, most recently from a0a5cd9 to e755b66 Compare April 30, 2021 01:48
updater/main.go Outdated
Comment on lines 146 to 162
updateState, empty := isUpdateAvailable(updateResult)
if updateState {
log.Printf("Instance %s did not update. Manual update advised.", ec2ID)
continue
} else if empty != "" {
log.Printf("Unable to determine update result. Manual verification of %s required", ec2ID)
continue
} else {
log.Printf("Instance %s updated successfully", ec2ID)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While testing my changes to address @samuelkarp's comments I encountered a scenario where an empty Json would trigger a success result. I added this in to prevent that scenario from giving false positives.

updater/aws.go Outdated
type updateCheckResult struct {
UpdateState string `json:"update_state"`
}
func isUpdateAvailable(commandOutput []byte) (bool, string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A string result here is very strange. An error is a common mechanism for communicating failure.

Alternately, success and failure can be communicated with a bool (conventionally, an ok variable).

Suggested change
func isUpdateAvailable(commandOutput []byte) (bool, string) {
func isUpdateAvailable(commandOutput []byte) (bool, error) {

Copy link
Contributor Author

@WilboMo WilboMo Apr 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally used error but was concerned using it in an elif statement as I had intended would be anti-pattern. I reckon I overthought that a bit as I don't think I even need to use it in an elif for that to work. Thanks for the comment!

updater/aws.go Outdated
updateCandidates = append(updateCandidates, v)
}
// getActiveVersion unmarshals GetCommandInvocation output to determine the active version of a Bottlerocket instance.
// Takes an SSM Command ID and EC2ID as parameters and returns the active version in use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this comment is out of date

Adds functionality to perform Bottlerocket version upgrades and
reactivates drained instances after successful upgrade.

[Issue: #9]
Copy link
Contributor

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@WilboMo WilboMo merged commit b462c32 into develop May 3, 2021
@WilboMo WilboMo deleted the UpgradeBRInstances branch May 3, 2021 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants