Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add create-recovery-plan command #622

Merged
merged 5 commits into from
Jul 10, 2023
Merged

Conversation

selzoc
Copy link
Member

@selzoc selzoc commented Jul 3, 2023

This command will allow users to scan a deployment for problems, then write out a YAML-formatted recovery plan, to be used later in the recover command. The CLI prompts per instance group and then per problem type.

Example recovery plan:

instance_groups_plan:
- name: cloud_controller_ng
  max_in_flight_override: 66%
  planned_resolutions:
    missing_vm: recreate_vm_without_wait
- name: diego_cell
  planned_resolutions:
    missing_vm: delete_vm_reference
- name: router
  max_in_flight_override: "2"
  planned_resolutions:
    unresponsive_agent: delete_vm

Note Relies on a director with this PR merged. The command fails with "Director does not support this command. Try 'bosh cloud-check' instead" if the director it is targeting does not return instance groups for problems

This command will allow users to scan a deployment for problems, then
write out a YAML-formatted recovery plan, to be used later in the
`recover` command.

[#185483613]

Authored-by: Chris Selzo <[email protected]>
The "plan" is not necessary for resolving problems via the director

[#185483613]

Authored-by: Chris Selzo <[email protected]>
The current value of `max_in_flight` is fetched from the director and
used as a default.  If the value does not change from the default, no
`max_in_flight_override` is written to the recovery plan.

[#185483613]

Authored-by: Chris Selzo <[email protected]>
@klakin-pivotal
Copy link
Contributor

So, I ran the unit tests, and got one failure in create_recovery_plan_test.go. I haven't spent much time actually reviewing the code, so I currently have no opinion about it.

  • I checked out create-recovery-plan-185483613
  • Started up Docker, then started up the bosh/cli image like so: docker run -it --volume=$HOME/workspace/bosh-cli:/bosh-cli bosh/cli:latest /bin/bash
  • And (after commenting out the lint stuff in the script) ran bin/test-unit, which emitted the following failure
root@b9e0db7204a6:/bosh-cli# ./bin/test-unit
<unimportant spew elided>
Will skip:
  ./acceptance
  ./integration
[1688428279] blobstore - 11/11 specs ••••••••••• SUCCESS! 20.861474ms PASS
[1688428279] cloud - 63/63 specs ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• SUCCESS! 80.240399ms PASS
[1688428279] cmd - 908/908 specs <success dots elided>
------------------------------
• [FAILED] [0.001 seconds]
CreateRecoveryPlanCmd Run problems are found [It] writes a recovery plan based on answers
/bosh-cli/cmd/create_recovery_plan_test.go:230

  [FAILED] Expected
      <map[string]string | len:2>: {
          "missing_vm": "reboot_vm",
          "mount_info_mismatch": "reattach_disk",
      }
  to have {key: value}
      <map[interface {}]interface {} | len:1>: {
          <string>"missing_vm": <string>"recreate_vm",
      }
  In [It] at: /bosh-cli/cmd/create_recovery_plan_test.go:254 @ 07/03/23 23:53:59.955
------------------------------
<success dots elided>

Summarizing 1 Failure:
  [FAIL] CreateRecoveryPlanCmd Run problems are found [It] writes a recovery plan based on answers
  /bosh-cli/cmd/create_recovery_plan_test.go:254

Ran 908 of 908 Specs in 27.500 seconds
FAIL! -- 907 Passed | 1 Failed | 0 Pending | 0 Skipped
--- FAIL: TestReg (27.72s)
FAIL

Copy link
Contributor

@nouseforaname nouseforaname left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if removing the switch case is really necessary but figured I'd at least leave a comment about it. Rest, lgtm.

@jpalermo
Copy link
Member

jpalermo commented Jul 4, 2023

Given this is a set of work that is under fairly active development. Is this something we should mark as experimental?

I say that without any idea how we'd go about doing that...

@selzoc
Copy link
Member Author

selzoc commented Jul 5, 2023

Given this is a set of work that is under fairly active development. Is this something we should mark as experimental?

I say that without any idea how we'd go about doing that...

The command does print a warning and exits if the director doesn't support it. I don't think it's particularly experimental, but we also haven't done the other half of the command yet. Perhaps we could hold off on merging until bosh recover is also complete?

The `%v` format string will do the right thing when the `max_in_flight`
value is either a string or an integer.

(of course, this just moves the type switch into the standard library)

[#185483613]

Authored-by: Chris Selzo <[email protected]>
Go ranges over map keys in a random order by design.  The `FakeUI` test
framework requires you to give answers to ui prompts in a specific
order, and does not allow you to react to the prompts with a given
input.

[#185483613]

Authored-by: Chris Selzo <[email protected]>
@selzoc
Copy link
Member Author

selzoc commented Jul 5, 2023

So, I ran the unit tests, and got one failure in create_recovery_plan_test.go. I haven't spent much time actually reviewing the code, so I currently have no opinion about it.

  • I checked out create-recovery-plan-185483613
  • Started up Docker, then started up the bosh/cli image like so: docker run -it --volume=$HOME/workspace/bosh-cli:/bosh-cli bosh/cli:latest /bin/bash
  • And (after commenting out the lint stuff in the script) ran bin/test-unit, which emitted the following failure
root@b9e0db7204a6:/bosh-cli# ./bin/test-unit
<unimportant spew elided>
Will skip:
  ./acceptance
  ./integration
[1688428279] blobstore - 11/11 specs ••••••••••• SUCCESS! 20.861474ms PASS
[1688428279] cloud - 63/63 specs ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• SUCCESS! 80.240399ms PASS
[1688428279] cmd - 908/908 specs <success dots elided>
------------------------------
• [FAILED] [0.001 seconds]
CreateRecoveryPlanCmd Run problems are found [It] writes a recovery plan based on answers
/bosh-cli/cmd/create_recovery_plan_test.go:230

  [FAILED] Expected
      <map[string]string | len:2>: {
          "missing_vm": "reboot_vm",
          "mount_info_mismatch": "reattach_disk",
      }
  to have {key: value}
      <map[interface {}]interface {} | len:1>: {
          <string>"missing_vm": <string>"recreate_vm",
      }
  In [It] at: /bosh-cli/cmd/create_recovery_plan_test.go:254 @ 07/03/23 23:53:59.955
------------------------------
<success dots elided>

Summarizing 1 Failure:
  [FAIL] CreateRecoveryPlanCmd Run problems are found [It] writes a recovery plan based on answers
  /bosh-cli/cmd/create_recovery_plan_test.go:254

Ran 908 of 908 Specs in 27.500 seconds
FAIL! -- 907 Passed | 1 Failed | 0 Pending | 0 Skipped
--- FAIL: TestReg (27.72s)
FAIL

Go returning map keys in random order when ranging strikes again! The FakeUI test framework does not allow for very much flexibility in responding to choices for a given input, you just have to give a slice of choices in sequence. This does not work out if the choices are presented in different order each time.

Fixed by sorting the map keys for the problems by type, then ranging over the sorted slice to access the map in a consistent way. Ran the test 150 times over lunch and never had a failure with the fix.

Note that for the user experience, it doesn't really matter, this is purely a testing problem. There are only 6 problem types, so sorting a list of at most 6 does not seem like an expensive operation vs. refactoring how the FakeUI works.

Resolved in dd7ff53

@selzoc
Copy link
Member Author

selzoc commented Jul 5, 2023

@klakin-pivotal do you mind testing the fix? Want to make sure it wasn't just fixed on my machine :).

@klakin-pivotal
Copy link
Contributor

@selzoc I checked out dd7ff53 (which is currently the latest commit on this branch), ran the tests and they passed. So, it looks like you successfully fixed the busted test.

@jpalermo
Copy link
Member

jpalermo commented Jul 6, 2023

Marking it "experimental" is more of a warning that the behavior might change before we're done with this track of work. For example, maybe we need to change the structure of the recovery plan before we're done.

@beyhan
Copy link
Member

beyhan commented Jul 6, 2023

Another question I have here is how this relate to the existing cloud-check command? Should we deprecate it in favour of this one at some point? Maybe we can describe the plans around the recover work in https://github.com/cloudfoundry/bosh/discussions like we did for other topics.

@selzoc
Copy link
Member Author

selzoc commented Jul 6, 2023

Greate point @beyhan - I've started a discussion here

@cunnie cunnie self-requested a review July 7, 2023 17:47
@cunnie
Copy link
Member

cunnie commented Jul 7, 2023

✅ CLI vs. old Director

Verified that it emits a proper error message when used with an older Director:

out/bosh create-recovery-plan /tmp/junk3.yml
...

Director does not support this command.  Try 'bosh cloud-check' instead

Copy link
Member

@cunnie cunnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me (though were I to do this I'd prefer 1-2 commits rather than 5).

@selzoc
Copy link
Member Author

selzoc commented Jul 7, 2023

Looks good to me (though were I to do this I'd prefer 1-2 commits rather than 5).

Would you like to click the "squash and merge" button? That'll guarantee 1 commit :)

Copy link
Member

@lnguyen lnguyen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lnguyen lnguyen merged commit 27cd7c0 into main Jul 10, 2023
2 checks passed
@lnguyen lnguyen deleted the create-recovery-plan-185483613 branch July 10, 2023 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

7 participants