Skip to content
forrestj edited this page Mar 23, 2016 · 8 revisions

Sometimes we need to rollout new docks for all the customers. Here is your short guide that might help you in this process.

Preparation

  1. Coordinate timing with Product and Support teams.
  2. Validate that customer messaging was sent.
  3. Put yourself as primary on Pagerduty before starting the process until the next morning.
  4. Wait till we have low usage (as late as you are willing to work)
  5. Open Datadog Container Mayhem Dashboard and write down number "Users Container Running"
  6. In the #general Slack channel notify @here that you are starting rolling out the docks.

Rolling out

The process of rolling out docks is not ideal right now. A lot of orgs have only 1 dock. If we tear down that dock we will receive a lot of Rollbar errors like: "no resources available to schedule a container for an org=233238".

Those rollbar errors right now will trigger PageDuty alerts that you should acknowledge and resolve in 15 minutes (because in 15 minutes docks would be there and error would be self-resolved). In the future hopefully we would improve this process with scaling out org before rollout and scaling down afterwards.

  1. Unhealthy 1 dock per org at a time docks aws list -e delta | grep large | grep -v "$DATE" | sort -u -n -k 4 | awk '{print $6}' | xargs -I % bash -c "echo y|docks unhealthy % -e delta" Where $DATE should be today like "Thu Mar 17" format

  2. Wait 10-15 minutes and run docks aws list -d delta If there are some figure out why it did not come up:

  • ssh into dock
  • cat /var/log/user-script-dock-init.log
  • if you see no errors do docker ps
  • if weave container is up, just add dock to mavis
  • if not, we missed the up event, just kill the dock with docks kill -e delta <ip>
  1. Repeat 3 every 15 / 30 min until all docks are rolled

Post rollout checks

  1. Make sure there are no PagerDuty alerts. Wait at least 30 minutes after you rollout docks. This wait time is important because we have canary builds that might trigger PagerDuty later.
  2. Open Datadog Container Mayhem Dashboard and compare number "Users Container Running" with the previously saved one. They should be roughly the same
  3. Go to Runnable org and CodeNow and see if some containers are still migrating. Try to rebuild containers manually. Do Integration test
  4. Notify @here in the #general Slack channel that you are finished.
  5. Ask someone from the product to verify customers containers
Clone this wiki locally