Skip to content
This repository has been archived by the owner on Sep 21, 2020. It is now read-only.

GPII-2996 GPII-2995: CI for GCP #80

Closed
wants to merge 17 commits into from
Closed

GPII-2996 GPII-2995: CI for GCP #80

wants to merge 17 commits into from

Conversation

mrtyler
Copy link
Contributor

@mrtyler mrtyler commented Aug 1, 2018

This PR adds:

  1. Pipeline steps for gcp, parallel to steps for aws
  2. A guard rail to prevent accidental operation in prd, a la GPII-3199
  3. Documentation on how to download credentials for use by CI
  4. A helper task to copy previously-downloaded credentials to the place where exekube expects to find them
  5. An attempt to normalize names of rake tasks, and some re-ordering of tasks in entrypoint.rake

I think 1 and 2 should be mostly uncontroversial. See https://issues.gpii.net/browse/GPII-2996?focusedCommentId=33804&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-33804 for a little more about the "tagged runner" strategy.

2 and 3 may warrant further discussion, particularly in light of recent conversations about secrets handling.

For now, I'm letting CI use the credentials for projectowner@ in each Project. A dedicated IAM would be better. I'd prefer to wait for @amatas's work in #60 and/or for https://issues.gpii.net/browse/GPII-2947 so that we have a place to put Terraform code to manage IAMs, but I can whip up something by hand if the team thinks it's worth doing.

BTW @amatas I was unable to create credentials for stg and prd because:

  • There is no Project gpii-gcp-stg. Instead there is a gpii-stg.
  • There is a gpii-gcp-prd, but it doesn't have a projectowner@ IAM.

Perhaps these are expected until your work in #60 is complete?

4 may cause merge conflicts for in-flight branches (mostly @amatas I think). Sorry about that, but it helped me reason about the changes I was making.

Mostly I tried to reduce the number of verbs in task names. Let me know if I made the names better or worse :).

The next problem

Directories created inside the exekube container (even those created implicitly, like volume mounts for .config/<env>/gcloud) are created with ownership root.root. This prevents rake clobber from cleaning up these directories (https://gitlab.com/gpii-ops/gpii-infra/-/jobs/85893841), and prevents secrets.rb from writing secrets files (https://gitlab.com/gpii-ops/gpii-infra/-/jobs/85889990).

I do not see this behavior on my machine / MacOS. The CI worker is CentOS 7.

My guess is this is because commands like gcloud and secrets-fetch run as root inside the exekube container, so files created on mounted volumes inside the container "leak" back to the host with ownership root. This may be fixable by adding and then using a role user inside the container instead of defaulting to USER root, but that approach can get complicated so I'm stopping here to ask for advice.

Copy link
Contributor

@amatas amatas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks fine for me, some minor comments added.

The service account creation is something that will be handled by the common part. Although you can create one with the same name and features which will be imported in the TF state before applying the common part in order to avoid conflicts. (There are many already created resources that must be imported before, as a previous step for the first execution of the common part).

About merging the file entrypoint.rake, that is something that foresees painfully anyway.

stage: setup
tags:
- aws
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's only a cosmetic change, but is there a particular reason whay the tags identation is different of the script section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

¯(°_o)/¯

(This is fixed in my new PR, #92.)

- git tag "deploy-aws-stg-$DATESTAMP"
# gitlab is not clever enough to clean up an added remote and git complains
# if we add a remote that already exists.
- git remote | grep -q "^origin-rw" || git remote add origin-rw [email protected]:gpii-ops/gpii-infra
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although this works, other option could be setting the origin url using:
git remote set-url origin [email protected]:gpii-ops/gpii-infra
it's just an idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find it in my notes, but I think I considered this idea when I wrote this task originally. I decided it was safer to create a separate origin rather than fight for control of origin with Gitlab, which may have certain expectations about the origin URL that we might break.

I checked and the origin url Gitlab uses is https://gitlab-ci-token:<PASSWORD>@gitlab.com/gpii-ops/gpii-infra.git. So I feel like there might be consequences if we change this to SSH-based authentication. :)

I propose we keep this as-is.

.gitlab-ci.yml Outdated
script:
- docker version
- docker-compose version
- docker pull gpii/exekube:0.3.1-google
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if #81 is merged first, perhaps this should be updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is merged, and I updated to 0.4.0.

@@ -46,6 +46,18 @@ Initial instructions based on [exekube's Getting Started](https://exekube.github
* @mrtyler requested a quota bump to 100 Projects.
* He only authorized his own email for now, to see what it did. But it's possible other Ops team members will need to go through this step.

## One-time CI Setup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have anything to manage the CI runner box? (not a big fan of any manual steps :))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! We have https://github.com/gpii-ops/ansible-gpii-ci-worker.

I share your preference for automation. There are a few reasons I have this as a manual step:

  • It was simple :). Also it was a record of what I had done to facilitate my manual testing while I figured things out.
  • Laziness :). Installing credentials on the CI machine is currently a rare event and there are other things to do.
  • I forgot that I had already solved a similar problem (docker hub creds) with ansible-gpii-ci-worker.
  • I knew that some details about auth/creds would change with Alfredo's work in GPII-3125 Init GCP organization #60, so I postponed a more robust solution.
  • This step currently has a manual component regardless, since a human must use their credentials to obtain owner.json.

All of that said, you are certainly wise to raise the question! I was going to respond by adding owner.json to the ansible vault and deploying it automatically with ansible-gpii-ci-worker.

However, now that I'm pursuing your suggestion of using Volumes instead of Bind Mounts, I'm not sure exactly how I'm going to provide owner.json. Let's talk about it in my new PR, #92.

@stepanstipl
Copy link
Contributor

Look good, I'm slightly unhappy about secrets on the host - effectively anyone with access to that machine, or with permission to run jobs there, can get complete access to our infra.

Also using one big fat account for all the environments (not sure what would be a good strategy to mitigate this, but ideally you run under stage env, you wouldn't be able to impact prod one).

Re. permission issues - this might help (not convinced is a good one, just an option :)) https://docs.docker.com/engine/security/userns-remap/

@mrtyler
Copy link
Contributor Author

mrtyler commented Aug 13, 2018

The docker Volumes/permissions thing is kind of a mess[1]. There are ~3 problems:

  1. If a Volume mount point doesn't exist on the host side, it is created with ownership root.root.
  2. Files written to a Volume inside the container are created with ownership of the user inside the container. I tried adding a static user exekube as part of the Dockerfile, but that only creates a different ownership mismatch.
  3. I worked around that by using the host user's uid and gid inside the container. But since we don't know the user at image build time, we can't put the .config directories into the user's home directory.

I don't have everything working yet, but this at least produces files with the correct permissions:
MY_UID=$(id -u) MY_GID=$(id -g) rake apply_secret_mgmt

(I'll add these to set_vars() next, assuming we go with this strategy.)

Note that this PR is dependent on this one in (our fork of) exekube.

[1] Some issues I read while researching this issue:

@mrtyler
Copy link
Contributor Author

mrtyler commented Aug 13, 2018

@stepanstipl:

Re. permission issues - this might help (not convinced is a good one, just an option :)) https://docs.docker.com/engine/security/userns-remap/

Aha, this is where you left that link! I forgot to read this before.

It isn't clear to me whether userns remapping helps with our Volume ownership problems or not. It looks like using it requires some special handling (especially on CentOS 7). Let's talk about how this solution compares with the problems I've identified (my previous comment).

@mrtyler mrtyler mentioned this pull request Aug 17, 2018
@mrtyler
Copy link
Contributor Author

mrtyler commented Aug 17, 2018

@stepanstipl

Look good, I'm slightly unhappy about secrets on the host - effectively anyone with access to that machine, or with permission to run jobs there, can get complete access to our infra.

Also using one big fat account for all the environments (not sure what would be a good strategy to mitigate this, but ideally you run under stage env, you wouldn't be able to impact prod one).

I definitely agree.

To some extent this is unavoidable -- something has to both execute code and have administrative production credentials. But there are mitigations, some existing, some planned, and some possible in the future, that can help:

  • Currently, gitlab-runner on h5 only runs code from a few "trusted" repos (mostly in gpii-ops). [DONE]
  • Alfredo's work in GPII-3125 Init GCP organization #60 adds some granularity. [IN PROGRESS]
  • https://issues.gpii.net/browse/GPII-2947 will give us better automation for managing permissions. [PLANNED]
  • One of the ideas behind the promote-to-* steps in the CI/CD pipeline to create deploy-* tags in the gpii-infra repo was to enable asynchrony between (in particular) the stg step and the prd steps. Perhaps the non-production and production steps should run in different Specific Runners/gitlab-runner processes, or on different machines. [FUTURE]
  • https://issues.gpii.net/browse/GPII-3057 through GPII-3061 are about moving the CI worker to the cloud. Perhaps that will open up other solutions (separate VMs/Nodes for production and non-production might be easier in the cloud). [PLANNED/FUTURE]
  • Vault and some of the techniques it affords (automatic key rotation -> short key lifetimes) would reduce the consequences of an attacker accessing credentials. [FUTURE]

@mrtyler
Copy link
Contributor Author

mrtyler commented Aug 17, 2018

I am closing this PR (and gpii-ops/exekube#8) in favor of #92.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants