GPII-2996 GPII-2995: CI for GCP #80

mrtyler · 2018-08-01T06:23:17Z

This PR adds:

Pipeline steps for gcp, parallel to steps for aws
A guard rail to prevent accidental operation in prd, a la GPII-3199
Documentation on how to download credentials for use by CI
A helper task to copy previously-downloaded credentials to the place where exekube expects to find them
An attempt to normalize names of rake tasks, and some re-ordering of tasks in entrypoint.rake

I think 1 and 2 should be mostly uncontroversial. See https://issues.gpii.net/browse/GPII-2996?focusedCommentId=33804&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-33804 for a little more about the "tagged runner" strategy.

2 and 3 may warrant further discussion, particularly in light of recent conversations about secrets handling.

For now, I'm letting CI use the credentials for projectowner@ in each Project. A dedicated IAM would be better. I'd prefer to wait for @amatas's work in #60 and/or for https://issues.gpii.net/browse/GPII-2947 so that we have a place to put Terraform code to manage IAMs, but I can whip up something by hand if the team thinks it's worth doing.

BTW @amatas I was unable to create credentials for stg and prd because:

There is no Project gpii-gcp-stg. Instead there is a gpii-stg.
There is a gpii-gcp-prd, but it doesn't have a projectowner@ IAM.

Perhaps these are expected until your work in #60 is complete?

4 may cause merge conflicts for in-flight branches (mostly @amatas I think). Sorry about that, but it helped me reason about the changes I was making.

Mostly I tried to reduce the number of verbs in task names. Let me know if I made the names better or worse :).

The next problem

Directories created inside the exekube container (even those created implicitly, like volume mounts for .config/<env>/gcloud) are created with ownership root.root. This prevents rake clobber from cleaning up these directories (https://gitlab.com/gpii-ops/gpii-infra/-/jobs/85893841), and prevents secrets.rb from writing secrets files (https://gitlab.com/gpii-ops/gpii-infra/-/jobs/85889990).

I do not see this behavior on my machine / MacOS. The CI worker is CentOS 7.

My guess is this is because commands like gcloud and secrets-fetch run as root inside the exekube container, so files created on mounted volumes inside the container "leak" back to the host with ownership root. This may be fixable by adding and then using a role user inside the container instead of defaulting to USER root, but that approach can get complicated so I'm stopping here to ask for advice.

Apologies for any merge conflicts this causes ;).

amatas

Everything looks fine for me, some minor comments added.

The service account creation is something that will be handled by the common part. Although you can create one with the same name and features which will be imported in the TF state before applying the common part in order to avoid conflicts. (There are many already created resources that must be imported before, as a previous step for the first execution of the common part).

About merging the file entrypoint.rake, that is something that foresees painfully anyway.

amatas · 2018-08-02T11:05:26Z

.gitlab-ci.yml

  stage: setup
+  tags:
+  - aws


it's only a cosmetic change, but is there a particular reason whay the tags identation is different of the script section?

¯(°_o)/¯

(This is fixed in my new PR, #92.)

amatas · 2018-08-02T11:29:44Z

.gitlab-ci.yml

+    - git tag "deploy-aws-stg-$DATESTAMP"
+    # gitlab is not clever enough to clean up an added remote and git complains
+    # if we add a remote that already exists.
+    - git remote | grep -q "^origin-rw" || git remote add origin-rw [email protected]:gpii-ops/gpii-infra


although this works, other option could be setting the origin url using:
git remote set-url origin [email protected]:gpii-ops/gpii-infra
it's just an idea.

I couldn't find it in my notes, but I think I considered this idea when I wrote this task originally. I decided it was safer to create a separate origin rather than fight for control of origin with Gitlab, which may have certain expectations about the origin URL that we might break.

I checked and the origin url Gitlab uses is https://gitlab-ci-token:<PASSWORD>@gitlab.com/gpii-ops/gpii-infra.git. So I feel like there might be consequences if we change this to SSH-based authentication. :)

I propose we keep this as-is.

amatas · 2018-08-02T11:41:49Z

.gitlab-ci.yml

+  script:
+    - docker version
+    - docker-compose version
+    - docker pull gpii/exekube:0.3.1-google


if #81 is merged first, perhaps this should be updated.

It is merged, and I updated to 0.4.0.

stepanstipl · 2018-08-03T13:18:53Z

gcp/README.md

@@ -46,6 +46,18 @@ Initial instructions based on [exekube's Getting Started](https://exekube.github
   * @mrtyler requested a quota bump to 100 Projects.
      * He only authorized his own email for now, to see what it did. But it's possible other Ops team members will need to go through this step.

+## One-time CI Setup


Don't we have anything to manage the CI runner box? (not a big fan of any manual steps :))

Indeed! We have https://github.com/gpii-ops/ansible-gpii-ci-worker.

I share your preference for automation. There are a few reasons I have this as a manual step:

It was simple :). Also it was a record of what I had done to facilitate my manual testing while I figured things out.

Laziness :). Installing credentials on the CI machine is currently a rare event and there are other things to do.

I forgot that I had already solved a similar problem (docker hub creds) with ansible-gpii-ci-worker.

I knew that some details about auth/creds would change with Alfredo's work in GPII-3125 Init GCP organization #60, so I postponed a more robust solution.

This step currently has a manual component regardless, since a human must use their credentials to obtain owner.json.

All of that said, you are certainly wise to raise the question! I was going to respond by adding owner.json to the ansible vault and deploying it automatically with ansible-gpii-ci-worker.

However, now that I'm pursuing your suggestion of using Volumes instead of Bind Mounts, I'm not sure exactly how I'm going to provide owner.json. Let's talk about it in my new PR, #92.

stepanstipl · 2018-08-03T13:38:47Z

Look good, I'm slightly unhappy about secrets on the host - effectively anyone with access to that machine, or with permission to run jobs there, can get complete access to our infra.

Also using one big fat account for all the environments (not sure what would be a good strategy to mitigate this, but ideally you run under stage env, you wouldn't be able to impact prod one).

Re. permission issues - this might help (not convinced is a good one, just an option :)) https://docs.docker.com/engine/security/userns-remap/

This is so that Volumes mounted from the host don't get a bunch of root-owned files.

mrtyler · 2018-08-13T05:39:51Z

The docker Volumes/permissions thing is kind of a mess[1]. There are ~3 problems:

If a Volume mount point doesn't exist on the host side, it is created with ownership root.root.
Files written to a Volume inside the container are created with ownership of the user inside the container. I tried adding a static user exekube as part of the Dockerfile, but that only creates a different ownership mismatch.
I worked around that by using the host user's uid and gid inside the container. But since we don't know the user at image build time, we can't put the .config directories into the user's home directory.

I don't have everything working yet, but this at least produces files with the correct permissions:
MY_UID=$(id -u) MY_GID=$(id -g) rake apply_secret_mgmt

(I'll add these to set_vars() next, assuming we go with this strategy.)

Note that this PR is dependent on this one in (our fork of) exekube.

[1] Some issues I read while researching this issue:

mrtyler · 2018-08-13T05:49:38Z

@stepanstipl:

Re. permission issues - this might help (not convinced is a good one, just an option :)) https://docs.docker.com/engine/security/userns-remap/

Aha, this is where you left that link! I forgot to read this before.

It isn't clear to me whether userns remapping helps with our Volume ownership problems or not. It looks like using it requires some special handling (especially on CentOS 7). Let's talk about how this solution compares with the problems I've identified (my previous comment).

mrtyler · 2018-08-17T02:30:42Z

@stepanstipl

Look good, I'm slightly unhappy about secrets on the host - effectively anyone with access to that machine, or with permission to run jobs there, can get complete access to our infra.

Also using one big fat account for all the environments (not sure what would be a good strategy to mitigate this, but ideally you run under stage env, you wouldn't be able to impact prod one).

I definitely agree.

To some extent this is unavoidable -- something has to both execute code and have administrative production credentials. But there are mitigations, some existing, some planned, and some possible in the future, that can help:

Currently, gitlab-runner on h5 only runs code from a few "trusted" repos (mostly in gpii-ops). [DONE]
Alfredo's work in GPII-3125 Init GCP organization #60 adds some granularity. [IN PROGRESS]
https://issues.gpii.net/browse/GPII-2947 will give us better automation for managing permissions. [PLANNED]
One of the ideas behind the promote-to-* steps in the CI/CD pipeline to create deploy-* tags in the gpii-infra repo was to enable asynchrony between (in particular) the stg step and the prd steps. Perhaps the non-production and production steps should run in different Specific Runners/gitlab-runner processes, or on different machines. [FUTURE]
https://issues.gpii.net/browse/GPII-3057 through GPII-3061 are about moving the CI worker to the cloud. Perhaps that will open up other solutions (separate VMs/Nodes for production and non-production might be easier in the cloud). [PLANNED/FUTURE]
Vault and some of the techniques it affords (automatic key rotation -> short key lifetimes) would reduce the consequences of an attacker accessing credentials. [FUTURE]

mrtyler · 2018-08-17T02:31:11Z

I am closing this PR (and gpii-ops/exekube#8) in favor of #92.

mrtyler added 10 commits July 31, 2018 23:17

GPII-2996: first crack at a real gcp pipeline.

bb2f845

GPII-2996: Add :configure_ci_auth.

d86a2ed

GPII-2996: Call configure_ci_auth during CI.

fb88ba4

GPII-2996: Refactor compose.env handling.

93b189d

GPII-2996: Update tag names.

69f6f1c

GPII-2996: Normalize names, re-order, and add comments.

9b17c21

Apologies for any merge conflicts this causes ;).

GPII-2996: Add :configure_serviceaccount_ci.

ffd099e

GPII-2996: Document one-time CI setup steps.

a6d2c78

GPII-2996: This has bitten me again, this time in CI. Off by default.

fc347ff

Add RAKE_REALLY_RUN_IN_PRD behavior from GPII-3199 to gcp.

9c02c66

amatas approved these changes Aug 2, 2018

View reviewed changes

amatas reviewed Aug 2, 2018

View reviewed changes

stepanstipl reviewed Aug 3, 2018

View reviewed changes

mrtyler mentioned this pull request Aug 3, 2018

[GPII-3222]: Corresponding changes after migration to exekube 0.4 #81

Merged

mrtyler added 4 commits August 9, 2018 20:07

Merge remote-tracking branch 'upstream/master'

e064935

Bump this (foolishly-coupled) docker pull exekube to 0.4.0.

c29ddec

GPII-2996: Run as host user instead of root.

bf7f4ef

This is so that Volumes mounted from the host don't get a bunch of root-owned files.

GPII-2996: Create host side of Volumes so they get friendly ownership.

e791399

mrtyler mentioned this pull request Aug 13, 2018

Gpii 2996 democracy gpii-ops/exekube#8

Closed

mrtyler added 3 commits August 13, 2018 21:10

MacOS u wot m8?

99d7921

DRY out some mocks.

a66b938

GPII-2996: Set MY_UID and MY_GID.

71ce78b

mrtyler mentioned this pull request Aug 17, 2018

GPII-2996 gcp ci volumes #92

Merged

mrtyler closed this Aug 17, 2018

mrtyler mentioned this pull request Aug 21, 2018

Bind mounts and root-owned files exekube/exekube#99

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPII-2996 GPII-2995: CI for GCP #80

GPII-2996 GPII-2995: CI for GCP #80

mrtyler commented Aug 1, 2018

amatas left a comment

amatas Aug 2, 2018

mrtyler Aug 17, 2018

amatas Aug 2, 2018

mrtyler Aug 17, 2018

amatas Aug 2, 2018

mrtyler Aug 13, 2018

stepanstipl Aug 3, 2018

mrtyler Aug 17, 2018

stepanstipl commented Aug 3, 2018

mrtyler commented Aug 13, 2018

mrtyler commented Aug 13, 2018

mrtyler commented Aug 17, 2018

mrtyler commented Aug 17, 2018

GPII-2996 GPII-2995: CI for GCP #80

GPII-2996 GPII-2995: CI for GCP #80

Conversation

mrtyler commented Aug 1, 2018

The next problem

amatas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stepanstipl commented Aug 3, 2018

mrtyler commented Aug 13, 2018

mrtyler commented Aug 13, 2018

mrtyler commented Aug 17, 2018

mrtyler commented Aug 17, 2018