Skip to content

Commit

Permalink
Merge pull request #3 from theodoresiu/dlp_api_example
Browse files Browse the repository at this point in the history
Create Terraform script to run Dataflow template for DLP API
  • Loading branch information
Tfmenard authored May 31, 2019
2 parents a9d21a8 + 99f7641 commit 7526b16
Show file tree
Hide file tree
Showing 4 changed files with 292 additions and 0 deletions.
72 changes: 72 additions & 0 deletions examples/dlp_api_example/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# DLP API Example

This dataflow example runs the DLP Dataflow template under gs://dataflow-templates/latest/Stream_DLP_GCS_Text_to_BigQuery. It downloads a fake credit card [zipfile](http://eforexcel.com/wp/wp-content/uploads/2017/07/1500000%20CC%20Records.zip) unzips to a csv, deidentifies the credit card number and pin columns using the DLP API and dumps the data into a BigQuery dataset.

This terraform script allows users to use their own pre-created KMS key ring/key/wrapped key by setting the variable `create_key_ring=false` or can also create all such resources for them by setting the variable `create_key_ring=true`.


## Best practices

### Cost and Performance
As featured in this example, using a single regional bucket for storing your jobs' temporary data is recommended to optimize cost.
Also, to optimize your jobs performance, this bucket should always in the corresponding region of the zones in which your jobs are running.
##
Make sure the terraform service account to execute the example has the basic permissions needed for the module listed [here](../../README#configure-a-service-account-to-execute-the-module)
Grant these additional permissions to the service account needed to run the example:
- roles/bigquery.admin
- roles/iam.serviceAccountUser
- roles/storage.admin
- roles/cloudkms.admin
- roles/dlp.admin
- roles/cloudkms.cryptoKeyEncrypterDecrypter

### Controller Service Account
This example features the use of a controller service account which is specified with the `service_account_email` input variables.
We recommend using a custom service account with fine-grained access control to mitigate security risks. See more about controller service accounts [here](https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#controller_service_account)

In order to execute this module, your Controller Service Account uses the following project roles:
- roles/dataflow.worker
- roles/storage.admin
- roles/bigquery.admin
- roles/cloudkms.admin
- roles/dlp.admin
- roles/cloudkms.cryptoKeyEncrypterDecrypter

### GCloud
This example uses gcloud shell commands to create a wrapped key and download the sample cc data. Please ensure that you have gcloud [installed](https://cloud.google.com/sdk/install) are authenticated using `gcloud init` and also properly set the project `gcloud config set project my-project`. You may need to enable the following APIs- see [here](https://cloud.google.com/apis/docs/enable-disable-apis)
- Cloud Key Management Service (KMS) API: `cloudkms.googleapis.com`
- Cloud Storage API : `storage-component.googleapis.com`
- DLP API: `dlp.googleapis.com`


[^]: (autogen_docs_start)

## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|:----:|:-----:|:-----:|
| project\_id | The project ID to deploy to | string | n/a | yes |
| region | The region in which the bucket and the dataflow job will be deployed | string | n/a | yes |
| service\_account\_email | The Service Account email used to create the job. | string | n/a | yes |
| key\_ring | The KMS key ring used to create a wrapped key (can be existing or created) | string | n/a | yes |
| kms\_key\_name | The KMS key within the key ring used to create a wrapped key (can be existing or created) | string | n/a | yes |
| wrapped\_key | The wrapped key generated from KMS used to encrypt sensitive information (leave blank if generating from terraform) | string | "" | yes |
| create\_key\_ring | Boolean for creating own KMS key ring/key or using pre-created resource | string | "true" | yes |

## Outputs

| Name | Description |
|------|-------------|
| bucket\_name | The name of the bucket |
| df\_job\_id | The unique Id of the newly created Dataflow job |
| df\_job\_name | The name of the newly created Dataflow job |
| df\_job\_state | The state of the newly created Dataflow job |
| project\_id | The project's ID |

[^]: (autogen_docs_end)

To provision this example, run the following from within this directory:
- `terraform init` to get the plugins
- `terraform plan` to see the infrastructure plan
- `terraform apply` to apply the infrastructure build
- `terraform destroy` to destroy the built infrastructure. (Note that KMS key rings and crypto keys cannot be destroyed!)
130 changes: 130 additions & 0 deletions examples/dlp_api_example/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
/**
* Copyright 2019 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

provider "google" {
version = "~> 2.4.0"
region = "${var.region}"
}

resource "random_id" "random_suffix" { byte_length = 4 }

locals {
gcs_bucket_name = "tmp-dir-bucket-${random_id.random_suffix.hex}"
}

module "dataflow-bucket" {
source = "../../modules/dataflow_bucket"
name = "${local.gcs_bucket_name}"
region = "${var.region}"
project_id = "${var.project_id}"
}

resource "null_resource" "download_sample_cc_into_gcs" {
provisioner "local-exec" {
command = <<EOF
curl http://eforexcel.com/wp/wp-content/uploads/2017/07/1500000%20CC%20Records.zip > cc_records.zip
unzip cc_records.zip
rm cc_records.zip
mv 1500000\ CC\ Records.csv cc_records.csv
gsutil cp cc_records.csv gs://${module.dataflow-bucket.name}
rm cc_records.csv
EOF
}
}

resource "null_resource" "deinspection_template_setup" {
provisioner "local-exec" {
command = <<EOF
if [ -f wrapped_key.txt ] && [ ${null_resource.create_kms_wrapped_key.count}=1 ]; then
wrapped_key=$(cat wrapped_key.txt)
else
wrapped_key=${var.wrapped_key}
fi
echo $wrapped_key
curl https://dlp.googleapis.com/v2/projects/${var.project_id}/deidentifyTemplates -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
-H "Content-Type: application/json" \
-d '{"deidentifyTemplate": {"deidentifyConfig": {"recordTransformations": {"fieldTransformations": [{"fields": [{"name": "Card Number"}, {"name": "Card PIN"}], "primitiveTransformation": {"cryptoReplaceFfxFpeConfig": {"cryptoKey": {"kmsWrapped": {"cryptoKeyName": "projects/${var.project_id}/locations/global/keyRings/${var.key_ring}/cryptoKeys/${var.kms_key_name}", "wrappedKey": "'$wrapped_key'"}}, "commonAlphabet": "ALPHA_NUMERIC"}}}]}}}, "templateId": "15"}'
EOF
}
}

resource "google_bigquery_dataset" "default" {
project = "${var.project_id}"
dataset_id = "dlp_demo"
friendly_name = "dlp_demo"
description = "This is the BQ dataset for running the dlp demo"
location = "US"
default_table_expiration_ms = 3600000
}

resource "google_kms_key_ring" "create_kms_ring" {
project = "${var.project_id}"
count = "${var.create_key_ring == "true" ? 1 : 0}"
name = "${var.key_ring}"
location = "global"
}

resource "google_kms_crypto_key" "create_kms_key" {
count = "${google_kms_key_ring.create_kms_ring.count}"
name = "${var.kms_key_name}"
key_ring = "${google_kms_key_ring.create_kms_ring.self_link}"
}

resource "null_resource" "create_kms_wrapped_key" {
count = "${google_kms_crypto_key.create_kms_key.count}"

provisioner "local-exec" {
command = <<EOF
rm original_key.txt
rm wrapped_key.txt
python -c "import os,base64; key=os.urandom(32); encoded_key = base64.b64encode(key).decode('utf-8'); print(encoded_key)" >> original_key.txt
original_key="$(cat original_key.txt)"
gcloud kms keys add-iam-policy-binding ${var.kms_key_name} --project ${var.project_id} --location global --keyring ${var.key_ring} --member serviceAccount:${var.terraform_service_account_email} --role roles/cloudkms.cryptoKeyEncrypterDecrypter
curl -s -X POST "https://cloudkms.googleapis.com/v1/projects/${var.project_id}/locations/global/keyRings/${var.key_ring}/cryptoKeys/${var.kms_key_name}:encrypt" -d '{"plaintext":"'$original_key'"}' -H "Authorization:Bearer $(gcloud auth application-default print-access-token)" -H "Content-Type:application/json" | python -c "import sys, json; print(json.load(sys.stdin)['ciphertext'])" >> wrapped_key.txt
EOF
}
}

module "dataflow-job" {
source = "../../"
project_id = "${var.project_id}"
name = "dlp_example_${null_resource.download_sample_cc_into_gcs.id}_${null_resource.deinspection_template_setup.id}"
on_delete = "cancel"
zone = "${var.region}-a"
template_gcs_path = "gs://dataflow-templates/latest/Stream_DLP_GCS_Text_to_BigQuery"
temp_gcs_location = "${module.dataflow-bucket.name}"
service_account_email = "${var.service_account_email}"
max_workers = 5

parameters = {
inputFilePattern = "gs://${module.dataflow-bucket.name}/cc_records.csv"
datasetName = "${google_bigquery_dataset.default.dataset_id}"
batchSize = 1000
dlpProjectId = "${var.project_id}"
deidentifyTemplateName = "projects/${var.project_id}/deidentifyTemplates/15"
}
}

resource "null_resource" "destroy_deidentify_template"{
provisioner "local-exec" {
when = "destroy"
command = <<EOF
curl -s -X DELETE "https://dlp.googleapis.com/v2/projects/${var.project_id}/deidentifyTemplates/15" -H "Authorization:Bearer $(gcloud auth application-default print-access-token)"
EOF
}
}
40 changes: 40 additions & 0 deletions examples/dlp_api_example/outputs.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
/**
* Copyright 2019 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

output "project_id" {
value = "${var.project_id}"
description = "The project's ID"
}

output "df_job_state" {
description = "The state of the newly created Dataflow job"
value = "${module.dataflow-job.state}"
}

output "df_job_id" {
description = "The unique Id of the newly created Dataflow job"
value = "${module.dataflow-job.id}"
}

output "df_job_name" {
description = "The name of the newly created Dataflow job"
value = "${module.dataflow-job.name}"
}

output "bucket_name" {
description = "The name of the bucket"
value = "${module.dataflow-bucket.name}"
}
50 changes: 50 additions & 0 deletions examples/dlp_api_example/variables.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
/**
* Copyright 2019 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

variable "project_id" {
description = "The project ID to deploy to"
}

variable "region" {
description = "The region in which the bucket and the dataflow job will be deployed"
default = "us-central1"
}

variable "service_account_email" {
description = "The Service Account email used to create the job."
}

variable "terraform_service_account_email" {
description = "The Service Account email used by terraform to spin up resources- the one from environmental variable GOOGLE_APPLICATION_CREDENTIALS"
}

variable "key_ring" {
description = "The GCP KMS key ring to be created"
}

variable "kms_key_name" {
description = "The GCP KMS key to be created going under the key ring"
}

variable "wrapped_key" {
description = "Wrapped key from KMS leave blank if create_key_ring=true"
default = ""
}

variable "create_key_ring" {
description = "Boolean for determining whether to create key ring with keys(true or false)"
default = "true"
}

0 comments on commit 7526b16

Please sign in to comment.