feat(cloudquery): Run on ECS #189

akash1810 · 2023-05-17T15:07:59Z

What does this change? And why?

Inspired by our experiments during this quarter's R&D sprint, this change moves away from using EC2 to run CloudQuery and towards ECS, on Fargate.

This brings some benefits:

No idle time

The EC2 instance sits idle for ~23 hours a day.
Implementing different schedules is easier

Though it's possible to implement different schedules using systemd timers on EC2, we'd likely need to manage multiple config files, which isn't fun.

The changes in this PR allow us to schedule a job as follows:
```
const customRateTables: CustomRateTable[] = [
  {
    schedule: Schedule.rate(Duration.hours(2)),
    tables: ['aws_s3_buckets'],
  },
  {
    schedule: Schedule.rate(Duration.minutes(30)),
    tables: ['aws_lambda_functions'],
  },
];
```
Being able to collect data from different tables at different rates is going to be crucial to adding GitHub to CloudQuery due to the rate limiting of the GitHub API. If we collect data for github_teams at a slower rate than github_repositories, we reduce the chance of getting rate limited.
Removal of known errors

The EC2 collector currently produces error logs when collecting data for the organization tables . This is because we're making the request from all accounts. We know that only the DeployTools account has this permission, so we now make the request from just that account.

Similar to the above, this is possible when using EC2, however we'd need to manage multiple config files, which isn't fun.
Reduced infrastructure

With the EC2 version, we have an ASG purely to facilitate deployments via Riff-Raff. This is not needed in the ECS version.

In the EC2 version, we also place configuration in S3. The implementation of the ECS version removes this, in favour of inlining the config.

How has it been verified?

The stack has been deployed. There are a couple of ways to verify it works:

Observe the logs in Central ELK
View the configuration of the ECS Cluster in AWS

Notes

Sadly, I don't think it will be possible to use centralised logging container on Fargate as:

Fargate tasks do not support the DAEMON scheduling strategy.
– https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_services.html#service_scheduler

This means each task has three containers:

One to vend temporary AWS RDS IAM tokens
One to perform the CloudQuery work
One to ship logs

We should keep an eye on the cost of this, as it might become cheaper to ship the logs to CloudWatch logs, and then use cloudwatch-logs-management to route them to Central ELK.

TODO (non-blocking)

Fine tuning, documentation (and renaming) of https://github.com/guardian/hackday-firelens, as the logs lines in Central ELK do not quite match those from the EC2 version, with missing fields.

packages/cdk/package.json

akash1810 · 2023-05-19T18:25:26Z

packages/cdk/lib/ecs/task.ts

+			logging: LogDrivers.awsLogs({
+				streamPrefix: [stack, stage, app].join('/'),
+				logRetention: RetentionDays.ONE_DAY,
+			}),


Send the log shipper's logs to CloudWatch.

akash1810 · 2023-05-19T18:26:02Z

packages/cdk/lib/ecs/task.ts

+			environment: {
+				STACK: stack,
+				STAGE: stage,
+				APP: app,
+				GU_REPO: thisRepo,
+			},


These are read by the custom firelens config.

packages/cdk/lib/ecs/task.ts

packages/cdk/lib/ecs/cloudquery.ts

packages/cdk/bin/cdk.ts

packages/cdk/package.json

akash1810 · 2023-05-19T18:41:03Z

packages/cdk/lib/ecs/task.ts

+const firelensImage = ContainerImage.fromRegistry(
+	'ghcr.io/guardian/hackday-firelens:main',
+);


TODO rename this repository.

packages/cdk/lib/ecs/cloudquery.ts

This should make it easier to update in future.

…config

To avoid the use of the root credentials, instrument RDS IAM auth. The flow is: 1. Create an AWS CLI container, generate a token, and write it to disk 2. Configure the worker's Postgres connection string using the above file

This'll make it easier to add to the existing EC2 stack.

The data doesn't change very often, so no need to collect it each day.

The EC2 infrastructure is doing this, so no need to duplicate it.

akash1810 · 2023-05-20T08:25:00Z

packages/cdk/lib/ecs/task.ts

+const awsCliImage = ContainerImage.fromRegistry(
+	'public.ecr.aws/aws-cli/aws-cli',
+);


This is the official AWS CLI image.

...in order to: * adhere to least privilege (by only passing policies to tasks that they need * pass the Cloudquery specs in as props as the task creation level (i.e. in cloudquery.ts rather than lower down) so that the same model will work for e.g. Github and other sources going forward The task schedules have also been modified - at the moment once a day is a good common baseline for tasks that are not hitting rate limits so just standardise on that. Finally, task names have been improved to aid discovery in the AWS Console and API.

nicl · 2023-05-22T14:53:31Z

Update: @akash1810 have added a commit with some changes. Obviously feel free to drop it if you'd rather get this live without it!

The main difference is to change the scheduled task interface from:

interface CustomRateTable {
    schedule: Schedule;
    tables: string[];
}

to something like:

interface CloudquerySource {
	name: string;
	description: string;
	schedule: Schedule;
	config: CloudqueryConfig;
	policies?: PolicyStatement[];
	managedPolicies?: IManagedPolicy[];
}

This looks more complicated but allows:

least privilege, as tasks can be passed just the policies they require
support for non-AWS tasks, such as the upcoming Github, Fastly etc.
better task identification (via the name/description)

The idea is that everything becomes a scheduled task here, including the standard fetch-everything-from-all-accounts Cloudquery task. So the code overall becomes simpler with fewer ifs/branches.

Task naming turns out to be especially helpful in the AWS Console and also because it shows up as a field in the logs:

Warning

The logging to ESLK seems to be broken and I'm not sure why. Seeing e.g.

[error] [parser] cannot parse '2023-05-22T16:36:45Z'
[ warn] [parser:json] invalid time format %d/%b/%Y:%H:%M:%S %z for '2023-05-22T16:36:45Z'

and not sure why as I didn't think I'd changed behaviour here.

akash1810 · 2023-05-23T08:20:15Z

This looks more complicated but allows:

least privilege, as tasks can be passed just the policies they require

support for non-AWS tasks, such as the upcoming Github, Fastly etc.

better task identification (via the name/description)

Love every one of these! I was struggling to add GH to this pattern; I think these changes get us closer, but I think we'd need a way to set secrets too; suggest this refactor be it's PR though.

Regarding logging, that was happening prior to your change too. We should be able to reproduce this behaviour locally to help debugging (woop containers!). WDYT - is this shippable the current version of logging?

NovemberTang · 2023-05-23T08:49:05Z

packages/cdk/lib/ecs/task.ts

+import { postgresDestinationConfig } from './config';
+import { Versions } from './versions';
+
+const awsCliImage = ContainerImage.fromRegistry(


for a potential follow-up. could we put the AWS CLI/firelens logging in a container that has cloudquery as the base, self publish, and auth that way? Would save couple of containers? @Mark-McCracken

Something to note (if my understanding is correct) is that containers share the resources of the Task, so we don't actually pay more for this. The auth container should also exit immediately too, so shouldn't impact resources (CPU/mem) available to the main Cloudquery container. Agree though that generally fewer containers is better if possible!

packages/cdk/lib/ecs/task.ts

NovemberTang

Some notes that can be talked about later, not blocking.
LGTM.
fargate 4eva.

akash1810 · 2023-05-23T10:10:16Z

packages/cdk/lib/cloudquery.ts

+						skipTables: skipTables,
+					}),
+					managedPolicies: [
+						readonlyAccessManagedPolicy(this, 'fetch-all-managed-policy'),


We're creating an instance of this managed policy for each task, which increases the chances of bumping against the CFN template limit. If we do see this in future, we can make instantiate just one, and pass it across.

akash1810 · 2023-05-23T10:11:27Z

packages/cdk/lib/ecs/cluster.ts

+		const essentialPolicies = [
+			// Log shipping
+			new PolicyStatement({
+				actions: ['kinesis:Describe*', 'kinesis:Put*'],
+				effect: Effect.ALLOW,
+				resources: [loggingStreamArn],
+			}),
+
+			new PolicyStatement({
+				effect: Effect.ALLOW,
+				resources: ['arn:aws:iam::*:role/cloudquery-access'],
+				actions: ['sts:AssumeRole'],
+			}),
+		];


Worth keeping an eye on these, for similar reasons to above.

We're always writing to Postgres, so no need for the prop.

…lse`

akash1810 · 2023-05-23T19:33:52Z

packages/cdk/lib/cloudquery.ts

+					description: 'Data fetched across all accounts in the organisation.',
+					schedule: Schedule.rate(Duration.days(1)),
+					config: awsSourceConfigForOrganisation({
+						tables: ['*'],


Doing '*' here with basic skipTables is "dangerous", as it doubles the data, and it might not have permission to. For example, this task will collect data for the aws_organizations table, in all accounts. It's policy doesn't allow this, and we know only the deployTools account can make this request. That is, there will be an amount of log noise with this.

I think a better version of the "all" task would be to add all the individual tables below to the skipTables prop. cc @nicl as this was changed in c67fa6c.

Using '*' seems to be discouraged by CloudQuery, since v3.0.0 too.

As of #189 we're now running on ECS, so we can remove the EC2 based infrastructure.

akash1810 force-pushed the aa/cq-ecs branch 3 times, most recently from 599bfe4 to 430e765 Compare May 19, 2023 05:40

akash1810 mentioned this pull request May 19, 2023

refactor(cloudquery): Obtain AWS root org at launch time #187

Closed

akash1810 force-pushed the aa/cq-ecs branch from 946dc88 to 17a60d6 Compare May 19, 2023 06:28

akash1810 changed the title ~~wip~~ feat(cloudquery): Run CloudQuery on ECS May 19, 2023

akash1810 mentioned this pull request May 19, 2023

Feat(cloudquery): Add Github source config to PROD #184

Closed

akash1810 commented May 19, 2023

View reviewed changes

akash1810 changed the title ~~feat(cloudquery): Run CloudQuery on ECS~~ feat(cloudquery): Run on ECS May 19, 2023

akash1810 force-pushed the aa/cq-ecs branch from 761acf9 to 31d75e2 Compare May 19, 2023 21:21

akash1810 added 12 commits May 20, 2023 06:22

feat(cloudquery): Run CloudQuery on ECS

f9993e4

feat(cloudquery): Generate configuration

b371681

refactor(cloudquery): Move version numbers into single place

ea2a4aa

This should make it easier to update in future.

wip(cloudquery): Remove AWS org config whilst in Playground account

fea89b6

feat(cloudquery): Collect data for AWS lambdas

e5c76ae

feat(cloudquery): Collect data from all other AWS sources too

cbd32b0

refactor(cloudquery): Simplify mechanism to define CloudQuery source …

831411f

…config

fix(cloudquery): Verify RDS SSL certificate

a5ef9f2

feat(cloudquery): RDS IAM auth within ECS

dd56109

To avoid the use of the root credentials, instrument RDS IAM auth. The flow is: 1. Create an AWS CLI container, generate a token, and write it to disk 2. Configure the worker's Postgres connection string using the above file

refactor(cloudquery): Extract Cluster into it's own class

35d50d8

This'll make it easier to add to the existing EC2 stack.

chore: Remove CloudQuery from the Playground acount

a82170c

feat(cloudquery): Add ECS infrastructure to stack

da5badc

akash1810 force-pushed the aa/cq-ecs branch from 31d75e2 to ad77c4c Compare May 20, 2023 06:29

fix(cloudquery): Collect AWS org data once a month

c6430a2

The data doesn't change very often, so no need to collect it each day.

akash1810 force-pushed the aa/cq-ecs branch from ad77c4c to c6430a2 Compare May 20, 2023 06:40

akash1810 added 3 commits May 20, 2023 07:52

fix(cloudquery): Ensure ECS can find the VPC's private subnets

0054da5

fix(cloudquery): Collect Access Analyer data once

d837b9f

chore(cloudquery): Disable general data collection on ECS

3ec2473

The EC2 infrastructure is doing this, so no need to duplicate it.

akash1810 force-pushed the aa/cq-ecs branch from d130c49 to 3ec2473 Compare May 20, 2023 07:55

akash1810 commented May 20, 2023

View reviewed changes

akash1810 marked this pull request as ready for review May 20, 2023 08:25

akash1810 requested review from a team as code owners May 20, 2023 08:25

nicl force-pushed the aa/cq-ecs branch 2 times, most recently from 50773f0 to 6fe5208 Compare May 22, 2023 14:35

nicl force-pushed the aa/cq-ecs branch from 6fe5208 to c67fa6c Compare May 22, 2023 14:35

NovemberTang reviewed May 23, 2023

View reviewed changes

packages/cdk/lib/ecs/task.ts Show resolved Hide resolved

NovemberTang reviewed May 23, 2023

View reviewed changes

packages/cdk/lib/ecs/task.ts Outdated Show resolved Hide resolved

NovemberTang approved these changes May 23, 2023

View reviewed changes

akash1810 commented May 23, 2023

View reviewed changes

akash1810 added 3 commits May 23, 2023 11:11

refactor(cloudquery): Simplify interface for ScheduledCloudqueryTask

ffd5dc4

We're always writing to Postgres, so no need for the prop.

refactor(cloudquery): Rename variable for clarity

9e4b9be

docs(cloudquery): Add inline comment to explain essential being `fa…

910b85e

…lse`

akash1810 enabled auto-merge May 23, 2023 11:53

akash1810 merged commit baff2dc into main May 23, 2023

akash1810 deleted the aa/cq-ecs branch May 23, 2023 11:55

akash1810 mentioned this pull request May 23, 2023

feat: Add solution for ECS log shipping guardian/devx-logs#26

Merged

akash1810 commented May 23, 2023

View reviewed changes

akash1810 added a commit that referenced this pull request May 24, 2023

chore(cloudquery): Remove EC2 based infrastructure

d33484b

As of #189 we're now running on ECS, so we can remove the EC2 based infrastructure.

This was referenced May 24, 2023

chore(cloudquery): Remove EC2 based infrastructure #191

Merged

feat(cloudquery): Collect GitHub repository data #193

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cloudquery): Run on ECS #189

feat(cloudquery): Run on ECS #189

akash1810 commented May 17, 2023 •

edited

Loading

akash1810 May 19, 2023

akash1810 May 19, 2023

akash1810 May 19, 2023

akash1810 May 20, 2023

nicl commented May 22, 2023 •

edited

Loading

akash1810 commented May 23, 2023

NovemberTang May 23, 2023

nicl May 23, 2023

NovemberTang left a comment

akash1810 May 23, 2023

akash1810 May 23, 2023

akash1810 May 23, 2023 •

edited

Loading

feat(cloudquery): Run on ECS #189

feat(cloudquery): Run on ECS #189

Conversation

akash1810 commented May 17, 2023 • edited Loading

What does this change? And why?

How has it been verified?

Notes

TODO (non-blocking)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicl commented May 22, 2023 • edited Loading

Warning

akash1810 commented May 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NovemberTang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akash1810 May 23, 2023 • edited Loading

Choose a reason for hiding this comment

akash1810 commented May 17, 2023 •

edited

Loading

nicl commented May 22, 2023 •

edited

Loading

akash1810 May 23, 2023 •

edited

Loading