Skip to content

feat(cloudquery): Run on ECS #189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
May 23, 2023
Merged

feat(cloudquery): Run on ECS #189

merged 20 commits into from
May 23, 2023

Conversation

akash1810
Copy link
Member

@akash1810 akash1810 commented May 17, 2023

What does this change? And why?

Inspired by our experiments during this quarter's R&D sprint, this change moves away from using EC2 to run CloudQuery and towards ECS, on Fargate.

This brings some benefits:

  • No idle time

    The EC2 instance sits idle for ~23 hours a day.

  • Implementing different schedules is easier

    Though it's possible to implement different schedules using systemd timers on EC2, we'd likely need to manage multiple config files, which isn't fun.

    The changes in this PR allow us to schedule a job as follows:

    const customRateTables: CustomRateTable[] = [
      {
        schedule: Schedule.rate(Duration.hours(2)),
        tables: ['aws_s3_buckets'],
      },
      {
        schedule: Schedule.rate(Duration.minutes(30)),
        tables: ['aws_lambda_functions'],
      },
    ];

    Being able to collect data from different tables at different rates is going to be crucial to adding GitHub to CloudQuery due to the rate limiting of the GitHub API. If we collect data for github_teams at a slower rate than github_repositories, we reduce the chance of getting rate limited.

  • Removal of known errors

    The EC2 collector currently produces error logs when collecting data for the organization tables . This is because we're making the request from all accounts. We know that only the DeployTools account has this permission, so we now make the request from just that account.

    Similar to the above, this is possible when using EC2, however we'd need to manage multiple config files, which isn't fun.

  • Reduced infrastructure

    With the EC2 version, we have an ASG purely to facilitate deployments via Riff-Raff. This is not needed in the ECS version.

    In the EC2 version, we also place configuration in S3. The implementation of the ECS version removes this, in favour of inlining the config.

How has it been verified?

The stack has been deployed. There are a couple of ways to verify it works:

Notes

Sadly, I don't think it will be possible to use centralised logging container on Fargate as:

Fargate tasks do not support the DAEMON scheduling strategy.
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_services.html#service_scheduler

This means each task has three containers:

  1. One to vend temporary AWS RDS IAM tokens
  2. One to perform the CloudQuery work
  3. One to ship logs

We should keep an eye on the cost of this, as it might become cheaper to ship the logs to CloudWatch logs, and then use cloudwatch-logs-management to route them to Central ELK.

TODO (non-blocking)

@akash1810 akash1810 force-pushed the aa/cq-ecs branch 3 times, most recently from 599bfe4 to 430e765 Compare May 19, 2023 05:40
@akash1810 akash1810 changed the title wip feat(cloudquery): Run CloudQuery on ECS May 19, 2023
Comment on lines +130 to +168
logging: LogDrivers.awsLogs({
streamPrefix: [stack, stage, app].join('/'),
logRetention: RetentionDays.ONE_DAY,
}),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Send the log shipper's logs to CloudWatch.

Comment on lines +134 to +174
environment: {
STACK: stack,
STAGE: stage,
APP: app,
GU_REPO: thisRepo,
},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are read by the custom firelens config.

Comment on lines +27 to +33
const firelensImage = ContainerImage.fromRegistry(
'ghcr.io/guardian/hackday-firelens:main',
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO rename this repository.

@akash1810 akash1810 changed the title feat(cloudquery): Run CloudQuery on ECS feat(cloudquery): Run on ECS May 19, 2023
The data doesn't change very often, so no need to collect it each day.
Comment on lines +23 to +25
const awsCliImage = ContainerImage.fromRegistry(
'public.ecr.aws/aws-cli/aws-cli',
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the official AWS CLI image.

@akash1810 akash1810 marked this pull request as ready for review May 20, 2023 08:25
@akash1810 akash1810 requested review from a team as code owners May 20, 2023 08:25
@nicl nicl force-pushed the aa/cq-ecs branch 2 times, most recently from 50773f0 to 6fe5208 Compare May 22, 2023 14:35
...in order to:

* adhere to least privilege (by only passing policies to tasks that
  they need
* pass the Cloudquery specs in as props as the task creation level
  (i.e. in cloudquery.ts rather than lower down) so that the same
  model will work for e.g. Github and other sources going forward

The task schedules have also been modified - at the moment once
a day is a good common baseline for tasks that are not hitting
rate limits so just standardise on that.

Finally, task names have been improved to aid discovery in the AWS
Console and API.
@nicl
Copy link
Contributor

nicl commented May 22, 2023

Update: @akash1810 have added a commit with some changes. Obviously feel free to drop it if you'd rather get this live without it!

The main difference is to change the scheduled task interface from:

interface CustomRateTable {
    schedule: Schedule;
    tables: string[];
}

to something like:

interface CloudquerySource {
	name: string;
	description: string;
	schedule: Schedule;
	config: CloudqueryConfig;
	policies?: PolicyStatement[];
	managedPolicies?: IManagedPolicy[];
}

This looks more complicated but allows:

  • least privilege, as tasks can be passed just the policies they require
  • support for non-AWS tasks, such as the upcoming Github, Fastly etc.
  • better task identification (via the name/description)

The idea is that everything becomes a scheduled task here, including the standard fetch-everything-from-all-accounts Cloudquery task. So the code overall becomes simpler with fewer ifs/branches.

Task naming turns out to be especially helpful in the AWS Console and also because it shows up as a field in the logs:

Screenshot 2023-05-22 at 16 41 29

Warning

The logging to ESLK seems to be broken and I'm not sure why. Seeing e.g.

[error] [parser] cannot parse '2023-05-22T16:36:45Z'
[ warn] [parser:json] invalid time format %d/%b/%Y:%H:%M:%S %z for '2023-05-22T16:36:45Z'

and not sure why as I didn't think I'd changed behaviour here.

@akash1810
Copy link
Member Author

This looks more complicated but allows:

  • least privilege, as tasks can be passed just the policies they require
  • support for non-AWS tasks, such as the upcoming Github, Fastly etc.
  • better task identification (via the name/description)

Love every one of these! I was struggling to add GH to this pattern; I think these changes get us closer, but I think we'd need a way to set secrets too; suggest this refactor be it's PR though.

Regarding logging, that was happening prior to your change too. We should be able to reproduce this behaviour locally to help debugging (woop containers!). WDYT - is this shippable the current version of logging?

import { postgresDestinationConfig } from './config';
import { Versions } from './versions';

const awsCliImage = ContainerImage.fromRegistry(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a potential follow-up. could we put the AWS CLI/firelens logging in a container that has cloudquery as the base, self publish, and auth that way? Would save couple of containers? @Mark-McCracken

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to note (if my understanding is correct) is that containers share the resources of the Task, so we don't actually pay more for this. The auth container should also exit immediately too, so shouldn't impact resources (CPU/mem) available to the main Cloudquery container. Agree though that generally fewer containers is better if possible!

Copy link
Contributor

@NovemberTang NovemberTang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes that can be talked about later, not blocking.
LGTM.
fargate 4eva.

skipTables: skipTables,
}),
managedPolicies: [
readonlyAccessManagedPolicy(this, 'fetch-all-managed-policy'),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're creating an instance of this managed policy for each task, which increases the chances of bumping against the CFN template limit. If we do see this in future, we can make instantiate just one, and pass it across.

Comment on lines +90 to +103
const essentialPolicies = [
// Log shipping
new PolicyStatement({
actions: ['kinesis:Describe*', 'kinesis:Put*'],
effect: Effect.ALLOW,
resources: [loggingStreamArn],
}),

new PolicyStatement({
effect: Effect.ALLOW,
resources: ['arn:aws:iam::*:role/cloudquery-access'],
actions: ['sts:AssumeRole'],
}),
];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth keeping an eye on these, for similar reasons to above.

@akash1810 akash1810 enabled auto-merge May 23, 2023 11:53
@akash1810 akash1810 merged commit baff2dc into main May 23, 2023
@akash1810 akash1810 deleted the aa/cq-ecs branch May 23, 2023 11:55
description: 'Data fetched across all accounts in the organisation.',
schedule: Schedule.rate(Duration.days(1)),
config: awsSourceConfigForOrganisation({
tables: ['*'],
Copy link
Member Author

@akash1810 akash1810 May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing '*' here with basic skipTables is "dangerous", as it doubles the data, and it might not have permission to. For example, this task will collect data for the aws_organizations table, in all accounts. It's policy doesn't allow this, and we know only the deployTools account can make this request. That is, there will be an amount of log noise with this.

I think a better version of the "all" task would be to add all the individual tables below to the skipTables prop. cc @nicl as this was changed in c67fa6c.

Using '*' seems to be discouraged by CloudQuery, since v3.0.0 too.

akash1810 added a commit that referenced this pull request May 24, 2023
As of #189 we're now running on ECS, so we can remove the EC2 based infrastructure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants