-
Notifications
You must be signed in to change notification settings - Fork 0
feat(cloudquery): Run on ECS #189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
599bfe4
to
430e765
Compare
logging: LogDrivers.awsLogs({ | ||
streamPrefix: [stack, stage, app].join('/'), | ||
logRetention: RetentionDays.ONE_DAY, | ||
}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Send the log shipper's logs to CloudWatch.
environment: { | ||
STACK: stack, | ||
STAGE: stage, | ||
APP: app, | ||
GU_REPO: thisRepo, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are read by the custom firelens config.
const firelensImage = ContainerImage.fromRegistry( | ||
'ghcr.io/guardian/hackday-firelens:main', | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO rename this repository.
This should make it easier to update in future.
To avoid the use of the root credentials, instrument RDS IAM auth. The flow is: 1. Create an AWS CLI container, generate a token, and write it to disk 2. Configure the worker's Postgres connection string using the above file
This'll make it easier to add to the existing EC2 stack.
The data doesn't change very often, so no need to collect it each day.
The EC2 infrastructure is doing this, so no need to duplicate it.
const awsCliImage = ContainerImage.fromRegistry( | ||
'public.ecr.aws/aws-cli/aws-cli', | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the official AWS CLI image.
50773f0
to
6fe5208
Compare
...in order to: * adhere to least privilege (by only passing policies to tasks that they need * pass the Cloudquery specs in as props as the task creation level (i.e. in cloudquery.ts rather than lower down) so that the same model will work for e.g. Github and other sources going forward The task schedules have also been modified - at the moment once a day is a good common baseline for tasks that are not hitting rate limits so just standardise on that. Finally, task names have been improved to aid discovery in the AWS Console and API.
Update: @akash1810 have added a commit with some changes. Obviously feel free to drop it if you'd rather get this live without it! The main difference is to change the scheduled task interface from: interface CustomRateTable {
schedule: Schedule;
tables: string[];
} to something like: interface CloudquerySource {
name: string;
description: string;
schedule: Schedule;
config: CloudqueryConfig;
policies?: PolicyStatement[];
managedPolicies?: IManagedPolicy[];
} This looks more complicated but allows:
The idea is that everything becomes a scheduled task here, including the standard fetch-everything-from-all-accounts Cloudquery task. So the code overall becomes simpler with fewer ifs/branches. Task naming turns out to be especially helpful in the AWS Console and also because it shows up as a field in the logs: ![]() WarningThe logging to ESLK seems to be broken and I'm not sure why. Seeing e.g.
and not sure why as I didn't think I'd changed behaviour here. |
Love every one of these! I was struggling to add GH to this pattern; I think these changes get us closer, but I think we'd need a way to set secrets too; suggest this refactor be it's PR though. Regarding logging, that was happening prior to your change too. We should be able to reproduce this behaviour locally to help debugging (woop containers!). WDYT - is this shippable the current version of logging? |
import { postgresDestinationConfig } from './config'; | ||
import { Versions } from './versions'; | ||
|
||
const awsCliImage = ContainerImage.fromRegistry( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for a potential follow-up. could we put the AWS CLI/firelens logging in a container that has cloudquery as the base, self publish, and auth that way? Would save couple of containers? @Mark-McCracken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to note (if my understanding is correct) is that containers share the resources of the Task, so we don't actually pay more for this. The auth container should also exit immediately too, so shouldn't impact resources (CPU/mem) available to the main Cloudquery container. Agree though that generally fewer containers is better if possible!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes that can be talked about later, not blocking.
LGTM.
fargate 4eva.
skipTables: skipTables, | ||
}), | ||
managedPolicies: [ | ||
readonlyAccessManagedPolicy(this, 'fetch-all-managed-policy'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're creating an instance of this managed policy for each task, which increases the chances of bumping against the CFN template limit. If we do see this in future, we can make instantiate just one, and pass it across.
const essentialPolicies = [ | ||
// Log shipping | ||
new PolicyStatement({ | ||
actions: ['kinesis:Describe*', 'kinesis:Put*'], | ||
effect: Effect.ALLOW, | ||
resources: [loggingStreamArn], | ||
}), | ||
|
||
new PolicyStatement({ | ||
effect: Effect.ALLOW, | ||
resources: ['arn:aws:iam::*:role/cloudquery-access'], | ||
actions: ['sts:AssumeRole'], | ||
}), | ||
]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth keeping an eye on these, for similar reasons to above.
We're always writing to Postgres, so no need for the prop.
description: 'Data fetched across all accounts in the organisation.', | ||
schedule: Schedule.rate(Duration.days(1)), | ||
config: awsSourceConfigForOrganisation({ | ||
tables: ['*'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing '*'
here with basic skipTables
is "dangerous", as it doubles the data, and it might not have permission to. For example, this task will collect data for the aws_organizations
table, in all accounts. It's policy doesn't allow this, and we know only the deployTools account can make this request. That is, there will be an amount of log noise with this.
I think a better version of the "all" task would be to add all the individual tables below to the skipTables
prop. cc @nicl as this was changed in c67fa6c.
Using '*'
seems to be discouraged by CloudQuery, since v3.0.0 too.
As of #189 we're now running on ECS, so we can remove the EC2 based infrastructure.
What does this change? And why?
Inspired by our experiments during this quarter's R&D sprint, this change moves away from using EC2 to run CloudQuery and towards ECS, on Fargate.
This brings some benefits:
No idle time
The EC2 instance sits idle for ~23 hours a day.
Implementing different schedules is easier
Though it's possible to implement different schedules using systemd timers on EC2, we'd likely need to manage multiple config files, which isn't fun.
The changes in this PR allow us to schedule a job as follows:
Being able to collect data from different tables at different rates is going to be crucial to adding GitHub to CloudQuery due to the rate limiting of the GitHub API. If we collect data for
github_teams
at a slower rate thangithub_repositories
, we reduce the chance of getting rate limited.Removal of known errors
The EC2 collector currently produces error logs when collecting data for the organization tables . This is because we're making the request from all accounts. We know that only the DeployTools account has this permission, so we now make the request from just that account.
Similar to the above, this is possible when using EC2, however we'd need to manage multiple config files, which isn't fun.
Reduced infrastructure
With the EC2 version, we have an ASG purely to facilitate deployments via Riff-Raff. This is not needed in the ECS version.
In the EC2 version, we also place configuration in S3. The implementation of the ECS version removes this, in favour of inlining the config.
How has it been verified?
The stack has been deployed. There are a couple of ways to verify it works:
Notes
Sadly, I don't think it will be possible to use centralised logging container on Fargate as:
This means each task has three containers:
We should keep an eye on the cost of this, as it might become cheaper to ship the logs to CloudWatch logs, and then use cloudwatch-logs-management to route them to Central ELK.
TODO (non-blocking)