Trace S3 GET requests back to Athena queries. #552
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
dbt-labs/dbt-athena#686
Context here is if you are spending a large amount of money on S3 GET requests it can be quite difficult to track down exactly which query is originating the requests. Yes, you can get pretty close with:
But these are all quite indirect. Instead I propose injecting some unique identifier into the User-Agent header for each StartQueryExecution call. That way, the User-Agent string now uniquely identifies a StartQueryExecution request and will be passed along to the GetObject requests, allowing us to associate GetObject requests with specific query executions.
Yes, this is janky as shit and not really a smart way of doing it and just generally probably considered _ab_use of the User-Agent header.... but there doesn't seem to really be another alternative so....
You can use the User-Agent and responseElements of the CloudTrail log line for the StartQueryExecution call to associate the GETs with an Athena QueryExecutionId, which can in turn be used to look up the QueryText (not incuded in CloudTrail logs, instead available from athena:GetQueryExecution).
TODO: