Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace S3 GET requests back to Athena queries. #552

Conversation

antonysouthworth-halter
Copy link

@antonysouthworth-halter antonysouthworth-halter commented Jul 11, 2024

dbt-labs/dbt-athena#686

Context here is if you are spending a large amount of money on S3 GET requests it can be quite difficult to track down exactly which query is originating the requests. Yes, you can get pretty close with:

  • Checking IAM identity in S3 GET CloudTrail log.
  • Checking User-Agent in S3 GET CloudTrail log.
  • S3 bucket metric filters.

But these are all quite indirect. Instead I propose injecting some unique identifier into the User-Agent header for each StartQueryExecution call. That way, the User-Agent string now uniquely identifies a StartQueryExecution request and will be passed along to the GetObject requests, allowing us to associate GetObject requests with specific query executions.

Yes, this is janky as shit and not really a smart way of doing it and just generally probably considered _ab_use of the User-Agent header.... but there doesn't seem to really be another alternative so....

You can use the User-Agent and responseElements of the CloudTrail log line for the StartQueryExecution call to associate the GETs with an Athena QueryExecutionId, which can in turn be used to look up the QueryText (not incuded in CloudTrail logs, instead available from athena:GetQueryExecution).

TODO:

  • test?

@antonysouthworth-halter
Copy link
Author

Just tested this locally in our dbt setup, here's an example of what the CloudTrail log would look like:

{
    "eventVersion": "1.09",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "REDACTED",
        "arn": "REDACTED",
        "accountId": "REDACTED",
        "accessKeyId": "REDACTED",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "REDACTED",
                "arn": "REDACTED",
                "accountId": "REDACTED",
                "userName": "REDACTED"
            },
            "attributes": {
                "creationDate": "2024-07-11T22:21:56Z",
                "mfaAuthenticated": "REDACTED"
            }
        }
    },
    "eventTime": "2024-07-11T22:24:44Z",
    "eventSource": "athena.amazonaws.com",
    "eventName": "StartQueryExecution",
    "awsRegion": "ap-southeast-2",
    "sourceIPAddress": "REDACTED",
    "userAgent": "Boto3/1.34.69 md/Botocore#1.34.69 ua/2.0 os/macos#22.6.0 md/arch#x86_64 lang/python#3.9.19 md/pyimpl#CPython cfg/retry-mode#standard Botocore/1.34.69 PyAthena/3.8.2 dbt-athena-community/1.8.0rc1 QueryTraceId=9ed7073d-2b8d-430c-a7f2-981e1814f1f1",
    "requestParameters": {
        "queryString": "***OMITTED***",
        "clientRequestToken": "REDACTED",
        "queryExecutionContext": {
            "database": "REDACTED",
            "catalog": "awsdatacatalog"
        },
        "resultConfiguration": {
            "outputLocation": "REDACTED"
        },
        "workGroup": "REDACTED"
    },
    "responseElements": {
        "queryExecutionId": "REDACTED"
    },
    "requestID": "REDACTED",
    "eventID": "REDACTED",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "REDACTED",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.3",
        "cipherSuite": "TLS_AES_128_GCM_SHA256",
        "clientProvidedHostHeader": "athena.ap-southeast-2.amazonaws.com"
    }
}

Lots there of course, here's the important bits:

{
    "eventVersion": "1.09",
    "eventTime": "2024-07-11T22:24:44Z",
    "eventSource": "athena.amazonaws.com",
    "eventName": "StartQueryExecution",
    "userAgent": "Boto3/1.34.69 md/Botocore#1.34.69 ua/2.0 os/macos#22.6.0 md/arch#x86_64 lang/python#3.9.19 md/pyimpl#CPython cfg/retry-mode#standard Botocore/1.34.69 PyAthena/3.8.2 dbt-athena-community/1.8.0rc1 QueryTraceId=9ed7073d-2b8d-430c-a7f2-981e1814f1f1",
    "requestParameters": {
        "queryString": "***OMITTED***",
        "workGroup": "REDACTED"
    },
    "responseElements": {
        "queryExecutionId": "REDACTED"
    },
    "requestID": "REDACTED",
    "eventID": "REDACTED"
}

QueryTraceId=9ed7073d-2b8d-430c-a7f2-981e1814f1f1 injected into the User-Agent.

@antonysouthworth-halter
Copy link
Author

It took me a while to get started trying to contribute to this project since there's no clear instructions I could find and I have never used hatch or even really much pyproject.toml stuff before; so adding a basic CONTRIBUTING.md doc too to hopefully help others in future.

@antonysouthworth-halter
Copy link
Author

So...... I don't know why I thought S3 GETs have this User-Agent string in them, must have got my wires crossed somewhere. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant