Trace S3 GET requests back to Athena queries. #552

antonysouthworth-halter · 2024-07-11T22:14:25Z

https://github.com/dbt-athena/dbt-athena/issues/686

Context here is if you are spending a large amount of money on S3 GET requests it can be quite difficult to track down exactly which query is originating the requests. Yes, you can get pretty close with:

Checking IAM identity in S3 GET CloudTrail log.
Checking User-Agent in S3 GET CloudTrail log.
S3 bucket metric filters.

But these are all quite indirect. Instead I propose injecting some unique identifier into the User-Agent header for each StartQueryExecution call. That way, the User-Agent string now uniquely identifies a StartQueryExecution request and will be passed along to the GetObject requests, allowing us to associate GetObject requests with specific query executions.

Yes, this is janky as shit and not really a smart way of doing it and just generally probably considered _ab_use of the User-Agent header.... but there doesn't seem to really be another alternative so....

You can use the User-Agent and responseElements of the CloudTrail log line for the StartQueryExecution call to associate the GETs with an Athena QueryExecutionId, which can in turn be used to look up the QueryText (not incuded in CloudTrail logs, instead available from athena:GetQueryExecution).

TODO:

test?

antonysouthworth-halter · 2024-07-11T22:30:19Z

Just tested this locally in our dbt setup, here's an example of what the CloudTrail log would look like:

{
    "eventVersion": "1.09",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "REDACTED",
        "arn": "REDACTED",
        "accountId": "REDACTED",
        "accessKeyId": "REDACTED",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "REDACTED",
                "arn": "REDACTED",
                "accountId": "REDACTED",
                "userName": "REDACTED"
            },
            "attributes": {
                "creationDate": "2024-07-11T22:21:56Z",
                "mfaAuthenticated": "REDACTED"
            }
        }
    },
    "eventTime": "2024-07-11T22:24:44Z",
    "eventSource": "athena.amazonaws.com",
    "eventName": "StartQueryExecution",
    "awsRegion": "ap-southeast-2",
    "sourceIPAddress": "REDACTED",
    "userAgent": "Boto3/1.34.69 md/Botocore#1.34.69 ua/2.0 os/macos#22.6.0 md/arch#x86_64 lang/python#3.9.19 md/pyimpl#CPython cfg/retry-mode#standard Botocore/1.34.69 PyAthena/3.8.2 dbt-athena-community/1.8.0rc1 QueryTraceId=9ed7073d-2b8d-430c-a7f2-981e1814f1f1",
    "requestParameters": {
        "queryString": "***OMITTED***",
        "clientRequestToken": "REDACTED",
        "queryExecutionContext": {
            "database": "REDACTED",
            "catalog": "awsdatacatalog"
        },
        "resultConfiguration": {
            "outputLocation": "REDACTED"
        },
        "workGroup": "REDACTED"
    },
    "responseElements": {
        "queryExecutionId": "REDACTED"
    },
    "requestID": "REDACTED",
    "eventID": "REDACTED",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "REDACTED",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.3",
        "cipherSuite": "TLS_AES_128_GCM_SHA256",
        "clientProvidedHostHeader": "athena.ap-southeast-2.amazonaws.com"
    }
}

Lots there of course, here's the important bits:

{
    "eventVersion": "1.09",
    "eventTime": "2024-07-11T22:24:44Z",
    "eventSource": "athena.amazonaws.com",
    "eventName": "StartQueryExecution",
    "userAgent": "Boto3/1.34.69 md/Botocore#1.34.69 ua/2.0 os/macos#22.6.0 md/arch#x86_64 lang/python#3.9.19 md/pyimpl#CPython cfg/retry-mode#standard Botocore/1.34.69 PyAthena/3.8.2 dbt-athena-community/1.8.0rc1 QueryTraceId=9ed7073d-2b8d-430c-a7f2-981e1814f1f1",
    "requestParameters": {
        "queryString": "***OMITTED***",
        "workGroup": "REDACTED"
    },
    "responseElements": {
        "queryExecutionId": "REDACTED"
    },
    "requestID": "REDACTED",
    "eventID": "REDACTED"
}

QueryTraceId=9ed7073d-2b8d-430c-a7f2-981e1814f1f1 injected into the User-Agent.

antonysouthworth-halter · 2024-07-11T23:10:04Z

It took me a while to get started trying to contribute to this project since there's no clear instructions I could find and I have never used hatch or even really much pyproject.toml stuff before; so adding a basic CONTRIBUTING.md doc too to hopefully help others in future.

antonysouthworth-halter · 2024-07-12T01:28:03Z

So...... I don't know why I thought S3 GETs have this User-Agent string in them, must have got my wires crossed somewhere. Closing.

antonysouthworth-halter added 2 commits July 12, 2024 09:58

Inject ClientRequestToken into User-Agent for StartQueryExecution

ed379a5

Actually, it doesn't need to be the ClientRequestToken

95b46c5

Add basic CONTRIBUTING.

9ddd20b

antonysouthworth-halter force-pushed the feat/tracing branch from 3f81c8e to 9ddd20b Compare July 11, 2024 23:06

antonysouthworth-halter closed this Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trace S3 GET requests back to Athena queries. #552

Trace S3 GET requests back to Athena queries. #552

antonysouthworth-halter commented Jul 11, 2024 •

edited

Loading

antonysouthworth-halter commented Jul 11, 2024

antonysouthworth-halter commented Jul 11, 2024

antonysouthworth-halter commented Jul 12, 2024

Trace S3 GET requests back to Athena queries. #552

Trace S3 GET requests back to Athena queries. #552

Conversation

antonysouthworth-halter commented Jul 11, 2024 • edited Loading

antonysouthworth-halter commented Jul 11, 2024

antonysouthworth-halter commented Jul 11, 2024

antonysouthworth-halter commented Jul 12, 2024

antonysouthworth-halter commented Jul 11, 2024 •

edited

Loading