Skip to content

Conversation

kingcrimsontianyu
Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu commented Sep 18, 2025

kvikio::RemoteHandle::open() has started to support public S3 since #820. When open() sees an S3 URL, it first assumes a private S3 object and queries its size. If the query fails, it proceeds to assume that the file is a public S3.

During the construction of a private S3 object, the constructor scans the environment variables for AWS credentials. Manual testing of #820 accidentally includes the env vars all the time and hides a bug: in absence of env vars, the constructor of the private S3 object will throw an exception, which is unhandled, and KvikIO never gets a chance to try with public S3.

This PR fixes this bug.

@kingcrimsontianyu kingcrimsontianyu added bug Something isn't working non-breaking Introduces a non-breaking change c++ Affects the C++ API of KvikIO labels Sep 18, 2025
Copy link

copy-pr-bot bot commented Sep 18, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@kingcrimsontianyu kingcrimsontianyu marked this pull request as ready for review September 18, 2025 15:44
@kingcrimsontianyu kingcrimsontianyu requested a review from a team as a code owner September 18, 2025 15:44
if (endpoint->remote_endpoint_type() == RemoteEndpointType::S3) {
try {
try {
endpoint = create_endpoint(type);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_endpoint may throw if the private S3 constructor does not find the env vars for AWS credentials. So this line must be placed in the try block too, in addition to the file size query (endpoint->get_file_size())

allow_list->end()) {
endpoint = std::make_unique<S3PublicEndpoint>(url);
} else {
throw;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this changed to avoid losing the error info?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In all other cases where we are not reassigning the endpoint (S3 private -> S3 public), if there is any exception, we should just catch and rethrow it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to unit test this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible. I'll learn boto3 and moto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think currently it is not possible to use moto to mock the access to the public and private S3 buckets for the open function because of some limitations in KvikIO. For public S3, we only support HTTP/HTTPS scheme, and syntax validation for HTTP/HTTPS URL currently only supports the actual AWS S3 format (e.g. https://s3.<region-code>.amazonaws.com/<bucket-name>/<key-name>). This doesn't work with moto where we need to use a local address. I've filed an issue to address this limit: #834 .

@kingcrimsontianyu
Copy link
Contributor Author

kingcrimsontianyu commented Sep 22, 2025

Verified that with this branch, pylibcudf can open S3 public object without the presence of dummy credentials. Full manual test script as follows.

import pylibcudf as plc
import cudf
import cupy as cp
import os


class EnvVarCtx:
    def __init__(self, env_vars={}):
        self.env_vars = env_vars
        self.original_values = {}

    def __enter__(self):
        for var_name, new_value in self.env_vars.items():
            # None if non-existent
            self.original_values[var_name] = os.environ.get(var_name)
            # Ensure value is a string
            os.environ[var_name] = str(new_value)

    def __exit__(self, exc_type, exc_val, exc_tb):
        for var_name, original_value in self.original_values.items():
            if original_value is None:
                if var_name in os.environ:
                    del os.environ[var_name]
            else:
                os.environ[var_name] = original_value


class TestManager:
    def __init__(self, url):
        self.url = url

    def test_pylibcudf(self):
        try:
            source = plc.io.SourceInfo([self.url])
            options = plc.io.parquet.ParquetReaderOptions.builder(
                source).build()
            table = plc.io.parquet.read_parquet(options)
            df = cudf.DataFrame.from_pylibcudf(table)
            print(df)
        except Exception as ex:
            print(ex)

if __name__ == "__main__":
    aws_credentials = {
        "AWS_ACCESS_KEY_ID": "",
        "AWS_SECRET_ACCESS_KEY": "",
        # "AWS_SESSION_TOKEN" : "",
        "AWS_DEFAULT_REGION": "",
    }

    # S3 private
    print("--> S3 private")
    with EnvVarCtx(aws_credentials):
        url = "<parquet_file_url>"
        tm = TestManager(url)
        tm.test_pylibcudf()

    # S3 public, but with unused credentials
    print("--> S3 public with credentials")
    with EnvVarCtx(aws_credentials):
        url = "<parquet_file_url>"
        tm = TestManager(url)
        tm.test_pylibcudf()

    # S3 public without credentials
    # Without this branch, libcudf encounters an error
    print("--> S3 public without credentials")
    with EnvVarCtx():
        url = "<parquet_file_url>"
        tm = TestManager(url)
        tm.test_pylibcudf()

    # S3 presigned URL
    print("--> S3 presigned URL")
    with EnvVarCtx():
        url = "<parquet_file_url>"
        tm = TestManager(url)
        tm.test_pylibcudf()

@kingcrimsontianyu
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit d4b7773 into rapidsai:branch-25.10 Sep 22, 2025
77 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working c++ Affects the C++ API of KvikIO non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants