-
Notifications
You must be signed in to change notification settings - Fork 77
Fix a bug in public S3 inference #831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a bug in public S3 inference #831
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
if (endpoint->remote_endpoint_type() == RemoteEndpointType::S3) { | ||
try { | ||
try { | ||
endpoint = create_endpoint(type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create_endpoint
may throw if the private S3 constructor does not find the env vars for AWS credentials. So this line must be placed in the try
block too, in addition to the file size query (endpoint->get_file_size()
)
allow_list->end()) { | ||
endpoint = std::make_unique<S3PublicEndpoint>(url); | ||
} else { | ||
throw; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this changed to avoid losing the error info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. In all other cases where we are not reassigning the endpoint (S3 private -> S3 public
), if there is any exception, we should just catch and rethrow it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to unit test this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible. I'll learn boto3
and moto
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think currently it is not possible to use moto
to mock the access to the public and private S3 buckets for the open
function because of some limitations in KvikIO. For public S3, we only support HTTP/HTTPS scheme, and syntax validation for HTTP/HTTPS URL currently only supports the actual AWS S3 format (e.g. https://s3.<region-code>.amazonaws.com/<bucket-name>/<key-name>
). This doesn't work with moto
where we need to use a local address. I've filed an issue to address this limit: #834 .
Verified that with this branch, pylibcudf can open S3 public object without the presence of dummy credentials. Full manual test script as follows. import pylibcudf as plc
import cudf
import cupy as cp
import os
class EnvVarCtx:
def __init__(self, env_vars={}):
self.env_vars = env_vars
self.original_values = {}
def __enter__(self):
for var_name, new_value in self.env_vars.items():
# None if non-existent
self.original_values[var_name] = os.environ.get(var_name)
# Ensure value is a string
os.environ[var_name] = str(new_value)
def __exit__(self, exc_type, exc_val, exc_tb):
for var_name, original_value in self.original_values.items():
if original_value is None:
if var_name in os.environ:
del os.environ[var_name]
else:
os.environ[var_name] = original_value
class TestManager:
def __init__(self, url):
self.url = url
def test_pylibcudf(self):
try:
source = plc.io.SourceInfo([self.url])
options = plc.io.parquet.ParquetReaderOptions.builder(
source).build()
table = plc.io.parquet.read_parquet(options)
df = cudf.DataFrame.from_pylibcudf(table)
print(df)
except Exception as ex:
print(ex)
if __name__ == "__main__":
aws_credentials = {
"AWS_ACCESS_KEY_ID": "",
"AWS_SECRET_ACCESS_KEY": "",
# "AWS_SESSION_TOKEN" : "",
"AWS_DEFAULT_REGION": "",
}
# S3 private
print("--> S3 private")
with EnvVarCtx(aws_credentials):
url = "<parquet_file_url>"
tm = TestManager(url)
tm.test_pylibcudf()
# S3 public, but with unused credentials
print("--> S3 public with credentials")
with EnvVarCtx(aws_credentials):
url = "<parquet_file_url>"
tm = TestManager(url)
tm.test_pylibcudf()
# S3 public without credentials
# Without this branch, libcudf encounters an error
print("--> S3 public without credentials")
with EnvVarCtx():
url = "<parquet_file_url>"
tm = TestManager(url)
tm.test_pylibcudf()
# S3 presigned URL
print("--> S3 presigned URL")
with EnvVarCtx():
url = "<parquet_file_url>"
tm = TestManager(url)
tm.test_pylibcudf() |
/merge |
kvikio::RemoteHandle::open()
has started to support public S3 since #820. Whenopen()
sees an S3 URL, it first assumes a private S3 object and queries its size. If the query fails, it proceeds to assume that the file is a public S3.During the construction of a private S3 object, the constructor scans the environment variables for AWS credentials. Manual testing of #820 accidentally includes the env vars all the time and hides a bug: in absence of env vars, the constructor of the private S3 object will throw an exception, which is unhandled, and KvikIO never gets a chance to try with public S3.
This PR fixes this bug.