Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Use custom url for s3 using AWS_ENDPOINT_URL #36770

Closed
Kotwic4 opened this issue Jul 19, 2023 · 8 comments · Fixed by #36791
Closed

[C++] Use custom url for s3 using AWS_ENDPOINT_URL #36770

Kotwic4 opened this issue Jul 19, 2023 · 8 comments · Fixed by #36791

Comments

@Kotwic4
Copy link

Kotwic4 commented Jul 19, 2023

Describe the enhancement requested

AWS_ENDPOINT_URL is now supported by the AWS for custom url (for example localhost).
More info about it docs and original github issue. It was merged into botocore in this pr.

What I can do with boto:

import os
import boto3

os.environ["AWS_ACCESS_KEY_ID"] = "ACCESS_KEY"
os.environ["AWS_SECRET_ACCESS_KEY"] = "SECRET_KEY"
os.environ["AWS_ENDPOINT_URL"] = "http://localhost:9000"

session = boto3.session.Session()
s3_client = session.client(
    service_name="s3",
)
print(s3_client.list_buckets()["Buckets"])

What I have to do in pyarrow:

import os
from pyarrow import fs

os.environ["AWS_ACCESS_KEY_ID"] = "ACCESS_KEY"
os.environ["AWS_SECRET_ACCESS_KEY"] = "SECRET_KEY"
os.environ["AWS_ENDPOINT_URL"] = "http://localhost:9000"

s3 = fs.S3FileSystem(endpoint_override=os.environ["AWS_ENDPOINT_URL"])
print(s3.get_file_info(fs.FileSelector("", recursive=False)))

What I would like to do in pyarrow:

import os
from pyarrow import fs

os.environ["AWS_ACCESS_KEY_ID"] = "ACCESS_KEY"
os.environ["AWS_SECRET_ACCESS_KEY"] = "SECRET_KEY"
os.environ["AWS_ENDPOINT_URL"] = "http://localhost:9000"

s3 = fs.S3FileSystem()
print(s3.get_file_info(fs.FileSelector("", recursive=False)))

This will allow me to use s3:// instead of creating file system

#current way
s3 = fs.S3FileSystem(endpoint_override=endpoint_url)
file = pq.ParquetFile('mybucket/my_file.parquet', filesystem=s3)

#possible future
file = pq.ParquetFile('s3://mybucket/my_file.parquet')

Component(s)

Python

@westonpace westonpace changed the title Use custom url for s3 using AWS_ENDPOINT_URL [C++] Use custom url for s3 using AWS_ENDPOINT_URL Jul 19, 2023
@westonpace
Copy link
Member

Arrow uses AWS' C++ SDK. From this table it appears that the C++ SDK does not yet support this feature:

image

Perhaps the best place to advocate for this feature would be on the AWS C++ SDK repo. If it is added there then we would pick up support for it automatically once we upgraded to the latest SDK version.

That being said, it should be possible for us to provide support for this, even if the SDK does not, in case someone wanted to create a PR.

@kou
Copy link
Member

kou commented Jul 19, 2023

That being said, it should be possible for us to provide support for this, even if the SDK does not, in case someone wanted to create a PR.

This may work:

diff --git a/cpp/src/arrow/filesystem/s3fs.cc b/cpp/src/arrow/filesystem/s3fs.cc
index c57fc7f29..b0c2d973e 100644
--- a/cpp/src/arrow/filesystem/s3fs.cc
+++ b/cpp/src/arrow/filesystem/s3fs.cc
@@ -339,6 +339,7 @@ Result<S3Options> S3Options::FromUri(const Uri& uri, std::string* out_path) {
   }
 
   bool region_set = false;
+  bool endpoint_override_set = false;
   for (const auto& kv : options_map) {
     if (kv.first == "region") {
       options.region = kv.second;
@@ -347,6 +348,7 @@ Result<S3Options> S3Options::FromUri(const Uri& uri, std::string* out_path) {
       options.scheme = kv.second;
     } else if (kv.first == "endpoint_override") {
       options.endpoint_override = kv.second;
+      endpoint_override_set = true;
     } else if (kv.first == "allow_bucket_creation") {
       ARROW_ASSIGN_OR_RAISE(options.allow_bucket_creation,
                             ::arrow::internal::ParseBoolean(kv.second));
@@ -357,6 +359,12 @@ Result<S3Options> S3Options::FromUri(const Uri& uri, std::string* out_path) {
       return Status::Invalid("Unexpected query parameter in S3 URI: '", kv.first, "'");
     }
   }
+  if (!endpoint_override_set) {
+    auto endpoint = std::getenv("AWS_ENDPOINT_URL");
+    if (endpoint) {
+      options.endpoint_override = endpoint;
+    }
+  }
 
   if (!region_set && !bucket.empty() && options.endpoint_override.empty()) {
     // XXX Should we use a dedicated resolver with the given credentials?

BTW, the following will work with the current implementation:

file = pq.ParquetFile(f's3://mybucket/my_file.parquet?endpoint_override={os.environ["AWS_ENDPOINT_URL"]}')

@Kotwic4
Copy link
Author

Kotwic4 commented Jul 20, 2023

I think it should just be in cpp sdk then. Raised an issue there

@kou kou assigned kou and unassigned kou Jul 21, 2023
kou pushed a commit that referenced this issue Jul 21, 2023
… AWS_ENDPOINT_URL (#36791)

### Rationale for this change
we need a way to read custom object storage (such as minio host or other s3-like storage).
use environment variable `AWS_ENDPOINT_URL `

### What changes are included in this PR?
set variable endpoint_override according the environment variable

### Are these changes tested?
unittest and tested on pyarrow

### Are there any user-facing changes?
No

* Closes: #36770

Authored-by: yiwei.wang <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@kou kou added this to the 14.0.0 milestone Jul 21, 2023
@kou
Copy link
Member

kou commented Jul 21, 2023

@adbmal Could you add take only comment here to assign this issue to you?

@adbmal
Copy link
Contributor

adbmal commented Jul 21, 2023

take

@amoeba
Copy link
Member

amoeba commented Jul 21, 2023

I think it would make sense to document this in https://arrow.apache.org/docs/cpp/env_vars.html. Happy to file an issue and submit a patch for that.

@kou
Copy link
Member

kou commented Jul 22, 2023

It's a good idea! Please do it!

@adbmal
Copy link
Contributor

adbmal commented Jul 23, 2023

I think it would make sense to document this in https://arrow.apache.org/docs/cpp/env_vars.html. Happy to file an issue and submit a patch for that.

@kou I think it is a minor change, no need to file an issue, here is the PR, please review.
MINOR: [Docs] update document for AWS_ENDPOINT_URL environment variable #36826

R-JunmingChen pushed a commit to R-JunmingChen/arrow that referenced this issue Aug 20, 2023
…riable AWS_ENDPOINT_URL (apache#36791)

### Rationale for this change
we need a way to read custom object storage (such as minio host or other s3-like storage).
use environment variable `AWS_ENDPOINT_URL `

### What changes are included in this PR?
set variable endpoint_override according the environment variable

### Are these changes tested?
unittest and tested on pyarrow

### Are there any user-facing changes?
No

* Closes: apache#36770

Authored-by: yiwei.wang <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…riable AWS_ENDPOINT_URL (apache#36791)

### Rationale for this change
we need a way to read custom object storage (such as minio host or other s3-like storage).
use environment variable `AWS_ENDPOINT_URL `

### What changes are included in this PR?
set variable endpoint_override according the environment variable

### Are these changes tested?
unittest and tested on pyarrow

### Are there any user-facing changes?
No

* Closes: apache#36770

Authored-by: yiwei.wang <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants