-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 Support #91
S3 Support #91
Conversation
Thanks for starting work on this! |
S3 Key-Value StoreThis describes the implementation of an S3 Key Value Store. S3 Builder ClassAs suggested in #91 (review), an S3 Builder Class should be provided. class S3RequestBuilder {
public:
S3RequestBuilder(std::string_view method);
// Replicate HttpRequestBuilder methods here
// ...
S3RequestBuilder & AddAwsRegion(std::string_view aws_region);
S3RequestBuilder & AddAwsAccessKey(std::string_view aws_access_key_id);
S3RequestBuilder & AddAwsSecretKey(std::string_view aws_secret_key);
// Adds a X-Amz-Security-Token to Headers
S3RequestBuilder & AddAwsSSessionToken(std::string_view aws_session_token);
S3RequestBuilder & AddEndpointUrl(std::string_view endpoint_url);
HttpRequest BuildRequest();
private:
// Based on https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-header-based-auth.html
// derives
// 1. Canonical Request String
// 2. Signing String
// 3. Signature
// and returns the AuthorizationHeader
std::string AuthorizationHeader();
}; Obtaining configuration variablesAWS configuration data can be obtained from multiple locations, primarily:
Environment variables take precedence over configuration files. Python pseudo-code for deriving AWS variablesaws_profile = os.environ.get("AWS_PROFILE", "default")
aws_credentials_file = os.environ.get("AWS_SHARED_CREDENTIALS_FILE", "~/.aws/credentials")
aws_config_file = os.environ.get("AWS_CONFIG_FILE", "~/.aws/config")
# Read credentials file
credentials = ConfigParser()
credentials.read(os.path.expanduser(aws_credentials_file))
aws_access_key = credentials.get(aws_profile, "aws_access_key_id")
aws_secret_key = credentials.get(aws_profile, "aws_secret_access_key")
aws_session_token = credentials.get(aws_profile, "aws_session_token")
config = ConfigParser()
config.read(os.path.expanduser(aws_config_file))
aws_region = config.get(aws_profile, "aws_region")
aws_region = os.environ.get("AWS_REGION", aws_region) # AWS_DEFAULT_REGION?
aws_access_key = os.environ.get("AWS_ACCESS_KEY_ID", aws_access_key)
aws_secret_access_key = os.environ.get("AWS_SECRET_ACCESS_KEY", aws_secret_access_key)
aws_session_token = os.environ("AWS_SESSION_TOKEN", aws_session_token) Constructing S3 Endpoint URLsAs per the AWS S3 Rest Api
The above patterns are specific to AWS but custom endpoints are used to access other Currently, there is no agreed upon environment variable for overriding the endpoint URL. Something like S3 KVStoreTBD. TODO
Questions
|
Looks good.
Looks good for Amazon S3 endpoints. For other endpoints we may want to rely on a ContextResource to specify where to find the credentials, e.g. name of environment variables to read or path to configuration file. Another important source for credentials is the EC2 metadata server.
I think we want to support the following:
One challenge with supporting multiple endpoints is that each endpoint requires its own credentials.
Probably it would make sense to keep at least most of the logic for generating urls in the kvstore, rather than
If there are some common elements that can be factored out, that would probably be good. However note that the GCS kvstore driver uses the GCS "JSON" api rather than the s3-compatible API, so it won't be that similar. The largest difference will be in supporting As far as INI library --- you might have to be aware that there are variations in format. E.g. the |
By inspection, the following seems to work to get the bucket region for both public and private buckets. $ curl -sI https://<bucket_name>.amazonaws.com | grep bucket-region This then seems to be the region to use in Signature Calculations: quoting from the example here:
I interpret this to mean that the signature depends on the actual bucket location and that the AWS_REGION and AWS_DEFAULT_REGION in config files/environment variables should probably be ignored.
Assume this refers to the following idea: role_name=$( curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/ )
curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/${role_name} I assume the precedence from highest to lowest would be:
Would, for example, a 200ms connect timeout on the Metadata Server connection be sufficient for a decent user XP w.r.t. timeouts outside AWS?
True, and I suspect most software just assumes a single endpoint. One way I've dealt with this in the past (in combination with fsspec) is to have the following sort of configuration in yaml: "s3://bucket-one":
aws_access_key_id: ...
aws_secret_access_key: ...
"s3+https://bucket-two":
aws_access_key_id: ...
aws_secret_access_key: ... and then apply the following pseudocode when handling access keys: s3url = "s3://bucket-two/path/to/data.bin"
aws_access_key_id = "ABCDEF..."
for url_prefix, config in yaml.load("s3config.yaml").items():
if s3url.startswith(url_prefix):
endpoint_url = config["endpoint_url"]
aws_access_key = config["aws_access_key_id"]
break It's a bit messy when combined with {"driver": "s3", "bucket": "my-bucket", "endpoint": "https://my-server", "path": "my-path", "profile": "myprofile"}
{"driver": "s3", "bucket": "my-bucket", "endpoint": "https://my-server", "path": "my-path", "aws_access_key_id": "...", "aws_secret_access_key": "...", "aws_session_token": "..."}
Yes, INI is simple enough for regex. Let's hold off on the XML library decision until the XML complexity is clearer. EDIT: Suggest placing keys for specific endpoints in the JSON spec |
I think the order should be:
You might check the AWS SDK to see what it does -- I expect it uses something like that order. That is also how we handle Google Cloud Storage credentials. As far as timeout, you might check to see what the AWS SDK does by default. In most cases I expect the connection will be refused immediately if not on EC2 and the timeout doesn't matter. Essentially the only time that the timeout actually comes into play is if you have no credentials and wish to use anonymous access, and you have a network configuration such that requests to the EC2 metadata server are silently dropped rather than are refused immediately. |
Yes I believe that is correct.
Yes, looks right. I think there is one extra complication: Buckets created in us-east-1 prior to Mar 1, 2018 could have names that are not DNS compatible. For those, we cannot use method you described to determine the region, since they cannot be used with the virtual hosted-style requests. However, we can instead detect that the bucket name does not conform to the new rules, and infer that it must be a us-east-1 bucket.
Yes, that would probably make sense. In any case it probably makes sense to first implement a basic version before worrying about all of the possible extra options. |
For other drivers we've avoided setting any secrets in the spec as they become easier to leak that way. It does make sense to allow a region setting in the verbose spec, as well as profile-related directives, IMO: {"driver": "s3", "bucket": "my-bucket", "endpoint": "https://my-server", "path": "my-path", "region": "us-east-1"} |
Some progress on:
A review might not be worth it at this point, but one thing that may be controversial is the move of the following:
from Other than that, I broadly plan to implement in the following order:
These might be logical points at which reviews may be worthwhile. |
Moving these is fine.
I'd suggest that you just implement the minimum here initially, e.g. getting the access key from the env var, and then work on the S3 kvstore, before worrying about more complicated methods of getting the credentials. That way you will have a workable driver sooner.
|
Any ETA for this PR to land ? This is a very exciting development. Really looking to try it out when it lands. |
Hey @sjperkins, Could you share a snippet of the Python API ? |
Hi @tchaton. I'm not developing this full time, so I can't estimate an ETA here.
I haven't touched any Python code yet, but I'd imagine the Python interface would be very similar to the GCS key-value driver https://google.github.io/tensorstore/kvstore/gcs/index.html |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
A localstack fixture has been added.
|
|
||
|
||
ABSL_FLAG(std::string, localstack_binary, "", | ||
"Path to the localstack"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@laramiel I adapted this from the gcs test bench, but didn't have enough time to understand and investigate how bazel linked (and installed?) the test bench utility. Might this be possible with localstack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, see here:
https://github.com/google/tensorstore/tree/master/third_party/pypa
Add to test_requirements.txt then run generate_workspace.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably name this s3_localstack_test, then add to the BUILD file tags = ["manual"]
so it's not triggered on every bazel test ... invocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... also, split the tests which don't require a running localstack into a separate test.
}; | ||
|
||
void SetUp() override { | ||
for(auto &pair: saved_vars) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saving the env vars shouldn't be necessary, since they are process scoped and should not leak past test execution.
static constexpr char kAwsRegion[] = "af-south-1"; | ||
static constexpr char kUriScheme[] = "s3"; | ||
static constexpr char kDriver[] = "s3"; | ||
static constexpr char kLocalStackEndpoint[] = "http://localhost:4566"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The endpoint should be a flag. Otherwise, if localstack were spawned from the test itself, then ideally there would be a --port argument that you could pass, and use tensorstore/internal/http/transport_test_utils.h PickUnusedPortOrDie()
.
namespace { | ||
|
||
/// Exemplar ListObjects v2 Response | ||
/// https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html#API_ListObjectsV2_ResponseSyntax |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is is possible to get the response in json format? If not, maybe we should use a small xml parser such as ... (looking around) ... tinyxml2? expat?
if(absl::EndsWith(bucket, "--ol-s3")) return BucketNameType::Invalid; | ||
|
||
// Bucket name is a series of labels split by . | ||
// https://web.archive.org/web/20170121163958/http://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we're using regexes, just write the full regex. something like:
static LazyRE2 kValidBucket = {
"^(([a-z0-9][a-z0-9-]*[a-z0-9])|([a-z0-9]))"
"(\.(([a-z0-9][a-z0-9-]*[a-z0-9])|([a-z0-9])))*$"
};
static LazyRE2 kOldValidBucket = {
"^(([a-zA-Z0-9][a-zA-Z0-9_-]*[a-zA-Z0-9])|([a-zA-Z0-9]))"
"(\.(([a-zA-Z0-9][a-zA-Z0-9_-]*[a-zA-Z0-9])|([a-zA-Z0-9])))*$"
};
if (!RE2::FullMatch(bucket, *kOldValidBucket)) {
return BucketNameType::kInvalid;
}
if (bucket.size() <= 63 && RE2::FullMatch(bucket, *kValidBucket)) {
return BucketNameType::kStandard;
}
return : BucketNameType::kOldUsEast;
I'm preparing to submit most of this in the next day or so, with some minor changes. |
Includes changes for includes, style, and some fixes from the following pull request: #91 This is still a work in progress, so it is not yet available by default. To use, add a bazel dependency on //tensorstore/kvstore/s3 #46 PiperOrigin-RevId: 562068999 Change-Id: Id660a72224ef865c858484c04985f9fc4f0e3bf5
I think this makes sense -- The PR contains core S3 functionality and further additions can be split into smaller PR's which will be easier to review. |
Closes #46
TODO
General KVstore
Testing
TestKeyValueStoreBasicFunctionality
passesAuthorisation
~/.aws/credentials
Documentation
Polishing
Python AWS4 Calculation: https://gist.github.com/sjperkins/15d3d196387cff8b6dc0dcca7f756fe8