Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions src/httpfs_extension.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,9 @@ static void LoadInternal(ExtensionLoader &loader) {
};
config.AddExtensionOption("httpfs_client_implementation", "Select which is the HTTPUtil implementation to be used",
LogicalType::VARCHAR, "default", callback_httpfs_client_implementation);
config.AddExtensionOption("auto_fetch_secret_info_from_env",
"Automatically fetch AWS credentials from environment variables.", LogicalType::BOOLEAN,
Value::BOOLEAN(true));

if (config.http_util && config.http_util->GetName() == "WasmHTTPUtils") {
// Already handled, do not override
Expand Down
5 changes: 5 additions & 0 deletions src/include/s3fs.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,11 @@ struct S3AuthParams {
string oauth2_bearer_token; // OAuth2 bearer token for GCS

static S3AuthParams ReadFrom(optional_ptr<FileOpener> opener, FileOpenerInfo &info);
//! Helper for creating secrets that should/should not inherit environment variable settings
static SettingLookupResult SetSecretOption(KeyValueSecretReader &secret_reader, string secret_option,
string setting_name, string &result);
static SettingLookupResult SetSecretOption(KeyValueSecretReader &secret_reader, string secret_option,
string setting_name, bool &result);
};

struct AWSEnvironmentCredentialsProvider {
Expand Down
50 changes: 37 additions & 13 deletions src/s3fs.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,32 @@ S3AuthParams AWSEnvironmentCredentialsProvider::CreateParams() {
return params;
}

SettingLookupResult S3AuthParams::SetSecretOption(KeyValueSecretReader &secret_reader, string secret_option,
string setting_name, string &result) {
Value use_env_vars_for_secret_info_setting;
secret_reader.TryGetSecretKeyOrSetting("auto_fetch_secret_info_from_env", "auto_fetch_secret_info_from_env",
use_env_vars_for_secret_info_setting);
auto use_env_vars_for_secrets = use_env_vars_for_secret_info_setting.GetValue<bool>();

auto option_scope = secret_reader.TryGetSecretKeyOrSetting(secret_option, setting_name, result);
// if option scope is global, that means it was set in the environment
if (!result.empty() && option_scope.GetScope() == SettingScope::GLOBAL && !use_env_vars_for_secrets) {
result = "";
}
return option_scope;
}

SettingLookupResult S3AuthParams::SetSecretOption(KeyValueSecretReader &secret_reader, string secret_option,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these functions are a little bit hard to understand right now. I think we should make it a bit more clear what order of checking for settings is when using this.

Perhaps this can be rewritten to match the API used by the KeyValueSecretReader to read settings in a cascading way? I would propose to create a new class CustomKeyValueReader which wraps the base KeyValueReader and injects the env variables .

// Custom KeyValue reader that will look for settings in a cascading way:
//     1. secret
//     2. setting
//     3. env variable
class CustomKeyValueReader {
public:
    HttpfsKeyValueReader(FileOpener &opener_p, optional_ptr<FileOpenerInfo> info, const char **secret_types,
	                     idx_t secret_types_len);
	                     
	SettingLookupResult TryGetSecretKeyOrSettingOrEnv(const string &secret_key, const string &setting_name,  const string &env_var_name, Value &result);
	
protected:
    KeyValueSecretReader base_reader;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does not look up a separate env_var_name for secret variables. We can still add the option, but I believe that will make secret initialization more confusing, and unnecessary logic to the PR. This PR is to just check a setting if env vars (Or GLOBAL scope settings) should be used when creating secrets.

The environment variables for AWS_ACCESS_KEY_ID, AWS_SECRET_KEY_ID are set in AWSEnvironmentCredentialsProvider::SetExtensionOptionValue on extension load, and then they are placed in db_config.options.set_options. We cannot disable loading the AWS_ACCESS_KEY, or AWS_SECRET_KEY_ID variables, because it is only run once on extension load before the auto_fetch_secret_info_from_env setting can be set/unset.

Also, I wonder if wrapping the KeyValueSecretReader is necessary. Happy to do it for the bug fix PR because it touches less code. I think a cleaner idea for v1.6 would be to add a max_scope parameter that defaults to SettingScope::GLOBAL. For Secret initialization you can pass max_scope = SettingScope::Local or SettingScope::SECRET depending on how strict secret initialization should be, and for other settings you can leave it as the default. If more cases like this come along, it will be easier to modify if the global or local setting should be used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh maybe i'm misunderstanding things here, but i'm confused now. The issue states that:

DuckDB automatically uses the environment AWS credentials even when no S3 secret has been created.

I feel like this does not align with the statement you make in this PR:

DuckDB silently uses env variables for secrets

Are we sure this issue is solving the right correct problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DuckDB automatically uses the environment AWS credentials even when no S3 secret has been created.

-> environment variables like AWS_REGION, AWS_ACCESS_KEY_ID etc.

DuckDB silently uses env variables for secrets.

-> Environment variables like AWS_REGION, AWS_ACCESS_KEY_ID etc. These variablees are used when no secret has been created/matches scope for the requested url

I think the confusion comes from when the environment variables are loaded and when they get used. I can expand

On httpfs extension load, the environment variables AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY_ID etc. are loaded into the httpfs options s3_region, s3_access_key_id, s3_secret_access_key. See here for the whole list of loaded environment variables. Here is where they actually loaded and stored in the extension configuration options

Environment variables are never checked or read after extension load. After extension load only the extension configuration options are used

After httpfs is loaded, when a user attempts to read a remote url and the S3KeyValueReader cannot find a secret that matches scope etc., the extension config options "s3_access_key_id", "s3_secret_access_key" etc. are used. If these settings are not set by the user after the httpfs extension loads, the environment variable values that were loaded into the extension options are used.

This is how DuckDB ends up "silently" using env variables for secrets, or automatically uses AWS environment credentials when no secret has explicitly been created.

The issue our client is experiencing suffers from this

  • The env vars AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY_ID are set, but they are not valid (how they get in this state I don't know)
  • They initialize a read from a public bucket.
    -> here the invalid AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY_ID values are used when creating the auth params
  • A 403 (unauthorized) is returned

This PR will allow a user to set the auto_fetch_secret_info_from_env option to false, so that when a secret options are initialized in ReadFrom(S3KeyValueReader &secret_reader, const std::string &file_path), the string settings that are returned with GLOBAL_SCOPE are not used. GLOBAL_SCOPE is returned for default values, or env values. Default values for all secret config options are empty, and I don't see that ever changing, so I think this way is relatively safe. Otherwise we have to still check global scope, and also check the current env variable value

string setting_name, bool &result) {
Value use_env_vars_for_secret_info_setting;
secret_reader.TryGetSecretKeyOrSetting("auto_fetch_secret_info_from_env", "auto_fetch_secret_info_from_env",
use_env_vars_for_secret_info_setting);
auto use_env_vars_for_secrets = use_env_vars_for_secret_info_setting.GetValue<bool>();

auto option_scope = secret_reader.TryGetSecretKeyOrSetting(secret_option, setting_name, result);
return option_scope;
}

S3AuthParams S3AuthParams::ReadFrom(optional_ptr<FileOpener> opener, FileOpenerInfo &info) {
auto result = S3AuthParams();

Expand All @@ -195,20 +221,18 @@ S3AuthParams S3AuthParams::ReadFrom(optional_ptr<FileOpener> opener, FileOpenerI
KeyValueSecretReader secret_reader(*opener, info, secret_types, 3);

// These settings we just set or leave to their S3AuthParams default value
secret_reader.TryGetSecretKeyOrSetting("region", "s3_region", result.region);
secret_reader.TryGetSecretKeyOrSetting("key_id", "s3_access_key_id", result.access_key_id);
secret_reader.TryGetSecretKeyOrSetting("secret", "s3_secret_access_key", result.secret_access_key);
secret_reader.TryGetSecretKeyOrSetting("session_token", "s3_session_token", result.session_token);
secret_reader.TryGetSecretKeyOrSetting("region", "s3_region", result.region);
secret_reader.TryGetSecretKeyOrSetting("use_ssl", "s3_use_ssl", result.use_ssl);
secret_reader.TryGetSecretKeyOrSetting("kms_key_id", "s3_kms_key_id", result.kms_key_id);
secret_reader.TryGetSecretKeyOrSetting("s3_url_compatibility_mode", "s3_url_compatibility_mode",
result.s3_url_compatibility_mode);
secret_reader.TryGetSecretKeyOrSetting("requester_pays", "s3_requester_pays", result.requester_pays);

S3AuthParams::SetSecretOption(secret_reader, "key_id", "s3_access_key_id", result.access_key_id);
S3AuthParams::SetSecretOption(secret_reader, "secret", "s3_secret_access_key", result.secret_access_key);
S3AuthParams::SetSecretOption(secret_reader, "session_token", "s3_session_token", result.session_token);
S3AuthParams::SetSecretOption(secret_reader, "region", "s3_region", result.region);
S3AuthParams::SetSecretOption(secret_reader, "use_ssl", "s3_use_ssl", result.use_ssl);
S3AuthParams::SetSecretOption(secret_reader, "kms_key_id", "s3_kms_key_id", result.kms_key_id);
S3AuthParams::SetSecretOption(secret_reader, "s3_url_compatibility_mode", "s3_url_compatibility_mode",
result.s3_url_compatibility_mode);
S3AuthParams::SetSecretOption(secret_reader, "requester_pays", "s3_requester_pays", result.requester_pays);
// Endpoint and url style are slightly more complex and require special handling for gcs and r2
auto endpoint_result = secret_reader.TryGetSecretKeyOrSetting("endpoint", "s3_endpoint", result.endpoint);
auto url_style_result = secret_reader.TryGetSecretKeyOrSetting("url_style", "s3_url_style", result.url_style);
auto endpoint_result = SetSecretOption(secret_reader, "endpoint", "s3_endpoint", result.endpoint);
auto url_style_result = SetSecretOption(secret_reader, "url_style", "s3_url_style", result.url_style);

if (StringUtil::StartsWith(info.file_path, "gcs://") || StringUtil::StartsWith(info.file_path, "gs://")) {
// For GCS urls we force the endpoint and vhost path style, allowing only to be overridden by secrets
Expand Down
36 changes: 36 additions & 0 deletions test/sql/test_read_public_bucket.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# name: test/sql/test_read_public_bucket.test
# description: test aws extension with different chain configs
# group: [sql]

require parquet

require httpfs

require-env AWS_ACCESS_KEY_ID

require-env AWS_SECRET_ACCESS_KEY

# override the default behaviour of skipping HTTP errors and connection failures: this test fails on connection issues
set ignore_error_messages

statement ok
set s3_region='us-east-2';

# set endpoint to the correct default, otherwise it will pick up the env variable
statement ok
set s3_endpoint='s3.amazonaws.com';

# see duckdb-internal/issues/6620
# env vars for access_key_id and secret_key_id are used
# which results in 403
statement error
SELECT * FROM read_parquet('s3://coiled-datasets/timeseries/20-years/parquet/part.0.parquet') LIMIT 5;
----
<REGEX>:.*HTTP Error:.*403.*Authentication Failure.*

# default to not using globally scoped settings for secrets
statement ok
set auto_fetch_secret_info_from_env=false;

statement ok
SELECT * FROM read_parquet('s3://coiled-datasets/timeseries/20-years/parquet/part.0.parquet') LIMIT 5;
Loading