Skip to content

cloud to cloud file transfers

Paul Sud edited this page Aug 10, 2020 · 1 revision

Index

  1. S3 to S3
  2. GCP to S3

S3 to S3

Here I describe how to transfer files from one S3 bucket to another, where each bucket is in a different AWS account. Although I will discuss how this is done when submitting files to the S3 buckets that belong to the ENCODE DCC team, generalizations can be taken away for other use cases.

The goal is to get files from your S3 bucket to the DCC's S3 bucket. When a new file object is created on the ENCODE Portal (which initially just contains the metadata and not the file itself), the resulting JSON object that is created contains AWS credentials for an AWS federated user. Think of an AWS federated user as a normal AWS IAM user that is only in existence for a specified amount of time - anywhere from 15 minutes to 36 hours, where the default is 12 hours.

Your bucket - the source bucket - will need to have a bucket policy that grants the DCC AWS account (the destination account) the GetObject privilege on your bucket objects. The bucket policy will need to know the 12-digit AWS account ID of the destination account, which for the ENCODE DCC is 618537831167.

The bucket policy to add to the source bucket looks like this:

#Bucket policy set up in the source AWS account.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "DelegateS3Access",
            "Effect": "Allow",
            "Principal": {"AWS": "618537831167"},
            "Action": ["s3:ListBucket", "s3:GetObject"],
            "Resource": [
                "arn:aws:s3:::sourcebucket/*",
                "arn:aws:s3:::sourcebucket"
            ]
        }
    ]
}

Replace the example bucket name (sourcebucket) with your bucket's name. You don't really need to include the "s3:ListBucket" action, but I've included here to make it easy for testing to ensure that a user from the destination bucket can access your bucket with a command such as aws s3 ls s3://sourcebucket. Once you have this confirmation, you can remove that action if you don't need it.

You may be wondering, why give access to the DCC account rather than a particular user? More restricted access is recommended when that gets the job done. However, since we are working with federated users that are created on the fly by the DCC, it's not as straightforward to give access to new individual users in your bucket policy each time we need to upload a file from your bucket. That would require frequent updates to the policy, and soon enough, it would grow quite large and contain permissions for lots of expired federated users. The simplest approach is to give the DCC Account access to your bucket, and then leave it up to the DCC account admins to delegate that account authority to their federated users that are dynamically created when needed.

Before a regular, non-admin IAM user of the DCC account can access your bucket, however, an admin IAM user of that account must delegate its new powers to one or more of the non-admin IAM users - specifically, the IAM user(s) that is responsible for creating the federated users. Then, that IAM user can delegate that same authority it received to the federated users it creates.

So, how to delegate permissions from account to IAM to federated user? First, a DCC admin IAM user needs to attach a policy to the IAM user that ends up creating federated users. This policy should give that user permission to fetch objects from your bucket (just like you gave this permission to the DCC account at the beginning). However, if multiple IAM users should be delegated the same authority, then it's best to make sure that the IAM users are all in the same group, to which a single group policy can be attached, and that's what will be demonstrated going forwards. Let's pretend that this group is named fed-makers. The group policy to attach to this group that gives users within access to your bucket would look something like this:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "yourbucket-access",
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::yourbucket/*"
        }
    ]
}

Here's how to create that group policy:

  1. In the AWS Console, navigate to your group of interest
  2. Click on the "Create Group Policy" blue button
  3. Select "Custom Policy"
  4. Click on "Select"
  5. Copy and paste in the above policy.
  6. Give it a meaningful name (in my test case, i.e. call it submitterBuckets).

This policy can be updated as needed when more submitter buckets need to be added.

Now, when an IAM user from that group needs to make a federated user, it must also pass the same policy in the API call to grant the federated user this privilege - otherwise it won't be able to do anything. Federated users can only have a subset of the powers of the IAM user making the call.

In boto3 - the AWS SDK for Python - federated users are created when an IAM user (using its own credentials) makes a call to a method named get_federation_token, which you'll see in action momentarily. Before an IAM user can make such a call, however, the IAM user will need to be specifically granted that privilege. The destination account will need to attach another custom group policy to the fed-makers group, call it myGetFederationToken, with the following contents:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "sts:GetFederationToken",
            "Resource": "*"
        }
    ]
}

Below is an example of creating a federated user using boto3. The IAM user (on the destination account) must use their own credentials when running the code.

>>> import boto3
>>> sts = boto3.client('sts') #sts means Session Token Service
>>> fed = sts.get_federation_token(Name="test-fed") 
>>> print(list(fed.keys()))
['Credentials', 'FederatedUser', 'PackedPolicySize']
 
>>> print(json.dumps((fed["Credentials"]), indent=4))
{
    "AccessKeyId": "ASIAIBWNZWOEWFGNAMCA",
    "SecretAccessKey": "SvVxHATTlUuixrTBDlSh0GIZaYOzyQjD59L9i+g4",
    "SessionToken": "FQoDYXdzEEwaDHFjEFZoUUL/IVGUxSLVAZ4lkKdfYMfc9Nv6WzpmCf2y4yqI0ZK2FNelcXFdY9gfIDulR76uopRXp6kj/2I1lhMUhAuHA69metlUI/1EM01UkPZ1Aw1VryQ2ZMmlzKuBKLOxpJXMUBwp92fowB4AxvdkmV9jhOqqPL004YW9CDFbFkZ3qFM02bkfmThZz7esyhwpF3/RAYmZE2QauIntwiddK7057flApveiUQJTifiqswYEmikjQH788uaO2Jef9wziC2uMemFB8IU6ft4msm8iKIERuWiw5u9vJhzDLvFMfGnSqyitqoDZBQ==",
    "Expiration": "2018-06-13 06:47:09+00:00"
}

>>> print(json.dumps((fed["FederatedUser"]), indent=4))
{
    "FederatedUserId": "167194893449:test-fed",
    "Arn": "arn:aws:sts::167194893449:federated-user/test-fed"
}

The creates a federated user, whose existence will only last 12 hours - the default time limit since the DurationSeconds parameter wasn't specified. But since a policy wasn't sent in the request, that temporary user can't do anything useful. The IAM user needs to pass in the submitterBuckets policy in the API call. Here's the updated code:

>>> import json
>>> import boto3
>>> iam = boto3.client('iam')
>>> policy = iam.get_group_policy(GroupName="yourGroupName", PolicyName="submitterBuckets")
>>> policy_doc = policy["PolicyDocument"]
>>> sts = boto3.client('sts') #sts means Storage Token Service
>>> fed = sts.get_federation_token(Policy=json.dumps(policy_doc))  

But this doesn't work just yet, unless the IAM user running the code has the permission to perform the action called iam:GetGroupPolicy on the specified policy (the resource in this instance). To fix this, attach the following policy to the fed-makers group and name it myGetGroupPolicy, to allow all users in the group to perform this action on all policies in the account:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "my-get-policy",
            "Effect": "Allow",
            "Action": "iam:GetGroupPolicy",
            "Resource": "*"
        }
    ]
}

Now, the call to download the policy will work. In the DCC production workflow, however, the policy supplied in the call to get_federation_token should be extended (i.e. locally after it is downloaded) in order to include the statement that will give the new federated user the additional access to put a file at a specific location in a DCC S3 bucket. At present, the federated user only has access to get an object from the source S3 bucket. This could simply be a matter of manipulating the policy document hash directly after its been locally downloaded, i.e. by appending the statements from another policy (maybe a dynamically generated policy statement).

See the nice boto3 documentation for more details on the get_federation_token method call. See here for AWS documentation regarding S3 to S3 file transfers between AWS accounts.

GCP to S3

You must have gsutil installed and configured with your credentials. If your environment is configured with credentials for both your AWS user account and your GCP user account, then you can use gsutil to copy a file from GCP to S3. The command will be of the form:

gsutil cp gs://gsbucket/myobj s3:://s3bucket

However, this does not yet work when working with the credentials of an AWS federated user. That is because temporary credentials form AWS rely on the use of a 3rd configuration setting called AWS_SESSION_TOKEN, and gsutil has not been made aware of that. I created a ticket in the gsutil GitHub repository on June 11, 2018 to address this.

Tips on configuring gsutil

I recommend installing gsutil as part of the Google Cloud SDK. You can instead install it as a stand-alone tool, but the way you set up credentials is different and I find it easier with the former. If you have the stand-alone tool, set up your credentials with gsutil config. If instead you installed it via the Google Cloud SDK, use gcloud auth login, which will start and OAuth2 flow and create a .boto configuration file for you. See here for more details on what has just been discussed. To see the where your configuration file is stored, run the command gsutil version -l.

Note: according to the GCP documentation, the .boto configuration file for gsutil can be updated with the AWS credentials in order to be able to use gsutil to copy from GCP to S3. But a better (seemingly undocumented) way I have found is that setting the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables also works (imitating how things are done with the AWS CLC. You can even use your standard AWS config file instead at ~/.aws/credentials and that'll work as well. Since encode_utils already sets the AWS environment variables, you don't have to worry about the AWS side of things.

Clone this wiki locally