Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws s3 sync downloading unchanged files. #7228

Closed
MrJoy opened this issue Aug 29, 2022 · 17 comments
Closed

aws s3 sync downloading unchanged files. #7228

MrJoy opened this issue Aug 29, 2022 · 17 comments
Labels
p2 This is a standard priority issue s3sync s3

Comments

@MrJoy
Copy link

MrJoy commented Aug 29, 2022

Describe the bug

I have a maintenance script I run to keep a local copy of billing & usage data for my personal AWS account. It's identifying almost every file as changed, on every run even though most of the files haven't been modified in years.

Expected Behavior

Only changed files -- in this case, files representing the current billing period -- should be downloaded.

Current Behavior

Of 6,279 files that do not represent the current billing period, it's consistently re-downloading 5,831 of them. The files it downloads are, byte-for-byte identical to the existing ones. I spot-checked one of the files, and aws s3 ls reports the exact same size and timestamp as ls does.

Reported by aws s3 sync:

download: s3://mrjoy-billing-data//cur//billing_and_usage/20210101-20210201/20210122T235314Z/billing_and_usage-00001.csv.gz to ../personal/Finance/AWS_Billing_Data/cur/billing_and_usage/20210101-20210201/20210122T235314Z/billing_and_usage-00001.csv.gz

Reported by aws s3 ls:

% aws-vault exec mrjoy -- aws s3 ls s3://mrjoy-billing-data//cur//billing_and_usage/20210101-20210201/20210122T235314Z/billing_and_usage-00001.csv.gz
2021-01-22 15:53:24     296522 billing_and_usage-00001.csv.gz

Reported by ls:

% ls -laD "%Y-%m-%d %H:%M:%S" ~/personal/Finance/AWS_Billing_Data/cur/billing_and_usage/20210101-20210201/20210122T235314Z/billing_and_usage-00001.csv.gz
-rw-r--r--  1 jonathonfrisby  staff  296522 2021-01-22 15:53:24 /Users/jonathonfrisby/personal/Finance/AWS_Billing_Data/cur/billing_and_usage/20210101-20210201/20210122T235314Z/billing_and_usage-00001.csv.gz

The post-fetch commit in all cases shows diffs for the files in the current billing period (as would be expected), and no changes to any of the other files that aws s3 sync reports as being downloaded.

All told, aws s3 sync appears to be downloading around 700MB of files on each run that it shouldn't be.

Reproduction Steps

The relevant portion of my script is:

#!/bin/bash
IFS=$'\n\t'
set -euo pipefail

(
  cd ~/personal
  git add .
  git commit --all --allow-empty -m "AWS bill snapshot, pre-fetch..."
  aws-vault exec mrjoy -- aws s3 sync s3://mrjoy-billing-data/ ~/personal/Finance/AWS_Billing_Data/
  git add .
  git commit --all --allow-empty -m "AWS bill snapshot, post-fetch..."
)

The data in the bucket is written by AWS itself.

Possible Solution

No response

Additional Information/Context

No response

CLI version used

2.7.26

Environment details (OS name and version, etc.)

macOS 12.5.1

@MrJoy MrJoy added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 29, 2022
@tim-finnigan tim-finnigan added s3sync s3 and removed needs-triage This issue or PR still needs to be triaged. labels Aug 30, 2022
@tim-finnigan
Copy link
Contributor

Hi @MrJoy thanks for reaching out. Have you tried using the --size-only parameter documented here? This parameter makes the size of each key the only criteria used to decide whether to sync from source to destination. So it should ignore all of those files that are the same size.

@tim-finnigan tim-finnigan added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Aug 30, 2022
@MrJoy
Copy link
Author

MrJoy commented Aug 31, 2022

I attempted it just now, and it did not change anything. It's still attempting to download ~everything. Note that after submitting this ticket, I updated to 2.7.27 -- so this test was done on 2.7.27 not 2.7.26.

@tim-finnigan
Copy link
Contributor

Thanks for the update. There is an older issue tracking problems with S3 sync here: #599. Some users have reported anomalies when certain files sync that should not, but I wouldn't expect the problem at the scale you're describing where it's happening with hundreds of files. I don't know if I'd be able to reproduce the issue as described but could try. If you can get the debug logs by adding --debug to the command that might also give more insight into the problem. Some have said that using --size-only or --exact-timestamps has helped produce the expected results. There are other S3 sync-related feature requests like #6750 that relate to using new checksum algorithms for improving the accuracy.

@MrJoy
Copy link
Author

MrJoy commented Sep 3, 2022

@tim-finnigan I'm sorry, I was unclear in my last message: When I said "I attempted it just now", I meant "I attempted to use --size-only just now". Accidental pronoun game, FTL.

If you'd like, I can temporarily give you read-only credentials for this bucket and you can see if you are able to recreate the problem from the same source. My personal AWS bill is... not data I'm terribly worried about sharing.

I'll get --debug output and add it here tomorrow. I'm about to be AFK for a while.

@MrJoy
Copy link
Author

MrJoy commented Sep 4, 2022

debug.log.cleansed.zip

I've stripped tokens/signatures/key IDs from the file, but it's otherwise as produced from running:

aws-vault exec mrjoy -- aws s3 sync --debug --size-only s3://mrjoy-billing-data/ ~/personal/Finance/AWS_Billing_Data/

@tim-finnigan
Copy link
Contributor

Thanks @MrJoy for following up and sharing your logs. I couldn't identify any anomalies after scanning through the logs. I think attempting to recreate the issue is a good idea, but for that I recommend reaching out through AWS Support to open a private communication channel. I'd also recommend trying to use --exact-timestamps when running the sync command to see if that addresses the issue you're seeing.

@MrJoy
Copy link
Author

MrJoy commented Sep 6, 2022

I went ahead and tried --exact-timestamps by itself and in combination with --size-only, and the behavior seems to be the same in all cases.

Going through AWS Support is not an option, as this is my personal account and I'm on the Basic plan.

@tim-finnigan
Copy link
Contributor

Checking in on this issue again - thanks for your patience. I think this issue might actually overlap with #5730, #648 and/or #5369. Have you looked through any of those issues? Based on some of the comments it sounds like this could be due to how S3 handles timestamps.

@tim-finnigan tim-finnigan added the p2 This is a standard priority issue label Nov 16, 2022
@tim-finnigan tim-finnigan self-assigned this Nov 16, 2022
@MrJoy
Copy link
Author

MrJoy commented Nov 16, 2022

Using --size-only, without --exact-timestamps does not alleviate the problem. Doesn't --size-only cause aws-cli to disregard timestamps?

@tim-finnigan
Copy link
Contributor

Hi again, thanks for your patience, I lost track of this issue. Per the s3 sync documentation --size-only does the following:

--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.

And --exact-timestamps does the following:

--exact-timestamps (boolean) When syncing from S3 to local, same-sized items will be ignored only when the timestamps match exactly. The default behavior is to ignore same-sized items unless the local version is newer than the S3 version.

Do you have any updates on your end as far as what you've tried? I still can't reproduce the issue but invite others to share their insights here if they know what the problem could be.

@tim-finnigan tim-finnigan removed their assignment Mar 6, 2023
@tim-finnigan tim-finnigan added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed bug This issue is a bug. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Mar 6, 2023
@MrJoy
Copy link
Author

MrJoy commented Mar 7, 2023

The totality of my script is, at present, this:

#!/bin/bash
IFS=$'\n\t'
set -euo pipefail

(
  cd ~/personal
  git add .
  git commit --all --allow-empty -m "AWS bill snapshot, pre-fetch..."
  aws-vault exec mrjoy -- aws s3 sync --size-only s3://mrjoy-billing-data/ ~/personal/Finance/AWS_Billing_Data/
  git add .
  git commit --all --allow-empty -m "AWS bill snapshot, post-fetch..."
)

(
  cd ~/mjbackup/aws
  git add .
  git commit --all --allow-empty -m "AWS log snapshot, pre-fetch..."
  aws-vault exec mrjoy -- aws s3 sync --size-only s3://mrjoy-logs/ ~/mjbackup/aws/access/
  aws-vault exec mrjoy -- aws s3 sync --size-only s3://mrjoy-api-logs/ ~/mjbackup/aws/api/
  git add .
  git commit --all --allow-empty -m "AWS log snapshot, post-fetch..."
)

echo 'Done.'

As of today, that first sync job has an issue and the other two do not. So the problem is clearly dependent upon the data in S3 and/or my local filesystem.

In the case of the first sync job, it's notably that only the cur sub-directory is affected -- and every single object under there is affected. There's about 58MB of files that sit parallel to the cur folder of the bucket and they do not get re-synced on every run. The 744.4MB of files under cur are re-synced every single time, with no changes resulting.

Currently, I'm using aws-cli version:

aws-cli/2.11.0 Python/3.11.2 Darwin/21.6.0 source/arm64 prompt/off

I'm doing a test real quick to have the first sync happen to a different folder, so I can see if it's something to do with the local FS side of things. Will post results momentarily.

I'm happy to give you temporary access to that bucket so you can see if that's helpful in reproducing the issue.

@MrJoy
Copy link
Author

MrJoy commented Mar 7, 2023

(Just to clarify: When I say no changes result, I mean I wind up with an empty commit despite aws-cli downloading 744.4MB of data.)

@MrJoy
Copy link
Author

MrJoy commented Mar 7, 2023

Ok. Re-running (twice) against a clean sub-folder produces the same behavior of the data being re-synced. So it seems to be either an issue on the S3 side, not something related to how the data (originally) got stored on disk locally.

@MrJoy
Copy link
Author

MrJoy commented Mar 7, 2023

% ls -la ~/personal/Finance/AWS_Billing_Data/cur/billing_and_usage/20210101-20210201/20210123T082656Z/billing_and_usage-Manifest.json
-rw-r--r--  1 jonathonfrisby  staff  6458 Jan 23  2021 /Users/jonathonfrisby/personal/Finance/AWS_Billing_Data/cur/billing_and_usage/20210101-20210201/20210123T082656Z/billing_and_usage-Manifest.json

An example of the details of one object that's getting re-synced.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Mar 7, 2023
@MrJoy
Copy link
Author

MrJoy commented Apr 15, 2023

@tim-finnigan Would it be helpful if I gave you access to the relevant S3 bucket?

@MrJoy
Copy link
Author

MrJoy commented Aug 18, 2023

I've identified the problem. I had AWS configured to put billing and usage reports under the prefix "/cur/". That got interpreted as a directory entry named "/" holding a directory entry named "cur" holding a directory entry named "/".

After I corrected the prefix, and moved things out of the "/" folders, the sync process shows no changes.

Screen Shot 2023-08-17 at 20 28 33

@MrJoy MrJoy closed this as completed Aug 18, 2023
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p2 This is a standard priority issue s3sync s3
Projects
None yet
Development

No branches or pull requests

2 participants