`aws s3 sync` downloading unchanged files. #7228

MrJoy · 2022-08-29T17:22:05Z

Describe the bug

I have a maintenance script I run to keep a local copy of billing & usage data for my personal AWS account. It's identifying almost every file as changed, on every run even though most of the files haven't been modified in years.

Expected Behavior

Only changed files -- in this case, files representing the current billing period -- should be downloaded.

Current Behavior

Of 6,279 files that do not represent the current billing period, it's consistently re-downloading 5,831 of them. The files it downloads are, byte-for-byte identical to the existing ones. I spot-checked one of the files, and aws s3 ls reports the exact same size and timestamp as ls does.

Reported by aws s3 sync:

download: s3://mrjoy-billing-data//cur//billing_and_usage/20210101-20210201/20210122T235314Z/billing_and_usage-00001.csv.gz to ../personal/Finance/AWS_Billing_Data/cur/billing_and_usage/20210101-20210201/20210122T235314Z/billing_and_usage-00001.csv.gz

Reported by aws s3 ls:

% aws-vault exec mrjoy -- aws s3 ls s3://mrjoy-billing-data//cur//billing_and_usage/20210101-20210201/20210122T235314Z/billing_and_usage-00001.csv.gz
2021-01-22 15:53:24     296522 billing_and_usage-00001.csv.gz

Reported by ls:

% ls -laD "%Y-%m-%d %H:%M:%S" ~/personal/Finance/AWS_Billing_Data/cur/billing_and_usage/20210101-20210201/20210122T235314Z/billing_and_usage-00001.csv.gz
-rw-r--r--  1 jonathonfrisby  staff  296522 2021-01-22 15:53:24 /Users/jonathonfrisby/personal/Finance/AWS_Billing_Data/cur/billing_and_usage/20210101-20210201/20210122T235314Z/billing_and_usage-00001.csv.gz

The post-fetch commit in all cases shows diffs for the files in the current billing period (as would be expected), and no changes to any of the other files that aws s3 sync reports as being downloaded.

All told, aws s3 sync appears to be downloading around 700MB of files on each run that it shouldn't be.

Reproduction Steps

The relevant portion of my script is:

#!/bin/bash
IFS=$'\n\t'
set -euo pipefail

(
  cd ~/personal
  git add .
  git commit --all --allow-empty -m "AWS bill snapshot, pre-fetch..."
  aws-vault exec mrjoy -- aws s3 sync s3://mrjoy-billing-data/ ~/personal/Finance/AWS_Billing_Data/
  git add .
  git commit --all --allow-empty -m "AWS bill snapshot, post-fetch..."
)

The data in the bucket is written by AWS itself.

Possible Solution

No response

Additional Information/Context

No response

CLI version used

2.7.26

Environment details (OS name and version, etc.)

macOS 12.5.1

The text was updated successfully, but these errors were encountered:

tim-finnigan · 2022-08-30T15:56:50Z

Hi @MrJoy thanks for reaching out. Have you tried using the --size-only parameter documented here? This parameter makes the size of each key the only criteria used to decide whether to sync from source to destination. So it should ignore all of those files that are the same size.

MrJoy · 2022-08-31T22:49:40Z

I attempted it just now, and it did not change anything. It's still attempting to download ~everything. Note that after submitting this ticket, I updated to 2.7.27 -- so this test was done on 2.7.27 not 2.7.26.

tim-finnigan · 2022-09-01T21:53:00Z

Thanks for the update. There is an older issue tracking problems with S3 sync here: #599. Some users have reported anomalies when certain files sync that should not, but I wouldn't expect the problem at the scale you're describing where it's happening with hundreds of files. I don't know if I'd be able to reproduce the issue as described but could try. If you can get the debug logs by adding --debug to the command that might also give more insight into the problem. Some have said that using --size-only or --exact-timestamps has helped produce the expected results. There are other S3 sync-related feature requests like #6750 that relate to using new checksum algorithms for improving the accuracy.

MrJoy · 2022-09-03T00:31:35Z

@tim-finnigan I'm sorry, I was unclear in my last message: When I said "I attempted it just now", I meant "I attempted to use --size-only just now". Accidental pronoun game, FTL.

If you'd like, I can temporarily give you read-only credentials for this bucket and you can see if you are able to recreate the problem from the same source. My personal AWS bill is... not data I'm terribly worried about sharing.

I'll get --debug output and add it here tomorrow. I'm about to be AFK for a while.

MrJoy · 2022-09-04T19:47:37Z

debug.log.cleansed.zip

I've stripped tokens/signatures/key IDs from the file, but it's otherwise as produced from running:

aws-vault exec mrjoy -- aws s3 sync --debug --size-only s3://mrjoy-billing-data/ ~/personal/Finance/AWS_Billing_Data/

tim-finnigan · 2022-09-06T21:42:44Z

Thanks @MrJoy for following up and sharing your logs. I couldn't identify any anomalies after scanning through the logs. I think attempting to recreate the issue is a good idea, but for that I recommend reaching out through AWS Support to open a private communication channel. I'd also recommend trying to use --exact-timestamps when running the sync command to see if that addresses the issue you're seeing.

MrJoy · 2022-09-06T23:33:01Z

I went ahead and tried --exact-timestamps by itself and in combination with --size-only, and the behavior seems to be the same in all cases.

Going through AWS Support is not an option, as this is my personal account and I'm on the Basic plan.

tim-finnigan · 2022-11-16T17:46:23Z

Checking in on this issue again - thanks for your patience. I think this issue might actually overlap with #5730, #648 and/or #5369. Have you looked through any of those issues? Based on some of the comments it sounds like this could be due to how S3 handles timestamps.

MrJoy · 2022-11-16T23:59:43Z

Using --size-only, without --exact-timestamps does not alleviate the problem. Doesn't --size-only cause aws-cli to disregard timestamps?

tim-finnigan · 2023-03-06T23:32:10Z

Hi again, thanks for your patience, I lost track of this issue. Per the s3 sync documentation --size-only does the following:

--size-only (boolean) Makes the size of each key the only criteria used to decide whether to sync from source to destination.

And --exact-timestamps does the following:

--exact-timestamps (boolean) When syncing from S3 to local, same-sized items will be ignored only when the timestamps match exactly. The default behavior is to ignore same-sized items unless the local version is newer than the S3 version.

Do you have any updates on your end as far as what you've tried? I still can't reproduce the issue but invite others to share their insights here if they know what the problem could be.

MrJoy · 2023-03-07T22:37:42Z

The totality of my script is, at present, this:

#!/bin/bash
IFS=$'\n\t'
set -euo pipefail

(
  cd ~/personal
  git add .
  git commit --all --allow-empty -m "AWS bill snapshot, pre-fetch..."
  aws-vault exec mrjoy -- aws s3 sync --size-only s3://mrjoy-billing-data/ ~/personal/Finance/AWS_Billing_Data/
  git add .
  git commit --all --allow-empty -m "AWS bill snapshot, post-fetch..."
)

(
  cd ~/mjbackup/aws
  git add .
  git commit --all --allow-empty -m "AWS log snapshot, pre-fetch..."
  aws-vault exec mrjoy -- aws s3 sync --size-only s3://mrjoy-logs/ ~/mjbackup/aws/access/
  aws-vault exec mrjoy -- aws s3 sync --size-only s3://mrjoy-api-logs/ ~/mjbackup/aws/api/
  git add .
  git commit --all --allow-empty -m "AWS log snapshot, post-fetch..."
)

echo 'Done.'

As of today, that first sync job has an issue and the other two do not. So the problem is clearly dependent upon the data in S3 and/or my local filesystem.

In the case of the first sync job, it's notably that only the cur sub-directory is affected -- and every single object under there is affected. There's about 58MB of files that sit parallel to the cur folder of the bucket and they do not get re-synced on every run. The 744.4MB of files under cur are re-synced every single time, with no changes resulting.

Currently, I'm using aws-cli version:

aws-cli/2.11.0 Python/3.11.2 Darwin/21.6.0 source/arm64 prompt/off

I'm doing a test real quick to have the first sync happen to a different folder, so I can see if it's something to do with the local FS side of things. Will post results momentarily.

I'm happy to give you temporary access to that bucket so you can see if that's helpful in reproducing the issue.

MrJoy · 2023-03-07T22:39:26Z

(Just to clarify: When I say no changes result, I mean I wind up with an empty commit despite aws-cli downloading 744.4MB of data.)

MrJoy · 2023-03-07T22:40:59Z

Ok. Re-running (twice) against a clean sub-folder produces the same behavior of the data being re-synced. So it seems to be either an issue on the S3 side, not something related to how the data (originally) got stored on disk locally.

MrJoy · 2023-03-07T22:44:23Z

% ls -la ~/personal/Finance/AWS_Billing_Data/cur/billing_and_usage/20210101-20210201/20210123T082656Z/billing_and_usage-Manifest.json
-rw-r--r--  1 jonathonfrisby  staff  6458 Jan 23  2021 /Users/jonathonfrisby/personal/Finance/AWS_Billing_Data/cur/billing_and_usage/20210101-20210201/20210123T082656Z/billing_and_usage-Manifest.json

An example of the details of one object that's getting re-synced.

MrJoy · 2023-04-15T00:32:48Z

@tim-finnigan Would it be helpful if I gave you access to the relevant S3 bucket?

MrJoy · 2023-08-18T05:12:58Z

I've identified the problem. I had AWS configured to put billing and usage reports under the prefix "/cur/". That got interpreted as a directory entry named "/" holding a directory entry named "cur" holding a directory entry named "/".

After I corrected the prefix, and moved things out of the "/" folders, the sync process shows no changes.

github-actions · 2023-08-18T05:13:19Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

MrJoy added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 29, 2022

tim-finnigan added s3sync s3 and removed needs-triage This issue or PR still needs to be triaged. labels Aug 30, 2022

tim-finnigan added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Aug 30, 2022

tim-finnigan added the p2 This is a standard priority issue label Nov 16, 2022

tim-finnigan self-assigned this Nov 16, 2022

tim-finnigan removed their assignment Mar 6, 2023

tim-finnigan added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed bug This issue is a bug. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Mar 6, 2023

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Mar 7, 2023

h-r-k-matsumoto mentioned this issue Aug 10, 2023

aws s3 sync --exact-timestamps {from-s3} {to-local} does not check the time correctly. #8092

Closed

MrJoy closed this as completed Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`aws s3 sync` downloading unchanged files. #7228

`aws s3 sync` downloading unchanged files. #7228

MrJoy commented Aug 29, 2022

tim-finnigan commented Aug 30, 2022

MrJoy commented Aug 31, 2022 •

edited

Loading

tim-finnigan commented Sep 1, 2022

MrJoy commented Sep 3, 2022

MrJoy commented Sep 4, 2022 •

edited

Loading

tim-finnigan commented Sep 6, 2022

MrJoy commented Sep 6, 2022

tim-finnigan commented Nov 16, 2022

MrJoy commented Nov 16, 2022

tim-finnigan commented Mar 6, 2023

MrJoy commented Mar 7, 2023

MrJoy commented Mar 7, 2023

MrJoy commented Mar 7, 2023

MrJoy commented Mar 7, 2023

MrJoy commented Apr 15, 2023

MrJoy commented Aug 18, 2023

github-actions bot commented Aug 18, 2023

aws s3 sync downloading unchanged files. #7228

aws s3 sync downloading unchanged files. #7228

Comments

MrJoy commented Aug 29, 2022

Describe the bug

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

Additional Information/Context

CLI version used

Environment details (OS name and version, etc.)

tim-finnigan commented Aug 30, 2022

MrJoy commented Aug 31, 2022 • edited Loading

tim-finnigan commented Sep 1, 2022

MrJoy commented Sep 3, 2022

MrJoy commented Sep 4, 2022 • edited Loading

tim-finnigan commented Sep 6, 2022

MrJoy commented Sep 6, 2022

tim-finnigan commented Nov 16, 2022

MrJoy commented Nov 16, 2022

tim-finnigan commented Mar 6, 2023

MrJoy commented Mar 7, 2023

MrJoy commented Mar 7, 2023

MrJoy commented Mar 7, 2023

MrJoy commented Mar 7, 2023

MrJoy commented Apr 15, 2023

MrJoy commented Aug 18, 2023

github-actions bot commented Aug 18, 2023

⚠️COMMENT VISIBILITY WARNING⚠️

`aws s3 sync` downloading unchanged files. #7228

`aws s3 sync` downloading unchanged files. #7228

MrJoy commented Aug 31, 2022 •

edited

Loading

MrJoy commented Sep 4, 2022 •

edited

Loading