Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3TransferManager directoryDownload downloads incomplete files without failing #5631

Open
1 task
bugrabenturk opened this issue Sep 27, 2024 · 12 comments
Open
1 task
Assignees
Labels
bug This issue is a bug. crt-client p2 This is a standard priority issue

Comments

@bugrabenturk
Copy link

bugrabenturk commented Sep 27, 2024

Describe the bug

Sometimes S3TransferManager does not actually download the complete file and it does not marks that as failed transfer instead it marks as complete transfer, it does not even throws any exception.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

Should download the complete file, if it does not downloads the complete file should trigger as failed download or throw the corresponded exception.

Current Behavior

Does not downloads the complete file and does not triggers as failed download instead marks as download completed without throwing any exception.

First I thought that it's happening because of the assuming request's token expiration, since its set to 3600 secs (1 hour), i thought that when the download starts then the token expires and somehow it does fails the transfer but still marks as completed.

So based on that idea, i've created the session right before every transfer action, so it still did not help and problem still occurred.

However, when I have checked the logs i couldn't be able to see much but this is the last part thats been downloaded, rest of the parts are not downloaded at all.
[DEBUG] [2024-09-26T15:39:52Z] [00007fbe09e24b38] [S3MetaRequest] - id=0x7fbe07ff2300: 132 out of 998 parts have completed.

I'm attaching the log file, since it was very big i'm adding the part that its stopped downloading at the part of 132.

svhf.log

Reproduction Steps

We're using STSClient to gather temporary AWS Session credentials with assuming to the corresponding role and then doing the download operation

`

                stsClient = StsClient.builder()
                    .credentialsProvider(StaticCredentialsProvider.create(AwsBasicCredentials.create(accessKey, secretKey)))
                    .build()

                AssumeRoleRequest assumeRoleRequest = AssumeRoleRequest.builder()
                    .roleArn(roleArn)
                    .roleSessionName(roleSessionName)
                    .externalId(externalId)
                    .durationSeconds(3600)
                    .build();

                AssumeRoleResponse assumeRoleResponse = stsClient.assumeRole(assumeRoleRequest);
                Credentials stsCredentials = assumeRoleResponse.credentials();

                awsSessionCredentials = AwsSessionCredentials.create(
                        stsCredentials.accessKeyId(),
                        stsCredentials.secretAccessKey(),
                        stsCredentials.sessionToken()
                );

        s3AsyncClient = S3AsyncClient.crtBuilder()
                .credentialsProvider(StaticCredentialsProvider.create(awsSessionCredentials))
                .region(Region.EU_WEST_1)
                .maxConcurrency(10)
                .targetThroughputInGbps(5.0)
                .minimumPartSizeInBytes(80 * SizeConstant.MB)
                .retryConfiguration(r -> r.numRetries(5))
                .build();

        s3TransferManager = S3TransferManager.builder()
                .s3Client(s3AsyncClient)
                .build();`


       DirectoryDownload directoryDownload = s3TransferManager.downloadDirectory(DownloadDirectoryRequest.builder()
                .destination(assetDir)
                .bucket(amazonS3URI.getBucket())
                .listObjectsV2RequestTransformer(builder -> builder.prefix(amazonS3URI.getKey()))
                .build());

        CompletedDirectoryDownload completedDirectoryDownload = directoryDownload.completionFuture().join();

`

Possible Solution

No response

Additional Information/Context

No response

AWS Java SDK version used

2.28.6

AWS CRT version used

0.31.1

JDK version used

21

Operating System and version

Alpine Linux v3.20

@bugrabenturk bugrabenturk added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Sep 27, 2024
@bugrabenturk
Copy link
Author

@debora-ito Can you help or point out to someone who can help please? We've been using this in production and we'd like to keep it that way but need to find-out why it's actually happening.

@debora-ito
Copy link
Member

@bugrabenturk Found this line in the logs:

[DEBUG] [2024-09-26T15:39:52Z] [00007fbe0a1b8b38] [S3MetaRequest] - id=0x7fbe07ff2300 Meta request finished with error code 14356 (The object modifed during download.)

Do you think it might be the case, the object was modified? I'm looking into why the client didn't surface an error.

@debora-ito debora-ito added investigating This issue is being investigated and/or work is in progress to resolve the issue. crt-client p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Oct 16, 2024
@debora-ito debora-ito self-assigned this Oct 16, 2024
@bugrabenturk
Copy link
Author

@debora-ito No, object did not get modified during download, thats strange to see that kind of message :/

@debora-ito
Copy link
Member

Some questions:

  1. Can you reproduce this reliably? How often does it occur?
  2. How many and how large are the files in the directory?
  3. I understand they are big, but can you share the full logs? Make sure to redact any sensitive information.

@bugrabenturk
Copy link
Author

bugrabenturk commented Oct 16, 2024

Some questions:

  1. Can you reproduce this reliably? How often does it occur?
  2. How many and how large are the files in the directory?
  3. I understand they are big, but can you share the full logs? Make sure to redact any sensitive information.

1 - I'm not able to reproduce this, it's very much random i'm afraid.

2- Normally in the directory there are more than 10 files but we filter the prefix and download only filtered ones, and on this case it's for .mov file, so let's assume that directory has 10 files but we do download only one of them at those failed ones and those downloaded files are around 60+ GB mostly, but its around 30 to 90GB. Issues are happening mostly with the big ones, but i don't think the size is important because i've seen some of the downloads just finished after %2 of the file downloaded, which is another strange part so i can't say it's happening to only large files, but it seemed like that.

3- they're really big more than 10gb so, i have had to change it to debug instead of trace after some failed downloads i am afraid. And i've kept few files, from below you can see how often that we've had those incomplete downloaded files, strange part is 20731570 that asset file seems like failed at the same day three times.

Seems like somehow other download operations, head object operations are being logged as well in the log files, they might confuse you, i will provide cleaner logs later on.

-rw-r--r--  1 root root  7944556 Oct 16 18:31 renditions-download-logging-20512540-1729102789185.log
-rw-r--r--  1 root root  4495375 Oct  7 16:28 renditions-download-logging-20649840-1728318139456.log
-rw-r--r--  1 root root  2266168 Oct  1 08:03 renditions-download-logging-20731570-1727769612592.log
-rw-r--r--  1 root root 10063600 Oct  1 12:07 renditions-download-logging-20731570-1727783574797.log
-rw-r--r--  1 root root  6223972 Oct  1 12:31 renditions-download-logging-20731570-1727785489400.log
-rw-r--r--  1 root root  3550197 Oct  2 10:11 renditions-download-logging-20807180-1727863621751.log
-rw-r--r--  1 root root  9985480 Oct 14 20:35 renditions-download-logging-V58350-1728937234963.log

renditions-download-logging-20512540-1729102789185.log
renditions-download-logging-20649840-1728318139456.log
renditions-download-logging-20731570-1727769612592.log
renditions-download-logging-20731570-1727783574797.log
renditions-download-logging-20731570-1727785489400.log
renditions-download-logging-20807180-1727863621751.log
renditions-download-logging-V58350-1728937234963.log

@debora-ito
Copy link
Member

Thank you for the files, I'll take a look at them.

but i don't think the size is important because i've seen some of the downloads just finished after %2 of the file downloaded, which is another strange part so i can't say it's happening to only large files, but it seemed like that.

When you say the downloads are finished, what exactly indicates they are finished?

@bugrabenturk
Copy link
Author

@debora-ito After few incomplete downloads to be able to understand more i have also added the transfer listener

addTransferListener(LoggingTransferListener.create())

after that, things got more interesting since transfer was getting Transfer complete! before it reaches to 100%, some did even before the first tick, so yeah thats how.

@debora-ito
Copy link
Member

transfer was getting Transfer complete! before it reaches to 100%, some did even before the first tick, so yeah thats how.

@bugrabenturk can you show how this looks like?

We have a known issue with the transfer listener where there's a long delay between 100% and printing "Transfer complete", but this is the opposite, I think.

@bugrabenturk
Copy link
Author

@debora-ito I've found this from old ones, hope its usefull.

incomplete

@debora-ito
Copy link
Member

Quick update: I'm still investigating this, the current hypothesis is CRT client is experiencing a specific error that is not being surfaced and not reaching the transfer listener. I'm working on a repro case, trying to figure out what's that original error.

@bugrabenturk I'm wondering if this issue happens with the standard s3 async client too, have you test it out of curiosity?

@bugrabenturk
Copy link
Author

@debora-ito No, i did not try with s3 async client, i might have next week.

@debora-ito
Copy link
Member

So we identified the issue, the way I reproduced is by deleting the object after the downloadDirectory has started. The transfer stops with no error, and no failure is registered in completedDirectoryDownload.failedTransfers().

We are working on a fix, to add it to the failedTransfers list when this happens.

In the meantime you can use the default S3AsyncClient as a workaround:

        S3AsyncClient s3AsyncClient =
                S3AsyncClient.builder()
                        .multipartEnabled(true)
                        .build();

@debora-ito debora-ito removed the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. crt-client p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

2 participants