-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data corruption when uploading blob via NewAppendBlobClient
=>AppendBlock
API
#24027
Comments
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage. |
By default, the retry policy will retry requests that fail with a 500 (see the docs here). It looks like you're using the default behavior but let us know if I got that wrong. If you have client-side logging enabled you should see the failed request and subsequent retry. So, assuming that the default behavior is in play, it would explain why you see a 500 server-side but not client side (assuming that the retried request succeeded). I looked through some of the links (thanks for providing detailed information, it's super helpful) but didn't see any client-side logs (let us know if I missed it). Would it be possible to reproduce the failure with client-side logging enabled? |
In the meantime, we'll also continue to investigate on our side. |
@jhendrixMSFT Thank you for your reply and time. Yes, we are using the default retry policy. I will work on enabling and reproducing the problem with client-side logs enabled tomorrow and try to provide you with more information. |
We have a theory, but it doesn't quite align with your notes. According to the REST docs a 500 with error code |
Block was missing or was out-of-order - I am 100% sure that it is one of these two options. If it was duplicated, I would have gotten negative offset when searching for the block. In order to understand what exactly gets corrupted, I started sending blobs of data generated using Golangs ChaCha8 pseudo-random generator and made sure to log the seed that is used in the tests output. I then compared the first 20-something bytes of the first 512 bytes block that tests reported as corrupted with the whole pseudo-random stream that was send to the backend (i.e. recreated it locally using the seed the tests logged) and the bytes were found 4MiB later in the stream as if backend skipped on the block. I do not know if the block was uploaded out-of-order though - tests stopped comparing the stream and aborted the test after they found the first discrepancy. |
In case you would to double-check my investigation:
and the output of one of the failing tests from the issue I linked to:
which translates to one of the paragraphs from that issue:
|
I double-checked my findings with a simple script below and this may indeed be a duplicate block after all - apologies, I was counting in the wrong direction :( It is pretty late for me, I should probably call it a day for today :)
|
Thanks so much for the help. Indeed, the consensus on our end is to not retry on |
sgtm. |
Do we still need it? |
No, we have enough info to move forward on this. |
I started working on it and created a draft MR. Do you think you could help me or point me to relevant documentation/code regarding how to handle the OperationTimeout case, namely the part where things might have succeded despite error returned:
|
For proper sequence to detect and recover, I'll defer to @gapra-msft as this was also observed in AzCopy (the way they fixed it there is Azure/azure-storage-azcopy#2430 which might be of interest). The way she explained how it works in AzCopy is as follows.
I believe that using an |
Thank you, that makes a lot of sense for AppendBlock Do you have any advice how to handle NewAppendBlobClient.Create and StartCopyFromURL maybe, as I mentioned in my previous comment? |
For I think the same pattern applies to |
Thank you very much! |
Bug Report
We recently noticed that the data we upload to Azure Blob storage is not the same as the data we get when using AppendBlock API during the time when API is throttling our requests.
The problem is relativelly easy to replicate and can be summarised by the logs below:
When uploading large blob (32MiB) chunks (4MiB) in a loop, one of the chunk uploads fails with ```500 Operation timeout out
but the
AppendBlock` call does not return erorr so the upload continues. Later, when fetching uploaded blob for inspection using GetBlob, we get `200` stauts but `InternalError` Status text and the corrupted file. The corruption is basically about the chunk that failed to upload being missing.The code we are using is opensource and can be found here: https://gitlab.com/gitlab-org/container-registry/-/blob/a7581d9e743bba63e8318f304a799a09ae9e4580/registry/storage/driver/azure/v2/azure.go#L645
One of the tests that fails due to this issue: https://gitlab.com/gitlab-org/container-registry/-/blob/a7581d9e743bba63e8318f304a799a09ae9e4580/registry/storage/driver/testsuites/testsuites.go?page=2#L1428-1449
The internal issue where we track this issue and where you can potentially find more details is here: https://gitlab.com/gitlab-org/container-registry/-/issues/1470
github.com/Azure/azure-sdk-for-go/sdk/storage/azblob/blob
go version
1.23.2`The text was updated successfully, but these errors were encountered: