Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPC error in push-mode from StoRM WebDAV to EOS for file size greater than 1MB #38

Open
andrearendina opened this issue May 16, 2022 · 6 comments
Assignees

Comments

@andrearendina
Copy link

Hi all,

I have noticed a costant error in the third-party push-copies from StoRM WebDAV to EOS with file size bigger than 1MB.

In my tests, the version of the StoRM WebDAV server is storm-webdav-1.4.1-1.el7.noarch, while the EOS version is the 5.0.22 and the XrootD version is the 5.4.3. Below, you can find two easy examples with the gfal tool of a successful transfer and a failed one.

$  gfal-copy -v -t 36000 --copy-mode push davs://ds-509.cr.cnaf.infn.it:8443/juno-test/rucio4juno/test_parallel_transfer/divina_commedia davs://eos-mgm.cr.cnaf.infn.it:9000/eos/dockertest/andrea/
Copying 557042 bytes davs://ds-509.cr.cnaf.infn.it:8443/juno-test/rucio4juno/test_parallel_transfer/divina_commedia => davs://eos-mgm.cr.cnaf.infn.it:9000/eos/dockertest/andrea/divina_commedia
[...]
event: [1652695555951] BOTH   http_plugin       TRANSFER:TYPE   3rd push
event: [1652695556028] BOTH   http_plugin       TRANSFER:EXIT   davs://ds-509.cr.cnaf.infn.it:8443/juno-test/rucio4juno/test_parallel_transfer/divina_commedia => davs://eos-mgm.cr.cnaf.infn.it:9000/eos/dockertest/andrea/divina_commedia
$ gfal-copy -v -t 36000 --copy-mode push davs://ds-509.cr.cnaf.infn.it:8443/juno-test/rucio4juno/test_parallel_transfer/test2M davs://eos-mgm.cr.cnaf.infn.it:9000/eos/dockertest/andrea/
Copying 2097152 bytes davs://ds-509.cr.cnaf.infn.it:8443/juno-test/rucio4juno/test_parallel_transfer/test2M => davs://eos-mgm.cr.cnaf.infn.it:9000/eos/dockertest/andrea/test2M
[...]
event: [1652695787847] BOTH   http_plugin       TRANSFER:TYPE   3rd push
event: [1652695787899] BOTH   http_plugin       TRANSFER:EXIT   ERROR: Copy failed with mode 3rd push, with error: Transfer failed: failure: SSLException while pushing https://eos-mgm.cr.cnaf.infn.it:9000/eos/dockertest/andrea/test2M: Connection reset by peer (Write failed)
event: [1652695787913] BOTH   http_plugin       TRANSFER:EXIT   davs://ds-509.cr.cnaf.infn.it:8443/juno-test/rucio4juno/test_parallel_transfer/test2M => davs://eos-mgm.cr.cnaf.infn.it:9000/eos/dockertest/andrea/test2M
gfal-copy error: 5 (Input/output error) - TRANSFER  ERROR: Copy failed with mode 3rd push, with error: Transfer failed: failure: SSLException while pushing https://eos-mgm.cr.cnaf.infn.it:9000/eos/dockertest/andrea/test2M: Connection reset by peer (Write failed)

Checking the EOS logs, it seems there is a miscommunication between the file transfer services, in fact EOS always attempts to parse all the file as a header. So in the first example above all the file content has been written into the logs.
Indeed, from the following configuration of EOS I suppose that this kind of behavoiur can only work until the size is lower than 1MB:

220516 10:05:55 395 http_Protocol: getDataOneShot BuffAvailable: 1048576 maxread: 1048576
220516 10:05:55 395 http_Protocol: getDataOneShot sslavail: 1048576

and this is the reason for which the transfers fail for the files with size bigger than 1MB.
The failed transfer logs of EOS show the following error:

220516 10:09:47 1641 http_Protocol: getDataOneShot BuffAvailable: 1043735 maxread: 1043735
220516 10:09:47 1641 http_Protocol: getDataOneShot sslavail: 1043735
220516 10:09:47 1641 http_Protocol: read 8192 of 1043735 bytes
220516 10:09:47 1641 http_Protocol:  rc:77 got hdr line: E9wjrlRm7txrFiOrA4nJWjotjWOZhe417faFkbGylOwm1YEbHj1cRdALQbKGtjD3CH9KxKRl6QHV

220516 10:09:47 1641 http_Protocol:  Parsing first line: E9wjrlRm7txrFiOrA4nJWjotjWOZhe417faFkbGylOwm1YEbHj1cRdALQbKGtjD3CH9KxKRl6QHV

220516 10:09:47 1641 http_Protocol:  Parsing of first line failed with -1
220516 10:09:47 1641 http_Protocol:  Cleanup
220516 10:09:47 1641 http_Protocol:  SSL_shutdown failed
220516 10:09:47 1641 http_Protocol:  Reset
220516 10:09:47 1641 http_Req:  XrdHttpReq request ended.

in which EOS tries to parse the first content line of the file, but then it fails.

As further information, by looking at the storm webdav logs, the third-party copy it is successfully submitted in both cases.
In summary, it seems that in this specific case of the file transfer EOS is not able to distinguish the header from the content and it parses all the content as the header.
As a consequence, the file transfer fails whenever the size is greater than 1MB.

Sorry if I made any mistakes or I was not clear and, please, feel free to move this issue if needed.

@esindril
Copy link
Member

esindril commented May 17, 2022

Hi Andrea,

Thanks for the report! I tried to replicate your issue by using two different EOS instances one running 4.8.82 version (used as source) and one running 5.0.23 (used as destination), but everything worked as expected. I used the following test script for this, basically relying on curl to trigger the TPC transfer:

export SRC=https://esdss000.cern.ch:9000/eos/dev/replica/xfile1.dat
export DST=https://elvin-dev01.cern.ch:9000/eos/dev/replica/copy_xfile.dat
export TSRC=$(curl -4 --silent --cert /tmp/x509up_u$(id -u) --key /tmp/x509up_u$(id -u) --cacert /tmp/x509up_u$(id -u) --capath /etc/grid-security/certificates -X POST -H 'Content-Type: application/macaroon-request' -d '{"caveats": ["activity:DOWNLOAD"], "validity": "PT3000M"}' "$SRC" | jq -r '.macaroon')
export TDST=$(curl -4 --silent --cert /tmp/x509up_u$(id -u) --key /tmp/x509up_u$(id -u) --cacert /tmp/x509up_u$(id -u) --capath /etc/grid-security/certificates -X POST -H 'Content-Type: application/macaroon-request' -d '{"caveats": ["activity:UPLOAD,DELETE,LIST"], "validity": "PT3000M"}' "$DST" | jq -r '.macaroon')
curl -4 -v --capath /etc/grid-security/certificates -L -X COPY -H 'Secure-Redirection: 1' -H 'X-No-Delegate: 1' -H 'Credentials: none' -H "Authorization: Bearer $TSRC" -H "TransferHeaderAuthorization: Bearer $TDST" -H "TransferHeaderTest: Test" -H "Destination: $DST" "$SRC" 

In general EOS relies on the Content-Lenght HTTP header to know how much data it should expect to receive, so it might be that this piece of information is not sent when the PUT is done by Storm?! This is just a wild guess and should be checked. I don't have a Storm instance to easily check this. Trying this between different EOS instances seems to yield the expected result.

[esindril@elvin-dev01]$ eos version 
EOS_INSTANCE=eosdev
EOS_SERVER_VERSION=5.0.23 EOS_SERVER_RELEASE=1
EOS_CLIENT_VERSION=5.0.23 EOS_CLIENT_RELEASE=1

[esindril@elvin-dev01]$ eos ls -lrt /eos/dev/replica/
-rw-r--r--   2 esindril c3           29210264 May 17 17:33 copy_xfile.dat

Cheers,
Elvin

@esindril esindril self-assigned this May 17, 2022
@andrearendina
Copy link
Author

Hi Elvin,

thank you very much for the quick feedback!
I have replicated your script to perform the push-copy in both cases: from WebDAV to EOS and from EOS to EOS in order to compare the logs inside the namespace server.

First of all, I noticed that the Content-Length parameter is in the header in both cases, respectively:

220519 08:25:45 467 http_Protocol:  rc:51 got hdr line: PUT /eos/dockertest/andrea/test2M_webdav HTTP/1.1
220519 08:25:45 467 http_Protocol:  Parsing first line: PUT /eos/dockertest/andrea/test2M_webdav HTTP/1.1
220519 08:25:45 467 http_Protocol:  rc:479 got hdr line: Authorization: Bearer
[...]
220519 08:25:45 467 http_Protocol:  rc:12 got hdr line: Test: Test
220519 08:25:45 467 http_Protocol:  rc:25 got hdr line: Content-Length: 2097152
220519 08:25:45 467 http_Protocol:  rc:40 got hdr line: Content-Type: application/octet-stream
220519 08:25:45 467 http_Protocol:  rc:36 got hdr line: Host: eos-mgm.cr.cnaf.infn.it:9000
220519 08:25:45 467 http_Protocol:  rc:24 got hdr line: Connection: Keep-Alive
220519 08:25:45 467 http_Protocol:  rc:52 got hdr line: User-Agent: Apache-HttpClient/4.5.12 (Java/11.0.4)
220519 08:25:45 467 http_Protocol:  rc:31 got hdr line: Accept-Encoding: gzip,deflate
220519 08:25:45 467 http_Protocol:  rc:2 got hdr line: 
220519 08:25:45 467 http_Protocol:  rc:2 detected header end.
220519 08:12:41 446 http_Protocol:  rc:44 got hdr line: PUT /eos/dockertest/andrea/test2M HTTP/1.1
220519 08:12:41 446 http_Protocol:  Parsing first line: PUT /eos/dockertest/andrea/test2M HTTP/1.1
220519 08:12:41 446 http_Protocol:  rc:35 got hdr line: User-Agent: xrootd-tpc/v5.4.3-rc1
220519 08:12:41 446 http_Protocol:  rc:40 got hdr line: Host: eos-ingress.cr.cnaf.infn.it:9000
220519 08:12:41 446 http_Protocol:  rc:13 got hdr line: Accept: */*
220519 08:12:41 446 http_Protocol:  rc:470 got hdr line: Authorization: Bearer
[...]
220519 08:12:41 446 http_Protocol:  rc:12 got hdr line: Test: Test
220519 08:12:41 446 http_Protocol:  rc:25 got hdr line: Content-Length: 2097152
220519 08:12:41 446 http_Protocol:  rc:22 got hdr line: Expect: 100-continue
220519 08:12:41 446 http_Protocol:  rc:2 got hdr line: 
220519 08:12:41 446 http_Protocol:  rc:2 detected header end.

In the failure case the file has been created, but it is not "filled".

So it seems that the miscommunication between WebDAV and EOS involves another part of the transfer.
I am also in contact with the StoRM developers to understand the reasons of this.

Please, keep open this issue. I will update it as soon as I have something new.

As further information, I attach the logs that I mentioned above to this comment.
Obviously, I don't ask you to check them.

Thank you again for the great support,
cheers,
Andrea.

========================================================================

xrdlog.mgm.curl-eos-eos.txt
xrdlog.mgm.curl-webdav-eos.txt

@esindril
Copy link
Member

esindril commented May 19, 2022

Hi Andrea,

Looking at the logs that you provided it seems that the WebDav client might not properly follow the redirection that it receives from the MGM daemon. Most likely you can easily replicate the behavior with a simple PUT operation as the TPC in the end boils down to this.

It would be interesting to know what the HTTP client from the other endpoint does/sees.

Cheers,
Elvin

@andrearendina
Copy link
Author

Dear Elvin,

I am waiting for a feedback of the StoRM developers.
In the meanwhile, I increased the debug level of the WebDAV TPCs.

In the following lines I have found the PUT request that you mentioned in the last comment:

2022-05-19T14:34:09.075Z []  DEBUG 30403 --- [thread-pool-72] o.i.s.w.authz.VOMSAuthenticationFilter   : Checking secure context token: null
2022-05-19T14:34:09.076Z []  DEBUG 30403 --- [thread-pool-72] o.i.s.w.authz.VOMSAuthenticationFilter   : No client certificate found in request.
2022-05-19T14:34:09.076Z []  DEBUG 30403 --- [thread-pool-72] o.i.s.w.authz.VOMSAuthenticationFilter   : No pre-authenticated principal found in request
2022-05-19T14:34:09.087Z []  DEBUG 30403 --- [thread-pool-72] o.i.s.webdav.server.DefaultPathResolver  : /disk/test2M matches with access point /disk. Resolved storage area name: dteam
2022-05-19T14:34:09.088Z []  DEBUG 30403 --- [thread-pool-72] o.i.s.w.a.v.UnanimousDelegatedVoter      : Voter: org.italiangrid.storm.webdav.authz.voters.FineGrainedAuthzVoter@5258a68, returned: 0
2022-05-19T14:34:09.088Z []  DEBUG 30403 --- [thread-pool-72] o.i.s.w.a.v.UnanimousDelegatedVoter      : Voter: org.italiangrid.storm.webdav.authz.voters.FineGrainedCopyMoveAuthzVoter@777dfc3e, returned: 0
2022-05-19T14:34:09.097Z []  DEBUG 30403 --- [thread-pool-72] o.i.s.webdav.server.DefaultPathResolver  : /disk/test2M matches with access point /disk. Resolved storage area name: dteam
2022-05-19T14:34:09.098Z []  DEBUG 30403 --- [thread-pool-72] o.i.s.w.a.v.UnanimousDelegatedVoter      : Voter: org.italiangrid.storm.webdav.authz.voters.WlcgScopeAuthzVoter@74b6964, returned: 0
2022-05-19T14:34:09.099Z []  DEBUG 30403 --- [thread-pool-72] o.i.s.w.a.v.UnanimousDelegatedVoter      : Voter: org.italiangrid.storm.webdav.authz.voters.WlcgScopeAuthzCopyMoveVoter@3813593b, returned: 0
2022-05-19T14:34:09.101Z [dd1be1d1-d35e-44b1-8c2d-8551ae515733]  DEBUG 30403 --- [thread-pool-72] o.i.s.webdav.server.DefaultPathResolver  : /disk/test2M matches with access point /disk. Resolved path: /storage/gemss_test1/dteam/disk/test2M
2022-05-19T14:34:09.101Z [dd1be1d1-d35e-44b1-8c2d-8551ae515733]   INFO 30403 --- [thread-pool-72] o.i.storm.webdav.tpc.TransferFilter      : Push third-party transfer requested: Source: /disk/test2M, Destination: https://eos-mgm.cr.cnaf.infn.it:9000/eos/dockertest/andrea/test2M, hasAuthorizationHeader: true, id: dd1be1d1-d35e-44b1-8c2d-8551ae515733
2022-05-19T14:34:09.101Z [dd1be1d1-d35e-44b1-8c2d-8551ae515733]  DEBUG 30403 --- [thread-pool-72] o.i.storm.webdav.tpc.TransferFilter      : PutTransferRequest [uuid=dd1be1d1-d35e-44b1-8c2d-8551ae515733, path=/disk/test2M, uri=https://eos-mgm.cr.cnaf.infn.it:9000/eos/dockertest/andrea/test2M, xferHeaders={Authorization=[Bearer [...]], Test=[Test]}, verifyChecksum=false, overwrite=true]
2022-05-19T14:34:09.102Z [dd1be1d1-d35e-44b1-8c2d-8551ae515733]  DEBUG 30403 --- [thread-pool-72] o.i.s.webdav.server.DefaultPathResolver  : /disk/test2M matches with access point /disk. Resolved path: /storage/gemss_test1/dteam/disk/test2M
2022-05-19T14:34:09.102Z [dd1be1d1-d35e-44b1-8c2d-8551ae515733]  DEBUG 30403 --- [thread-pool-72] o.i.s.w.t.t.impl.TransferRequestImpl     : TPC id: dd1be1d1-d35e-44b1-8c2d-8551ae515733, status: STARTED
2022-05-19T14:34:09.253Z [dd1be1d1-d35e-44b1-8c2d-8551ae515733]  ERROR 30403 --- [thread-pool-72] o.i.s.w.tpc.http.HttpTransferClient      : Connection reset by peer (Write failed)

I have also tried to increase the debug level of all the libraries and I noticed that WebDAV tried to pass all the file content (maybe through the header?) until the connection is closed and the pipe is broken. I attached this part below.
I hope that it could help.

Thank you very much for your time and patiance,
cheers,
Andrea.

================================================================================

Webav_log_TPC_push.txt

@esindril
Copy link
Member

Hi Andrea,

I believe the problem is that the Apache HttpClient used by Storm to perform the PUT request does not send the Expect: 100-continue header, therefore sending both the headers and the body in one go. According to the HttpClient documentation this needs to be set explicitly:

https://hc.apache.org/httpcomponents-client-4.5.x/current/httpclient/apidocs/org/apache/http/client/protocol/RequestExpectContinue.html

Sending everything in one go will not work with any EOS endpoint, since the initial point of contact is the namespace metadata server and only after one redirection does the client connect to the actual (data) server that will store/provide the data.

Cheers,
Elvin

@andrearendina
Copy link
Author

Hi Elvin,

sorry for the late reply. I have open also a ticket about this issue to the StoRM webDAV developers:
https://issues.infn.it/jira/browse/STOR-1560.

As you can see from that, in the next release the request of sending the Expect: 100-continue header will be configurable.
Then we can perform other tests and verify if this solves problem.

Until that moment, please leave open the issue.

Thank you again,
cheers,
Andrea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants