-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stuck process in D -- I killed davfs2 #90
Comments
meanwhile I am looking into upgrading davfs2 on drogon to 1.7.0 (backporting pkg from debian testing) so before I complain we are using most recent version |
The |
ok, to sudoers I added a line with |
@yarikoptic FYI, the current run failed at the mount stage with:
|
removed now
|
@yarikoptic The script is now running using a read-only mount. |
coolio, and davfs2 package was updated. Let's see where we would get! ATM it looks healthyeish albeit slow
"slow" as above procetest processes seems to be already half an hour old, and it is only the |
the davfs2 stalled already (only after a few hours) again... it is not yet 100% stalled, but just became super slow even for some gory details which lead me to the conclusion which might explain observed behavior: davfs2 (at least some times) decides to download full file instead of sparsely accessing it, e.g.
and subsequent it might be because it is very busy in IO itself trying to cache locally potentially the entire file it is accessing!!!
so we have 300G of cache even though we have
so cache should not exceed 10G? and most recent/currently modified and growing file is
I "guess" for this one
which is
so 19GB. for this call
so -- python and we know that python does "sparse access". logs from davfs for it
and it seems we have downloaded it fully ❯ dandi ls -f json_pp dandi://dandi/000168@draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb | grep size
"size": 19419844643
root@drogon:/home/dandi# du -scb /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/jGCaMP8f_ANM478349_cell07.nwb-ZV8NTQ
19419844643 /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/jGCaMP8f_ANM478349_cell07.nwb-ZV8NTQ |
looking at posts like https://savannah.nongnu.org/support/?110422 suggests that there is actually NO sparse caching , and full download of the file is expected! Did you check on what happens to the file while benchmarking davfs2 awhile back @jwodder ? |
@yarikoptic No. |
then davfs2 is likely not an acceptable solution for us. Do you have ideas/recommendations on how we should proceed? |
@yarikoptic My only other idea was to just download the files directly before operating on them, but you rejected that in discussion with Einar. |
FWIW: I filed https://savannah.nongnu.org/support/index.php?111110 for now. Also found https://github.com/thehyve/davfs2/blob/main/TODO#L35 https://cvs.savannah.nongnu.org/viewvc/davfs2/davfs2/TODO?view=markup#l35
(edit: interestingly most of it , including partial download one, was removed in '150ce86f45a7cd67235f748a1d3511b3f357cd0a (tag: rel-1-5-0)') . So may be just gave up on thinking TODO it since I do not see any reflection for RANGE in the code besides ❯ git grep -i '\<range\>'
src/webdav.c: case 416: /* Requested Range Not Satisfiable */ Full download are pretty much prohibitive -- we are in effect observing its effects with this davfs2 which was just spending (wasting) time downloading most of the time instead of doing quick sparse download only of necessary blocks. I still think that we would be better off with webdav based FUSE solution in favor of reverting back to fsspec-based (may be with completely disabled caching to avoid multithreading fiascos) datalad-fuse... But might also be worth checking if there were any related changes in fsspec since then. I will also think about this over weekend. I would appreciate if you also look into alternatives etc. |
I am not sure if this is the same issue as
I see some zombies but the problem seems to be a stuck in D (so not killed at all?) process
which had been running for 2 weeks now.
We also have a bunch of
df
processes stuck and overall it all seems to be due to stuck davfs mount.@jwodder , I see that we mount it with
rw
is there a reason? I think
ro
would be sufficient for our use case, wouldn't it? if so -- what should be adjusted - if you could, please do that.meanwhile I killed that davfs2 process since the whole thing was stuck. So the run of healthstatus might report more of errors etc. The script is now sleeping for its 600 seconds before the next round when it would mount it again I guess.
The text was updated successfully, but these errors were encountered: