Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stuck process in D -- I killed davfs2 #90

Open
yarikoptic opened this issue Aug 23, 2024 · 13 comments
Open

stuck process in D -- I killed davfs2 #90

yarikoptic opened this issue Aug 23, 2024 · 13 comments

Comments

@yarikoptic
Copy link
Member

yarikoptic commented Aug 23, 2024

I am not sure if this is the same issue as

I see some zombies but the problem seems to be a stuck in D (so not killed at all?) process

dandi    2630713  0.0  0.0   8980  1560 ?        Ss   Aug07   0:00   SCREEN
dandi    2630714  0.0  0.0  10620  2816 pts/1    Ss   Aug07   0:00     /bin/bash
dandi    2630723  0.0  0.0   2580  1536 pts/1    S+   Aug07   0:00       /bin/sh tools/run_loop.sh
dandi    2949169  0.0  0.0  16300  7168 pts/1    S+   Aug10   3:56         /usr/bin/perl /usr/bin/chronic ./run.sh --mode random-outdated-asset-first
dandi    2949189  0.0  0.0   6932  1792 pts/1    S+   Aug10   0:00           /bin/bash ./run.sh --mode random-outdated-asset-first
dandi    2958341  0.0  0.0 1421476 2620 pts/1    Sl+  Aug10   3:15             /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/bin/dandisets-healthstatus check -m /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse -J 10 --mode random-outdated-asset-first
dandi     591845  0.0  0.0      0     0 pts/1    Zl+  Aug10   0:02               [python] <defunct>
dandi     592572  0.0  0.0      0     0 pts/1    Zl+  Aug10   0:02               [python] <defunct>
dandi     592663  0.0  0.0 412932  2988 pts/1    D+   Aug10   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000710/draft/sub-VR32/sub-VR32_ecephys.nwb

which had been running for 2 weeks now.
We also have a bunch of df processes stuck and overall it all seems to be due to stuck davfs mount.

@jwodder , I see that we mount it with rw

davfs2   2958663  2.2  0.0  56396  5940 ?        Ss   Aug10 424:37 /sbin/mount.davfs https://webdav.dandiarchive.org /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse -o rw

is there a reason? I think ro would be sufficient for our use case, wouldn't it? if so -- what should be adjusted - if you could, please do that.

meanwhile I killed that davfs2 process since the whole thing was stuck. So the run of healthstatus might report more of errors etc. The script is now sleeping for its 600 seconds before the next round when it would mount it again I guess.

@yarikoptic yarikoptic changed the title stuck process in D stuck process in D -- I killed davfs2 Aug 23, 2024
@yarikoptic
Copy link
Member Author

meanwhile I am looking into upgrading davfs2 on drogon to 1.7.0 (backporting pkg from debian testing) so before I complain we are using most recent version

@jwodder
Copy link
Member

jwodder commented Aug 23, 2024

@yarikoptic

I think ro would be sufficient for our use case, wouldn't it? if so -- what should be adjusted - if you could, please do that.

The mount command will need to be adjusted in both the source code and the sudoers file to include -o ro.

@yarikoptic
Copy link
Member Author

ok, to sudoers I added a line with -t davfs -o ro so both could be used now. Please do necessary code adjustments and restart of that script.

@jwodder
Copy link
Member

jwodder commented Aug 23, 2024

@yarikoptic FYI, the current run failed at the mount stage with:

healthstatus: Mounting davfs2 mount ...
/sbin/mount.davfs: found PID file /var/run/mount.davfs/mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse.pid.
Either /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse is used by another process,
or another mount process ended irregular

@yarikoptic
Copy link
Member Author

removed now

root@drogon:/mnt/backup# cat /var/run/mount.davfs/mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse.pid
2958663
root@drogon:/mnt/backup# ps auxw | grep `!!`
ps auxw | grep `cat /var/run/mount.davfs/mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse.pid`
root      141051  0.0  0.0   6332  2176 pts/22   S+   12:17   0:00 grep 2958663
root@drogon:/mnt/backup# rm /var/run/mount.davfs/mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse.pid

jwodder added a commit that referenced this issue Aug 23, 2024
Add `-o ro` to davfs mount command
@jwodder
Copy link
Member

jwodder commented Aug 23, 2024

@yarikoptic The script is now running using a read-only mount.

@yarikoptic
Copy link
Member Author

coolio, and davfs2 package was updated. Let's see where we would get! ATM it looks healthyeish albeit slow

dandi    2630713  0.0  0.0   9244  2584 ?        Ss   Aug07   0:00   SCREEN
dandi    2630714  0.0  0.0  11676  5504 pts/1    Ss   Aug07   0:00     /bin/bash
dandi     145300  0.0  0.0   2580  1536 pts/1    S+   12:34   0:00       /bin/sh tools/run_loop.sh
dandi     145302  0.0  0.0  15872 10752 pts/1    S+   12:34   0:01         /usr/bin/perl /usr/bin/chronic ./run.sh --mode random-outdated-asset-first
dandi     145303  0.0  0.0   6932  3200 pts/1    S+   12:34   0:00           /bin/bash ./run.sh --mode random-outdated-asset-first
dandi     145557  2.4  0.5 1454332 342136 pts/1  Sl+  12:34   2:36             /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/bin/dandisets-healthstatus check -m /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse -J 10 --mode random-outdated-asset-first
dandi     211730  0.0  0.1 413520 90132 pts/1    Sl+  13:39   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000051/draft/pons8-yo_16xdownsampled.nwb
dandi     211732  0.0  0.1 413520 90208 pts/1    Sl+  13:39   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000056/draft/sub-Mouse24/sub-Mouse24_ses-Mouse24-131216_behavior+ecephys.nwb
dandi     211734  0.0  0.1 413520 90080 pts/1    Sl+  13:39   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000054/draft/sub-R6/sub-R6_ses-20200209T210000_obj-1ouyda4_behavior+ophys.nwb
dandi     211739  0.0  0.1 413524 90092 pts/1    Sl+  13:39   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000055/draft/sub-09/sub-09_ses-5_behavior+ecephys.nwb
dandi     211747  0.0  0.1 413516 89644 pts/1    Sl+  13:39   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000053/draft/sub-Ella/sub-Ella_ses-20190402_behavior.nwb
dandi     211756  0.0  0.1 412924 89376 pts/1    Sl+  13:39   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000059/draft/sub-MS22/sub-MS22_ses-Peter-MS22-180712-102504-concat_desc-raw_ecephys.nwb
dandi     212244  0.0  0.1 412952 89376 pts/1    Sl+  13:40   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000061/draft/sub-Rat10/sub-Rat10_ses-Rat10-20140704_ecephys+image.nwb
dandi     212246  0.0  0.1 412872 89668 pts/1    Sl+  13:40   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000060/draft/sub-359855/sub-359855_ses-20161221_behavior+ecephys+ogen.nwb
dandi     212251  0.0  0.1 412920 89376 pts/1    Sl+  13:40   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000064/draft/sub-001/sub-001.nwb
dandi     215474  0.1  0.1 412996 89376 pts/1    Sl+  13:47   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000065/draft/sub-Kibbles/sub-Kibbles_behavior+ecephys.nwb
davfs2    145565 19.0  0.0  54176 10548 ?        Ss   12:34  20:32   /sbin/mount.davfs https://webdav.dandiarchive.org /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse -o ro

"slow" as above procetest processes seems to be already half an hour old, and it is only the mount.davfs which I see CPU busy (at 10-20%). Oddly those test processes are in S not in D mode.

@yarikoptic
Copy link
Member Author

the davfs2 stalled already (only after a few hours) again... it is not yet 100% stalled, but just became super slow even for df or ls call

some gory details which lead me to the conclusion which might explain observed behavior: davfs2 (at least some times) decides to download full file instead of sparsely accessing it

, e.g.

$> time ls -l /mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse
total 0
drwxr-xr-x  486 root root 0 Aug 23 12:34 dandisets/
drwx------    2 root root 0 Aug 23 12:18 lost+found/
drwxr-xr-x 3007 root root 0 Aug 23 12:34 zarrs/
LC_COLLATE=POSIX ls -bCF --color=auto -l   0.00s user 0.00s system 0% cpu 1.853 total

and subsequent df also was faster than first one . After a few seconds, df became super slow again...

it might be because it is very busy in IO itself trying to cache locally potentially the entire file it is accessing!!!

root@drogon:/home/yoh# du -scm /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/
300111  /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/
300111  total
root@drogon:/home/yoh# du -scm --apparent-size /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/
300646  /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/
300646  total

so we have 300G of cache

even though we have

root@drogon:/home/yoh# grep cache_size /etc/davfs2/davfs2.conf
# cache_size      50                # MiByte
cache_size 10000

so cache should not exceed 10G?

and most recent/currently modified and growing file is

root@drogon:/home/yoh# ls -ld /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/jGCaMP8f_ANM478349_cell07.nwb-ZV8NTQ
-rw------- 1 davfs2 davfs2 18067083509 Aug 23 16:45 /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/jGCaMP8f_ANM478349_cell07.nwb-ZV8NTQ
root@drogon:/home/yoh# ls -ld /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/jGCaMP8f_ANM478349_cell07.nwb-ZV8NTQ
-rw------- 1 davfs2 davfs2 18239804453 Aug 23 16:45 /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/jGCaMP8f_ANM478349_cell07.nwb-ZV8NTQ

I "guess" for this one

/mnt/backup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb

which is

❯ dandi ls -f pyout dandi://dandi/000168@draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
ASSET_ID                 PATH                                   SIZE    CREATED                  MODIFIED                  BLOB                    
8fd3b1a0-3dab-4de2-82... jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb 19.4 GB 2022-01-21T03:36:10.7... 2022-01-21T03:50:24.00... 030b38c7-1196-4352-b8.

so 19GB. for this call

dandi     402971  0.3  0.1 413524 90176 pts/1    Sl+  16:40   0:02               /home/dandi/cronlib/dandisets-healthstatus/venv/bin/python /home/dandi/cronlib/dandisets-healthstatus/venv/lib/python3.8/site-packages/healthstatus/pynwb_open_load_ns.py /mnt/ba
ckup/dandi/dandisets-healthstatus/dandisets-fuse/dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb

so -- python and we know that python does "sparse access". logs from davfs for it

$> journalctl -rg jGCaMP8f_ANM478349_cell07.nwb
Aug 23 16:45:31 drogon mount.davfs[145565]: getattr /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 23 16:45:31 drogon mount.davfs[145565]:   /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 23 16:40:49 drogon mount.davfs[145565]: open /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 23 16:40:49 drogon mount.davfs[145565]: lookup /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 23 16:40:49 drogon mount.davfs[145565]:   p 0x55e27e4678c0, jGCaMP8f_ANM478349_cell07.nwb
Aug 23 16:40:49 drogon mount.davfs[145565]: added /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 10 13:14:57 drogon mount.davfs[2958663]:   /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 10 12:59:26 drogon mount.davfs[2958663]: added /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 08 16:22:30 drogon mount.davfs[1911909]:   /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 08 16:21:20 drogon mount.davfs[1911909]: added /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 08 14:32:06 drogon mount.davfs[1911909]:   /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 08 14:30:00 drogon mount.davfs[1911909]:   /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 08 14:27:03 drogon mount.davfs[1911909]:   /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 08 14:23:32 drogon mount.davfs[1911909]:   /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 08 14:19:34 drogon mount.davfs[1911909]:   /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 08 14:16:23 drogon mount.davfs[1911909]: added /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 04 15:06:58 drogon mount.davfs[3226681]: added /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 03 21:02:12 drogon mount.davfs[4048318]:   /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 03 20:59:13 drogon mount.davfs[4048318]: added /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 03 01:02:42 drogon mount.davfs[3268345]:   /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 03 00:58:08 drogon mount.davfs[3268345]:   /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 03 00:43:47 drogon mount.davfs[3268345]: added /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 02 02:20:27 drogon mount.davfs[4084327]:   /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 02 02:17:47 drogon mount.davfs[4084327]:   /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 02 02:14:13 drogon mount.davfs[4084327]:   /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 02 02:03:06 drogon mount.davfs[4084327]:   /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 02 01:51:26 drogon mount.davfs[4084327]:   /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Aug 02 01:48:02 drogon mount.davfs[4084327]: added /dandisets/000168/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb
Jul 31 10:16:13 drogon mount.davfs[1551830]: added /dandisets/000362/draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb

and it seems we have downloaded it fully

❯ dandi ls -f json_pp dandi://dandi/000168@draft/jGCaMP8f/jGCaMP8f_ANM478349_cell07.nwb | grep size
    "size": 19419844643

root@drogon:/home/dandi# du -scb /mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/jGCaMP8f_ANM478349_cell07.nwb-ZV8NTQ
19419844643	/mnt/backup/cache/davfs2/webdav.dandiarchive.org+mnt-backup-dandi-dandisets-healthstatus-dandisets-fuse+root/jGCaMP8f_ANM478349_cell07.nwb-ZV8NTQ

@yarikoptic
Copy link
Member Author

looking at posts like https://savannah.nongnu.org/support/?110422 suggests that there is actually NO sparse caching , and full download of the file is expected! Did you check on what happens to the file while benchmarking davfs2 awhile back @jwodder ?

@jwodder
Copy link
Member

jwodder commented Aug 23, 2024

@yarikoptic No.

@yarikoptic
Copy link
Member Author

then davfs2 is likely not an acceptable solution for us. Do you have ideas/recommendations on how we should proceed?

@jwodder
Copy link
Member

jwodder commented Aug 23, 2024

@yarikoptic My only other idea was to just download the files directly before operating on them, but you rejected that in discussion with Einar.

@yarikoptic
Copy link
Member Author

yarikoptic commented Aug 23, 2024

FWIW: I filed https://savannah.nongnu.org/support/index.php?111110 for now.

Also found https://github.com/thehyve/davfs2/blob/main/TODO#L35 https://cvs.savannah.nongnu.org/viewvc/davfs2/davfs2/TODO?view=markup#l35

- ranged requests for GET (partial download)

(edit: interestingly most of it , including partial download one, was removed in '150ce86f45a7cd67235f748a1d3511b3f357cd0a (tag: rel-1-5-0)') . So may be just gave up on thinking TODO it since I do not see any reflection for RANGE in the code besides

❯ git grep -i '\<range\>'
src/webdav.c:        case 416:           /* Requested Range Not Satisfiable */

Full download are pretty much prohibitive -- we are in effect observing its effects with this davfs2 which was just spending (wasting) time downloading most of the time instead of doing quick sparse download only of necessary blocks.

I still think that we would be better off with webdav based FUSE solution in favor of reverting back to fsspec-based (may be with completely disabled caching to avoid multithreading fiascos) datalad-fuse... But might also be worth checking if there were any related changes in fsspec since then.

I will also think about this over weekend. I would appreciate if you also look into alternatives etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants