Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os.listdir (and presumably other os functions) fails in the face of signals #955

Open
emeryberger opened this issue Jul 29, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@emeryberger
Copy link

Mountpoint for Amazon S3 version

mount-s3 1.7.2

AWS Region

us-east-1

Describe the running environment

Running on EC2, accessing an S3 bucket through my account, using this AMI: Deep Learning Base Proprietary Nvidia Driver GPU AMI (Ubuntu 20.04) 20240314. Same setup as here: plasma-umass/scalene#841

Mountpoint options

mount-s3 <MY_BUCKET> /mnt/mounted_bucket_dir

What happened?

This error (failure when running in a mounted S3 system) was brought to my attention with this issue with the Scalene profiler: plasma-umass/scalene#841

The root cause turns out to be the CPU timer signal; if the os.listdir function is interrupted by a signal.SIGALRM, the call fails with an OSError. I set the frequency below to a level that triggers the failure roughly half the time; setting it to 1 second makes it never happen. Since the default CPU sampling frequency used by Scalene is 0.01 seconds, it fails consistently. Note that the profiler py-spy also causes this failure.

MRE here:

import os
import signal
import time

def signal_handler(signum, frame):
    print("signal")
    pass

signal.setitimer(signal.ITIMER_REAL, 0.02) # setting to 0.01 triggers failures reliably                              
signal.signal(signal.SIGALRM, signal_handler)

dirname = "/mnt/mounted_bucket_dir"

assert os.path.exists(dirname)
assert os.access(dirname, os.R_OK)

print(len(os.listdir(dirname)))

Example failure:

Traceback (most recent call last):
  File "test-mount-s3.py", line 16, in <module>
    print(len(os.listdir(dirname)))
OSError: [Errno 22] Invalid argument: '/mnt/mounted_bucket_dir'

I have implemented a workaround for this situation (wrapping all os functions so that they block the SIGALRM signal) - plasma-umass/scalene#842 - , but it seems like it is exposing a race condition in mount-s3, or at minimum, undesirable behavior.

Relevant log output

2024-07-29T00:47:59.230133Z  WARN mountpoint_s3::cli: failed to detect network throughput. Using 10 gbps as throughput. Use --maximum-throughput-gbps CLI flag to configure a target throughput appropriate for the instance. Detection failed due to: failed to get network throughput
2024-07-29T00:48:15.393878Z  WARN readdirplus{req=10 ino=1 fh=1 offset=1}: mountpoint_s3::fuse: readdirplus failed: out-of-order readdir, expected=3, actual=1
@vladem
Copy link
Contributor

vladem commented Aug 13, 2024

Hey @emeryberger, thanks for reporting this issue and sorry for the delay in the response. We've reproduced it and are able to confirm that the problem exists. Here is a debug log:

2024-07-31T15:56:07.404177Z DEBUG fuser::request: FUSE( 20) ino 0x0000000000000001 READDIRPLUS fh FileHandle(2), offset 0, size 4096
2024-07-31T15:56:07.404334Z DEBUG <request to s3 created>
2024-07-31T15:56:07.503964Z DEBUG fuser::request: FUSE( 21) ino 0x0000000000000000 INTERRUPT unique RequestId(20)
2024-07-31T15:56:07.745586Z DEBUG <got response from s3>
2024-07-31T15:56:07.746746Z DEBUG fuser::request: FUSE( 22) ino 0x0000000000000001 READDIRPLUS fh FileHandle(2), offset 1, size 4096
2024-07-31T15:56:07.746792Z  WARN readdirplus{req=22 ino=1 fh=2 offset=1}: mountpoint_s3::fuse: readdirplus failed: out-of-order readdir, expected=24, actual=1

From the log we can see that when an application gets interrupted in a readdir syscall the underlying chain of readdir fuse requests gets reset to an offset which is considered stale by Mountpoint (in the log the offset is 1 instead of 24). Due to a streaming implementation of readdir in Mountpoint such behaviour is not currently supported. We'll continue exploring possible solutions for this problem and provide updates in this issue.

A separate problem are possible errors caused by interrupts during other system calls, for which we don't have a reproduction yet. I've created a separate issue for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants