Skip to content

Conversation

@xelxebar
Copy link

@xelxebar xelxebar commented Jun 1, 2025

Test "setup_fifo + (already exists)" would sometimes error with ENOENT
on /tmp/test-fifo. We noticed that it was intermittent and nailed it
down to a race condition between "setup_fifo +" and "setup_fifo +
(already exists)" both racing to create and delete /tmp/test-fifo.

Previously, we were creating the FIFO thread in a detached state,
which made verification and management challenging. Tests were spawning
multiple copies of the thread, forcing workarounds and the issue above.

This commit introduces explicit management for the FIFO thread, which
allows us to be more direct and precise in the previously-failing tests.

@xelxebar xelxebar force-pushed the push-nyqnzllumtyy branch from fea2db8 to 2c52f47 Compare June 1, 2025 17:46
@xelxebar xelxebar force-pushed the push-nyqnzllumtyy branch from 2c52f47 to 173863e Compare June 1, 2025 17:59
@xelxebar xelxebar changed the title Explicitly shutdown FIFO thread Explicitly manage FIFO thread resource Jun 1, 2025
@xelxebar xelxebar force-pushed the push-nyqnzllumtyy branch from 173863e to 7f0869a Compare June 1, 2025 18:05
@xelxebar
Copy link
Author

xelxebar commented Jun 2, 2025

FWIW, this is just to spur discussion. No expectations!

@Seeker04
Copy link
Owner

Seeker04 commented Jun 2, 2025

Hi,

Your code is really nice, it works and I'm glad for it!

I'm reluctant, because I'm not sure we should introduce complexity in our code for the sake of a warning in a unit test output (which I can't reproduce in the Docker image, nor on my machine). Reproducability actually doesn't matter. Even if this always happened everywhere, I would rather adjust these two testcases than introduce complexity in the code.

We already have "sleep hacks" in other unit tests too. Actually it's always for waiting file creation. I guess, we could create a wait_until_file_exists/1 predicate (using this) in utils.pl to make these nicer and not performance reliant. What do you think?

Or if you can give a reason outside of unit tests, where this added complexity to fifo would pay of, let me know!

Thanks again for the PR and the thought provoking!

@xelxebar
Copy link
Author

xelxebar commented Jun 3, 2025

Thanks for the feedback!

I'm reluctant, because I'm not sure we should introduce complexity in our code for the sake of a warning in a unit test output.

process_fifo is definitely a tad more complicated. I agree this is the main argument against the patch.

if you can give a reason outside of unit tests, where this added complexity to fifo would pay of, let me know!

That said, if we consider implementing dynamic settings, I think this makes sense. Ideally, if a user changes the state of fifo_enabled, we probably want to start and stop the thread accordingly. (Or if fifo_path becomes temporarily invalid, etc.)

We might also want to set a thread alias, to make thread interaction easy for users, but maybe there's a better way.

Test "setup_fifo + (already exists)" would sometimes error with ENOENT
on `/tmp/test-fifo`. We noticed that it was intermittent and nailed it
down to a race condition between "setup_fifo +" and "setup_fifo +
(already exists)" both racing to create and delete `/tmp/test-fifo`.

Previously, we were creating the FIFO thread in a detached state,
which made verification and management challenging. Tests were spawning
multiple copies of the thread, forcing workarounds and the issue above.

This commit introduces explicit management for the FIFO thread, which
allows us to be more direct and precise in the previously-failing tests.
@xelxebar xelxebar force-pushed the push-nyqnzllumtyy branch from 7f0869a to 6ec3ccd Compare June 5, 2025 00:33
@Seeker04
Copy link
Owner

Seeker04 commented Jun 7, 2025

That said, if we consider implementing dynamic settings, I think this makes sense. Ideally, if a user changes the state of fifo_enabled, we probably want to start and stop the thread accordingly.

Valid point. That would definitely require us storing the thread ID.

Or if fifo_path becomes temporarily invalid, etc.

Yes. If I delete the fifo file and re-create it later, plwm can pick up processing it with your solution, but not with mine.


However, this particular implementation (i.e. trying open/3 every 0.1 seconds expecting an exception) yields performance issues. With this, plwm uses 0.7% CPU on my system about half the time while not doing anything (according htop), whereas before this patch, it's constant 0.0%.

This can be solved by changing the 0.1 to e.g. 5.0 in your code. The fifo is still responsive, because open/3 blocks and after the artificial timeout by the exception, re-opening is attempted immediately. (Not sure if there's a small window of dead time between the two).


The bigger problem however is that if I indeed delete the fifo file, than plwm spins up the CPU to around 135% until it's re-created. This is because in this case open/3 always throws immediately. This is a non-issue in the current state, because the detached thread simply dies at that point, whereas here we keep catching it in an infinite recursion.

plwm could also try re-creating the fifo, but I think we'd mostly land in this state if it couldn't create it in the first place during startup. It probably couldn't do it later either.

If I add a sleep(5) as exception handler in place of the true, then performance gets fixed, but the fifo becomes slow to respond, because the delay also runs after the timeout-generated exceptions.

A proper solution would be if we could distinguish between the actual existence_error exceptions when the fifo does not exist and the artifical time_limit_exceeded ones that only come if open/3 is blocking for too long. So we could wait e.g. 5 secs only in case of the former. It would look like this:

(call_with_time_limit(5.0, open(FifoPath, read, Fifo)),
	(read_terms(Fifo, Jobs) ->
		jobs_notify(Jobs)
	; true),
	close(Fifo)
), Ex, (Ex \= time_limit_exceeded(_) -> sleep(5))

The problem though is that Ex always unifies to this:

error(existence_error(source_sink,/tmp/plwm_fifo),context(system:open/3,Interrupted system call))

Not sure why this is the case when it comes from call_with_time_limit/3. It should be a time_limit_exceeded(Context). I'm not seeing something.

tl;dr We're in good direction, but should solve this performance problem, otherwise plwm goes brrrr exactly in the situation we're trying to mend.

@xelxebar
Copy link
Author

xelxebar commented Jun 8, 2025

However, this particular implementation (i.e. trying open/3 every 0.1 seconds expecting an exception) yields performance issues.

Yikes! Good catch.

Not sure why this is the case when it comes from call_with_time_limit/3. It should be a time_limit_exceeded(Context). I'm not seeing something.

Probably EINTR error handling from the interrupted open(2) syscall is happening before time_limit_exceeded(Context) unification can happen. My guess is that this is intentional, since even on a timeout, you might want to retry I/O and whatnot, but arguably there's documentation bug.

If I add a sleep(5) as exception handler in place of the true, then performance gets fixed, but the fifo becomes slow to respond, because the delay also runs after the timeout-generated exceptions.

Indeed. This is why I originally went with sleep(0.1). It's really obvious in the tests, too.

Arguably, the timeout is a hack and open/3 blocking is actually correct. One option is to have an "update settings" thread that blocks on thread_get_message and restarts the fifo thread for us . This approach has some advantages:

  1. It fixes the performance problem you caught;
  2. The lag from sleep/1ing is gone; and
  3. It gives us a general "hook" for reacting to arbitrary updates in dynamic settings.

The disadvantage is that it complicates the architecture some, and now every time we update dynamic settings, somehow thread_send_message/2 would need to get called on the "update settings" watcher.

Good discussion. Thoughts?

@Seeker04
Copy link
Owner

Probably EINTR error handling from the interrupted open(2) syscall is happening before time_limit_exceeded(Context) unification can happen. My guess is that this is intentional, since even on a timeout, you might want to retry I/O and whatnot, but arguably there's documentation bug.

Yeah I think, you're right on this.

Arguably, the timeout is a hack and open/3 blocking is actually correct. One option is to have an "update settings" thread that blocks on thread_get_message and restarts the fifo thread for us .

I'm currently working on the configuration redesign that will let us dynamically change any setting. A separate thread that watches for changes won't be needed, I think. I'd also avoid multi-threading as much as possible. Though this fifo handling will be a special case indeed, so I'm not dismissing this approach just yet.

I'll get back to this once I'm done with the config redesign. I'll let you know and ask for feedback as soon as I have a working state. Then we can continue this from there.

Thanks for the ideas! ^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants