Explicitly manage FIFO thread resource #101

xelxebar · 2025-06-01T17:35:42Z

Test "setup_fifo + (already exists)" would sometimes error with ENOENT
on /tmp/test-fifo. We noticed that it was intermittent and nailed it
down to a race condition between "setup_fifo +" and "setup_fifo +
(already exists)" both racing to create and delete /tmp/test-fifo.

Previously, we were creating the FIFO thread in a detached state,
which made verification and management challenging. Tests were spawning
multiple copies of the thread, forcing workarounds and the issue above.

This commit introduces explicit management for the FIFO thread, which
allows us to be more direct and precise in the previously-failing tests.

xelxebar · 2025-06-02T05:29:37Z

FWIW, this is just to spur discussion. No expectations!

Seeker04 · 2025-06-02T17:54:38Z

Hi,

Your code is really nice, it works and I'm glad for it!

I'm reluctant, because I'm not sure we should introduce complexity in our code for the sake of a warning in a unit test output (which I can't reproduce in the Docker image, nor on my machine). Reproducability actually doesn't matter. Even if this always happened everywhere, I would rather adjust these two testcases than introduce complexity in the code.

We already have "sleep hacks" in other unit tests too. Actually it's always for waiting file creation. I guess, we could create a wait_until_file_exists/1 predicate (using this) in utils.pl to make these nicer and not performance reliant. What do you think?

Or if you can give a reason outside of unit tests, where this added complexity to fifo would pay of, let me know!

Thanks again for the PR and the thought provoking!

xelxebar · 2025-06-03T02:02:20Z

Thanks for the feedback!

I'm reluctant, because I'm not sure we should introduce complexity in our code for the sake of a warning in a unit test output.

process_fifo is definitely a tad more complicated. I agree this is the main argument against the patch.

if you can give a reason outside of unit tests, where this added complexity to fifo would pay of, let me know!

That said, if we consider implementing dynamic settings, I think this makes sense. Ideally, if a user changes the state of fifo_enabled, we probably want to start and stop the thread accordingly. (Or if fifo_path becomes temporarily invalid, etc.)

We might also want to set a thread alias, to make thread interaction easy for users, but maybe there's a better way.

Test "setup_fifo + (already exists)" would sometimes error with ENOENT on `/tmp/test-fifo`. We noticed that it was intermittent and nailed it down to a race condition between "setup_fifo +" and "setup_fifo + (already exists)" both racing to create and delete `/tmp/test-fifo`. Previously, we were creating the FIFO thread in a detached state, which made verification and management challenging. Tests were spawning multiple copies of the thread, forcing workarounds and the issue above. This commit introduces explicit management for the FIFO thread, which allows us to be more direct and precise in the previously-failing tests.

Seeker04 · 2025-06-07T17:20:59Z

That said, if we consider implementing dynamic settings, I think this makes sense. Ideally, if a user changes the state of fifo_enabled, we probably want to start and stop the thread accordingly.

Valid point. That would definitely require us storing the thread ID.

Or if fifo_path becomes temporarily invalid, etc.

Yes. If I delete the fifo file and re-create it later, plwm can pick up processing it with your solution, but not with mine.

However, this particular implementation (i.e. trying open/3 every 0.1 seconds expecting an exception) yields performance issues. With this, plwm uses 0.7% CPU on my system about half the time while not doing anything (according htop), whereas before this patch, it's constant 0.0%.

This can be solved by changing the 0.1 to e.g. 5.0 in your code. The fifo is still responsive, because open/3 blocks and after the artificial timeout by the exception, re-opening is attempted immediately. (Not sure if there's a small window of dead time between the two).

The bigger problem however is that if I indeed delete the fifo file, than plwm spins up the CPU to around 135% until it's re-created. This is because in this case open/3 always throws immediately. This is a non-issue in the current state, because the detached thread simply dies at that point, whereas here we keep catching it in an infinite recursion.

plwm could also try re-creating the fifo, but I think we'd mostly land in this state if it couldn't create it in the first place during startup. It probably couldn't do it later either.

If I add a sleep(5) as exception handler in place of the true, then performance gets fixed, but the fifo becomes slow to respond, because the delay also runs after the timeout-generated exceptions.

A proper solution would be if we could distinguish between the actual existence_error exceptions when the fifo does not exist and the artifical time_limit_exceeded ones that only come if open/3 is blocking for too long. So we could wait e.g. 5 secs only in case of the former. It would look like this:

(call_with_time_limit(5.0, open(FifoPath, read, Fifo)),
	(read_terms(Fifo, Jobs) ->
		jobs_notify(Jobs)
	; true),
	close(Fifo)
), Ex, (Ex \= time_limit_exceeded(_) -> sleep(5))

The problem though is that Ex always unifies to this:

error(existence_error(source_sink,/tmp/plwm_fifo),context(system:open/3,Interrupted system call))

Not sure why this is the case when it comes from call_with_time_limit/3. It should be a time_limit_exceeded(Context). I'm not seeing something.

tl;dr We're in good direction, but should solve this performance problem, otherwise plwm goes brrrr exactly in the situation we're trying to mend.

xelxebar · 2025-06-08T02:28:49Z

However, this particular implementation (i.e. trying open/3 every 0.1 seconds expecting an exception) yields performance issues.

Yikes! Good catch.

Not sure why this is the case when it comes from call_with_time_limit/3. It should be a time_limit_exceeded(Context). I'm not seeing something.

Probably EINTR error handling from the interrupted open(2) syscall is happening before time_limit_exceeded(Context) unification can happen. My guess is that this is intentional, since even on a timeout, you might want to retry I/O and whatnot, but arguably there's documentation bug.

If I add a sleep(5) as exception handler in place of the true, then performance gets fixed, but the fifo becomes slow to respond, because the delay also runs after the timeout-generated exceptions.

Indeed. This is why I originally went with sleep(0.1). It's really obvious in the tests, too.

Arguably, the timeout is a hack and open/3 blocking is actually correct. One option is to have an "update settings" thread that blocks on thread_get_message and restarts the fifo thread for us . This approach has some advantages:

It fixes the performance problem you caught;
The lag from sleep/1ing is gone; and
It gives us a general "hook" for reacting to arbitrary updates in dynamic settings.

The disadvantage is that it complicates the architecture some, and now every time we update dynamic settings, somehow thread_send_message/2 would need to get called on the "update settings" watcher.

Good discussion. Thoughts?

Seeker04 · 2025-06-10T22:16:46Z

Probably EINTR error handling from the interrupted open(2) syscall is happening before time_limit_exceeded(Context) unification can happen. My guess is that this is intentional, since even on a timeout, you might want to retry I/O and whatnot, but arguably there's documentation bug.

Yeah I think, you're right on this.

Arguably, the timeout is a hack and open/3 blocking is actually correct. One option is to have an "update settings" thread that blocks on thread_get_message and restarts the fifo thread for us .

I'm currently working on the configuration redesign that will let us dynamically change any setting. A separate thread that watches for changes won't be needed, I think. I'd also avoid multi-threading as much as possible. Though this fifo handling will be a special case indeed, so I'm not dismissing this approach just yet.

I'll get back to this once I'm done with the config redesign. I'll let you know and ask for feedback as soon as I have a working state. Then we can continue this from there.

Thanks for the ideas! ^^

xelxebar force-pushed the push-nyqnzllumtyy branch from fea2db8 to 2c52f47 Compare June 1, 2025 17:46

xelxebar mentioned this pull request Jun 1, 2025

Fifo exception warning in test output #94

Open

xelxebar force-pushed the push-nyqnzllumtyy branch from 2c52f47 to 173863e Compare June 1, 2025 17:59

xelxebar changed the title ~~Explicitly shutdown FIFO thread~~ Explicitly manage FIFO thread resource Jun 1, 2025

xelxebar force-pushed the push-nyqnzllumtyy branch from 173863e to 7f0869a Compare June 1, 2025 18:05

xelxebar mentioned this pull request Jun 1, 2025

Fix #94. Ignore expected FIFO exception in tests #97

Closed

xelxebar force-pushed the push-nyqnzllumtyy branch from 7f0869a to 6ec3ccd Compare June 5, 2025 00:33

Seeker04 mentioned this pull request Jun 21, 2025

Configuration overhaul (#30) #104

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explicitly manage FIFO thread resource #101

Explicitly manage FIFO thread resource #101

Uh oh!

xelxebar commented Jun 1, 2025 •

edited

Loading

Uh oh!

xelxebar commented Jun 2, 2025

Uh oh!

Seeker04 commented Jun 2, 2025

Uh oh!

xelxebar commented Jun 3, 2025

Uh oh!

Seeker04 commented Jun 7, 2025

Uh oh!

xelxebar commented Jun 8, 2025 •

edited

Loading

Uh oh!

Seeker04 commented Jun 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Explicitly manage FIFO thread resource #101

Are you sure you want to change the base?

Explicitly manage FIFO thread resource #101

Uh oh!

Conversation

xelxebar commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xelxebar commented Jun 2, 2025

Uh oh!

Seeker04 commented Jun 2, 2025

Uh oh!

xelxebar commented Jun 3, 2025

Uh oh!

Seeker04 commented Jun 7, 2025

Uh oh!

xelxebar commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Seeker04 commented Jun 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xelxebar commented Jun 1, 2025 •

edited

Loading

xelxebar commented Jun 8, 2025 •

edited

Loading