epoll integration: Complete network I/O overhaul across Volclava components #35

spagochitarra · 2025-10-16T14:43:12Z

This PR introduces a consolidated epoll-based I/O model across core Volclava components, replacing legacy socket handling with scalable, event-driven mechanisms. The changes were squashed into commit 88193e8 and cherry-picked here for visibility and review.

Key Features:

Register/deregister epoll channels via static openSock(), with %m-style logging for diagnostics
epollChan_() implementation now active across lim, lstools, and API libraries
Listening socket cleanup on master restart during request processing
Fixed epoll timeout behavior for reliable responsiveness
Enabled epoll support in mbatchd and sbatchd network communication
epoll_fd creation deferred until post-fork to prevent descriptor reset in child processes
Non-blocking channel reads gated by readiness events

Code Quality Improvements:

Reformatted lsb.job.c for consistency and readability
Untabified and re-indented legacy sections for maintainability

Test Coverage:

Verified base and batch command functionality
Ran full job lifecycle: submit → dispatch → execution → termination
Observed correct channel readiness and socket cleanup
Caveats:
This integration is functional but not yet fully hardened. Some edge cases remain under investigation, particularly around fork timing and descriptor lifecycle. Reviewers are encouraged to test under mixed OS conditions and high-load scenarios.

Lu

Remove a space from Makefile.am

…onents This commit consolidates comprehensive epoll integration across multiple Volclava components, replacing traditional socket handling with efficient event-driven I/O: Key Features: - Register/deregister epoll channels with static openSock() and %m logging - epollChan_() implementation working across lim, lstools, and API libraries - Proper listening socket cleanup on master restart during request processing - Fixed epoll timeout handling for reliable operation - Enabled epoll support in mbatchd and sbatchd network communication - Create epoll_fd post-fork to prevent descriptor reset in child processes - Non-blocking channel reads when events are ready Code Quality: - Reformatted lsb.job.c for consistency - Cleaned up and untabified messy code sections for maintainability The implementation provides scalable, non-blocking I/O handling that properly manages file descriptors across process boundaries and ensures robust error handling throughout the event loop.

tczz88888 · 2025-10-21T12:00:09Z

lsf/lib/lib.channel.c

+        // Handle error conditions
+        if ((ev & EPOLLERR) || (ev & EPOLLHUP) || (ev & EPOLLRDHUP)) {
+            channels[ch].events |= EPOLL_EVENTS_ERROR;
+            channels[ch].chanerr = CHANE_SYSCALL;
+            continue;
+        }


I think there might be an issue here. When a peer writes data and immediately closes the socket, the first epoll event may trigger EPOLLIN. If the data isn't fully read in one read() call, the next epoll event will likely include both EPOLLIN (remaining data) and EPOLLRDHUP (peer closed the connection). However, this logic treats EPOLLRDHUP as an error unconditionally, marking the channel as errored and stopping further processing. This could prevent reading remaining data in the buffer, leading to data loss.

Good observation, I think we saw this problem even before regardless of epoll(), basically the EOF on the connection arrives before the data, we saw this in testing on one hosts but also in production.

The current client does chanRpc_() inside call_server(), I think it waits for reply for the daemon about the status of the operation, so logically that should work.

Do you have a reproduce scenario or it is just a consideration? If we have real case we can investigate it closely.

Yes, I encountered this in a real case during MBD_PROBE handling. Here's the scenario:

MBD calls probe_slave, sending a MBD_PROBE request to SBD.

SBD processes it via do_probe, sends a response using chanWrite_, then closes the socket.

MBD uses epoll to monitor the socket,epoll triggers both EPOLLIN (remaining data) and EPOLLRDHUP (peer closed).

The current logic unconditionally treats EPOLLRDHUP as an error, marking the channel as errored before fully reading data. This causes doProbeReply to be called with exception == TRUE, leading to a false error.

spagochitarra · 2025-10-21T12:59:23Z

I understand, in the epoll loop the mbd handles the error first. Ok so this is the case in which the order of operations is inverted or perhaps not inverted but 2 operations are compressed in 1 since the tcp layer does some buffering and the two bits are on. What is the effect on mbd? It thinks the host is down?

…

On Tue, Oct 21, 2025 at 2:53 PM tczz88888 ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In lsf/lib/lib.channel.c <#35 (comment)>: > + // Handle error conditions + if ((ev & EPOLLERR) || (ev & EPOLLHUP) || (ev & EPOLLRDHUP)) { + channels[ch].events |= EPOLL_EVENTS_ERROR; + channels[ch].chanerr = CHANE_SYSCALL; + continue; + } Yes, I encountered this in a real case during MBD_PROBE handling. Here's the scenario: - MBD calls probe_slave, sending a MBD_PROBE request to SBD. - SBD processes it via do_probe, sends a response using chanWrite_, then closes the socket. - MBD uses epoll to monitor the socket,epoll triggers both EPOLLIN (remaining data) and EPOLLRDHUP (peer closed). The current logic unconditionally treats EPOLLRDHUP as an error, marking the channel as errored before fully reading data. This causes doProbeReply to be called with exception == TRUE, leading to a false error. — Reply to this email directly, view it on GitHub <#35 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AO77WUMMVOBTSUX2FB3FI3T3YYUEBAVCNFSM6AAAAACJMDRZGWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTGNRQGUYTKNJWGY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

tczz88888 · 2025-10-21T13:08:55Z

I understand, in the epoll loop the mbd handles the error first. Ok so this is the case in which the order of operations is inverted or perhaps not inverted but 2 operations are compressed in 1 since the tcp layer does some buffering and the two bits are on. What is the effect on mbd? It thinks the host is down?
…
On Tue, Oct 21, 2025 at 2:53 PM tczz88888 @.> wrote: @.* commented on this pull request. ------------------------------ In lsf/lib/lib.channel.c <#35 (comment)>: > + // Handle error conditions + if ((ev & EPOLLERR) || (ev & EPOLLHUP) || (ev & EPOLLRDHUP)) { + channels[ch].events |= EPOLL_EVENTS_ERROR; + channels[ch].chanerr = CHANE_SYSCALL; + continue; + } Yes, I encountered this in a real case during MBD_PROBE handling. Here's the scenario: - MBD calls probe_slave, sending a MBD_PROBE request to SBD. - SBD processes it via do_probe, sends a response using chanWrite_, then closes the socket. - MBD uses epoll to monitor the socket,epoll triggers both EPOLLIN (remaining data) and EPOLLRDHUP (peer closed). The current logic unconditionally treats EPOLLRDHUP as an error, marking the channel as errored before fully reading data. This causes doProbeReply to be called with exception == TRUE, leading to a false error. — Reply to this email directly, view it on GitHub <#35 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO77WUMMVOBTSUX2FB3FI3T3YYUEBAVCNFSM6AAAAACJMDRZGWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTGNRQGUYTKNJWGY . You are receiving this because you authored the thread.Message ID: @.***>

Actually, the impact is not significant. A failed doProbeReply will only output an ERR log. Additionally, it will set the flag through sbdPtr->hData->flags |= HOST_NEEDPOLL;. Later, when pollSbatchds is invoked, it will detect the HOST_NEEDPOLL flag and initiate another call to probe_slave.

spagochitarra · 2025-10-21T13:27:49Z

Here you go, so it is resilient already. It manages failures as part of the overall flow. No need to change the code since this is not a problem, this is a typical case in system programming where you hit a boundary case because you manage many moving parts. I would make a note in the code. However, you could try to say if EPOLLIN and EPOLLRDHUP are both on then process the read first and then continue and see if epoll_wait() will return next time with only EPOLLRDHUP. It is not clear what happens in this edge case, at least to me. If you can constantly reproduce it, I would play around a bit just to understand it better. Consider that in real life the daemons receive hundreds if not thousands of requests per second, some failure or hiccup are inevitable and part of normal functioning so the goal is to tolerate them, not to die crying at the first apparent inconsistency :-)

…

On Tue, Oct 21, 2025 at 3:09 PM tczz88888 ***@***.***> wrote: *tczz88888* left a comment (bytedance/volclava#35) <#35 (comment)> I understand, in the epoll loop the mbd handles the error first. Ok so this is the case in which the order of operations is inverted or perhaps not inverted but 2 operations are compressed in 1 since the tcp layer does some buffering and the two bits are on. What is the effect on mbd? It thinks the host is down? … <#m_-4075657005770212749_> On Tue, Oct 21, 2025 at 2:53 PM tczz88888 *@*.*> wrote: @.** commented on this pull request. ------------------------------ In lsf/lib/lib.channel.c <#35 (comment) <#35 (comment)>>: > + // Handle error conditions + if ((ev & EPOLLERR) || (ev & EPOLLHUP) || (ev & EPOLLRDHUP)) { + channels[ch].events |= EPOLL_EVENTS_ERROR; + channels[ch].chanerr = CHANE_SYSCALL; + continue; + } Yes, I encountered this in a real case during MBD_PROBE handling. Here's the scenario: - MBD calls probe_slave, sending a MBD_PROBE request to SBD. - SBD processes it via do_probe, sends a response using chanWrite_, then closes the socket. - MBD uses epoll to monitor the socket,epoll triggers both EPOLLIN (remaining data) and EPOLLRDHUP (peer closed). The current logic unconditionally treats EPOLLRDHUP as an error, marking the channel as errored before fully reading data. This causes doProbeReply to be called with exception == TRUE, leading to a false error. — Reply to this email directly, view it on GitHub <#35 (comment) <#35 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO77WUMMVOBTSUX2FB3FI3T3YYUEBAVCNFSM6AAAAACJMDRZGWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTGNRQGUYTKNJWGY . You are receiving this because you authored the thread.Message ID: *@* .***> Actually, the impact is not significant. A failed doProbeReply will only output an ERR log. Additionally, it will set the flag through sbdPtr->hData->flags |= HOST_NEEDPOLL;. Later, when pollSbatchds is invoked, it will detect the HOST_NEEDPOLL flag and initiate another call to probe_slave. — Reply to this email directly, view it on GitHub <#35 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AO77WUNXB27HDY4YAC4EKM33YYWAJAVCNFSM6AAAAACJMDRZGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTIMRWGU2TAOJRGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

bokassa and others added 5 commits October 4, 2025 07:54

Add a modernized lex definition AC_PROG_LEX([yywrap])

9dece9d

Add service files. Modified .gitignore to accept the files in config.

3055aa6

Remove a space from Makefile.am

Add the software version in the path to the daemons.

8ae90d0

emacs reformat the lsb.job.c file.

13dcb8e

tczz88888 reviewed Oct 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

epoll integration: Complete network I/O overhaul across Volclava components #35

epoll integration: Complete network I/O overhaul across Volclava components #35

Uh oh!

spagochitarra commented Oct 16, 2025

Uh oh!

tczz88888 Oct 21, 2025 •

edited

Loading

Uh oh!

LavaLite-Compute Oct 21, 2025

Uh oh!

tczz88888 Oct 21, 2025

Uh oh!

spagochitarra commented Oct 21, 2025 via email

Uh oh!

tczz88888 commented Oct 21, 2025

Uh oh!

spagochitarra commented Oct 21, 2025 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

epoll integration: Complete network I/O overhaul across Volclava components #35

Are you sure you want to change the base?

epoll integration: Complete network I/O overhaul across Volclava components #35

Uh oh!

Conversation

spagochitarra commented Oct 16, 2025

Uh oh!

tczz88888 Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LavaLite-Compute Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

tczz88888 Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

spagochitarra commented Oct 21, 2025 via email

Uh oh!

tczz88888 commented Oct 21, 2025

Uh oh!

spagochitarra commented Oct 21, 2025 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tczz88888 Oct 21, 2025 •

edited

Loading