-
Notifications
You must be signed in to change notification settings - Fork 18
epoll integration: Complete network I/O overhaul across Volclava components #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Remove a space from Makefile.am
…onents This commit consolidates comprehensive epoll integration across multiple Volclava components, replacing traditional socket handling with efficient event-driven I/O: Key Features: - Register/deregister epoll channels with static openSock() and %m logging - epollChan_() implementation working across lim, lstools, and API libraries - Proper listening socket cleanup on master restart during request processing - Fixed epoll timeout handling for reliable operation - Enabled epoll support in mbatchd and sbatchd network communication - Create epoll_fd post-fork to prevent descriptor reset in child processes - Non-blocking channel reads when events are ready Code Quality: - Reformatted lsb.job.c for consistency - Cleaned up and untabified messy code sections for maintainability The implementation provides scalable, non-blocking I/O handling that properly manages file descriptors across process boundaries and ensures robust error handling throughout the event loop.
| // Handle error conditions | ||
| if ((ev & EPOLLERR) || (ev & EPOLLHUP) || (ev & EPOLLRDHUP)) { | ||
| channels[ch].events |= EPOLL_EVENTS_ERROR; | ||
| channels[ch].chanerr = CHANE_SYSCALL; | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there might be an issue here. When a peer writes data and immediately closes the socket, the first epoll event may trigger EPOLLIN. If the data isn't fully read in one read() call, the next epoll event will likely include both EPOLLIN (remaining data) and EPOLLRDHUP (peer closed the connection). However, this logic treats EPOLLRDHUP as an error unconditionally, marking the channel as errored and stopping further processing. This could prevent reading remaining data in the buffer, leading to data loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good observation, I think we saw this problem even before regardless of epoll(), basically the EOF on the connection arrives before the data, we saw this in testing on one hosts but also in production.
The current client does chanRpc_() inside call_server(), I think it waits for reply for the daemon about the status of the operation, so logically that should work.
Do you have a reproduce scenario or it is just a consideration? If we have real case we can investigate it closely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I encountered this in a real case during MBD_PROBE handling. Here's the scenario:
- MBD calls probe_slave, sending a MBD_PROBE request to SBD.
- SBD processes it via do_probe, sends a response using chanWrite_, then closes the socket.
- MBD uses epoll to monitor the socket,epoll triggers both EPOLLIN (remaining data) and EPOLLRDHUP (peer closed).
The current logic unconditionally treats EPOLLRDHUP as an error, marking the channel as errored before fully reading data. This causes doProbeReply to be called with exception == TRUE, leading to a false error.
|
I understand, in the epoll loop the mbd handles the error first. Ok so this
is the case in which the order of operations is inverted or perhaps not
inverted but 2 operations are compressed in 1 since the tcp layer does some
buffering and the two bits are on.
What is the effect on mbd? It thinks the host is down?
…On Tue, Oct 21, 2025 at 2:53 PM tczz88888 ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In lsf/lib/lib.channel.c
<#35 (comment)>:
> + // Handle error conditions
+ if ((ev & EPOLLERR) || (ev & EPOLLHUP) || (ev & EPOLLRDHUP)) {
+ channels[ch].events |= EPOLL_EVENTS_ERROR;
+ channels[ch].chanerr = CHANE_SYSCALL;
+ continue;
+ }
Yes, I encountered this in a real case during MBD_PROBE handling. Here's
the scenario:
- MBD calls probe_slave, sending a MBD_PROBE request to SBD.
- SBD processes it via do_probe, sends a response using chanWrite_,
then closes the socket.
- MBD uses epoll to monitor the socket,epoll triggers both EPOLLIN
(remaining data) and EPOLLRDHUP (peer closed).
The current logic unconditionally treats EPOLLRDHUP as an error, marking
the channel as errored before fully reading data. This causes doProbeReply
to be called with exception == TRUE, leading to a false error.
—
Reply to this email directly, view it on GitHub
<#35 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AO77WUMMVOBTSUX2FB3FI3T3YYUEBAVCNFSM6AAAAACJMDRZGWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTGNRQGUYTKNJWGY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Actually, the impact is not significant. A failed doProbeReply will only output an ERR log. Additionally, it will set the flag through sbdPtr->hData->flags |= HOST_NEEDPOLL;. Later, when pollSbatchds is invoked, it will detect the HOST_NEEDPOLL flag and initiate another call to probe_slave. |
|
Here you go, so it is resilient already. It manages failures as part of the
overall flow.
No need to change the code since this is not a problem, this is a typical
case in system programming where you hit a boundary case because you manage
many moving parts. I would make a note in the code.
However, you could try to say if EPOLLIN and EPOLLRDHUP are both on then
process the read first and then continue and see if epoll_wait() will
return next time with only EPOLLRDHUP. It is not clear what happens in
this edge case, at least to me.
If you can constantly reproduce it, I would play around a bit just to
understand it better.
Consider that in real life the daemons receive hundreds if not thousands of
requests per second, some failure or hiccup are inevitable and part of
normal functioning so the goal is to tolerate them, not to die crying at
the first apparent inconsistency :-)
…On Tue, Oct 21, 2025 at 3:09 PM tczz88888 ***@***.***> wrote:
*tczz88888* left a comment (bytedance/volclava#35)
<#35 (comment)>
I understand, in the epoll loop the mbd handles the error first. Ok so
this is the case in which the order of operations is inverted or perhaps
not inverted but 2 operations are compressed in 1 since the tcp layer does
some buffering and the two bits are on. What is the effect on mbd? It
thinks the host is down?
… <#m_-4075657005770212749_>
On Tue, Oct 21, 2025 at 2:53 PM tczz88888 *@*.*> wrote: @.** commented on
this pull request. ------------------------------ In lsf/lib/lib.channel.c <#35
(comment)
<#35 (comment)>>:
> + // Handle error conditions + if ((ev & EPOLLERR) || (ev & EPOLLHUP) ||
(ev & EPOLLRDHUP)) { + channels[ch].events |= EPOLL_EVENTS_ERROR; +
channels[ch].chanerr = CHANE_SYSCALL; + continue; + } Yes, I encountered
this in a real case during MBD_PROBE handling. Here's the scenario: - MBD
calls probe_slave, sending a MBD_PROBE request to SBD. - SBD processes it
via do_probe, sends a response using chanWrite_, then closes the socket. -
MBD uses epoll to monitor the socket,epoll triggers both EPOLLIN (remaining
data) and EPOLLRDHUP (peer closed). The current logic unconditionally
treats EPOLLRDHUP as an error, marking the channel as errored before fully
reading data. This causes doProbeReply to be called with exception == TRUE,
leading to a false error. — Reply to this email directly, view it on GitHub
<#35 (comment)
<#35 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AO77WUMMVOBTSUX2FB3FI3T3YYUEBAVCNFSM6AAAAACJMDRZGWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTGNRQGUYTKNJWGY
. You are receiving this because you authored the thread.Message ID: *@*
.***>
Actually, the impact is not significant. A failed doProbeReply will only
output an ERR log. Additionally, it will set the flag through
sbdPtr->hData->flags |= HOST_NEEDPOLL;. Later, when pollSbatchds is
invoked, it will detect the HOST_NEEDPOLL flag and initiate another call to
probe_slave.
—
Reply to this email directly, view it on GitHub
<#35 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AO77WUNXB27HDY4YAC4EKM33YYWAJAVCNFSM6AAAAACJMDRZGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTIMRWGU2TAOJRGI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
This PR introduces a consolidated epoll-based I/O model across core Volclava components, replacing legacy socket handling with scalable, event-driven mechanisms. The changes were squashed into commit 88193e8 and cherry-picked here for visibility and review.
Key Features:
Code Quality Improvements:
Test Coverage:
Verified base and batch command functionality
Ran full job lifecycle: submit → dispatch → execution → termination
Observed correct channel readiness and socket cleanup
Caveats:
This integration is functional but not yet fully hardened. Some edge cases remain under investigation, particularly around fork timing and descriptor lifecycle. Reviewers are encouraged to test under mixed OS conditions and high-load scenarios.
Lu