Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: vsomeip slow to restart due to VSOMEIP_MAX_TCP_SENT_WAIT_TIME #701

Open
joeyoravec opened this issue May 13, 2024 · 0 comments
Open
Labels

Comments

@joeyoravec
Copy link

vSomeip Version

v3.4.10

Boost Version

1.82

Environment

QNX and Android

Describe the bug

When tcp client endpoint gets a broken socket during normal operation:

  • receive_cbk() should get Connection reset by peer(104)
  • send_cbk() should get Broken pipe (32)

but there's still a 10 second delay until communication gets re-established. This is because inside receive_cbk() while the state is ESTABLISHED it calls wait_until_sent():

if (state_ == cei_state_e::CONNECTING) {
VSOMEIP_WARNING << "tcp_client_endpoint receive_cbk already"
" restarting" << get_remote_information();
} else {
VSOMEIP_WARNING << "tcp_client_endpoint receive_cbk restarting.";
state_ = cei_state_e::CONNECTING;
shutdown_and_close_socket_unlocked(false);
its_lock.unlock();
// wait_until_sent interprets "no error" as timeout.
// Therefore call it with an error.
wait_until_sent(boost::asio::error::operation_aborted);

And at the top of that function it checks

  1. is_sending_ which was the subject of [BUG]: tcp_client_endpoint must unset is_sending_ during restart #668 and may be set at this point; likely if the system is under heavy load
  2. _error which is definitely set by the caller

void tcp_client_endpoint_impl::wait_until_sent(const boost::system::error_code &_error) {
std::unique_lock<std::recursive_mutex> its_lock(mutex_);
if (!is_sending_ || !_error) {

sending us down the else-path to wait hardcoded 10s:

} else {
std::chrono::milliseconds its_timeout(VSOMEIP_MAX_TCP_SENT_WAIT_TIME);
boost::system::error_code ec;
sent_timer_.expires_from_now(its_timeout, ec);
sent_timer_.async_wait(std::bind(&tcp_client_endpoint_impl::wait_until_sent,
std::dynamic_pointer_cast<tcp_client_endpoint_impl>(shared_from_this()),
std::placeholders::_1));
}

where instead we probably wanted to go down the if-path for an immediate restart, knowing the socket is broken -- noting that multiple code paths doing restart is going to exaggerate issues like #690

Reproduction Steps

Same repro like issue #668 with:

  1. Start with a TCE established, lots of traffic
  2. Break the socket on the service side (probability of is_sending_ being either true or false at time of break)

Repeat in a loop. Maybe 1-in-10 chance of hitting this code path

Expected behaviour

vsomeip should recover more quickly from tce socket break

Logs and Screenshots

After the socket break we see:

tcp_client_endpoint receive_cbk: Connection reset by peer(104) local: 10.6.0.10:39447 remote: 10.6.0.3:30510
tcp_client_endpoint receive_cbk restarting.
tce::write_completion_condition: Broken pipe(32) bytes transferred: 0 bytes to sent: 18 remote:10.6.0.3:30510 (20f9): [0fe2.4dbb.0a32]
tce::send_cbk endpoint is already restarting:10.6.0.3:30510
tce::send_cbk received error: Broken pipe (32) 10.6.0.3:30510 1 18 (20f9): [0fe2.4dbb.0a32]

then after the 10 second delay it's finally unblocked to continue:

wait_until_sent: Maximum wait time for send operation exceeded for tce.
tce::restart: dropping message: remote:10.6.0.3:30510 (20f9): [0fe2.4dbb.0a32] size: 18
tce::restart: dropping message: remote:10.6.0.3:30510 (20de): [0fe2.3f91.0047] size: 18
@joeyoravec joeyoravec added the bug label May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant