Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release ownership of entities after spinning cancelled #2556

Merged

Conversation

Barry-Xu-2018
Copy link
Collaborator

This issue was found in ros2/rmw_fastrtps#761.
In code of MoveIt, it used static executor. When the static executor is destructed (The executor is in a cancel state.), and it's time to free up the ownership of entity resources, Fast DDS resources have already been released.

After spinning is cancelled, the Executor should release ownership of entities at once.

@alsora
Copy link
Collaborator

alsora commented Jun 12, 2024

can we deprecate / remove the static executor rather than keep trying to fix it?
i don't see any reason why users should keep using it

@wjwwood
Copy link
Member

wjwwood commented Jun 12, 2024

Might be a misunderstanding, I thought they meant the StaticSingleThreadedExecutor too, but they mean they had a static-global executor, so it was destructed after the (also static-global) domain participant factory in fastrtps. At least that's my understanding.

So this affects all of the executors that use this part of the executor base class.

@wjwwood
Copy link
Member

wjwwood commented Jun 12, 2024

But yes, we can deprecate it in rolling at least, though that should be a separate pr.

@Barry-Xu-2018
Copy link
Collaborator Author

Might be a misunderstanding, I thought they meant the StaticSingleThreadedExecutor too, but they mean they had a static-global executor, so it was destructed after the (also static-global) domain participant factory in fastrtps. At least that's my understanding.

Your understanding are right. Sorry, my description static executor caused a misunderstanding.

@Barry-Xu-2018
Copy link
Collaborator Author

@wjwwood

How about moving release action to every exit point in different spin functions ?
0b4ab7e

rclcpp/test/rclcpp/test_executor.cpp Outdated Show resolved Hide resolved
rclcpp/src/rclcpp/executor.cpp Outdated Show resolved Hide resolved
rclcpp/src/rclcpp/executor.cpp Outdated Show resolved Hide resolved
@fujitatomoya
Copy link
Collaborator

Might be a misunderstanding, I thought they meant the StaticSingleThreadedExecutor too, but they mean they had a static-global executor, so it was destructed after the (also static-global) domain participant factory in fastrtps. At least that's my understanding.

i thought that too until looking into the reproducible program.

So this affects all of the executors that use this part of the executor base class.

true. actually they uses SingleThreadedExecutor, https://github.com/moveit/moveit_task_constructor/blob/bb047c894cd2a395a672abdd47ec76d709d1cb1e/core/src/introspection.cpp#L160

@wjwwood
Copy link
Member

wjwwood commented Jun 13, 2024

@Barry-Xu-2018 that was kind of what I was thinking, but we should take care to clear it before setting the spinning to false, so that a new spin doesn't enter before it can be reset.

@@ -275,7 +275,7 @@ Executor::spin_until_future_complete_impl(
if (spinning.exchange(true)) {
throw std::runtime_error("spin_until_future_complete() called while already spinning");
}
RCPPUTILS_SCOPE_EXIT(this->spinning.store(false); );
RCPPUTILS_SCOPE_EXIT(this->spinning.store(false);wait_result_.reset(););
Copy link
Collaborator

@alsora alsora Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this order is not thread-safe.
you are first setting spinning to false and then resetting the wait_result.

If there are two threads, one trying to start spin and the other exiting spin, there can be a concurrent access to wait_result_.

  • thread 1 is exiting spin.
  • thread 1 sets spinning to false
  • thread 2 is entering spin
  • thread 2 does something on the wait result
  • thread 1 finishes to exit and resets the wait result

We should at least invert the order, to ensure that no other threads can enter spin until we have cleared the wait result.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. It is better to invert the order.

Besides, There will be many issues if multiple threads executes the same executor's spin concurrently.
wait_result_ was completely unprotected.

void
Executor::wait_for_work(std::chrono::nanoseconds timeout)
{
TRACETOOLS_TRACEPOINT(rclcpp_executor_wait_for_work, timeout.count());
// Clear any previous wait result
this->wait_result_.reset();
{
std::lock_guard<std::mutex> guard(mutex_);
if (this->entities_need_rebuild_.exchange(false) || current_collection_.empty()) {
this->collect_entities();
}
}
this->wait_result_.emplace(wait_set_.wait(timeout));
if (!this->wait_result_ || this->wait_result_->kind() == WaitResultKind::Empty) {
RCUTILS_LOG_WARN_NAMED(
"rclcpp",
"empty wait set received in wait(). This should never happen.");
} else {
if (this->wait_result_->kind() == WaitResultKind::Ready && current_notify_waitable_) {
auto & rcl_wait_set = this->wait_result_->get_wait_set().get_rcl_wait_set();
if (current_notify_waitable_->is_ready(rcl_wait_set)) {
current_notify_waitable_->execute(current_notify_waitable_->take_data());
}
}
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in the case of the multithreaded executor, the wait_result is protected by a mutex in the mt executor itself (

std::lock_guard wait_lock{wait_mutex_};
)

As for single-threaded executor, it should generally not be possible to call spin from multiple threads as there is an internal spinning atomic that should block other threads from spinning an executor that is already spinning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjcarroll Thanks for your comments.
I didn't express myself clearly. I mean, For the SingleThreadedExecutor, if the user creates multiple threads to execute spin(), problems will arise.

@Barry-Xu-2018
Copy link
Collaborator Author

The failure of Rpr__rclcpp__ubuntu_noble_amd64 is related to test create_two_content_filters_with_same_topic_name_and_destroy. rmw_fastrtps in CI environment doesn't include ros2/rmw_fastrtps#762.

@Barry-Xu-2018
Copy link
Collaborator Author

@wjwwood @alsora I have updated code. Please review again.

@Barry-Xu-2018 Barry-Xu-2018 force-pushed the review/topic-fix-cancel-spin-issue branch from b30df3a to 50bce78 Compare July 3, 2024 03:01
@Barry-Xu-2018
Copy link
Collaborator Author

Rebase was done.

Signed-off-by: Barry Xu <[email protected]>
@Barry-Xu-2018
Copy link
Collaborator Author

@alsora I updated code. Please help to review again.

@Barry-Xu-2018
Copy link
Collaborator Author

@alsora I have moved test code. Please review again.

@fujitatomoya
Copy link
Collaborator

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@Barry-Xu-2018
Copy link
Collaborator Author

CI:

  • Windows Build Status

Copy link
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@fujitatomoya fujitatomoya merged commit 069a001 into ros2:rolling Jul 10, 2024
3 checks passed
@fujitatomoya
Copy link
Collaborator

@Mergifyio backport jazzy

Copy link
Contributor

mergify bot commented Jul 10, 2024

backport jazzy

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Jul 10, 2024
* Release ownership of entities after spinning cancelled

Signed-off-by: Barry Xu <[email protected]>

* Move release action to every exit point in different spin functions

Signed-off-by: Barry Xu <[email protected]>

* Move wait_result_.reset() before setting spinning to false

Signed-off-by: Barry Xu <[email protected]>

* Update test code

Signed-off-by: Barry Xu <[email protected]>

* Move test code to test_executors.cpp

Signed-off-by: Barry Xu <[email protected]>

---------

Signed-off-by: Barry Xu <[email protected]>
(cherry picked from commit 069a001)

# Conflicts:
#	rclcpp/test/rclcpp/executors/test_executors.cpp
fujitatomoya pushed a commit that referenced this pull request Jul 18, 2024
… (#2580)

* Release ownership of entities after spinning cancelled (#2556)

* Release ownership of entities after spinning cancelled

Signed-off-by: Barry Xu <[email protected]>

* Move release action to every exit point in different spin functions

Signed-off-by: Barry Xu <[email protected]>

* Move wait_result_.reset() before setting spinning to false

Signed-off-by: Barry Xu <[email protected]>

* Update test code

Signed-off-by: Barry Xu <[email protected]>

* Move test code to test_executors.cpp

Signed-off-by: Barry Xu <[email protected]>

---------

Signed-off-by: Barry Xu <[email protected]>
(cherry picked from commit 069a001)

# Conflicts:
#	rclcpp/test/rclcpp/executors/test_executors.cpp

* Fix backport issue (#2581)

Signed-off-by: Barry Xu <[email protected]>

---------

Signed-off-by: Barry Xu <[email protected]>
Co-authored-by: Barry Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants