Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Startup Crash Caused by Read() and Write() on Un-activated Ethercat #120

Merged

Conversation

tony-laza
Copy link
Contributor

On our system (multiple CiA 402 motor drivers), we have found there are frequent crashes (at least 25% of the time) at startup caused by accesses to the domain_info_ map before it is populated. We traced this to ros2_control calling the read() and write() functions on the ethercat interface before on_activate() completed. This PR is our fix for this, which applies:

  1. Implement a lock and introduce the activated_ variable into the hardware interface class, preventing access before activation.
  2. Have the lower-level activate() function return a boolean false if it fails to succeed, and propagate this back to ros2_control.
  3. Use at() instead of direct access for the domain_info_ map so it will throw an exception on out-of-bounds access instead of undefined behavior.

I'm happy to re-write this as the team sees fit, or drop this if it is unhelpful.

@yguel
Copy link
Contributor

yguel commented May 2, 2024

Many thanks for your report and your PR.
At first read, your fixes sound good.
I am currently reviewing the PR.
My plan is to devise a unit test that reproduces the problem to validate the solution.

@tony-laza tony-laza force-pushed the feature/laza-startup-crash-stability branch from af7a7e4 to 2e99cc6 Compare May 2, 2024 15:39
@tony-laza
Copy link
Contributor Author

tony-laza commented May 2, 2024

I have looked into writing a test for this, I think it would have to be hardware in the loop. There doesn't seem to be existing software-only tests that start a hardware_interface. I have started looking into if this is possible with the etherlab codebase.

@yguel
Copy link
Contributor

yguel commented May 3, 2024

I have looked into writing a test for this, I think it would have to be hardware in the loop. There doesn't seem to be existing software-only tests that start a hardware_interface. I have started looking into if this is possible with the etherlab codebase.

We have work in progress towards unit testing:

  1. a real hardware platform (we can expect it running tests during summer)
  2. a mock ethercat master to simulate data exchange (we can expect it ok at end of may)
  3. a simulation of the ethernet card (can test the interactions with the ethercat master, but not the ethernet communications)
  4. a simulation of ethercat slaves (we have no realistic schedule yet, in particular we have no good solution yet to simulate an ethercat slave)
    that will run tests as part of the CI/CD.

But I was not counting on those to create the unit test I had in mind. I am planning to mock anything to make the ethercat_driver achieve «inactive» state (pass on_init() and then configure()) and then call read, write, prepare_, perform_, etc. on it. That is before the activate() call is performed. It will obviously fail at the current state of the code since the ethercat communication would not be enabled (that is also why this test does not really require hardware). Unfortunately, this will not really testing your case, however it should pass with your PR.

In the meantime, I will merge your PR (you can expect the merge to be done by the end of the week).

The problem, you had, shows however a bigger problem of our design. It shows that our design does not follow the hardware interface lifecycle (https://control.ros.org/master/_images/hardware_interface_lifecycle.png)

Read and write calls are legit after the on_configure stage, just «no movement» should happen. Since for the moment we have neither the concept of a «motion enabled» nor the concept of a «motion disabled» state for our ethercat slaves, it seemed more «simple» to setup the ethercat communication in the activate() method, but we should change that.

Our proposal is to have something that emulate a write_inactive() and a write_active() method for each slave that will be called by the write() method according to the hardware interface state.

For the generic slave interface, we could add a new parameter to the channels providing values for the writes in the inactive mode:

  • either an inactive_value that will be written in place of the actual value in inactive state
  • or an inactive_mask that will be applied to the value to write in inactive state.
    Only one of the 2 (inactive_value or inactive_mask) would be accepted or we could always use the inactive_value if it exists and applied at the end the inactive_mask.

As it will break the existing mode of operation, however being more respectful of the ros2 control specifications, we will increment the major version of the lib, making it an api breaking change.

@yguel yguel merged commit 52be2c2 into ICube-Robotics:main May 29, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants