Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle msg not found #2384

Open
wants to merge 22 commits into
base: noetic-devel
Choose a base branch
from

Conversation

12tbuchman
Copy link

No description provided.

C. Andy Martin and others added 22 commits June 8, 2021 14:40
Add two new roslaunch substitution args, param and optparam.
These function similarily to env and optenv, but read from what is in
the param server at the time of argument substitution. This means the
params for these substitutions are read *prior* to whatever ros
params are set in the launch file. This is useful for storing global
configuration params in the server by some prior launch file or
process prior to launching various nodes that depend on those
configurations. This mechanism is very useful to alter launch file
behavior based on sensor configuration or other global configuration,
like sim time being enabled.
Currently, by default, roscpp sets the nodelay hint to true, while rospy sets
its to false. This is inconsistent. Fix the inconsistency by having rospy
follow roscpp.

It makes little sense to add any delay with ROS, as ROS is a message based
system. It is generally better to send a message as soon as it is available.
Invoking rosbag to record bagfiles and asking it to split those
bagfiles at certain thresholds will cause the following (relevant)
events to occur:

1. Rosbag subscribes to all topics specified
2. Record data
3. Threshold hit
  1a. Stop writing to file
  2a. Create new file
  3a. Start writing to new file
  4a. Spin (1a to 3a)

We have a specific need to only retain a subset of these split files.
If that subset of files does not include the first (0th) file, then
any topics that are static will get lost (because they are only
recorded upon initial subscription, step 1 above).

With this change, you can now specify (via regex) a set of topics
that should be re-subscribed to during a split. This would allow
these latched topics to be available in every split file.
This is the first of several steps to simplify time tracking in rosbag.
We still convert _duration_ to wall time to figure out how long to sleep
but we no longer need to keep track of an absolute wall-clock time which
in turn means that we do not need to track how long (in wall-clock time)
we've been sleeping while paused etc.
When the wall-clock horizon was removed the publish rate was skewed
because I forgot to take into account the run time of the code between
calls to runClock(). To fully correct this, the TimePublisher needed
to be aware of when we were paused. Since pausing the clock seems very
much relevant to TimePublisher, I moved the handling and tracking of
"paused" into TimePublisher.
This recv is done on the thread that handles accepting new connections
to this node. If for some reason the client fails to send the header for
an extended period of time it will prevent any other connections from
being made to this node. If a client hasn't sent the ros header within
the timeout we will drop there connection and move on. In most cases the
client will retry the connection.
If we hit one of the cases where we get an exception during the
connection sequence and a reconnection is desired then we need to close
the socket and set it to None. This will allow the connect loop in
robust_connect_subscriber() to continue for another iteration. Without
these changes the connect loop would exit because conn.socket was not
None and then enter the receive loop without successfully having
negotiated a connection.

This can cause a deadlock between a publisher and subscriber if the
connection was established but the ros header was not written out. This
will leave the publisher trying to receive the ros header from the
subscriber and the subscriber waiting to receive a message from the
publisher. When this case is hit, for whatever reason, it's best to just
close the current connection and establish a new connection.
If the connection is reset for some reason we want to attempt to
reconnect. If the connection has actually gone away then we will fail
the next attemp with ECONNREFUSED and stop trying but if not we may be
able to reconnect.
Long-running callbacks in rospy can cause extreme amounts of buffering
resulting in unnecessary delay, essentially ignoring the queue_size
setting (ros#1901). This can already be somewhat mitigated by setting
buff_size to be larger than the amount of data that could be buffered
by a long running callback. However, setting buff_size to a correct
value is not possible for the user of the API if the amount of time in
the callback or the amount of data that would be transmitted is
unknown. Also, even with a correct buff_size and a queue_size of 1, the
received data may still be the oldest of all data transmitted while the
callback was running.

Fix the delays in such cases by running callbacks in a separate thread.
The receive_loop then calls recv() concurrently with the long running
callback, enforcing queue_size as new data is received. This fixes the
latency in the data when queue_size is set to be similar to roscpp.

This fixes ros#1901
This bug is severe because it can break ordinary rospy.get_param()
after a rospy.get_param_cached() has been perfomed. And, furthermore,
rospy.log*() does a rospy.get_param_cached()!

When a parameter is set in the cache, the cached value will always be
returned, even when the caller uses ordinary rospy.get_param(). This is
as designed, as subscribeParam() is done to make the cache
receive updates from the param server whenever the value changes.

However, existing code only uses the presence of a key in the cache to
indicate whether the cached param is valid. However, this is not enough.

Suppose the param server has these contents:
    top:
        min: 1
        max: 10

    print(rospy.get_param('/top'))
    # So far, so good. Output is:
    # {'min': 1, 'max': 10}

    print(rospy.get_param_cached('/top/min'))
    # As expected, this outputs:
    # 1
    # However, ParamServerCache now contains:
    # {'top': {'min': 1}}
    # This is correct, because only the value of 'min' is cached.

    print(rospy.get_param('/top'))
    # Incorrect output:
    # {'min': 1}
    # Ugh! 'max': 10 is missing!
    # ...because it looks like /top is in the cache, but it was never
    # actually cached; only the individual item within it, 'min', was
    # cached.

The fix is to track the actual keys that are cached in a Python set().
Only return a cached value if either the key or a parent of it is in
the set.

This was found by production code which attempts to get all params using
rospy.get_param('/'). This works OK as long as no calls to
rospy.get_param_cached() have been made. However, the rospy.log*()
functions all do
rospy.get_param_cached('rosout_disable_topics_generation'). So after the
first loginfo, subsequent calls to rospy.get_param('/') return only the
single cached key rather than all params.

Fixes: d16296a ("rospy: added get_param_cached (ros#1515)")
If rospy.get_param_cached() is called with a non-existent parameter,
existing code would attempt to encode that the parameter doesn't exist
by storing its value in the cache as an empty dictionary. And it would
correctly make a subscribeParam call to rosmaster to receive updates.

However, every subsequent .get_param_cached()
on this non-existent parameter would re-subscribe for updates from
rosmaster, nullifying any performance benefit from caching. Likewise, a
regular .get_param() would simply re-read the parameter from rosmaster
every time, still not benefitting from caching.

To make matters worse, the rospy.log* functions do a
rospy.get_param_cached('rosout_disable_topics_generation'), so we pay
this penalty on every log message whenever that paremeter is not set (it
is not set by default!).

The cause of this problem is that the translation of an empty dict in
the cache to a KeyError occurs within the cache itself, so the
caller can't know whether the parameter is simply unset vs. not
subscribed for updates.

The fix is simple: Move the check for empty dict from the cache itself
to the caller.

Fixes: d16296a ("rospy: added get_param_cached (ros#1515)")
This bug affected latched topics that were published more than once
per recording session, and only when the `--repeat_latched` option
was set.

Latched messages were added to a std::map object using the `insert`
call. This call adds a new element to a map only if one does not yet
exist for that key. This causes the first latched messages to be the
only ones recorded at the start of each split bagfile, not the most
recent messages.
Because our ROS system is entirely contained to a local network, we do
not need tcp_nodelay to be "false".

Also worth mentioning, since we are using upstream code for some of our
nodes, we don't have the option to go change the flags passed to all the
subscribers. Instead we change the default here.
When omitted from the subscriber the queue size is None, which has the
potential to cause the subscribers to buffer without bound. This can be
avoided by setting the queue size to 100 whenever it's omitted or None.

ROS-658
python 3.10

collections abstract base classes are now under
collections.abc in python 3.10.
roslaunch when waiting for processes to exit eats a lot of resources
waking up every 0.1 seconds to check on things. Change this to a 1.0
second sleep. The resources used to wake add up. For our robot, this
is about 5% of a CPU core wasted.
If a "required" node in a launch file goes down, we really want to pass
the error code up through the system; specifically to systemd. We can
use this information to do things like attempt restarts. This was an
overlooked feature of roslaunch, which always returns zero if nodes die.
roscore will only clean up a node if all of its subscriptions have been
unregistered. This means that any rospy node that uses roslog would not
get cleaned up by roscore. If the node is then restarted it will attempt
to send a shutdown to the previous node because it shared the same node,
even though it had actually shutdown.

To complicate this further since the original node shutdown another node
might have been assigned the same address and then it would randomly
receive the shutdown request even though it didn't actually conflict.

These changes will ensure that we unregister any param that we had
subscribed for. Now the rospy nodes will actually get properly cleaned
up in roscore.
This exception currently crashes rosbag but can easily be handled with
the message skipped and a readable warning published which indicates the
topic causing the exception.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants