Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add qos_reliablity param #329

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

MrBlenny
Copy link

@MrBlenny MrBlenny commented Dec 5, 2024

Changelog

Adds a qos_reliablity param to the ros2 bridge that allows the default reliability to be set via param. Defaults to "automatic" to retain existing behavior.

Docs

N/A

Description

The web-socket bridge runs on TCP which causes issues on flakey internet connections. There is some discussion here:
https://github.com/orgs/foxglove/discussions/15

This PR adds the ability to specify "qos_reliability" of "best_effort" which is useful when running a bridge on a different computer to the robot for example:

ROBOT COMPUTER -> DDS/UDP -> REMOTE COMPUTER -> Websocket/TCP -> Foxglove / Browser

@jtbandes
Copy link
Member

jtbandes commented Dec 5, 2024

when running a bridge on a different computer to the robot

Just curious, why are you doing this rather than running the bridge on the robot?

@MrBlenny
Copy link
Author

MrBlenny commented Dec 5, 2024

Running the bridge on the robot means the connection from the Robot to Foxglove is TCP.
When comms is poor (for example satellite internet on a field robot) the TCP connection can go down for a few seconds which will lead to a build-up of messages, potentially up to the send_buffer_limit. When connection is established there will then be a large flood of messages, some of which may no longer be relevant.

In my use-case, it is fine if messages are dropped via the DDS middleware. Adding WebRTC or WebTransport would be a nicer solution but this param provides a quick fix.

README.md Outdated
@@ -84,6 +84,7 @@ Parameters are provided to configure the behavior of the bridge. These parameter
* (ROS 2) __num_threads__: The number of threads to use for the ROS node executor. This controls the number of subscriptions that can be processed in parallel. 0 means one thread per CPU core. Defaults to `0`.
* (ROS 2) __min_qos_depth__: Minimum depth used for the QoS profile of subscriptions. Defaults to `1`. This is to set a lower limit for a subscriber's QoS depth which is computed by summing up depths of all publishers. See also [#208](https://github.com/foxglove/ros-foxglove-bridge/issues/208).
* (ROS 2) __max_qos_depth__: Maximum depth used for the QoS profile of subscriptions. Defaults to `25`.
* (ROS 2) __qos_reliability__: The default QoS reliability setting for subscriptions the bridge creates. Can be 'reliable', 'best_effort', or 'automatic'. Defaults to `automatic`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to explain what the automatic setting does under the hood.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a description of all options and another best_effort_if_volatile.
I know this is borderline but... this is a very useful default if running the bridge remote as transient_local topics are typically not things that should be thrown away.

A better option would probably be to have a regex matcher on topics that specifies which QoS foxglove should use when in subscribes.

README.md Outdated
* (ROS 2) __qos_reliability__: The default QoS reliability setting for subscriptions the bridge creates. Defaults to `automatic`.
* `reliable`: ALWAYS subscribe with a "reliable" QoS profile.
* `best_effort`: ALWAYS subscribe with a "best effort" QoS profile.
* `best_effort_if_volatile`: subscribe as "best effort" if all the participants are "volatile". If any are "transient_local", subscribe as "reliable".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it wrong to subscribe with "best effort" if any of the participants are "transient_local" ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I can fully answer this question but the following info may be helpful: https://docs.ros.org/en/rolling/Concepts/Intermediate/About-Quality-of-Service-Settings.html#qos-compatibilities

Copy link
Author

@MrBlenny MrBlenny Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a common pattern to publish a topic with "transient_local" durability whenever that value changes. For example a /robot/healthy Bool topic may be published only when the value changes from true<->false.

If the bridge subscribes to a transient_local topic with best_effort over a flakey connection. The historic transient_local message may be lost. These sorts of transient_local topics are typically not re-published so the bridge will never get the value.

best_effort_if_volatile is a useful default in this case but as mentioned, it's probably better to add a regex matcher param instead of this qos_reliability param.

i.e.
best_effort_qos_topic_whitelist: List of regular expressions (ECMAScript) for topics that should use be forced to use 'best_effort' QoS. Unmatched topics will use 'reliable' QoS if ALL publishers are 'reliable', 'best_effort' if any publishers are 'best_effort'. Defaults to ["(?!)"] (match nothing).

I might have a look at adding this today.

Copy link
Author

@MrBlenny MrBlenny Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made said change. Ready for review again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 we'll take a look

It's a common pattern to publish a topic with "transient_local" durability whenever that value changes. For example a /robot/healthy Bool topic may be published only when the value changes from true<->false.

FWIW my general advice on this is to to pick a cadence where you publish things like this even if they have not changed. It helps with data recording and debugging afterwards because you have the state information present without having to go back minutes or hours in history to lookup the topic value (which is hard in some workflows).

@defunctzombie defunctzombie requested a review from achim-k January 3, 2025 23:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants