[Bug] Closing transport with multiple bridges/subscribers connected #314

miltzhaw · 2024-10-23T15:45:05Z

Describe the bug

I have a robot with the latest release of the zenoh bridge connecting as a client to a Zenoh router in a Kubernetes cluster. On the same k8s cluster I have a container with a zenoh-bridge that connects to the same Zenoh router as a client and can see these topics and for instance use Rviz with nav2 to visualize and move the robot.

When I start in another container another zenoh-bridge connecting to the same router and visualizing the topics I get the following error and the zenoh-bridge on the robot will stop working.

ERROR ThreadId(19) zenoh_transport::unicast::universal::tx: Unable to push non droppable network message to acac40b9496508dc4cf792ca876954fc. Closing transport!

OBS: I also observed this behavior without starting a second container, but with a second subscriber to a topic for instance with ros2 topic echo and rviz already running. However, in this case, it occurs inconsistently, so sometimes it works other times not.

Two warning messages I noticed that seemed also to be related are the following ones:

WARN net-0 ThreadId(10) zenoh::net::runtime::orchestrator: Unable to connect to tcp/ip:port! Received a close message (reason MAX_LINKS) in response to an OpenSyn on: TransportLinkUnicast { link: Link { src: tcp/ip:port, dst: tcp/ip:port, mtu: 64995, is_reliable: true, is_streamed: true }, config: TransportLinkUnicastConfig { direction: Outbound, batch: BatchConfig { mtu: 64995, is_streamed: true, is_compression: false }, priorities: None, reliability: None } } at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zenoh-transport-1.0.0/src/unicast/establishment/open.rs:472.

WARN net-0 ThreadId(09) zenoh_plugin_ros2dds::route_service_cli: Route Service Client (ROS:/summit/lifecycle_manager_navigation/is_active <-> Zenoh:bot1/summit/lifecycle_manager_navigation/is_active): received error as reply for (2c2adf3057843613,26): ReplyError { payload: ZBytes(ZBuf { slices: [[54, 69, 6d, 65, 6f, 75, 74]] }), encoding: Encoding(Encoding { id: 0, schema: None }) }

To reproduce

Start ros2-dds-bridge on the robot with almost default configuration to connect to a Zenoh router in client mode.
Start ros2-dds-bridge on the container in the cloud with almost default configuration to connect to a Zenoh router in client mode.
Start rviz e.g., with nav2 on the cloud container.
Repeat steps 2 and 3 on another container in the cluster

System info

Robot with ROS2 humble container and ros2ddsbridge stable latest version 1.0.0
Container on Kubernetes cluster with ROS2 humble and ros2ddsbridge stable latest version 1.0.0
Zenoh router stable latest version 1.0.0

The text was updated successfully, but these errors were encountered:

miltzhaw · 2024-10-24T08:43:01Z

An update on this issue. I tested previous versions to identify in which release this error appears and it seems to appear in 1.0.0-beta.4. The 1.0.0-beta.3 does not have this specific error I described above.

miltzhaw · 2024-11-11T16:12:54Z

As a follow-up comment. The issue seems to be related to the reduced resources Zenoh router had allocated as a container on Kubernetes. What are the minimum memory and cpu resources required? That would be helpful to know.

JEnoch · 2024-12-03T17:51:06Z

What are the minimum memory and cpu resources required? That would be helpful to know.

Hard to say, as it really depend on the number of routed DDS entities and on the amount of traffic to route.
We didn't do such detailed characterization.
The bare minimum is an idle system where the bridge uses ~10Mb and 0.3% CPU (on MacOS).
As far as I remember, in a Turtlebot3 (RaspberryPi 3) the memory usage doesn't increase a lot, and the CPU usage remains <10%.

How did you found the resources allocation in Kubernetes was in cause ?
What were the default settings and bridge consumed resource then ?
Which resources did you increase for the bridge ?
Your answers might help other users facing similar issues with Kubernetes.

miltzhaw · 2024-12-04T11:27:55Z

We also didn't do a detailed analysis, but it was more a logic intuition.
In our setup the number of topics and the bandwidth used is important, especially when using depth camera topics, pointclouds and so on. Since we are for the moment not filtering or working on the publishing frequency in the Zenoh bridge configuration file, all topics are allowed and we are aware of this being a lot of data.

We visualize the data in Rviz and/or Foxglove in containers connected to the same Zenoh router part of the same Kubernetes cluster. We then noticed that as soon as multiple Rviz/Foglove instances were used the error occurred. This made us think that the router was not able to handle all the traffic. Therefore I increased the cpu and ram resources for the router (note not for the bridges) and that solved the issue. To make sure it works I use very high value for cpu (i.e., 8) and ram (i.e. 25 Gib) for the router container. On the container with Foxglove and the bridge I use now 5Gib of ram and 2 cpu, but the impression we have is that the key point are rather the resources for the Zenoh router.

Obviously reducing the number of topics and reducing the pub frequency would help as well, but I was wondering if there are some values that we can use as reference to be sure the router can handle all the traffic.

miltzhaw added the bug Something isn't working label Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Closing transport with multiple bridges/subscribers connected #314

[Bug] Closing transport with multiple bridges/subscribers connected #314

miltzhaw commented Oct 23, 2024 •

edited

Loading

miltzhaw commented Oct 24, 2024

miltzhaw commented Nov 11, 2024

JEnoch commented Dec 3, 2024

miltzhaw commented Dec 4, 2024

[Bug] Closing transport with multiple bridges/subscribers connected #314

[Bug] Closing transport with multiple bridges/subscribers connected #314

Comments

miltzhaw commented Oct 23, 2024 • edited Loading

Describe the bug

To reproduce

System info

miltzhaw commented Oct 24, 2024

miltzhaw commented Nov 11, 2024

JEnoch commented Dec 3, 2024

miltzhaw commented Dec 4, 2024

miltzhaw commented Oct 23, 2024 •

edited

Loading