Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Closing transport with multiple bridges/subscribers connected #314

Open
miltzhaw opened this issue Oct 23, 2024 · 4 comments
Open
Labels
bug Something isn't working

Comments

@miltzhaw
Copy link

miltzhaw commented Oct 23, 2024

Describe the bug

I have a robot with the latest release of the zenoh bridge connecting as a client to a Zenoh router in a Kubernetes cluster. On the same k8s cluster I have a container with a zenoh-bridge that connects to the same Zenoh router as a client and can see these topics and for instance use Rviz with nav2 to visualize and move the robot.

When I start in another container another zenoh-bridge connecting to the same router and visualizing the topics I get the following error and the zenoh-bridge on the robot will stop working.

ERROR ThreadId(19) zenoh_transport::unicast::universal::tx: Unable to push non droppable network message to acac40b9496508dc4cf792ca876954fc. Closing transport!

OBS: I also observed this behavior without starting a second container, but with a second subscriber to a topic for instance with ros2 topic echo and rviz already running. However, in this case, it occurs inconsistently, so sometimes it works other times not.

Two warning messages I noticed that seemed also to be related are the following ones:

WARN net-0 ThreadId(10) zenoh::net::runtime::orchestrator: Unable to connect to tcp/ip:port! Received a close message (reason MAX_LINKS) in response to an OpenSyn on: TransportLinkUnicast { link: Link { src: tcp/ip:port, dst: tcp/ip:port, mtu: 64995, is_reliable: true, is_streamed: true }, config: TransportLinkUnicastConfig { direction: Outbound, batch: BatchConfig { mtu: 64995, is_streamed: true, is_compression: false }, priorities: None, reliability: None } } at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/zenoh-transport-1.0.0/src/unicast/establishment/open.rs:472.

WARN net-0 ThreadId(09) zenoh_plugin_ros2dds::route_service_cli: Route Service Client (ROS:/summit/lifecycle_manager_navigation/is_active <-> Zenoh:bot1/summit/lifecycle_manager_navigation/is_active): received error as reply for (2c2adf3057843613,26): ReplyError { payload: ZBytes(ZBuf { slices: [[54, 69, 6d, 65, 6f, 75, 74]] }), encoding: Encoding(Encoding { id: 0, schema: None }) }

To reproduce

  1. Start ros2-dds-bridge on the robot with almost default configuration to connect to a Zenoh router in client mode.
  2. Start ros2-dds-bridge on the container in the cloud with almost default configuration to connect to a Zenoh router in client mode.
  3. Start rviz e.g., with nav2 on the cloud container.
  4. Repeat steps 2 and 3 on another container in the cluster

System info

Robot with ROS2 humble container and ros2ddsbridge stable latest version 1.0.0
Container on Kubernetes cluster with ROS2 humble and ros2ddsbridge stable latest version 1.0.0
Zenoh router stable latest version 1.0.0

@miltzhaw miltzhaw added the bug Something isn't working label Oct 23, 2024
@miltzhaw
Copy link
Author

An update on this issue. I tested previous versions to identify in which release this error appears and it seems to appear in 1.0.0-beta.4. The 1.0.0-beta.3 does not have this specific error I described above.

@miltzhaw
Copy link
Author

As a follow-up comment. The issue seems to be related to the reduced resources Zenoh router had allocated as a container on Kubernetes. What are the minimum memory and cpu resources required? That would be helpful to know.

@JEnoch
Copy link
Member

JEnoch commented Dec 3, 2024

What are the minimum memory and cpu resources required? That would be helpful to know.

Hard to say, as it really depend on the number of routed DDS entities and on the amount of traffic to route.
We didn't do such detailed characterization.
The bare minimum is an idle system where the bridge uses ~10Mb and 0.3% CPU (on MacOS).
As far as I remember, in a Turtlebot3 (RaspberryPi 3) the memory usage doesn't increase a lot, and the CPU usage remains <10%.

How did you found the resources allocation in Kubernetes was in cause ?
What were the default settings and bridge consumed resource then ?
Which resources did you increase for the bridge ?
Your answers might help other users facing similar issues with Kubernetes.

@miltzhaw
Copy link
Author

miltzhaw commented Dec 4, 2024

We also didn't do a detailed analysis, but it was more a logic intuition.
In our setup the number of topics and the bandwidth used is important, especially when using depth camera topics, pointclouds and so on. Since we are for the moment not filtering or working on the publishing frequency in the Zenoh bridge configuration file, all topics are allowed and we are aware of this being a lot of data.

We visualize the data in Rviz and/or Foxglove in containers connected to the same Zenoh router part of the same Kubernetes cluster. We then noticed that as soon as multiple Rviz/Foglove instances were used the error occurred. This made us think that the router was not able to handle all the traffic. Therefore I increased the cpu and ram resources for the router (note not for the bridges) and that solved the issue. To make sure it works I use very high value for cpu (i.e., 8) and ram (i.e. 25 Gib) for the router container. On the container with Foxglove and the bridge I use now 5Gib of ram and 2 cpu, but the impression we have is that the key point are rather the resources for the Zenoh router.

Obviously reducing the number of topics and reducing the pub frequency would help as well, but I was wondering if there are some values that we can use as reference to be sure the router can handle all the traffic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants