-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Federation hangs requesting time #2574
Comments
Is there a place you can post the code so that we could try to correlate the error on our side? The increasing time for each time grant is most likely something outside the HELICS API (at least it has been every time I've seen that in my own federations); are there any likely places to look? We've had problems with federations hanging in the past and they've usually had to do with making a time request for helicsTimeMaxtime and relying on incoming publications or messages to wake up the federate. Are you doing anything like that? (It looks like you are based on your log files but I want to make sure.) |
Yes, the storage federate does make requests for I'm not sure where I would look for something outside the HELICS API. I've added logging on either side of the call to Unfortunately I can't post the code right now. I am trying to make a smaller reproducer that I can share. |
Regarding the federation hanging, which federate is making the request to |
The one I called "storage federate" above. |
Is this always at a particular time step or is it variable which time step it hangs at? |
It is variable |
is it possible to remove the endpoint from the grid federate without breaking the federation? If possible, try running without it and see if it hangs. My experience has been endpoints are more trouble-some when using |
one possibility is to use the aysnc call to request time, then checking if is has completed in a loop, and if it takes too long, make a query to "root" for "global_time_debugging". The result sometime gives some additional information that can be used for diagnostics. |
Also like Trevor said there are few possible corner cases with endpoints in the timing with max time. Using a targeted endpoint might resolve them as well, if your communication is always going on the same path. |
Okay, I will try to eliminate the endpoints and see if that fixes it. When you say targeted you mean set the "destination" for the endpoint? |
Yeah that might work. Depends if you are defining them through the API or a config file |
I missed that there is actually an endpoint in the "storage federate" (configured via the API) that is currently unused (I write messages to it, but nothing is listening for them). Eliminating it completely resolved the issue. |
If you're willing, it would still be helpful to have a stripped-down version of the code to help us with finding the bug in HELICS. HELICS shouldn't hang a federation under the conditions you've described so this is still a bug we need to fix. |
Yeah, I'm going to keep working on my small example. At some point I am going to need/use that endpoint. If I get it to reproduce the problem I will update this issue (unless you'd rather close it and I can make a new one later). |
Let's keep it all here; thanks! |
We have had a few reports of things like this before, they have all been resolvable with some minor tweaks or traceable to a problem with the federate itself. But there is likely some missed edge condition in the timing system that we are not handing correctly in HELICS, but reproducing the right conditions to be able to debug them in a compact repeatable way has thus far been very elusive. |
Any further update on this? |
Describe the bug
Federation appears to hang while requesting next time. All of the federates have requested a next time, and the logs show the following message(s) continuously:
Based on the logs it looks like the amount of time taken to process each time request increases with each time grant:
What is the expected behavior?
A time should be granted to the federates. The federation works (doesn't hang) with an older version of HELICS (3.1.2.post8).
To Reproduce
The behavior occurs in a federation with 2 federates configured as follows:
The grid federate also configures the following subscriptions and publications for each storage device:
storage federate
Environment (please complete the following information):
pip install helics
Additional context and information
The increasing number of log statements for each time request makes me think there is some kind of resource leak and that eventually a new time would be granted.
The text was updated successfully, but these errors were encountered: