Federation hangs requesting time #2574

wfvining-snl · 2023-11-09T20:12:28Z

Describe the bug
Federation appears to hang while requesting next time. All of the federates have requested a next time, and the logs show the following message(s) continuously:

...
[2023-11-09 12:00:05.464] [console] [trace] S814 (131074)[1808]::processing command time_request:From (1) Time(1809, 1809, 1809) to (131074)
[2023-11-09 12:00:05.464] [console] [trace] S814 (131074)[1808]::processing command time_request:From (131075) Time(1809, 2708, 1809) to (131074)
[2023-11-09 12:00:05.464] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (131074) Time(1809, 1809, 1809) to (1) from 131074
[2023-11-09 12:00:05.464] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (131074) Time(1809, 1809, 1809) to (131074) from 131074
[2023-11-09 12:00:05.464] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (131074) Time(1809, 9223372036.854776, 9223372036.854776) to (131075) from 131074
[2023-11-09 12:00:05.464] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (131074) Time(1809, 1809, 1809) to (131075) from 131074
[2023-11-09 12:00:05.464] [console] [trace] S814 (131074)[1808]::processing command time_request:From (131074) Time(1809, 1809, 1809) to (131074)
[2023-11-09 12:00:05.465] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (1) Time(1809, 1809, 1809) to (131074) from 1
[2023-11-09 12:00:05.465] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (131075) Time(1809, 2708, 1809) to (131074) from 131075
[2023-11-09 12:00:05.465] [console] [trace] S814 (131074)[1808]::processing command time_request:From (1) Time(1809, 1809, 1809) to (131074)
[2023-11-09 12:00:05.465] [console] [trace] S814 (131074)[1808]::processing command time_request:From (131075) Time(1809, 2708, 1809) to (131074)
[2023-11-09 12:00:05.465] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (131074) Time(1809, 1809, 1809) to (1) from 131074
[2023-11-09 12:00:05.465] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (131074) Time(1809, 1809, 1809) to (131075) from 131074
[2023-11-09 12:00:05.465] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (131074) Time(1809, 9223372036.854776, 9223372036.854776) to (131074) from 131074
[2023-11-09 12:00:05.465] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (131074) Time(1809, 1809, 1809) to (131075) from 131074
[2023-11-09 12:00:05.465] [console] [trace] S814 (131074)[1808]::processing command time_request:From (131074) Time(1809, 9223372036.854776, 9223372036.854776) to (131074)
...

Based on the logs it looks like the amount of time taken to process each time request increases with each time grant:

$ grep "\[t=1\]::|| cmd:time_request" grid.log | wc -l
51
$ grep "\[t=2\]::|| cmd:time_request" grid.log | wc -l
104
$ grep "\[t=3\]::|| cmd:time_request" grid.log | wc -l
72
$ grep "\[t=4\]::|| cmd:time_request" grid.log | wc -l
72753
$ grep "\[t=904\]::|| cmd:time_request" grid.log | wc -l
23102
$ grep "\[t=905\]::|| cmd:time_request" grid.log | wc -l
41955
$ grep "\[t=906\]::|| cmd:time_request" grid.log | wc -l
98845
$ grep "\[t=907\]::|| cmd:time_request" grid.log | wc -l
141920
$ grep "\[t=908\]::|| cmd:time_request" grid.log | wc -l
1054678

What is the expected behavior?
A time should be granted to the federates. The federation works (doesn't hang) with an older version of HELICS (3.1.2.post8).

To Reproduce
The behavior occurs in a federation with 2 federates configured as follows:

{
  "name": "grid",
  "core_name": "grid_core",
  "log_level": "trace",
  "core_type": "zmq",
  "period": 1.0,
  "subscriptions": [],
  "publications": [
    {
      "key": "total_power",
      "type":  "complex",
      "units": "kW"
    }
  ],
  "endpoints": [
    {
      "name": "reliability"
    },
    {
      "name": "metrics"
    }
  ]
}

The grid federate also configures the following subscriptions and publications for each storage device:

        self._power_sub = federate.register_subscription(
            f"{device.name}/power", units="kW"
        )
        self._power_sub.set_default(complex(0.0, 0.0))
        self._voltage_pub = federate.register_publication(
            f"storage.{device.name}.voltage",
            kind=HelicsDataType.DOUBLE,
            units="pu"
        )
        self._soc_pub = federate.register_publication(
            f"storage.{device.name}.soc",
            kind=HelicsDataType.DOUBLE,
            units=""
        )
        self._power_pub = federate.register_publication(
            f"storage.{device.name}.power",
            kind=HelicsDataType.COMPLEX,
            units="kW"
        )

storage federate

{
  "log_level": "trace",
  "core_type": "zmq",
  "subscriptions": [],
  "publications": [
    {
      "key": "power",
      "unit": "kW",
      "type": "complex"
    }
  ]
}

Environment (please complete the following information):

Operating System: Windows
Language Extension: Python
what compiler or setup process did you use: pip install helics
HELICS version:

$ helics --version
helics, version v3.4.0

Python HELICS version v3.4.0

HELICS Library version 3.4.0 (2023-01-19)

Additional context and information
The increasing number of log statements for each time request makes me think there is some kind of resource leak and that eventually a new time would be granted.

The text was updated successfully, but these errors were encountered:

trevorhardy · 2023-11-10T16:25:20Z

Is there a place you can post the code so that we could try to correlate the error on our side?

The increasing time for each time grant is most likely something outside the HELICS API (at least it has been every time I've seen that in my own federations); are there any likely places to look?

We've had problems with federations hanging in the past and they've usually had to do with making a time request for helicsTimeMaxtime and relying on incoming publications or messages to wake up the federate. Are you doing anything like that? (It looks like you are based on your log files but I want to make sure.)

wfvining-snl · 2023-11-10T16:56:46Z

Yes, the storage federate does make requests for helics_time_maxtime and rely on input from the other federate to wake it up. But the "grid" federate makes requests for times less than the maxtime.

I'm not sure where I would look for something outside the HELICS API. I've added logging on either side of the call to Federate.request_time() and it simply never returns. What should I be looking for that might cause this?

Unfortunately I can't post the code right now. I am trying to make a smaller reproducer that I can share.

trevorhardy · 2023-11-10T16:59:29Z

Regarding the federation hanging, which federate is making the request to helics_time_maxtime?

wfvining-snl · 2023-11-10T17:00:11Z

The one I called "storage federate" above.

phlptp · 2023-11-10T17:00:20Z

Is this always at a particular time step or is it variable which time step it hangs at?

wfvining-snl · 2023-11-10T17:00:53Z

It is variable

trevorhardy · 2023-11-10T17:02:22Z

is it possible to remove the endpoint from the grid federate without breaking the federation? If possible, try running without it and see if it hangs. My experience has been endpoints are more trouble-some when using helics_time_maxtime.

phlptp · 2023-11-10T17:03:32Z

one possibility is to use the aysnc call to request time, then checking if is has completed in a loop, and if it takes too long, make a query to "root" for "global_time_debugging". The result sometime gives some additional information that can be used for diagnostics.

phlptp · 2023-11-10T17:04:53Z

Also like Trevor said there are few possible corner cases with endpoints in the timing with max time. Using a targeted endpoint might resolve them as well, if your communication is always going on the same path.

wfvining-snl · 2023-11-10T17:07:52Z

Okay, I will try to eliminate the endpoints and see if that fixes it. When you say targeted you mean set the "destination" for the endpoint?

phlptp · 2023-11-10T17:13:24Z

Yeah that might work. Depends if you are defining them through the API or a config file

wfvining-snl · 2023-11-10T17:15:58Z

I missed that there is actually an endpoint in the "storage federate" (configured via the API) that is currently unused (I write messages to it, but nothing is listening for them). Eliminating it completely resolved the issue.

trevorhardy · 2023-11-10T17:26:47Z

If you're willing, it would still be helpful to have a stripped-down version of the code to help us with finding the bug in HELICS. HELICS shouldn't hang a federation under the conditions you've described so this is still a bug we need to fix.

wfvining-snl · 2023-11-10T17:29:30Z

Yeah, I'm going to keep working on my small example. At some point I am going to need/use that endpoint. If I get it to reproduce the problem I will update this issue (unless you'd rather close it and I can make a new one later).

trevorhardy · 2023-11-10T17:32:15Z

Let's keep it all here; thanks!

phlptp · 2023-11-10T17:36:23Z

We have had a few reports of things like this before, they have all been resolvable with some minor tweaks or traceable to a problem with the federate itself. But there is likely some missed edge condition in the timing system that we are not handing correctly in HELICS, but reproducing the right conditions to be able to debug them in a compact repeatable way has thus far been very elusive.

trevorhardy · 2024-12-31T17:48:09Z

Any further update on this?

wfvining-snl added the bug Issue concerning incorrect code operation label Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Federation hangs requesting time #2574

Federation hangs requesting time #2574

wfvining-snl commented Nov 9, 2023

trevorhardy commented Nov 10, 2023

wfvining-snl commented Nov 10, 2023

trevorhardy commented Nov 10, 2023

wfvining-snl commented Nov 10, 2023

phlptp commented Nov 10, 2023

wfvining-snl commented Nov 10, 2023

trevorhardy commented Nov 10, 2023

phlptp commented Nov 10, 2023

phlptp commented Nov 10, 2023

wfvining-snl commented Nov 10, 2023

phlptp commented Nov 10, 2023

wfvining-snl commented Nov 10, 2023

trevorhardy commented Nov 10, 2023

wfvining-snl commented Nov 10, 2023

trevorhardy commented Nov 10, 2023

phlptp commented Nov 10, 2023

trevorhardy commented Dec 31, 2024

Federation hangs requesting time #2574

Federation hangs requesting time #2574

Comments

wfvining-snl commented Nov 9, 2023

trevorhardy commented Nov 10, 2023

wfvining-snl commented Nov 10, 2023

trevorhardy commented Nov 10, 2023

wfvining-snl commented Nov 10, 2023

phlptp commented Nov 10, 2023

wfvining-snl commented Nov 10, 2023

trevorhardy commented Nov 10, 2023

phlptp commented Nov 10, 2023

phlptp commented Nov 10, 2023

wfvining-snl commented Nov 10, 2023

phlptp commented Nov 10, 2023

wfvining-snl commented Nov 10, 2023

trevorhardy commented Nov 10, 2023

wfvining-snl commented Nov 10, 2023

trevorhardy commented Nov 10, 2023

phlptp commented Nov 10, 2023

trevorhardy commented Dec 31, 2024