Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Federation hangs requesting time #2574

Open
wfvining-snl opened this issue Nov 9, 2023 · 17 comments
Open

Federation hangs requesting time #2574

wfvining-snl opened this issue Nov 9, 2023 · 17 comments
Labels
bug Issue concerning incorrect code operation

Comments

@wfvining-snl
Copy link

Describe the bug
Federation appears to hang while requesting next time. All of the federates have requested a next time, and the logs show the following message(s) continuously:

...
[2023-11-09 12:00:05.464] [console] [trace] S814 (131074)[1808]::processing command time_request:From (1) Time(1809, 1809, 1809) to (131074)
[2023-11-09 12:00:05.464] [console] [trace] S814 (131074)[1808]::processing command time_request:From (131075) Time(1809, 2708, 1809) to (131074)
[2023-11-09 12:00:05.464] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (131074) Time(1809, 1809, 1809) to (1) from 131074
[2023-11-09 12:00:05.464] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (131074) Time(1809, 1809, 1809) to (131074) from 131074
[2023-11-09 12:00:05.464] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (131074) Time(1809, 9223372036.854776, 9223372036.854776) to (131075) from 131074
[2023-11-09 12:00:05.464] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (131074) Time(1809, 1809, 1809) to (131075) from 131074
[2023-11-09 12:00:05.464] [console] [trace] S814 (131074)[1808]::processing command time_request:From (131074) Time(1809, 1809, 1809) to (131074)
[2023-11-09 12:00:05.465] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (1) Time(1809, 1809, 1809) to (131074) from 1
[2023-11-09 12:00:05.465] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (131075) Time(1809, 2708, 1809) to (131074) from 131075
[2023-11-09 12:00:05.465] [console] [trace] S814 (131074)[1808]::processing command time_request:From (1) Time(1809, 1809, 1809) to (131074)
[2023-11-09 12:00:05.465] [console] [trace] S814 (131074)[1808]::processing command time_request:From (131075) Time(1809, 2708, 1809) to (131074)
[2023-11-09 12:00:05.465] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (131074) Time(1809, 1809, 1809) to (1) from 131074
[2023-11-09 12:00:05.465] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (131074) Time(1809, 1809, 1809) to (131075) from 131074
[2023-11-09 12:00:05.465] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (131074) Time(1809, 9223372036.854776, 9223372036.854776) to (131074) from 131074
[2023-11-09 12:00:05.465] [console] [trace] S814_core (1879048194)[t=1808]::|| cmd:time_request:From (131074) Time(1809, 1809, 1809) to (131075) from 131074
[2023-11-09 12:00:05.465] [console] [trace] S814 (131074)[1808]::processing command time_request:From (131074) Time(1809, 9223372036.854776, 9223372036.854776) to (131074)
...

Based on the logs it looks like the amount of time taken to process each time request increases with each time grant:

$ grep "\[t=1\]::|| cmd:time_request" grid.log | wc -l
51
$ grep "\[t=2\]::|| cmd:time_request" grid.log | wc -l
104
$ grep "\[t=3\]::|| cmd:time_request" grid.log | wc -l
72
$ grep "\[t=4\]::|| cmd:time_request" grid.log | wc -l
72753
$ grep "\[t=904\]::|| cmd:time_request" grid.log | wc -l
23102
$ grep "\[t=905\]::|| cmd:time_request" grid.log | wc -l
41955
$ grep "\[t=906\]::|| cmd:time_request" grid.log | wc -l
98845
$ grep "\[t=907\]::|| cmd:time_request" grid.log | wc -l
141920
$ grep "\[t=908\]::|| cmd:time_request" grid.log | wc -l
1054678

What is the expected behavior?
A time should be granted to the federates. The federation works (doesn't hang) with an older version of HELICS (3.1.2.post8).

To Reproduce
The behavior occurs in a federation with 2 federates configured as follows:

{
  "name": "grid",
  "core_name": "grid_core",
  "log_level": "trace",
  "core_type": "zmq",
  "period": 1.0,
  "subscriptions": [],
  "publications": [
    {
      "key": "total_power",
      "type":  "complex",
      "units": "kW"
    }
  ],
  "endpoints": [
    {
      "name": "reliability"
    },
    {
      "name": "metrics"
    }
  ]
}

The grid federate also configures the following subscriptions and publications for each storage device:

        self._power_sub = federate.register_subscription(
            f"{device.name}/power", units="kW"
        )
        self._power_sub.set_default(complex(0.0, 0.0))
        self._voltage_pub = federate.register_publication(
            f"storage.{device.name}.voltage",
            kind=HelicsDataType.DOUBLE,
            units="pu"
        )
        self._soc_pub = federate.register_publication(
            f"storage.{device.name}.soc",
            kind=HelicsDataType.DOUBLE,
            units=""
        )
        self._power_pub = federate.register_publication(
            f"storage.{device.name}.power",
            kind=HelicsDataType.COMPLEX,
            units="kW"
        )

storage federate

{
  "log_level": "trace",
  "core_type": "zmq",
  "subscriptions": [],
  "publications": [
    {
      "key": "power",
      "unit": "kW",
      "type": "complex"
    }
  ]
}

Environment (please complete the following information):

  • Operating System: Windows
  • Language Extension: Python
  • what compiler or setup process did you use: pip install helics
  • HELICS version:
$ helics --version
helics, version v3.4.0

Python HELICS version v3.4.0

HELICS Library version 3.4.0 (2023-01-19)

Additional context and information
The increasing number of log statements for each time request makes me think there is some kind of resource leak and that eventually a new time would be granted.

@wfvining-snl wfvining-snl added the bug Issue concerning incorrect code operation label Nov 9, 2023
@trevorhardy
Copy link
Contributor

Is there a place you can post the code so that we could try to correlate the error on our side?

The increasing time for each time grant is most likely something outside the HELICS API (at least it has been every time I've seen that in my own federations); are there any likely places to look?

We've had problems with federations hanging in the past and they've usually had to do with making a time request for helicsTimeMaxtime and relying on incoming publications or messages to wake up the federate. Are you doing anything like that? (It looks like you are based on your log files but I want to make sure.)

@wfvining-snl
Copy link
Author

Yes, the storage federate does make requests for helics_time_maxtime and rely on input from the other federate to wake it up. But the "grid" federate makes requests for times less than the maxtime.

I'm not sure where I would look for something outside the HELICS API. I've added logging on either side of the call to Federate.request_time() and it simply never returns. What should I be looking for that might cause this?

Unfortunately I can't post the code right now. I am trying to make a smaller reproducer that I can share.

@trevorhardy
Copy link
Contributor

Regarding the federation hanging, which federate is making the request to helics_time_maxtime?

@wfvining-snl
Copy link
Author

The one I called "storage federate" above.

@phlptp
Copy link
Member

phlptp commented Nov 10, 2023

Is this always at a particular time step or is it variable which time step it hangs at?

@wfvining-snl
Copy link
Author

It is variable

@trevorhardy
Copy link
Contributor

is it possible to remove the endpoint from the grid federate without breaking the federation? If possible, try running without it and see if it hangs. My experience has been endpoints are more trouble-some when using helics_time_maxtime.

@phlptp
Copy link
Member

phlptp commented Nov 10, 2023

one possibility is to use the aysnc call to request time, then checking if is has completed in a loop, and if it takes too long, make a query to "root" for "global_time_debugging". The result sometime gives some additional information that can be used for diagnostics.

@phlptp
Copy link
Member

phlptp commented Nov 10, 2023

Also like Trevor said there are few possible corner cases with endpoints in the timing with max time. Using a targeted endpoint might resolve them as well, if your communication is always going on the same path.

@wfvining-snl
Copy link
Author

Okay, I will try to eliminate the endpoints and see if that fixes it. When you say targeted you mean set the "destination" for the endpoint?

@phlptp
Copy link
Member

phlptp commented Nov 10, 2023

Yeah that might work. Depends if you are defining them through the API or a config file

@wfvining-snl
Copy link
Author

I missed that there is actually an endpoint in the "storage federate" (configured via the API) that is currently unused (I write messages to it, but nothing is listening for them). Eliminating it completely resolved the issue.

@trevorhardy
Copy link
Contributor

If you're willing, it would still be helpful to have a stripped-down version of the code to help us with finding the bug in HELICS. HELICS shouldn't hang a federation under the conditions you've described so this is still a bug we need to fix.

@wfvining-snl
Copy link
Author

Yeah, I'm going to keep working on my small example. At some point I am going to need/use that endpoint. If I get it to reproduce the problem I will update this issue (unless you'd rather close it and I can make a new one later).

@trevorhardy
Copy link
Contributor

Let's keep it all here; thanks!

@phlptp
Copy link
Member

phlptp commented Nov 10, 2023

We have had a few reports of things like this before, they have all been resolvable with some minor tweaks or traceable to a problem with the federate itself. But there is likely some missed edge condition in the timing system that we are not handing correctly in HELICS, but reproducing the right conditions to be able to debug them in a compact repeatable way has thus far been very elusive.

@trevorhardy
Copy link
Contributor

Any further update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue concerning incorrect code operation
Projects
None yet
Development

No branches or pull requests

3 participants