Cell timeout error #1815
-
Hello, I've been able to run the 2.3 hello-pt example in POC mode. However, if I switch to a secure provisioned environment across 3 machines as docker containers (2 clients each on separate machines and 1 server 1 admin on the same machine), I see errors. I am using the example project.yml w/ dummy overseer (replacing my server fqdn) and default local files. I see the following error after clients connect to the server: Then, when I submit a job I see this message I've confirmed that on job submission the clients return True here If I downgrade to NVFLARE 2.2, I able to run the 2.2 version of hello-pt in my setup. What am I doing wrong? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 12 replies
-
@KSchmidtACR can you share the project.yml file. I suspect that there is something wrong in deployment configuration. In current POC mode, the "admin" is the default user name for FLARE Console access/Admin Console. In Production Model, the user name is "admin@nvidia.com" by default.
so I am suspect something todo with your configuration, particular in participants. |
Beta Was this translation helpful? Give feedback.
-
The error message indicates that the sites didn't have enough resources to run the job, and job requires minimum 2 sites to meet the resource requirement. Can you check the job config to see if there's any resource requirement missing from the sites? You can also run the admin command "list_jobs -d 0cd407a7-06ad-4a57-a758-3f47acfa5a0e" to see more details of the job schedule history. |
Beta Was this translation helpful? Give feedback.
@KSchmidtACR
An update from our side: we have identified the issue and we are working on a proper fix.
We will keep updating in #1821
Meanwhile a temporary hotfix for you (if doable on your side) is to sync up the server and client machine time (differ by just some seconds instead of 30 minutes).
Thanks!