Cell timeout error #1815

KSchmidtACR · 2023-06-15T21:00:46Z

KSchmidtACR
Jun 15, 2023

Hello,

I've been able to run the 2.3 hello-pt example in POC mode. However, if I switch to a secure provisioned environment across 3 machines as docker containers (2 clients each on separate machines and 1 server 1 admin on the same machine), I see errors. I am using the example project.yml w/ dummy overseer (replacing my server fqdn) and default local files.

I see the following error after clients connect to the server: 2023-06-15 20:20:19,840 - Cell - ERROR - [ME=server O=? D=site-1 F=? T=? CH=admin TP=admin] timeout on Request dce30617-7add-4004-b677-770a3ea6a879 for ['admin'] after 15 secs .

Then, when I submit a job I see this message 2023-06-15 20:19:20,728 - DefaultJobScheduler - INFO - [identity=example_project, run=?]: Try to schedule job 0cd407a7-06ad-4a57-a758-3f47acfa5a0e, get result: (not enough sites have enough resources (ok sites 0 < min sites 2).

I've confirmed that on job submission the clients return True here

If I downgrade to NVFLARE 2.2, I able to run the 2.2 version of hello-pt in my setup. What am I doing wrong?

Answered by YuanTingHsieh

Jun 21, 2023

@KSchmidtACR
An update from our side: we have identified the issue and we are working on a proper fix.
We will keep updating in #1821

Meanwhile a temporary hotfix for you (if doable on your side) is to sync up the server and client machine time (differ by just some seconds instead of 30 minutes).

Thanks!

View full answer

chesterxgchen · 2023-06-17T18:23:50Z

chesterxgchen
Jun 17, 2023
Maintainer

@KSchmidtACR can you share the project.yml file. I suspect that there is something wrong in deployment configuration.

In current POC mode, the "admin" is the default user name for FLARE Console access/Admin Console. In Production Model, the user name is "admin@nvidia.com" by default.
the log shows

...['admin'] after 15 secs  ...

so I am suspect something todo with your configuration, particular in participants.

1 reply

KSchmidtACR Jun 20, 2023
Author

Sure:

api_version: 3
name: example_project
description: NVIDIA FLARE sample project yaml file

participants:

name: myFQDN
type: server
org: nvidia
fed_learn_port: 8002
admin_port: 8003
name: site-1
type: client
org: nvidia
name: site-2
type: client
org: nvidia
name: admin@nvidia.com
type: admin
org: nvidia
role: project_admin

builders:

path: nvflare.lighter.impl.workspace.WorkspaceBuilder
args:
template_file: master_template.yml
path: nvflare.lighter.impl.template.TemplateBuilder
path: nvflare.lighter.impl.static_file.StaticFileBuilder
args:
config_folder: config
docker_image: my-docker-v23
overseer_agent:
path: nvflare.ha.dummy_overseer_agent.DummyOverseerAgent
overseer_exists: false
args:
sp_end_point: myFQDN:8002:8003
path: nvflare.lighter.impl.cert.CertBuilder
path: nvflare.lighter.impl.signature.SignatureBuilder

yhwen · 2023-06-20T14:34:54Z

yhwen
Jun 20, 2023
Maintainer

I see the following error after clients connect to the server: 2023-06-15 20:20:19,840 - Cell - ERROR - [ME=server O=? D=site-1 F=? T=? CH=admin TP=admin] timeout on Request dce30617-7add-4004-b677-770a3ea6a879 for ['admin'] after 15 secs .

Then, when I submit a job I see this message 2023-06-15 20:19:20,728 - DefaultJobScheduler - INFO - [identity=example_project, run=?]: Try to schedule job 0cd407a7-06ad-4a57-a758-3f47acfa5a0e, get result: (not enough sites have enough resources (ok sites 0 < min sites 2).

The error message indicates that the sites didn't have enough resources to run the job, and job requires minimum 2 sites to meet the resource requirement. Can you check the job config to see if there's any resource requirement missing from the sites? You can also run the admin command "list_jobs -d 0cd407a7-06ad-4a57-a758-3f47acfa5a0e" to see more details of the job schedule history.

11 replies

YuanTingHsieh Jun 21, 2023
Maintainer

@KSchmidtACR

Thanks for the debugging log, that helps a lot.

I am assuming you are using "myFQDN" to represent "flpilot001vm01".

On reading the log, I found that:

2023-06-20 18:41:43,015 - Cell - ERROR - [ME=server O=? D=site-1 F=? T=? CH=admin TP=admin] timeout on Request f0b46dfc-5215-4152-b06b-4f4229dd14e6 for ['admin'] after 15 secs

Further comparing the logs between server VS client, the root cause it that server and client side TIMESTAMP is different.

server:

2023-06-20 19:48:15,089 - Cell - DEBUG - server: set up waiter 73c1427d-687c-447f-befa-3f275537da62 to wait for 15 secs

client:

2023-06-20 20:12:18,877 - Cell - DEBUG - site-1: received message: {'cn__topic': 'admin', 'cn__channel': 'admin', 'cn__destination': 'site-1', 'cn__req_id': '73c1427d-687c-447f-befa-3f275537da62', 'cn__reply_expected': True, 'cn__wait_until': 1687290510.0878868, 'cn__optional': False, 'cn__origin': 'server', 'cn__from': 'server', 'cn__msg_type': 'req', 'cn__route': [['server', 1687290495.0881503]], 'cn__to': 'site-1', 'cn__payload_encoding': 'fobs', 'cn__send_time': 1687290495.0882032}
2023-06-20 20:12:18,878 - Cell - DEBUG - site-1: processing incoming request
2023-06-20 20:12:18,878 - Cell - DEBUG - site-1: calling registered request CB
2023-06-20 20:12:18,878 - Cell - DEBUG - site-1: calling CB _dispatch_request
2023-06-20 20:12:18,878 - GPUResourceManager - DEBUG - [identity=41c6e867-7691-492c-bf00-ab4ef0f99968, run=?]: reserving resources: {} for requirements {}.
2023-06-20 20:12:18,879 - GPUResourceManager - DEBUG - [identity=41c6e867-7691-492c-bf00-ab4ef0f99968, run=?]: current resources: {}, reserved_resources {'492d5744-acf3-4287-a076-62bd0984cdf0': ({}, 5), '489bb50e-54c8-4cd2-838e-0104de25d312': ({}, 30)}.
2023-06-20 20:12:18,879 - Cell - DEBUG - site-1: don't send response - reply is too late

I think we have a bug in our code at https://github.com/NVIDIA/NVFlare/blob/dev/nvflare/fuel/f3/cellnet/cell.py#L1828-L1835

The client time (2023-06-20 20:12:18,879) is much faster than server side time (2023-06-20 19:48:15,089), so when client receive it thinks this request is TIMEOUT right away.

@nvidianz @chesterxgchen @yhwen I think we need to fix this as different systems/machines will have different time.time() values.

KSchmidtACR Jun 21, 2023
Author

You are correct that "myFQDN" represents "flpilot001vm01." If this issue is caused by NVFLARE 2.3 not properly handling differing timestamps across machines, does it make sense that NVFLARE 2.2 did not have this issue?

yhwen Jun 21, 2023
Maintainer

NVFlare 2.3 changed the way of server / client communication. This issue does not apply to NVFlare 2.2.

YuanTingHsieh Jun 21, 2023
Maintainer

@KSchmidtACR
An update from our side: we have identified the issue and we are working on a proper fix.
We will keep updating in #1821

Meanwhile a temporary hotfix for you (if doable on your side) is to sync up the server and client machine time (differ by just some seconds instead of 30 minutes).

Thanks!

Answer selected by YuanTingHsieh

YuanTingHsieh Aug 25, 2023
Maintainer

It is fixed in #1942

Will be included in the next release (2.4)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cell timeout error #1815

{{title}}

Replies: 2 comments 12 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Cell timeout error #1815

KSchmidtACR Jun 15, 2023

Replies: 2 comments · 12 replies

chesterxgchen Jun 17, 2023 Maintainer

KSchmidtACR Jun 20, 2023 Author

yhwen Jun 20, 2023 Maintainer

YuanTingHsieh Jun 21, 2023 Maintainer

KSchmidtACR Jun 21, 2023 Author

yhwen Jun 21, 2023 Maintainer

YuanTingHsieh Jun 21, 2023 Maintainer

YuanTingHsieh Aug 25, 2023 Maintainer

KSchmidtACR
Jun 15, 2023

Replies: 2 comments 12 replies

chesterxgchen
Jun 17, 2023
Maintainer

KSchmidtACR Jun 20, 2023
Author

yhwen
Jun 20, 2023
Maintainer

YuanTingHsieh Jun 21, 2023
Maintainer

KSchmidtACR Jun 21, 2023
Author

yhwen Jun 21, 2023
Maintainer

YuanTingHsieh Jun 21, 2023
Maintainer

YuanTingHsieh Aug 25, 2023
Maintainer