-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rdma: add separate bounce buffer freelist for data (eager) messages #614
base: master
Are you sure you want to change the base?
Conversation
19af9ec
to
512366e
Compare
512366e
to
6cbe500
Compare
The CI failure(s) are real:
Need to fix up some asserts. |
6cbe500
to
3494556
Compare
3494556
to
bf176e6
Compare
Rebased on master |
bot:aws:retest |
1 similar comment
bot:aws:retest |
5f22617
to
bb2483a
Compare
Rebase on master |
* Minimum ctrl recv buffers posted per rail. The plugin will attempt to post | ||
* more buffers if we dip below this threshold, allocating new buffers if needed. | ||
*/ | ||
OFI_NCCL_PARAM_INT(rdma_min_posted_ctrl_recv_buffers, "RDMA_MIN_POSTED_CTRL_RECV_BUFFERS", 64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a function out max outstanding requests, which today we have at 128?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a bit tricky, because the outstanding requests max (NCCL_OFI_MAX_REQUESTS
) is
- A per-communicator value, whereas this parameter is a
per-endpointper-rail value, so we'd have to multiply this parameter by the number of communicators, which we don't know in advance - Much higher than NCCL actually keeps in flight typically, so setting this param to
NCCL_OFI_MAX_REQUESTS x num_comms
will be overkill.
That said, we may need to tune this value, but the current value in this PR is already significantly higher than what is in master today (which is 16-32 per rail, and shared between eager and ctrl recv buffers).
bb2483a
to
e5452be
Compare
Separate out bounce buffer freelists into a smaller-sized freelist for control messages and a larger size for data (eager) messages Also refactor bounce buffer freelists to be per-rail instead of shared across all rails. Parameters `min_posted_bounce_buffers` and `max_posted_bounce_buffers` are now per-rail values. Split parameters for posted ctrl and eager buffers. These are now separately configurable. Set a higher count for ctrl buffers. Signed-off-by: Eric Raut <eraut@amazon.com>
e5452be
to
7f36f35
Compare
Rebased on master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right approach. Now would be a great time to clean up some code, but this only makes the code more confusing.
Bounce buffers for eager messages and control messages are different things. Have different structures with different names so that there's no reason to figure out what's what. I'm also not sure I understand the desire to move the freelists from the endpoint into the per-rail structures, given that most of the data is really operations that will be per-endpoint.
@@ -239,18 +239,31 @@ OFI_NCCL_PARAM_INT(disable_dmabuf, "DISABLE_DMABUF", 0); | |||
OFI_NCCL_PARAM_UINT(min_stripe_size, "MIN_STRIPE_SIZE", (64 * 1024)); | |||
|
|||
/* | |||
* Minimum bounce buffers posted per endpoint. The plugin will attempt to post | |||
* Minimum ctrl recv buffers posted per rail. The plugin will attempt to post |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per what rail? device, endpoint, etc?
They are different things, but there is a lot of code that is shared between them (e.g., all of the buffer pool management code, posting receives). We should rename these structures to something more generic (like receive buffers instead of bounce buffers [which I wanted to do in a followup patch]), but I think duplicating all the buffer management code is overkill.
I'm confused... isn't having the freelists per-rail exactly what you were advocating in #614 (comment)? I think it makes sense to have the freelists per-rail because posted recv buffer counts are maintained per-rail. |
I think the problem with your patch is that it's way too much cognitive load to figrue out what a bounce buffer is. And the rest is just too confusing to review. Fix that. |
Converting this to draft until I have cycles to work on feedback above. |
Separate out bounce buffer freelists into a smaller-sized freelist for
control messages and a larger size for data (eager) messages
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.