Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdma: add separate bounce buffer freelist for data (eager) messages #614

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rauteric
Copy link
Contributor

Separate out bounce buffer freelists into a smaller-sized freelist for
control messages and a larger size for data (eager) messages

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@rauteric rauteric requested a review from a team as a code owner September 19, 2024 22:32
@rauteric rauteric force-pushed the fl-separate-pr branch 2 times, most recently from 19af9ec to 512366e Compare September 19, 2024 23:09
@rauteric rauteric changed the title rdma: add separate bounce buffer for data (eager messages) rdma: add separate bounce buffer freelist for data (eager) messages Sep 19, 2024
include/nccl_ofi_rdma.h Show resolved Hide resolved
include/nccl_ofi_rdma.h Outdated Show resolved Hide resolved
include/nccl_ofi_rdma.h Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
@rauteric
Copy link
Contributor Author

rauteric commented Oct 3, 2024

The CI failure(s) are real:

scatter_perf: nccl_ofi_rdma.c:2100: alloc_bounce_req: Assertion `NCCL_OFI_IS_PTR_ALIGNED(&bounce_fl_item->bounce_msg, BOUNCE_BUFFER_ALIGNMENT)' failed.

Need to fix up some asserts.

@rauteric
Copy link
Contributor Author

rauteric commented Oct 3, 2024

Force-push to 3494556 fixes the CI failure by having alignment on both (eager) bounce buffers and ctrl recv buffers.

This requirement will be relaxed when we refactor the data structure in the related #600.

@rauteric
Copy link
Contributor Author

rauteric commented Oct 3, 2024

Rebased on master

@AmedeoSapio
Copy link

bot:aws:retest

1 similar comment
@AvivBenchorin
Copy link
Contributor

bot:aws:retest

@rauteric
Copy link
Contributor Author

Rebase on master

* Minimum ctrl recv buffers posted per rail. The plugin will attempt to post
* more buffers if we dip below this threshold, allocating new buffers if needed.
*/
OFI_NCCL_PARAM_INT(rdma_min_posted_ctrl_recv_buffers, "RDMA_MIN_POSTED_CTRL_RECV_BUFFERS", 64);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a function out max outstanding requests, which today we have at 128?

Copy link
Contributor Author

@rauteric rauteric Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a bit tricky, because the outstanding requests max (NCCL_OFI_MAX_REQUESTS) is

  1. A per-communicator value, whereas this parameter is a per-endpoint per-rail value, so we'd have to multiply this parameter by the number of communicators, which we don't know in advance
  2. Much higher than NCCL actually keeps in flight typically, so setting this param to NCCL_OFI_MAX_REQUESTS x num_comms will be overkill.

That said, we may need to tune this value, but the current value in this PR is already significantly higher than what is in master today (which is 16-32 per rail, and shared between eager and ctrl recv buffers).

Separate out bounce buffer freelists into a smaller-sized freelist for
control messages and a larger size for data (eager) messages

Also refactor bounce buffer freelists to be per-rail instead of shared
across all rails. Parameters `min_posted_bounce_buffers` and
`max_posted_bounce_buffers` are now per-rail values.

Split parameters for posted ctrl and eager buffers. These are now
separately configurable. Set a higher count for ctrl buffers.

Signed-off-by: Eric Raut <eraut@amazon.com>
@rauteric
Copy link
Contributor Author

Rebased on master

Copy link
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right approach. Now would be a great time to clean up some code, but this only makes the code more confusing.

Bounce buffers for eager messages and control messages are different things. Have different structures with different names so that there's no reason to figure out what's what. I'm also not sure I understand the desire to move the freelists from the endpoint into the per-rail structures, given that most of the data is really operations that will be per-endpoint.

@@ -239,18 +239,31 @@ OFI_NCCL_PARAM_INT(disable_dmabuf, "DISABLE_DMABUF", 0);
OFI_NCCL_PARAM_UINT(min_stripe_size, "MIN_STRIPE_SIZE", (64 * 1024));

/*
* Minimum bounce buffers posted per endpoint. The plugin will attempt to post
* Minimum ctrl recv buffers posted per rail. The plugin will attempt to post
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per what rail? device, endpoint, etc?

@rauteric
Copy link
Contributor Author

Bounce buffers for eager messages and control messages are different things. Have different structures with different names so that there's no reason to figure out what's what.

They are different things, but there is a lot of code that is shared between them (e.g., all of the buffer pool management code, posting receives). We should rename these structures to something more generic (like receive buffers instead of bounce buffers [which I wanted to do in a followup patch]), but I think duplicating all the buffer management code is overkill.

I'm also not sure I understand the desire to move the freelists from the endpoint into the per-rail structures, given that most of the data is really operations that will be per-endpoint.

I'm confused... isn't having the freelists per-rail exactly what you were advocating in #614 (comment)?

I think it makes sense to have the freelists per-rail because posted recv buffer counts are maintained per-rail.

@bwbarrett
Copy link
Contributor

I think the problem with your patch is that it's way too much cognitive load to figrue out what a bounce buffer is. And the rest is just too confusing to review. Fix that.

@rauteric
Copy link
Contributor Author

rauteric commented Dec 4, 2024

Converting this to draft until I have cycles to work on feedback above.

@rauteric rauteric marked this pull request as draft December 4, 2024 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants