Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Free some resources after each step to avoid OOM #75

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

andrewmoise
Copy link

Fixes #74

@lucidrains
Copy link
Owner

hmm, i think the GC should handle this?

@andrewmoise
Copy link
Author

For some reason it seems like it doesn't until the variable goes out of scope - see https://pytorch.org/docs/stable/notes/faq.html ("If you assign a Tensor or Variable to a local, Python will not deallocate until the local goes out of scope.")

IDK whether casting and deleting the loss is necessary, as it made no difference in my run (I was just obeying the pytorch docs above). But, deleting the sample data after saving definitely caused my test runs to succeed where previously they were running out of GPU memory.

@andrewmoise
Copy link
Author

(I should clarify - my understanding is that the resources will be freed by the GC when the variable is reassigned on the next step. We're not leaking significant memory on an ongoing basis. Just, we're consuming a constant amount of memory we don't need to be by keeping old local variables around with large storage on the GPU, after they're done being used but before they're reassigned on the next loop. For me that was enough to push the consumption over the edge and make the program quit because it had no more GPU memory.)

@pengzhangzhi
Copy link
Contributor

(I should clarify - my understanding is that the resources will be freed by the GC when the variable is reassigned on the next step. We're not leaking significant memory on an ongoing basis. Just, we're consuming a constant amount of memory we don't need to be by keeping old local variables around with large storage on the GPU, after they're done being used but before they're reassigned on the next loop. For me that was enough to push the consumption over the edge and make the program quit because it had no more GPU memory.)

May I ask what GC stands for?

@lucidrains lucidrains force-pushed the main branch 2 times, most recently from c894e2c to 60f760e Compare November 12, 2022 04:01
@mgrachten
Copy link

mgrachten commented Dec 14, 2022

I've seen this issue in the past, and I think what @andrewmoise proposes makes sense. The garbage collector (GC) doesn't handle this situation by itself. For example, in the following for loop, the computational graph gets constructed by compute_loss(batch), and is assigned to loss. In the next iteration, loss is still bound to the graph of the previous iteration, a new graph is constructed by compute_loss(batch), and only when it is constructed loss is bound to that new graph, and the ref count to the old graph is decreased. That means that this code requires enough memory to hold two graphs simultaneously, unless you decrease the ref count to the old graph before computing the new graph, e.g. by del loss at the end of each iteration. Only then will the GC be able to get rid of the old graph.

for batch in loader:
    loss = compute_loss(batch)

@lucidrains lucidrains force-pushed the main branch 2 times, most recently from ed62022 to 7c8e5aa Compare December 23, 2022 21:16
@lucidrains lucidrains force-pushed the main branch 2 times, most recently from 1891844 to ac933e4 Compare June 19, 2023 04:50
@VimukthiRandika1997
Copy link

@lucidrains Is this resolved now ?

@andrewmoise
Copy link
Author

Is this of interest? I'm cleaning out my github, and I'd like to be able to either get it in, or close it as WONTFIX and delete my branch.

I'm happy to recheck the state of the code now, since it looks like things have changed and the original PR won't apply anymore. If it's not of interest though, then no worries, just let me know and I can close the issue/PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Running out of memory
5 participants