Free some resources after each step to avoid OOM #75

andrewmoise · 2022-08-12T13:50:55Z

Fixes #74

lucidrains · 2022-08-12T17:44:27Z

hmm, i think the GC should handle this?

andrewmoise · 2022-08-12T17:53:48Z

For some reason it seems like it doesn't until the variable goes out of scope - see https://pytorch.org/docs/stable/notes/faq.html ("If you assign a Tensor or Variable to a local, Python will not deallocate until the local goes out of scope.")

IDK whether casting and deleting the loss is necessary, as it made no difference in my run (I was just obeying the pytorch docs above). But, deleting the sample data after saving definitely caused my test runs to succeed where previously they were running out of GPU memory.

andrewmoise · 2022-08-12T18:00:55Z

(I should clarify - my understanding is that the resources will be freed by the GC when the variable is reassigned on the next step. We're not leaking significant memory on an ongoing basis. Just, we're consuming a constant amount of memory we don't need to be by keeping old local variables around with large storage on the GPU, after they're done being used but before they're reassigned on the next loop. For me that was enough to push the consumption over the edge and make the program quit because it had no more GPU memory.)

pengzhangzhi · 2022-10-18T10:24:05Z

(I should clarify - my understanding is that the resources will be freed by the GC when the variable is reassigned on the next step. We're not leaking significant memory on an ongoing basis. Just, we're consuming a constant amount of memory we don't need to be by keeping old local variables around with large storage on the GPU, after they're done being used but before they're reassigned on the next loop. For me that was enough to push the consumption over the edge and make the program quit because it had no more GPU memory.)

May I ask what GC stands for?

mgrachten · 2022-12-14T16:24:35Z

I've seen this issue in the past, and I think what @andrewmoise proposes makes sense. The garbage collector (GC) doesn't handle this situation by itself. For example, in the following for loop, the computational graph gets constructed by compute_loss(batch), and is assigned to loss. In the next iteration, loss is still bound to the graph of the previous iteration, a new graph is constructed by compute_loss(batch), and only when it is constructed loss is bound to that new graph, and the ref count to the old graph is decreased. That means that this code requires enough memory to hold two graphs simultaneously, unless you decrease the ref count to the old graph before computing the new graph, e.g. by del loss at the end of each iteration. Only then will the GC be able to get rid of the old graph.

for batch in loader:
    loss = compute_loss(batch)

VimukthiRandika1997 · 2024-02-26T00:53:48Z

@lucidrains Is this resolved now ?

andrewmoise · 2024-11-11T16:19:41Z

Is this of interest? I'm cleaning out my github, and I'd like to be able to either get it in, or close it as WONTFIX and delete my branch.

I'm happy to recheck the state of the code now, since it looks like things have changed and the original PR won't apply anymore. If it's not of interest though, then no worries, just let me know and I can close the issue/PR.

Free some resources after each step to avoid OOM

b25168e

lucidrains force-pushed the main branch 2 times, most recently from c894e2c to 60f760e Compare November 12, 2022 04:01

lucidrains force-pushed the main branch 2 times, most recently from ed62022 to 7c8e5aa Compare December 23, 2022 21:16

lucidrains force-pushed the main branch 2 times, most recently from 1891844 to ac933e4 Compare June 19, 2023 04:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Free some resources after each step to avoid OOM #75

Free some resources after each step to avoid OOM #75

andrewmoise commented Aug 12, 2022

lucidrains commented Aug 12, 2022

andrewmoise commented Aug 12, 2022

andrewmoise commented Aug 12, 2022

pengzhangzhi commented Oct 18, 2022

mgrachten commented Dec 14, 2022 •

edited

Loading

VimukthiRandika1997 commented Feb 26, 2024

andrewmoise commented Nov 11, 2024

Free some resources after each step to avoid OOM #75

Are you sure you want to change the base?

Free some resources after each step to avoid OOM #75

Conversation

andrewmoise commented Aug 12, 2022

lucidrains commented Aug 12, 2022

andrewmoise commented Aug 12, 2022

andrewmoise commented Aug 12, 2022

pengzhangzhi commented Oct 18, 2022

mgrachten commented Dec 14, 2022 • edited Loading

VimukthiRandika1997 commented Feb 26, 2024

andrewmoise commented Nov 11, 2024

mgrachten commented Dec 14, 2022 •

edited

Loading