Improve performance #28

sirmarcel · 2024-07-26T11:42:04Z

This is to track progress on removing "obvious" bottlenecks to improve performance.

Timing before starting work (for cutoff=10, 96 atoms ZrO2, script attached -- timings with profiler, so slower than real):

CPU (M3 Max)
cpu time: 109.082ms

GPU (H100)
cpu time: 1.724s
cuda time: 262.400ms

profile.zip

📚 Documentation preview 📚: https://meshlode--28.org.readthedocs.build/en/28/

Replace with index_add.

sirmarcel · 2024-07-26T11:43:03Z

Timing after removing a for loop in the short-range part:

CPU (M3 Max):
cpu time: 42.564ms

CUDA (H100)
cpu time: 3.976ms
cuda time: 817.129us

PicoCentauri · 2024-07-26T12:00:56Z

Nice! Should we put a profiling example showing some breakdowns? Similar to https://luthaf.fr/rascaline/latest/examples/profiling.html maybe?

sirmarcel · 2024-07-26T12:40:17Z

I have the outputs on hand, but they're rather unwieldy, and not particularly informative -- maybe we can try to do something polished once this is done.

Avoid some big multiplications.

sirmarcel · 2024-07-26T14:47:54Z

Okay, optimised generate_kvectors a bit.

before (H100, N=4116):
Self CPU time total: 21.154ms
Self CUDA time total: 19.132ms

after (H100, N=4116):

Self CPU time total: 17.907ms
Self CUDA time total: 15.858ms

Now it seems that we're basically dominated by the FFTs and the convolution, which seems reasonable. I think the division by volume in pmepotential.py, line 125 is the next target, that's 1ms that seems avoidable... but not sure if it's worth doing. I think the most obvious things have been done now.

sirmarcel · 2024-07-29T10:11:15Z

Just for future reference, here are the timings for energy + forces. "Before" is main, "after" is the current state of this branch.

Before, CPU, M3 Max, 96 atoms
CPU time: 2.16s

(didn't bother with GPU)

After, CPU, M3 Max, 96 atoms
CPU time: 77.6ms

After, GPU, H100, 96 atoms
CPU time: 8.432ms
CUDA time: 1.680ms

After, GPU, H100, 4116 atoms
CPU time: 42.291ms
CUDA time: 39.356ms

After, GPU, H100, 8748 atoms
CPU time: 43.439ms
CUDA time: 40.820ms

sirmarcel · 2024-07-29T11:37:09Z

Test pass on kuma, removing draft status. Someone please check + merge. 🙏

Remove python loop from short-range contributions

a6b9ac2

Replace with index_add.

sirmarcel marked this pull request as draft July 26, 2024 12:41

Optimise generate_kvectors

f94407a

Avoid some big multiplications.

Fix typo in comment

fc3c12c

sirmarcel marked this pull request as ready for review July 29, 2024 11:37

sirmarcel changed the title ~~WIP: Improve performance~~ Improve performance Jul 29, 2024

PicoCentauri merged commit 995fa93 into main Jul 29, 2024
0 of 7 checks passed

PicoCentauri deleted the sirmarcel_perf branch July 29, 2024 14:15

PicoCentauri mentioned this pull request Jul 30, 2024

Use half neighbor list instead of full #30

Closed

PicoCentauri mentioned this pull request Aug 31, 2024

Reduce CUDA synchronizing operations #40

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance #28

Improve performance #28

sirmarcel commented Jul 26, 2024 •

edited by github-actions bot

Loading

sirmarcel commented Jul 26, 2024

PicoCentauri commented Jul 26, 2024

sirmarcel commented Jul 26, 2024

sirmarcel commented Jul 26, 2024

sirmarcel commented Jul 29, 2024

sirmarcel commented Jul 29, 2024

Improve performance #28

Improve performance #28

Conversation

sirmarcel commented Jul 26, 2024 • edited by github-actions bot Loading

sirmarcel commented Jul 26, 2024

PicoCentauri commented Jul 26, 2024

sirmarcel commented Jul 26, 2024

sirmarcel commented Jul 26, 2024

sirmarcel commented Jul 29, 2024

sirmarcel commented Jul 29, 2024

sirmarcel commented Jul 26, 2024 •

edited by github-actions bot

Loading