Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance #28

Merged
merged 3 commits into from
Jul 29, 2024
Merged

Improve performance #28

merged 3 commits into from
Jul 29, 2024

Conversation

sirmarcel
Copy link
Contributor

@sirmarcel sirmarcel commented Jul 26, 2024

This is to track progress on removing "obvious" bottlenecks to improve performance.

Timing before starting work (for cutoff=10, 96 atoms ZrO2, script attached -- timings with profiler, so slower than real):

CPU (M3 Max)
cpu time: 109.082ms

GPU (H100)
cpu time: 1.724s
cuda time: 262.400ms

profile.zip


📚 Documentation preview 📚: https://meshlode--28.org.readthedocs.build/en/28/

@sirmarcel
Copy link
Contributor Author

Timing after removing a for loop in the short-range part:

CPU (M3 Max):
cpu time: 42.564ms

CUDA (H100)
cpu time: 3.976ms
cuda time: 817.129us

@PicoCentauri
Copy link
Contributor

Nice! Should we put a profiling example showing some breakdowns? Similar to https://luthaf.fr/rascaline/latest/examples/profiling.html maybe?

@sirmarcel
Copy link
Contributor Author

I have the outputs on hand, but they're rather unwieldy, and not particularly informative -- maybe we can try to do something polished once this is done.

@sirmarcel sirmarcel marked this pull request as draft July 26, 2024 12:41
Avoid some big multiplications.
@sirmarcel
Copy link
Contributor Author

Okay, optimised generate_kvectors a bit.

before (H100, N=4116):
Self CPU time total: 21.154ms
Self CUDA time total: 19.132ms

after (H100, N=4116):

Self CPU time total: 17.907ms
Self CUDA time total: 15.858ms

Now it seems that we're basically dominated by the FFTs and the convolution, which seems reasonable. I think the division by volume in pmepotential.py, line 125 is the next target, that's 1ms that seems avoidable... but not sure if it's worth doing. I think the most obvious things have been done now.

before after

@sirmarcel
Copy link
Contributor Author

Just for future reference, here are the timings for energy + forces. "Before" is main, "after" is the current state of this branch.

Before, CPU, M3 Max, 96 atoms
CPU time: 2.16s

(didn't bother with GPU)

After, CPU, M3 Max, 96 atoms
CPU time: 77.6ms

After, GPU, H100, 96 atoms
CPU time: 8.432ms
CUDA time: 1.680ms

After, GPU, H100, 4116 atoms
CPU time: 42.291ms
CUDA time: 39.356ms

After, GPU, H100, 8748 atoms
CPU time: 43.439ms
CUDA time: 40.820ms

@sirmarcel
Copy link
Contributor Author

Test pass on kuma, removing draft status. Someone please check + merge. 🙏

@sirmarcel sirmarcel marked this pull request as ready for review July 29, 2024 11:37
@sirmarcel sirmarcel changed the title WIP: Improve performance Improve performance Jul 29, 2024
@PicoCentauri PicoCentauri merged commit 995fa93 into main Jul 29, 2024
0 of 7 checks passed
@PicoCentauri PicoCentauri deleted the sirmarcel_perf branch July 29, 2024 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants