-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Sizes #158
Comments
Hi @ax3l - yes, as you have noticed this code really is intended to work on data that can be loaded into memory. A work around would be to load and process data one 'chunk' at a time - this is already possible with the Early on I considered whether I should try and be more abstract framework for much larger datasets, and I had a look at polars instead of pandas, which could have enabled a 'lazy' evaluation. But ultimately, all the data I work with easily fits into memory (albeit my workstation has 128 Gb of RAM :-P) and I was sort of trying to solve problems I didn't have, so I just decided to keep it simple... DASK looks interesting. At first glance - it seems more geared towards parallelizing operations, rather than memory management? |
Thank you for the details! Yes, I think with pandas you might already have some support for chunked operations and upgrades to the mentioned backends could enable this in the future. For DASK: parallelization includes memory management; often limited shared memory per node is the driving reason why one parallelizes :) |
This is perfect and great scoping guidance for users and potential future directions! Thanks a lot. I am closing this as part of the JOSS review, but feel free to reopen it if you like to keep it as a issue for tracking potential future developments/contributions. |
Thank you for the JOSS submission in openjournals/joss-reviews#5375 .
This is a follow-up question to #156.
In the design of this package, what are the envisioned data sizes for phase space data to be processed? Up to the size of a laptop RAM/single node?
I was looking at
https://bwheelz36.github.io/ParticlePhaseSpace/new_data_loader.html
and am wondering if not most of the operations here are map-reduce operations and could be implemented to stream over arbitrary data sizes, e.g., if large simulation data is being processed?
I did some experiments on processing such data with Dask:
openPMD/openPMD-api#963 (comment)
and wonder if something similar could be used as the backend here to scale up? 🚀
The text was updated successfully, but these errors were encountered: