Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Sizes #158

Closed
ax3l opened this issue May 30, 2023 · 4 comments
Closed

Data Sizes #158

ax3l opened this issue May 30, 2023 · 4 comments

Comments

@ax3l
Copy link
Contributor

ax3l commented May 30, 2023

Thank you for the JOSS submission in openjournals/joss-reviews#5375 .

This is a follow-up question to #156.

In the design of this package, what are the envisioned data sizes for phase space data to be processed? Up to the size of a laptop RAM/single node?

I was looking at
https://bwheelz36.github.io/ParticlePhaseSpace/new_data_loader.html

and am wondering if not most of the operations here are map-reduce operations and could be implemented to stream over arbitrary data sizes, e.g., if large simulation data is being processed?

I did some experiments on processing such data with Dask:
openPMD/openPMD-api#963 (comment)

and wonder if something similar could be used as the backend here to scale up? 🚀

@bwheelz36
Copy link
Owner

Hi @ax3l - yes, as you have noticed this code really is intended to work on data that can be loaded into memory. A work around would be to load and process data one 'chunk' at a time - this is already possible with the IAEA DataLoader, and could be (but has not been) implemented for other data loaders.

Early on I considered whether I should try and be more abstract framework for much larger datasets, and I had a look at polars instead of pandas, which could have enabled a 'lazy' evaluation. But ultimately, all the data I work with easily fits into memory (albeit my workstation has 128 Gb of RAM :-P) and I was sort of trying to solve problems I didn't have, so I just decided to keep it simple...

DASK looks interesting. At first glance - it seems more geared towards parallelizing operations, rather than memory management?

@ax3l
Copy link
Contributor Author

ax3l commented Jun 5, 2023

Thank you for the details!

Yes, I think with pandas you might already have some support for chunked operations and upgrades to the mentioned backends could enable this in the future.

For DASK: parallelization includes memory management; often limited shared memory per node is the driving reason why one parallelizes :)

bwheelz36 pushed a commit that referenced this issue Jun 6, 2023
bwheelz36 added a commit that referenced this issue Jun 6, 2023
add limitations to docs as per #158
@bwheelz36
Copy link
Owner

Hey @ax3l - I won't realistically be addressing this concern any time soon, but I think it is a very valid point - as such, I've added a page to the docs called limitations, which details this as what I think is the major limitation of this code at present...

@ax3l
Copy link
Contributor Author

ax3l commented Jun 25, 2023

This is perfect and great scoping guidance for users and potential future directions! Thanks a lot.

I am closing this as part of the JOSS review, but feel free to reopen it if you like to keep it as a issue for tracking potential future developments/contributions.

@ax3l ax3l closed this as completed Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants