Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add number of samples to read per chunk #308

Open
latot opened this issue Jul 17, 2023 · 3 comments
Open

[Feature] Add number of samples to read per chunk #308

latot opened this issue Jul 17, 2023 · 3 comments

Comments

@latot
Copy link

latot commented Jul 17, 2023

Hi, first sorry, I was writting the issue and for some reason was posted before I write it!

And, actually there is code to read data in chunks, I think would be great be able to pass the number of samples we want to read per chunk.

thx!

@latot latot changed the title Feature] [Feature] Add number of samples to read per chunk Jul 17, 2023
@adamreeve
Copy link
Owner

Hi, if you're referring to the data_chunks method, this uses chunk sizes based on the sizes of the data segments stored in the TDMS file, so it isn't simple to change this to use your own chunk size. But you can use slice syntax to do something similar quite easily, eg.:

channel = tdms_file[group_name][channel_name]
chunk_size = 1024
for chunk_start in range(0, len(channel), chunk_size):
    chunk_data = channel[chunk_start:chunk_start + chunk_size]

Although under the hood this still reads a full chunk of raw data from the required segments and then trims any extra values off.

Does this work for you, or if not, can you explain your use-case some more?

@latot
Copy link
Author

latot commented Jul 17, 2023

Hi!, I'm trying to read per chunk too, but maybe as you wrote, to read a slice it need to read a full chunk.

I have 7.2k^3 samples, are close to 54gb of raw data of float64 numbers.

Every chunk uses close to 54gb of ram..... maybe is just a coincidence it close to the same value as the file size. If I try to read the full file python crashes, but chunk by chunk works.... well if I set 30GB of swap and use my 33gb of ram.

I notice there must be something that "must" be read because when I try getting smaller slices always uses the same amount of ram, if you don't have enough ram numpy will stop says you don't have enough, there I know nptdms was always trying to store the same length even when I change the slice, still is lower than the size data.

But use 54gb of ram is still too much, and not very efficient.....

I want to do some statistics on the data, get distributions per min, 5 mins, or where the data is more or less persistent... still is hard actually do this.

@adamreeve
Copy link
Owner

Right, it sounds like your file might just have one big chunk. And if the data is interleaved this is even worse as all channel data is read rather than skipping over data for other channels.

So rather than always reading a full chunk, it sounds like we need to add the ability to read subsets of data from chunks, which isn't a trivial change. The place to start would be read_raw_data_for_channel in reader.py.

I don't have a lot of time to spend on npTDMS at the moment, but would be happy to accept a PR for this if you wanted to implement this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants