[Feature] Add number of samples to read per chunk #308

latot · 2023-07-17T13:58:14Z

Hi, first sorry, I was writting the issue and for some reason was posted before I write it!

And, actually there is code to read data in chunks, I think would be great be able to pass the number of samples we want to read per chunk.

thx!

adamreeve · 2023-07-17T21:42:10Z

Hi, if you're referring to the data_chunks method, this uses chunk sizes based on the sizes of the data segments stored in the TDMS file, so it isn't simple to change this to use your own chunk size. But you can use slice syntax to do something similar quite easily, eg.:

channel = tdms_file[group_name][channel_name]
chunk_size = 1024
for chunk_start in range(0, len(channel), chunk_size):
    chunk_data = channel[chunk_start:chunk_start + chunk_size]

Although under the hood this still reads a full chunk of raw data from the required segments and then trims any extra values off.

Does this work for you, or if not, can you explain your use-case some more?

latot · 2023-07-17T21:51:28Z

Hi!, I'm trying to read per chunk too, but maybe as you wrote, to read a slice it need to read a full chunk.

I have 7.2k^3 samples, are close to 54gb of raw data of float64 numbers.

Every chunk uses close to 54gb of ram..... maybe is just a coincidence it close to the same value as the file size. If I try to read the full file python crashes, but chunk by chunk works.... well if I set 30GB of swap and use my 33gb of ram.

I notice there must be something that "must" be read because when I try getting smaller slices always uses the same amount of ram, if you don't have enough ram numpy will stop says you don't have enough, there I know nptdms was always trying to store the same length even when I change the slice, still is lower than the size data.

But use 54gb of ram is still too much, and not very efficient.....

I want to do some statistics on the data, get distributions per min, 5 mins, or where the data is more or less persistent... still is hard actually do this.

adamreeve · 2023-07-17T23:21:51Z

Right, it sounds like your file might just have one big chunk. And if the data is interleaved this is even worse as all channel data is read rather than skipping over data for other channels.

So rather than always reading a full chunk, it sounds like we need to add the ability to read subsets of data from chunks, which isn't a trivial change. The place to start would be read_raw_data_for_channel in reader.py.

I don't have a lot of time to spend on npTDMS at the moment, but would be happy to accept a PR for this if you wanted to implement this.

latot changed the title ~~Feature]~~ [Feature] Add number of samples to read per chunk Jul 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add number of samples to read per chunk #308

[Feature] Add number of samples to read per chunk #308

latot commented Jul 17, 2023 •

edited

Loading

adamreeve commented Jul 17, 2023

latot commented Jul 17, 2023

adamreeve commented Jul 17, 2023

[Feature] Add number of samples to read per chunk #308

[Feature] Add number of samples to read per chunk #308

Comments

latot commented Jul 17, 2023 • edited Loading

adamreeve commented Jul 17, 2023

latot commented Jul 17, 2023

adamreeve commented Jul 17, 2023

latot commented Jul 17, 2023 •

edited

Loading