Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading huge amount of data takes a lot of time #249

Open
TenchiMuyo1984 opened this issue Aug 27, 2021 · 10 comments
Open

Reading huge amount of data takes a lot of time #249

TenchiMuyo1984 opened this issue Aug 27, 2021 · 10 comments

Comments

@TenchiMuyo1984
Copy link

Hello,
I don't know why, but when I try to read out a huge amount of data, it'll takes several minutes.

`from nptdms import TdmsFile
tdms_file = TdmsFile.open("file.tdms", "r")

that takes a lot of time:

l_data = tdms_file[group_name][channel_name][0:5000000]`

This takes a lot of time as well:

`import tempfile

with tempfile.TemporaryDirectory() as temp_memmap_dir:
tdms_file = TdmsFile.read("file.tdms", memmap_dir=temp_memmap_dir)`

Is there a way to get the offset (adress of pointer) of every element in the file?
like:
p_element = tdms_file[group_name][channel_name][0].tell()

@adamreeve
Copy link
Owner

Hi

The time it takes to read data will depend a lot on the TDMS file structure and the type of data being read, eg. timestamp data reading is more complicated than reading plain floats, and data with many small segments will take longer to read. Interleaved data will also take longer than non-interleaved data as the data points for a single channel are not contiguous.

It's hard to say without an example file to use for profiling why your specific files would take a long time to read. Eg. if I read data from a file with a similar number of data points but only a single channel of floats, this takes < 1 second on my machine. Are you able to provide one of your files for testing?

There isn't a way to get the positions of all data points, and I'm doubtful that this would allow you to read the data any faster.

numpy/numpy#13319 indicates that using np.fromfile can be a lot slower than expected when reading data in small chunks in Python 3, so possibly using f.readinto(buffer) or np.frombuffer as suggested there may help improve performance.

@adamreeve
Copy link
Owner

I've pushed a change to the fromfile_perf branch that uses file.readinto instead of np.fromfile. Are you able to test with that branch to see if it improves performance?

@axel-kah
Copy link

Hi,
npTDMS is one of two actively maintained tdms python packages, and the only one with partial read support, I did some benchmarking and compared this branch vs. last relase (1.3.1). I used two different files to test, unfortunately I can't make them available, so I added some meta info to give you a better idea of how they are structured. Test conditions:

  • Thinkpad T580
  • Windows 10
  • Defender enabled
  • Python 3.9.6
  • numpy 1.21.2

file: larger.tdms
groups: 1106
avg. ch/group: 20.0
filesize: 216.900 MB

npTDMS		1.3.1
$ python -m timeit --verbose -n 2 -r 3 -s "from nptdms import TdmsFile" "file = TdmsFile.read('larger.tdms')"
raw times: 201 sec, 203 sec, 169 sec

2 loops, best of 3: 84.6 sec per loop
npTDMS		090ed793271b6824b60105c198e55ef6be8f67b7
python -m timeit --verbose -n 2 -r 3 -s "from nptdms import TdmsFile" "file = TdmsFile.read('larger.tdms')"
raw times: 99.3 sec, 91.7 sec, 118 sec

2 loops, best of 3: 45.9 sec per loop

file: smaller.tdms
groups: 529
avg. ch/group: 20.0
filesize: 26.058 MB

npTDMS		1.3.1
$ python -m timeit --verbose -n 10 -r 5 -s "from nptdms import TdmsFile" "file = TdmsFile.read('smaller.tdms')"
raw times: 10.2 sec, 9.64 sec, 9.94 sec, 9.55 sec, 9.67 sec

10 loops, best of 5: 955 msec per loop
npTDMS		090ed793271b6824b60105c198e55ef6be8f67b7
$ python -m timeit --verbose -n 10 -r 5 -s "from nptdms import TdmsFile" "file = TdmsFile.read('smaller.tdms')"
raw times: 7.52 sec, 6.95 sec, 7.3 sec, 7.27 sec, 6.89 sec

10 loops, best of 5: 689 msec per loop

If you don't see any downside to this feature branch, then the numbers are definitely in favor and would be much appreciated.

@adamreeve
Copy link
Owner

Hi @axel-kah, thanks for testing this. I also found some reasonable speed ups in my tests and don't see a reason not to make this change so will merge that branch.

@Nikolai-Hlubek
Copy link

A comment that might help the original poster:

I have used 2 GB tdms files without any problems with nptdms.
However I noticed depending on how the labview program is written tdms files can "get fragmented" (maybe this is interleaved data?) and this will slow down reading a lot. An indication if you have such a fragmentation issue is that the tdms index file is large (MB instead if KB).

@pashaLyb
Copy link

pashaLyb commented Feb 22, 2022

I have the same issue with a 260 MB file, one group and 499 channels 86400 rows long floats. On my hp elite x2 laptop it takes ~90 sec per channel. Probably it's "bad" file issue.
The only solution I found is to use multiprocessing Pool().map() method to read the channels in parallel.

def read_TDMS_Data_Parallel(channelList):
    p = Pool()
    result = p.map(read_TDMS_Data, channelList)
    p.close()
    p.join()
    return result

I also don't use built in *.as_dataframe() method. I create the DataFrame from the list returned from read_TDMS_Data_Parallel() function:

def store_DataFrame(Data, channelList, filePath):
    df = pd.DataFrame(data=Data).T
    df.columns = channelList
    df.set_index('Time', drop=True, inplace=True)
    df = df.convert_dtypes()
    df.to_pickle(filePath)

@spri902
Copy link

spri902 commented Jun 23, 2022

@pashaLyb

I have ~1GB TDMS files I would like to read in as dataframes. I am currently using the built-in as.dataframe() method like so to read in only certain channels:

TdmsFile(file1).as_dataframe().iloc[:,channel_list]

I would like to read them in parallel using the multi-processing package. What does the "read_TDMS_Data" function look like in your "read_TDMS_Data_Parallel" function ?

@pashaLyb
Copy link

pashaLyb commented Jul 20, 2022

@spri902 sorry for late reply..

tdms_data_path and gr_name I had to define as global because p.map() function takes only one variable or only one that should be maped

There are 499 channels in my TDMS file and I read only some of them, that are relevant. That is why I'm using predefined channelList

def read_TDMS_Data(channelName):
    with TdmsFile.open(tdms_data_path) as tdms_file:
        data = tdms_file[gr_name][channelName][:]
        print(f'\n{channelName} is loaded')
        return data

def read_TDMS_Data_Parallel(channelList):
    p = Pool()
    result = p.map(read_TDMS_Data, channelList)
    p.close()
    p.join()
    return result

@pashaLyb
Copy link

A comment that might help the original poster:

I have used 2 GB tdms files without any problems with nptdms. However I noticed depending on how the labview program is written tdms files can "get fragmented" (maybe this is interleaved data?) and this will slow down reading a lot. An indication if you have such a fragmentation issue is that the tdms index file is large (MB instead if KB).

My TDMS files are indeed very fragmented.

@johannesloibl
Copy link
Contributor

I have a TDMS file with several thousand groups (~300MB) which i'm reading completely, but only by partially loading data from channels. This take ~7000s.
Here a picture of the profiling:
image

You can see that i'm doing a lot of reads :D
Maybe you see some improvement potential here, e.g. inside _read_channel_data_chunk or _have_daqmx_objects by using some caching techniques, for the cases where the same channel is read many times by slices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants