Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data read #85

Draft
wants to merge 165 commits into
base: main
Choose a base branch
from
Draft

Add data read #85

wants to merge 165 commits into from

Conversation

oruebel
Copy link
Contributor

@oruebel oruebel commented Aug 31, 2024

This PR is to try and implement the proposed approach for data read from #83 to see if this approach is viable. This PR is experimental right now and should not be merged.

2. Proposed Implementation for reading data arrays

BaseReadData

  • Create a new ReadDatasetWrapper and ReadAttributeWrapper classes for reading data. This is modified from the proposal, which suggested a single BaseReadData class for reading any array (datasets, attributes) from a file.
  • Support conversion to boost multi-dimensional array for convenience
  • Note I did not update BaseRecordingData to inherit from ReadDataWrapper because the two are not compatible right now. BaseReadData uses the io and the path to access the data, whereas BaseRecordingData leaves that choice to the I/O backend and stores the references to the dataset. We may or may not want to change this.

BaseIO

  • Note I did not add abstract methods for lazy reading objects from a file to BaseIO (or more accurately I remove them) because: 1) I wanted to use the shared_ptr to the I/O rather than a raw pointer, which I can't get from BaseIO, and 2) with the ReadDatasetWrapper this is more approbriately done in the Container class directly.
  • Add pure virtual method to allow us to get the storage object type (Group, Dataset, Attribute) for a given path
  • Add pure virtual methods to read data values from a Dataset or Attribute that the BaseReadData can call for read

HDF5IO

  • Note In contrast to the proposal, I did not implement specific version of the BaseReadData for HDF5 but left read logic to HDF5IO itself so that the ReadDatasetWrapper can remain generic. To make this more manageable, I defined ReadAttributeWrapper separately
  • Implement the methods for reading data values from a Dataset or Attribute that the HDF5ReadDataSet and HDF5ReadAttribute wrappers can call can call for read
  • Implement the getObjectType method for getting the storage object type (Group, Dataset, Attribute) for a given path

Container

  • Store the io object on the Container so that we can call io->readDataset and io->readAttribute in the read methods

NWB types: TimeSeries, ElectricalSeries etc.

  • Remove storage of properties from the Container classes and replace them with access methods that return BaseReadData objects instead. This allows for reading in both read and write mode and avoids keeping data in memory that we have already written to disk. For example, in TimeSeries, these variables would need to change to properties:
    /**
    * @brief Base unit of measurement for working with the data. Actual stored
    * values are not necessarily stored in these units. To access the data in
    * these units, multiply ‘data’ by ‘conversion’ and add ‘offset’.
    */
    std::string unit;
    /**
    * @brief The description of the TimeSeries.
    */
    std::string description;
    /**
    * @brief Human-readable comments about the TimeSeries.
    */
    std::string comments;
    /**
    * @brief Size used in dataset creation. Can be expanded when writing if
    * needed.
    */
    SizeArray dsetSize;
    /**
    * @brief Chunking size used in dataset creation.
    */
    SizeArray chunkSize;
    /**
    * @brief Scalar to multiply each element in data to convert it to the
    * specified ‘unit’.
    */
    float conversion;
    /**
    * @brief Smallest meaningful difference between values in data, stored in the
    * specified by unit.
    */
    float resolution;
    /**
    * @brief Scalar to add to the data after scaling by ‘conversion’ to finalize
    * its coercion to the specified ‘unit’.
    */
    float offset;
    /**
    * @brief The starting time of the TimeSeries.
    */
    float startingTime = 0.0;
  • Add access methods that return BaseReadData for missing fields

3. Proposed implementation for reading whole Containers (e.g., to read an ElectricalSeries)

  • Add access methods on the respective Container that owns the respective objects, e.g., NWBFile owning ElectricalSeries objects to retrieve the object
  • Add abstract factory method (that is templated on the return type) to Container to create an instance of the specific Container type using only the io and path for the Container as input. The specific Container classes, such as TimeSeries will then need to implement a corresponding constructor that uses io and path as input.

Step 1: Define the Template Factory Method in Container

class Container {
public:
   
    template <typename T>
    static std::unique_ptr<T> create(const BaseIO& io, const std::string& path) {
        static_assert(std::is_base_of<Container, T>::value, "T must be a derived class of Container");
        return std::unique_ptr<T>(new T(path, io));
    }
};

Step 2: Implement the constructors on the specific Container classes (e.g., TimeSeries)

  • Add the necessary constructor
class TimeSeries : public Container {
public:
    TimeSeries(const std::string& path, const BaseIO& io) {
        // Implementation of TimeSeries constructor
    }
};

4. Proposed implementation for reading untyped groups (e.g., /acquisition)

I'm not sure we'll need do this, since a group by itself does not define data. To access the contents we could define access methods on the parent Container class (e.g., NWBFile) that owns the untyped group to access its contents.

TODO

Items moved to new issues

Next steps

@oruebel oruebel requested a review from stephprince September 1, 2024 08:45
@oruebel
Copy link
Contributor Author

oruebel commented Sep 1, 2024

@stephprince when you get a chance ,could you please do a first code review of this PR to make sure this is heading in the right direction. I now have a first outline of one possible solution for how we might implement read. There is still a lot more work to be done before this PR is ready, but it would be useful if you could take a look before I go any further with this approach.

I would start by looking at:

  1. tests/examples/test_ecephys_data_read.cpp which shows an example of how read works for the user
  2. BaseIO then defines the main new classes used for reading and HDF5IO then implements the actual reading
  3. Container and ElectricalSeries also have some relevant changes to allow us to construct Container objects for read and how we can get specific datasets/attributes

@oruebel
Copy link
Contributor Author

oruebel commented Sep 1, 2024

@stephprince I just added a documentation page as well, which hopefully helps explain the current proposed design for read so we can review and discuss.

@oruebel
Copy link
Contributor Author

oruebel commented Oct 23, 2024

We could modify the tests.yml to run all the tests with both states, i.e., with the code as is in the PR as well as with the temporary merge. This would make the CI runtime longer (since all tests would run twice) and make the workflow a bit longer but would help with finding merge errors.

We decide not to do this and to continue testing only for the merged version. We decided to add a note in the developer docs to clarify this behavior.

@oruebel
Copy link
Contributor Author

oruebel commented Dec 22, 2024

@stephprince I synced the branch with the main branch. However, Windows tests are currently failing due to 'boost/multi_array.hpp': No such file or directory. Can you check the Windows action to make sure boost is being installed correctly.

@oruebel
Copy link
Contributor Author

oruebel commented Dec 22, 2024

, Windows tests are currently failing due to 'boost/multi_array.hpp': No such file or directory

Windows tests are working again. I had to add boost multi-array to the windows Action to fix the include error for boost/multi_array.hpp and update used of variable length arrays in HDF5IO to use std::vector instead, because apparently gcc has an extension to support variable length arrays but the default compiler on Windows does not.

@oruebel oruebel mentioned this pull request Dec 26, 2024
17 tasks
@codecov-commenter
Copy link

codecov-commenter commented Jan 8, 2025

Codecov Report

Attention: Patch coverage is 89.33518% with 231 lines in your changes missing coverage. Please review.

Project coverage is 90.87%. Comparing base (a532ec5) to head (44e957d).

Files with missing lines Patch % Lines
src/io/hdf5/HDF5IO.cpp 76.42% 169 Missing ⚠️
src/io/hdf5/HDF5RecordingData.cpp 73.87% 29 Missing ⚠️
src/io/ReadIO.hpp 79.48% 8 Missing ⚠️
src/nwb/RegisteredType.cpp 79.48% 8 Missing ⚠️
src/nwb/NWBFile.cpp 87.23% 6 Missing ⚠️
src/nwb/RegisteredType.hpp 55.55% 4 Missing ⚠️
src/io/BaseIO.cpp 95.77% 3 Missing ⚠️
src/Utils.hpp 90.47% 2 Missing ⚠️
src/nwb/file/ElectrodeTable.hpp 50.00% 1 Missing ⚠️
tests/examples/test_HDF5IO_examples.cpp 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #85      +/-   ##
==========================================
+ Coverage   90.32%   90.87%   +0.55%     
==========================================
  Files          39       53      +14     
  Lines        1830     3289    +1459     
  Branches      126      229     +103     
==========================================
+ Hits         1653     2989    +1336     
- Misses        167      290     +123     
  Partials       10       10              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

* Add unit test for HDFIO::getH5ObjectType

* Fix behavior of HDF5IO::getH5ObjectType when object does not exist

* Add unit tests for HDF5IO::getNativeType

* Add unit test for HDF5IO::getH5Type

* Add unit test for HDF5IO::objectExists

* Add unit tests for HDF5IO::attributeExists

* add unit test for HDF5IO::getGroupObjects

* Add unit test for HDF5IO::getStorageObjectType

* Fix reading of arrays and select dtypes for HDF5IO::readAttribute

* Add initial unit tests for HDF5IO::readAttribute

* Fix reading of string array attributes via HDF5IO::readAttribute

* Fix reading of reference attributes

* Add HDF5IO::readAttribute unit test for invalid and double attribute

* Add HDF5IO::readAttriute unit test for reading attribute from a dataset

* Fix handling of non-existent file ReadOnly and ReadWrite mode

* Add unit tests for HDF5IO::open for file modes

* Add additonal unit tests for HDF5IO::readDataset

* Fix element selection in HDF5IO::readDataset

* Add unit test for reading hyperslap with HDF5IO::readDataset

* Fix hyperslap selection with block in HDF5IO::readDataset

* Expand unit tests for HDF5IO::readDataset with hyperslaps

* Add note with TODO item to the tests

* Improve coverage of numeric types in readDataset

* Fix determining string size on read

* Set string size when creating the dataset

* Fix error in commmented unit test

* Fix writing of fixed length string datasets

* Add test writing fixed length string data

* Fix reading of fixed length strings

* Fix writing of variable length string datasets

* Fix writing of variable length string datasets

* Fix reading of variable length string dataset

* Fix reading of empty string datasets

* Minor enhancement to check for address santizer

* Create RecordingData::writeStrinData to simplify writing of strings

* Update tests/testHDF5IO.cpp

Co-authored-by: Steph Prince <40640337+stephprince@users.noreply.github.com>

* Update tests/testHDF5IO.cpp

Co-authored-by: Steph Prince <40640337+stephprince@users.noreply.github.com>

* Fix formatting

---------

Co-authored-by: Steph Prince <40640337+stephprince@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Propose refactor of I/O class organization
3 participants