Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when saving TensorFlowModelDataset as partition #759

Open
anabelchuinard opened this issue Aug 21, 2023 · 8 comments · May be fixed by #978
Open

Error when saving TensorFlowModelDataset as partition #759

anabelchuinard opened this issue Aug 21, 2023 · 8 comments · May be fixed by #978
Assignees
Labels
bug Something isn't working

Comments

@anabelchuinard
Copy link

anabelchuinard commented Aug 21, 2023

Description

Can't save TensorFlowModelDataset objects as partition.

Context

I am dealing with a project where I have to train several models concurrently. I started writing my code using PartitionedDataset where each partition corresponds to the data relative to one training. When I am trying to save the resulting tensorflow models as a partition, I get an error. I wonder is this has to do with the fact that those inherit from the AbstractVersionedDataset instead of the AbstractDataset. And if yes, I am interested to know if there is any workaround for batch saving those.

This is the instance of my catalog corresponding to the models I want to save:

tensorflow_models:
  type: PartitionedDataset
  path: data/derived/ML/models
  filename_suffix: ".hdf5"
  dataset:
    type: kedro.extras.datasets.tensorflow.TensorFlowModelDataset

Note: Saving one model (not as partition) works.

Steps to Reproduce

  1. Generate a bunch of trained models
  2. Try to save them in a partition as TensorFlowModelDataset objects

Expected Result

Should save one .hdf5 file per partition with the name of the file being the associate dictionary key.

Actual Result

Getting this error:

DatasetError: Failed while saving data to data set PartitionedDataset(dataset_config={}, dataset_type=TensorFlowModelDataset,
path=...).
The first argument to `Layer.call` must always be passed.

Your Environment

  • Kedro version used (pip show kedro or kedro -V): kedro, version 0.18.12
  • Python version used (python -V): 3.9.16
  • Operating system and version: Mac M2
@astrojuanlu
Copy link
Member

Hi @anabelchuinard, thanks for opening this issue and sorry for the delay. It will take us some time but I'm labeling this issue so we don't lose track of it.

@astrojuanlu astrojuanlu added the Community Issue/PR opened by the open-source community label Sep 5, 2023
@merelcht
Copy link
Member

merelcht commented Jul 8, 2024

Hi @anabelchuinard, do you still need help fixing this issue?

@anabelchuinard
Copy link
Author

@merelcht I found a non-kedronic workaround for this but would love to know if there is now a kedronic way for batch-saving those models.

@merelcht
Copy link
Member

merelcht commented Jul 9, 2024

Using the PartitionedDataset is definitely the recommended Kedro way for batch saving. I've done some digging and it seems that the following lines are causing issues for using the TensorFlowModelDataset with PartitionedDataset:

if callable(partition_data):
partition_data = partition_data() # noqa: PLW2901

@merelcht merelcht removed the Community Issue/PR opened by the open-source community label Jul 9, 2024
@merelcht merelcht changed the title Saving TensorFlowModelDataset as partition Error when saving TensorFlowModelDataset as partition Jul 9, 2024
@merelcht merelcht transferred this issue from kedro-org/kedro Jul 9, 2024
@merelcht merelcht added the bug Something isn't working label Jul 9, 2024
@merelcht merelcht moved this to To Do in Kedro Framework Aug 5, 2024
@ElenaKhaustova ElenaKhaustova self-assigned this Jan 6, 2025
@ElenaKhaustova ElenaKhaustova moved this from To Do to In Progress in Kedro Framework Jan 6, 2025
@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Jan 7, 2025

Cause of the issue

The issue is in how we implement partitioned dataset lazy saving. To postpone data loading, we require return Callable types in the dictionary fed to PartitionedDataset instead of the actual object.

if callable(partition_data):
partition_data = partition_data() # noqa: PLW2901

When saving the data, we check if the Callable type was passed and call it to get the actual object. Since the TensorFlow model is callable, we make this call when saving, which causes the above error, though the user didn't mean to apply lazy saving.

So PartitionedDataset cannot save Callable types now, unless they're wrapped with another Callable, for example, lambda.

Current workaround

@anabelchuinard - To make PartitionedDataset save Callable in the current Kedro version you need to wrap an object as if you wanted to do a lazy saving:

save_dict = {
	"tensorflow_model_32": models["tensorflow_model_32"](),
	"tensorflow_model_64": models["tensorflow_model_64"](),
}

# Tensorflow model can be wrapped with lambda, to avoid calling it when saving
save_dict = {
	"tensorflow_model_32": lambda: models["tensorflow_model_32"](),
	"tensorflow_model_64": lambda: models["tensorflow_model_64"](),
}

Suggested fix

Make PartitionedDataset accept only lambda functions for lazy saving and ignore other callable objects - #978

Following PR to update docs

kedro-org/kedro#4402

@noklam
Copy link
Contributor

noklam commented Jan 7, 2025

Suggested fix
Make PartitionedDataset accept only lambda functions for lazy saving and ignore other callable objects - #978

To me this seems to be a niche case, and changing PartitionedDataset to only accept lambda is a bigger breaking change. Any useful callable will likely be more complicated than a simple lambda. Maybe we can disable lazy loading/saving (default enable) when specified?

@ElenaKhaustova
Copy link
Contributor

Suggested fix
Make PartitionedDataset accept only lambda functions for lazy saving and ignore other callable objects - #978

To me this seems to be a niche case, and changing PartitionedDataset to only accept lambda is a bigger breaking change. Any useful callable will likely be more complicated than a simple lambda. Maybe we can disable lazy loading/saving (default enable) when specified?

I see the point but I think the issue is a little bit broader than this case. Particularly I don't think it's right to call any callable object and use this check to decide if we apply lazy saving. This affects all the ml-models cases (tensorflow, pytorch, scikit-learn, etc.) and potentially can also execute some unwanted code implemented in __call__. Moreover, it's not intuitive for users to wrap their objects to avoid such a behaviour.

In the solution suggested I tried to narrow down these cases from callable to lamda, so there's less chance to get them.

As an alternative, we can consider making lazy saving a default behaviour so we internally wrap and unwrap objects automatically. But here, the question is whether we need to make it the only option (as it is for lazy saving) or provide some interface to disable it.

@DimedS
Copy link
Member

DimedS commented Jan 8, 2025

Thanks for the investigation and PR, @ElenaKhaustova! I agree with @noklam that relying solely on lambda functions for lazy saving doesn't seem like a generic solution. While it is a breaking change, it's hard to determine how much it will impact users. In my opinion, it would be better to avoid treating all Callables as participants in lazy saving by default. However, this would also be a breaking change. As a simpler alternative, we could provide an option to disable lazy saving, as you suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Review
Development

Successfully merging a pull request may close this issue.

6 participants