[tensorflow] MirroredStrategy, LossScaledOptimizer - merge_call failed #18666

crohkohl · 2023-10-22T16:38:20Z

Hi,

Using the LossScaledOptimizer fails for MirroredStrategy with the following exception:

Exception has occurred: RuntimeError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
in user code:

    File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/trainer.py", line 105, in one_step_on_data  **
        return self.train_step(data)
    File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/trainer.py", line 72, in train_step
        self.optimizer.apply_gradients(zip(gradients, trainable_weights))
    File "/usr/local/lib/python3.10/dist-packages/keras/src/optimizers/base_optimizer.py", line 206, in apply_gradients
        self.apply(grads, trainable_variables)
    File "/usr/local/lib/python3.10/dist-packages/keras/src/optimizers/loss_scale_optimizer.py", line 183, in apply
        ops.cond(finite, handle_finite_grads, handle_non_finite_grads)
    File "/usr/local/lib/python3.10/dist-packages/keras/src/ops/core.py", line 594, in cond
        return Cond()(pred, true_fn, false_fn)
    File "/usr/local/lib/python3.10/dist-packages/keras/src/utils/traceback_utils.py", line 123, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "/usr/local/lib/python3.10/dist-packages/keras/src/backend/tensorflow/optimizer.py", line 82, in _internal_apply_gradients
        tf.__internal__.distribute.interim.maybe_merge_call(

    RuntimeError: Exception encountered when calling Cond.call().
    
    �[1m`merge_call` called while defining a new graph or a tf.function. This can often happen if the function `fn` passed to `strategy.run()` contains a nested `@tf.function`, and the nested `@tf.function` contains a synchronization point, such as aggregating gradients (e.g, optimizer.apply_gradients), or if the function `fn` uses a control flow statement which contains a synchronization point in the body. Such behaviors are not yet supported. Instead, please avoid nested `tf.function`s or control flow statements that may potentially cross a synchronization boundary, for example, wrap the `fn` passed to `strategy.run` or the entire `strategy.run` inside a `tf.function` or move the control flow out of `fn`. If you are subclassing a `tf.keras.Model`, please avoid decorating overridden methods `test_step` and `train_step` in `tf.function`.�[0m
    hods `test_step` and `train_step` in `tf.function`.

The reason for the exception is the following tf.cond() call:

keras/keras/optimizers/loss_scale_optimizer.py

Line 183 in cb65582

ops.cond(finite, handle_finite_grads, handle_non_finite_grads)

To reproduce change the following line:

keras/integration_tests/tf_distribute_training_test.py

Line 44 in cb65582

optimizer=optimizers.SGD(learning_rate=0.001, momentum=0.01),

to

optimizer=optimizers.LossScaleOptimizer(optimizers.SGD(learning_rate=0.001, momentum=0.01)),

Alternatively, you can turn on the GPU and used mixed precision which then automatically uses the optimizer.

The text was updated successfully, but these errors were encountered:

qlzh727 · 2023-10-23T23:23:40Z

Thanks for the report. let me take a look.

qlzh727 · 2023-10-24T18:58:50Z

I see. It seems that we miss bunch of logic for the tf specific backend when the loss scale optimizer runs with tf.distribute. Will fix that.

qlzh727 · 2023-10-30T16:55:29Z

Should be addressed by #18691

iamsoroush · 2024-05-16T08:44:29Z

the same issue exists with Adam and AdamW optimizers, when setting use_ema=True and using MirroredStrategy

IvanUkhov · 2024-12-03T11:50:24Z

For me, gradient accumulation does not work with Adam, with or without use_ema, and results in the same error. It feels like it applies to the base optimizer in general. Opened #20582.

qlzh727 self-assigned this Oct 23, 2023

qlzh727 mentioned this issue Oct 25, 2023

Fix the loss scale optimizer for tf.distribute. #18691

Merged

sachinprasadhs added the type:Bug label Oct 25, 2023

qlzh727 closed this as completed Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tensorflow] MirroredStrategy, LossScaledOptimizer - merge_call failed #18666

[tensorflow] MirroredStrategy, LossScaledOptimizer - merge_call failed #18666

crohkohl commented Oct 22, 2023

qlzh727 commented Oct 23, 2023

qlzh727 commented Oct 24, 2023

qlzh727 commented Oct 30, 2023

iamsoroush commented May 16, 2024

IvanUkhov commented Dec 3, 2024 •

edited

Loading

[tensorflow] MirroredStrategy, LossScaledOptimizer - merge_call failed #18666

[tensorflow] MirroredStrategy, LossScaledOptimizer - merge_call failed #18666

Comments

crohkohl commented Oct 22, 2023

qlzh727 commented Oct 23, 2023

qlzh727 commented Oct 24, 2023

qlzh727 commented Oct 30, 2023

iamsoroush commented May 16, 2024

IvanUkhov commented Dec 3, 2024 • edited Loading

IvanUkhov commented Dec 3, 2024 •

edited

Loading