Introduce post-operation callback #2115

charleskawczynski · 2025-01-03T21:27:55Z

It seems to me that, to date, one of the most challenging tasks that users have is: debug a large simulation that is breaking, e.g., yielding NaNs somewhere. This is especially complex for large models with many terms and implicit time-stepping with all the bells and whistles that we offer.

This PR attempts to provide tools to tackle this giant task, by introducing two functions:

call_post_op_callback
post_op_callback

One way that users may leverage this is, for example:

import ClimaCore
ClimaCore.DataLayouts.call_post_op_callback() = true
function ClimaCore.DataLayouts.post_op_callback(data)
    if any(isnan, data)
        error("Found NaNs!")
    end
end

# run simulation

Now, this assums that any(isnan, data) will work on all of the results-- it may not, but users can catch whatever data post_op_callback is called with and handle them accordingly. As such, this sort of debugging tool is capable of revealing that a given simulation errors in a wide range of locations, like here, here, or here, which may provide hints to users as to what is wrong.

I'm not sure exactly how useful this will be, but I'm curious myself to try in order to debug the flaky ClimaTimestepper issues (where I'm having this experiencing this problem of hunting down NaNs).

Since this would be a purely debugging tool, I would argue that it should not be a feature used in production (i.e., removing it should not be a breaking change).

Thoughts?

charleskawczynski · 2025-01-06T03:52:09Z

Here's an example of this feature "in action" where I'm using it to debug the flaky job(s) in ClimaTimesteppers.

Indeed, NaNs is first introduced (spatially) in solve_newton! -> ldiv! -> run_field_matrix_solver! -> single_field_solve!.

It looks like, since this was caught early enough, there is spatial information in where NaNs is showing up, which I hope turns out to be useful (I after-all suspected that the issue with this problem related to the boundary conditions).

charleskawczynski · 2025-01-06T20:48:26Z

One piece of feedback I received from today's standup meeting was: let's also call this callback with all of the arguments. I had thought about this, and I agree. I think we can call it with exactly two: the returned object (which will help make dispatch easier), and all of the arguments (which will allow users to inspect the inputs more easily).

For interactive use, this could actually be really helpful for users/debugging when tied together with Infiltrator.jl.

I'll try to update this PR to add the input arguments into the callback.

charleskawczynski requested review from dennisYatunin, Sbozzolo, szy21, juliasloan25, sriharshakandala, nefrathenrici, ph-kev and imreddyTeja January 3, 2025 21:27

Introduce post-operation callback

36c8118

charleskawczynski force-pushed the ck/debugging_hooks branch from b10fbfc to 36c8118 Compare January 4, 2025 00:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce post-operation callback #2115

Introduce post-operation callback #2115

charleskawczynski commented Jan 3, 2025

charleskawczynski commented Jan 6, 2025

charleskawczynski commented Jan 6, 2025

Introduce post-operation callback #2115

Are you sure you want to change the base?

Introduce post-operation callback #2115

Conversation

charleskawczynski commented Jan 3, 2025

charleskawczynski commented Jan 6, 2025

charleskawczynski commented Jan 6, 2025