Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix torch.linspace #2416

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

fix torch.linspace #2416

wants to merge 6 commits into from

Conversation

twoertwein
Copy link

@twoertwein twoertwein commented Dec 11, 2024

fixes #2412

@TobyRoseman where should I add testcases for this? I grep'ed through the tests, there are currently none for torch.linspace. And having quick tutorial how to get the test locally running would be great, too :) Getting RuntimeError: BlobWriter not loaded when calling ct.convert.

@TobyRoseman
Copy link
Collaborator

Thanks for looking into this.

Please add your unit test to the TestLinspace Test Class.

To get things running locally, see our Building from Source Document.

@twoertwein
Copy link
Author

twoertwein commented Dec 12, 2024

This existing test fails: TestLinspace::test_linspace_static_large[compute_unit=ComputeUnit.CPU_ONLY-backend=('mlprogram', 'fp16')]. It tries to do torch.linspace(1, 2_000_000, 2_000_000) but it creates infs
array([ 1., 2., 3., ..., inf, inf, inf], dtype=float32)

Here is a minimal example when the inf start coming in

arange = mb.range_1d(start=0.0, end=2_000_000.0, step=1.0) # no infs
res = mb.add(x=arange, y=1.0) # now we have infs! (same also if we do mul instead of add)

Do you have an idea how to avoid that?

Based on the test name "fp16", this sounds somewhat expected: 2_000_000 > 65_504?

@TobyRoseman
Copy link
Collaborator

The TestLinspace Unit Tests work for me with the most recent main. Are you saying the existing unit tests fail with your change? If so, what is the error and stack trace?

@twoertwein
Copy link
Author

twoertwein commented Dec 13, 2024

Yes, it works on main and fails on this PR:

$ pytest -k 'test_linspace_static_large and fp16' coremltools/converters/mil/frontend/torch/test/test_torch_ops.py
F                                                                                                                                                                      [100%]
================================================================================== FAILURES ==================================================================================
__________________________________ TestLinspace.test_linspace_static_large[compute_unit=ComputeUnit.CPU_ONLY-backend=('mlprogram', 'fp16')] __________________________________

self = <coremltools.converters.mil.frontend.torch.test.test_torch_ops.TestLinspace object at 0x141c19290>, compute_unit = <ComputeUnit.CPU_ONLY: 3>
backend = ('mlprogram', 'fp16')

    @pytest.mark.parametrize("compute_unit, backend", itertools.product(compute_units, backends))
    def test_linspace_static_large(self, compute_unit, backend):
        input_shape = tuple([1])
    
        class Model(nn.Module):
            def forward(self, x):
                return torch.linspace(1, 2_000_000, 2_000_000)
    
        model = Model()
>       self.run_compare_torch(input_shape, model, backend=backend, compute_unit=compute_unit)

coremltools/converters/mil/frontend/torch/test/test_torch_ops.py:5189: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
coremltools/converters/mil/frontend/torch/test/testing_utils.py:369: in run_compare_torch
    model_spec, mlmodel, coreml_inputs, coreml_results = convert_and_compare(
coremltools/converters/mil/frontend/torch/test/testing_utils.py:313: in convert_and_compare
    np.testing.assert_allclose(coreml_result, torch_result, atol=atol, rtol=rtol)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

args = (<function assert_allclose.<locals>.compare at 0x157121440>, array([ 1.,  2.,  3., ..., inf, inf, inf], dtype=float32), array([1.000000e+00, 2.000000e+00, 3.000000e+00, ..., 1.999998e+06,
       1.999999e+06, 2.000000e+06], dtype=float32))
kwds = {'equal_nan': True, 'err_msg': '', 'header': 'Not equal to tolerance rtol=0.05, atol=0.5', 'strict': False, ...}

    @wraps(func)
    def inner(*args, **kwds):
        with self._recreate_cm():
>           return func(*args, **kwds)
E           AssertionError: 
E           Not equal to tolerance rtol=0.05, atol=0.5
E           
E           +inf location mismatch:
E            ACTUAL: array([ 1.,  2.,  3., ..., inf, inf, inf], dtype=float32)
E            DESIRED: array([1.000000e+00, 2.000000e+00, 3.000000e+00, ..., 1.999998e+06,
E                  1.999999e+06, 2.000000e+06], dtype=float32)

../miniforge3/lib/python3.11/contextlib.py:81: AssertionError

This PR is closer to the pytorch and numpy behavior:

import numpy as np
import torch

torch.linspace(0, 2_000_000, 2_000_000).to(dtype=torch.float16)
# tensor([0., 1., 2.,  ..., inf, inf, inf], dtype=torch.float16)

torch.linspace(0, 2_000_000, 2_000_000, dtype=torch.float16)
# RuntimeError: value cannot be converted to type at::Half without overflow

np.linspace(0, 2_000_000, 2_000_000, dtype=np.float16)
# array([ 0.,  1.,  2., ..., inf, inf, inf], dtype=float16)

np.linspace(0, 2_000_000, 2_000_000).astype(np.float16)
# array([ 0.,  1.,  2., ..., inf, inf, inf], dtype=float16)

It migt be good to explicilty cast the actual and expected output in run_compare_torch to the requested dtype (in this cast to np.float16)? Or to xfail this test (2 million cant be expressed as float16)?

edit: this static test ends up in the dynamic code because if nums_val < MAX_SIZE_CONSTANT_FOLDING is false for float16.

@twoertwein
Copy link
Author

I extended an existing test to trigger the reported bug on main (these tests pass on this PR):

$ pytest -k 'test_linspace_dynamic and 10' coremltools/converters/mil/frontend/torch/test/test_torch_ops.py       
F...F.......                                                                                                                                                           [100%]
================================================================================== FAILURES ==================================================================================
____________________ TestLinspace.test_linspace_dynamic[compute_unit=ComputeUnit.CPU_ONLY-backend=('mlprogram', 'fp16')-start_end=(-0.1, -0.7)-steps=10] _____________________

self = <coremltools.converters.mil.frontend.torch.test.test_torch_ops.TestLinspace object at 0x12ea12690>, compute_unit = <ComputeUnit.CPU_ONLY: 3>
backend = ('mlprogram', 'fp16'), start_end = (-0.1, -0.7), steps = 10

    @pytest.mark.parametrize(
        "compute_unit, backend, start_end, steps",
        itertools.product(
            compute_units,
            backends,
            [(-0.1, -0.7), (1, 10)],
            [1, 2, 10, 100],
        ),
    )
    def test_linspace_dynamic(self, compute_unit, backend, start_end, steps):
        start, end = start_end
    
        class Model(nn.Module):
            def forward(self, x):
                return torch.linspace(x[0], x[1], steps)
    
        model = Model()
        inputs = [torch.Tensor([start, end])]
>       self.run_compare_torch(
            inputs,
            model,
            backend=backend,
            compute_unit=compute_unit,
            input_as_shape=False,
        )

coremltools/converters/mil/frontend/torch/test/test_torch_ops.py:5551: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
coremltools/converters/mil/frontend/torch/test/testing_utils.py:369: in run_compare_torch
    model_spec, mlmodel, coreml_inputs, coreml_results = convert_and_compare(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

input_data = [tensor([-0.1000, -0.7000])], model_spec = Model(original_name=Model)
expected_results = [array([-0.1       , -0.16666666, -0.23333332, -0.29999998, -0.36666664,
       -0.43333334, -0.5       , -0.56666666, -0.6333333 , -0.7       ],
      dtype=float32)]
atol = 0.5, rtol = 0.05, backend = ('mlprogram', 'fp16'), converter_input_type = None, compute_unit = <ComputeUnit.CPU_ONLY: 3>, minimum_deployment_target = None
converter = <function convert at 0x12a155260>

    def convert_and_compare(
        input_data,
        model_spec,
        expected_results=None,
        atol=1e-4,
        rtol=1e-05,
        backend=("neuralnetwork", "fp32"),
        converter_input_type=None,
        compute_unit=ct.ComputeUnit.CPU_ONLY,
        minimum_deployment_target=None,
        converter=ct.convert,
    ):
        """
        If expected results is not set, it will by default
        be set to the flattened output of the torch model.
    
        Inputs:
    
        - input_data: torch.tensor or list[torch.tensor]
        """
        if isinstance(model_spec, str):
            torch_model = torch.jit.load(model_spec)
        else:
            torch_model = model_spec
        if _HAS_TORCH_EXPORT_API and isinstance(torch_model, ExportedProgram):
            torch_model = torch_model.module()
    
        if not isinstance(input_data, (list, tuple)):
            input_data = [input_data]
    
        if expected_results is None:
            torch_input = _copy_input_data(input_data)
            expected_results = torch_model(*torch_input)
        expected_results = flatten_and_detach_torch_results(expected_results)
    
        PYTEST_CURRENT_TEST = os.environ.get("PYTEST_CURRENT_TEST").split("(call)")[0].strip()
        if PYTEST_CURRENT_TEST in debug_save_mlmodels:
            serialization_path = _create_current_pytest_serialization_path()
            Path(serialization_path).mkdir(parents=True, exist_ok=True)
            flat_inputs = flatten_and_detach_torch_results(input_data)
            np.savez(serialization_path + "ref_inputs.npz", *flat_inputs)
            np.savez(serialization_path + "ref_outputs.npz", *expected_results)
    
        mlmodel = convert_to_mlmodel(
            model_spec,
            input_data,
            backend=backend,
            converter_input_type=converter_input_type,
            compute_unit=compute_unit,
            minimum_deployment_target=minimum_deployment_target,
            converter=converter,
        )
    
        coreml_inputs = convert_to_coreml_inputs(mlmodel.input_description, input_data)
    
        if not _IS_MACOS or (mlmodel.is_package and coremltoolsutils._macos_version() < (12, 0)):
            return model_spec, mlmodel, coreml_inputs, None
    
        _, dtype = backend
        if mlmodel.compute_unit != ct.ComputeUnit.CPU_ONLY or (dtype == "fp16"):
            atol = max(atol * 100.0, 5e-1)
            rtol = max(rtol * 100.0, 5e-2)
    
        if not coremltoolsutils._has_custom_layer(mlmodel._spec):
            coreml_preds = mlmodel.predict(coreml_inputs)
            coreml_outputs = mlmodel._spec.description.output
            coreml_results = [coreml_preds[output.name] for output in coreml_outputs]
            for torch_result, coreml_result in zip(expected_results, coreml_results):
    
                if torch_result.shape == ():
                    torch_result = np.array([torch_result])
>               np.testing.assert_equal(coreml_result.shape, torch_result.shape)
E               AssertionError: 
E               Items are not equal:
E               item=0
E               
E                ACTUAL: 11
E                DESIRED: 10

coremltools/converters/mil/frontend/torch/test/testing_utils.py:312: AssertionError

[...]


FAILED coremltools/converters/mil/frontend/torch/test/test_torch_ops.py::TestLinspace::test_linspace_dynamic[compute_unit=ComputeUnit.CPU_ONLY-backend=('mlprogram', 'fp16')-start_end=(-0.1, -0.7)-steps=10] - AssertionError: 
FAILED coremltools/converters/mil/frontend/torch/test/test_torch_ops.py::TestLinspace::test_linspace_dynamic[compute_unit=ComputeUnit.CPU_ONLY-backend=('mlprogram', 'fp16')-start_end=(1, 10)-steps=10] - AssertionError: 

@twoertwein
Copy link
Author

I xfailed the test on fp16 for now.

Converting the pytorch expected results to float16 in run_compare_torch is not a solution as an inf is off by one:

+inf location mismatch:
 ACTUAL: array([ 1.,  2.,  3., ..., inf, inf, inf], dtype=float32)
 DESIRED: array([ 1.,  2.,  3., ..., inf, inf, inf], dtype=float16)

@TobyRoseman
Copy link
Collaborator

Since 2 million can not be represented using fp16, rather than xfailing those tests, it would be better to change the model being converted to not to use values that high.

Any ideas why we were not running into this issue before you change?

@twoertwein
Copy link
Author

Any ideas why we were not running into this issue before you change?

Unfortuatly not. I don't undersand the casting magic in coremltools: maybe you do some operations in fp32 and cast them later in future operations. The previous code might have created the results in fp32 and had no operations afterwards. The new code creates the array and then scales and shifts it (which might invovle the actual casts to fp16)? Just speculating :)

@twoertwein
Copy link
Author

it would be better to change the model being converted to not to use values that high.

Changed: using the largest integer in float16 - the test passes now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect behavior for torch.linspace
2 participants