Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.19: Clip operator with type FLOAT16 defaults to min or max value 0.0 if not explicitly given, breaking many models using FLOAT16 #21957

Closed
lesters opened this issue Sep 2, 2024 · 4 comments
Assignees
Labels
regression issues that demonstrate a regression in ORT functionality and need to be addressed immediately

Comments

@lesters
Copy link

lesters commented Sep 2, 2024

Describe the issue

On version 1.19, for type FLOAT16, the min or max values are set to default 0.0 if not explicitly set. This is the wrong behaviour, and breaks quite a few models that use FLOAT16.

For instance, the tensor [-1, 0.5, 1.0] with a min value of 0.0 (and no max value set) would result in [0.0, 0.0, 0.0] instead of the expected [0.0, 0.5, 1.0]. This is likewise for giving a max value but not a min value. The behaviour is as expected if both are set.

The expected values of min and max if not given is -inf and inf respectively, not 0.0.

This is not a problem for instance for the regular FLOAT type.

This was not a problem on 1.18.

To reproduce

The following Python code fails when running FLOAT16:

import onnx
import numpy as np
import numpy.testing as npt
import onnxruntime as ort
from onnx import helper


def test_model(onnx_float_type, np_float_type):
    shape = (3,1)
    input_tensor = helper.make_tensor_value_info(name='input', elem_type=onnx_float_type, shape=shape)
    output_tensor = helper.make_tensor_value_info(name='output', elem_type=onnx_float_type, shape=shape)

    min_tensor = helper.make_tensor(
        name='min_tensor',
        data_type=onnx_float_type,
        dims=(1,),
        vals=[np_float_type(0.0)]
    )

    min_node = helper.make_node(
        'Constant',
        inputs=[],
        outputs=['min'],
        value=min_tensor
    )

    clip_node = helper.make_node(
        'Clip',
        inputs=['input', 'min'],  # no 'max' input
        outputs=['output'],
        name='clip'
    )

    graph = helper.make_graph(
        nodes=[min_node, clip_node],
        name='clip_model',
        inputs=[input_tensor],
        outputs=[output_tensor]
    )

    model = helper.make_model(graph)

    model_path = 'clip_model.onnx'
    onnx.save_model(model, model_path)
    session = ort.InferenceSession(model_path)

    test_input = np.array([[-1.0], [0.5], [1.5]], dtype=np_float_type)
    expected_output = np.array([[0.0], [0.5], [1.5]], dtype=np_float_type)

    outputs = session.run(None, {"input": test_input})

    print("Output:", outputs[0])
    npt.assert_array_equal(outputs[0], expected_output)


test_model(onnx.TensorProto.FLOAT, np.float32)
test_model(onnx.TensorProto.FLOAT16, np.float16)

This code does not fail on 1.18.

Urgency

URGENT: This is a blocker for us to upgrade to 1.19 as it breaks quite a few models using FLOAT16.

Platform

Linux

OS Version

all

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.19

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

@arnej27959
Copy link
Contributor

we think this is triggered by #21493 which enables MLFloat16 in onnxruntime/core/providers/cpu/math/clip.cc
The implementation uses std::numeric_limits<T>::max() and lowest() as the default values for max_val and min_val; but there is no specialisation of numeric_limits for MLFloat16, so the defaults become just 0 instead of what Clip wants.

@tianleiwu
Copy link
Contributor

@yihonglyu, please fix the cpu clip with code like this.

@arnej27959
Copy link
Contributor

add-numeric-limits-patch.txt

I suggest adding specialization of std::numeric_limits for the extra floating-point types, that could be generally useful for other operators also. The attached patch works for MLFloat16 and BFloat16.

@sophies927 sophies927 added the regression issues that demonstrate a regression in ORT functionality and need to be addressed immediately label Sep 5, 2024
tianleiwu added a commit that referenced this issue Sep 26, 2024
### Description
* Add std::numeric_limits for MLFloat16 and BFloat16.
* Update some comments in csharp ORTFloat16.shared.cs.
* Add unit tests (including Clip)

Note that the canonical NaN is not consistent in C++ and C#. C# uses
negative quiet NaN as canonical NaN, while C++ uses positive quiet NaN.
The choice of CSharp Float16.NaN is to be consistent with
System.Half.NaN.

FP16 data returns from CUDA might have 7FFF as NaN; FP16 data from CPU
provider might have 0x7E00 as NaN. Anyway there is no consistent
canonical NaN in ORT right now. Because all these NaNs are aligned with
IEEE spec, there shall not an issue in downstream.

### Motivation and Context
std::numeric_limits is used in codebase but not defined for MLFloat16
and BFloat16. It causes some bugs like
#21957 introduced by
#21493.
@yihonglyu
Copy link
Contributor

@arnej27959 Given #22197 is merged, please close the issue if it is solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
regression issues that demonstrate a regression in ORT functionality and need to be addressed immediately
Projects
None yet
Development

No branches or pull requests

5 participants