Corrupted value for model outputs that are also model inputs #21922

adrianlizarraga · 2024-08-29T22:41:33Z

Describe the issue

There seems to be a memory corruption bug for model outputs that are also model inputs. Consider a model with an input that is also a model output:

The above model is run multiple times, and the model outputs are saved into a list. After the run loop, only the outputs for input_0 will have incorrect values. Here's some pseudo-code (full repro script included below):

all_run_outputs = []
for run_index in range(num_runs):
    inputs = # get inputs ...
    outputs = session.run(None, inputs)

    # All graph outputs are always correct here (immediately after session.run())
    assert all_outputs_correct(outputs, ...) == True, ""  # Passes

    # Add outputs to list
    all_run_outputs.append(outputs)

# After the run loop, the outputs for 'input_0' (which is also a graph input) is incorrect/corrupted.
for saved_run_outputs in all_run_outputs:
    # The following assert fails. 
    assert all_outputs_correct(saved_run_outputs, ...) == True, ""  # Fails

One workaround is to copy the output numpy arrays for the problematic outputs (e.g., input_0). Here's pseudo-code:

all_run_outputs = []
for run_index in range(num_runs):
    inputs = # get inputs ...
    outputs = session.run(None, inputs)

    # All graph outputs are always correct here (immediately after session.run())
    assert all_outputs_correct(outputs, ...) == True, ""  # Passes

    # The following would work: call np.ndarray.copy() on graph outputs that are also graph inputs.
    fixed_outputs = []
    for i, output in enumerate(outputs):
        if output_names[i] in input_names:
            fixed_outputs.append(output.copy())
        else:
            fixed_outputs.append(output)
    
    # Add fixed outputs to list
    all_run_outputs.append(fixed_outputs)

# After the run loop, all outputs are still correct
for saved_run_outputs in all_run_outputs:
    # The following assert passes. 
    assert all_outputs_correct(saved_run_outputs, ...) == True, ""  # Passes

To reproduce

Here's a full python script (repro.py) to reproduce the issue.

from __future__ import annotations
import argparse
import onnx
import onnxruntime
import numpy as np
import ctypes

"""
Reproduces memory corruption error when a graph input is also a graph output

USAGE:
  python repro.py --num_runs 7

USAGE (apply workaround):
  python repro.py --num_runs 7 --use_workaround
"""

SHAPE = (1, 2, 2, 2)

def make_model():
    """
    Makes onnx model with an input that is also a graph output.
    'input_0' ---+----> (is graph output)
                 | 
                 +--> Add(+ 10) --> 'plus_10'
    """
    inp_shape = (1, 2, 2, 2)
    input_0 = onnx.helper.make_tensor_value_info("input_0", onnx.TensorProto.FLOAT, inp_shape)
    output_0 = onnx.helper.make_tensor_value_info("plus_10", onnx.TensorProto.FLOAT, inp_shape)
    ten_const = onnx.numpy_helper.from_array(np.array(10, dtype=np.float32), "ten_const")

    add_node = onnx.helper.make_node("Add", ["input_0", "ten_const"], ["plus_10"], name="Add0")
    graph = onnx.helper.make_graph(
        [add_node],
        "AddTen_f32",
        [input_0],
        [output_0, input_0],
        initializer=[ten_const],
    )
    opset_imports = [onnx.helper.make_opsetid("", 21)]
    model = onnx.helper.make_model(graph, opset_imports=opset_imports)
    model = onnx.shape_inference.infer_shapes(model)
    onnx.checker.check_model(model, True)
    return model

def get_inputs(run_index: int):
    return {
        "input_0": np.full(SHAPE, float(run_index), dtype=np.float32),  # elems equal to run_index
    }

def get_expected_outputs(run_index: int):
    return {
        "plus_10": np.full(SHAPE, run_index + 10.0, dtype=np.float32),  # elems equal to run_index + 10
        "input_0": np.full(SHAPE, float(run_index), dtype=np.float32),  # elems equal to run_index
    }

def check_outputs(run_index, output_names, outputs, verbose=False) -> list[bool]:
    """
    Checks that the outputs for a run match the expected values.
    """
    expected_outputs = get_expected_outputs(run_index)
    output_correctness = [True] * len(outputs)
    for i, output in enumerate(outputs):
        output_name = output_names[i]
        expected_output = expected_outputs[output_name]

        if not np.array_equal(output, expected_output):
            if verbose:
                print(f"\tGraph output '{output_name}' is WRONG")
                print(f"\t\texpected: {expected_output.flatten().tolist()}")
                print(f"\t\tactual:   {output.flatten().tolist()}")
            output_correctness[i] = False
        else:
            if verbose:
                print(f"\tGraph output '{output_name}' is correct")
            output_correctness[i] = True
    
    return output_correctness

def parse_args():
    parser = argparse.ArgumentParser(description="Reproduces memory corruption error when a graph input is also a graph output")
    parser.add_argument("--use_workaround", action="store_true", default=False, help="Use a workaround for the problem")
    parser.add_argument("--num_runs", type=int, default=7, help="The number of times to run the model")
    return parser.parse_args()

if __name__ == "__main__":
    args = parse_args()
    model = make_model()
    sess_options = onnxruntime.SessionOptions()
    sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL
    session = onnxruntime.InferenceSession(
        model.SerializeToString(),
        sess_options=sess_options,
        providers=['CPUExecutionProvider'],
    )

    input_names = [node_arg.name for node_arg in session.get_inputs()]
    output_names = [node_arg.name for node_arg in session.get_outputs()]

    # Run the model multiple times and collect all run outputs in a list.
    # For each run, the expected values of the graph outputs 'plus_10' and 'input_0'
    # are (run_index + 10) and run_index, respectively.
    all_run_outputs = []
    for run_index in range(args.num_runs):
        inputs = get_inputs(run_index)
        outputs = session.run(None, inputs)

        # All graph outputs are always correct at this point (immediately after session.run()).
        # However, outputs that are also graph inputs will be incorrect **after** this loop due to a memory corruption(?).
        output_correctness = check_outputs(run_index, output_names, outputs, verbose=False)
        assert all(is_correct == True for is_correct in output_correctness), ""

        # Check that the input and output numpy arrays for 'input_0' point to the same memory.
        output_index_for_input_0 = output_names.index('input_0')
        assert inputs['input_0'].ctypes.data == outputs[output_index_for_input_0].ctypes.data, "OrtValues to point to same data"

        # Add this run's outputs to a list
        if not args.use_workaround:
            all_run_outputs.append(outputs)  # Doesn't work
            #all_run_outputs.append(outputs[:])  # Doesn't work either
        else:
            # The following would work: call np.ndarray.copy() on graph outputs that are also graph inputs.
            fixed_outputs = []
            for i, output in enumerate(outputs):
                if output_names[i] in input_names:
                    fixed_outputs.append(output.copy())
                    # Storing the input np.array would also work, but shouldn't have to do this.
                    #fixed_outputs.append(inputs[output_names[i]])
                else:
                    fixed_outputs.append(output)
            all_run_outputs.append(fixed_outputs)
            
            # The following one-liner also WORKS, but copies all np.arrays unnecessarily
            #all_run_outputs.append([o.copy() for o in outputs])


    assert len(all_run_outputs) == args.num_runs, "Unexpected number of elements in all_run_outputs"

    # Check if the outputs for each run are correct.
    times_output_is_wrong = [0] * len(output_names)
    for run_index, outputs in enumerate(all_run_outputs):
        print(f"\nRun {run_index}")
        output_correctness = check_outputs(run_index, output_names, outputs, verbose=True)

        # Count how many times a graph output has been incorrect
        for i, output_is_correct in enumerate(output_correctness):
            if not output_is_correct:
                times_output_is_wrong[i] += 1
    
    
    # Print a summary of results
    if any(wrong_count > 0 for wrong_count in times_output_is_wrong):
        print("\nFAILURE")
        for i, wrong_count in enumerate(times_output_is_wrong):
            print(f"Number of incorrect '{output_names[i]}' graph outputs = {wrong_count}")
    else:
        print("\nALL OUTPUTS OK")

Here's the console output from a sample run:

$ python repro.py --num_runs 7

Run 0
        Graph output 'plus_10' is correct
        Graph output 'input_0' is WRONG
                expected: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
                actual:   [6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0]

Run 1
        Graph output 'plus_10' is correct
        Graph output 'input_0' is correct

Run 2
        Graph output 'plus_10' is correct
        Graph output 'input_0' is correct

Run 3
        Graph output 'plus_10' is correct
        Graph output 'input_0' is WRONG
                expected: [3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0]
                actual:   [6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0]

Run 4
        Graph output 'plus_10' is correct
        Graph output 'input_0' is WRONG
                expected: [4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0]
                actual:   [14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0, 14.0]

Run 5
        Graph output 'plus_10' is correct
        Graph output 'input_0' is WRONG
                expected: [5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0]
                actual:   [15.0, 15.0, 15.0, 15.0, 15.0, 15.0, 15.0, 15.0]

Run 6
        Graph output 'plus_10' is correct
        Graph output 'input_0' is correct

FAILURE
Number of incorrect 'plus_10' graph outputs = 0
Number of incorrect 'input_0' graph outputs = 4

Here's a run that applies the workaround described above:

$ python repro.py --num_runs 7 --use_workaround

Run 0
        Graph output 'plus_10' is correct
        Graph output 'input_0' is correct

Run 1
        Graph output 'plus_10' is correct
        Graph output 'input_0' is correct

Run 2
        Graph output 'plus_10' is correct
        Graph output 'input_0' is correct

Run 3
        Graph output 'plus_10' is correct
        Graph output 'input_0' is correct

Run 4
        Graph output 'plus_10' is correct
        Graph output 'input_0' is correct

Run 5
        Graph output 'plus_10' is correct
        Graph output 'input_0' is correct

Run 6
        Graph output 'plus_10' is correct
        Graph output 'input_0' is correct

ALL OUTPUTS OK

Urgency

This issue is causing a crash when running the python quantization tools with the Percentile, Distribution, and Entropy calibration methods. These calibration methods create an augmented onnx model that makes all model inputs into model outputs. The output data from this augmented model is corrupted, which causes an eventual crash.

onnxruntime/onnxruntime/python/tools/quantization/calibrate.py

Lines 564 to 572 in 0223e86

    
               def collect_data(self, data_reader: CalibrationDataReader): 
        
                   """ 
        
                   Entropy Calibrator collects operators' tensors as well as generates tensor histogram for each operator. 
        
                   """ 
        
                   while True: 
        
                       inputs = data_reader.get_next() 
        
                       if not inputs: 
        
                           break 
        
                       self.intermediate_outputs.append(self.infer_session.run(None, inputs))

Platform

Windows

OS Version

Windows 11

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.19.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

The text was updated successfully, but these errors were encountered:

adrianlizarraga · 2024-08-29T22:55:06Z

Hi @yuslepukhin,
I believe you've dealt with code related to how we wrap numpy arrays over OrtValues. Do you know what could be happening?

…calibrators (#21972) ### Description - Applies a workaround that prevents the histogram-based calibrators (percentile, entropy, distribution) from crashing. The workaround involves copying inference outputs that come directly from model inputs. A description of the bug is here: #21922. **This PR does not fix the root bug, but instead provides a workaround to _unblock_ users using histogram-based calibration.** - Adds a unit test that runs all histogram-based calibrators to help catch future regressions. We didn't have unit tests that ran these calibration methods. ### Motivation and Context Trying to quantize a model with the percentile, entropy, or distribution calibration methods raises an exception: ```shell File "/.../site-packages/onnxruntime/quantization/quantize.py", line 691, in quantize quantize_static( File "/.../site-packages/onnxruntime/quantization/quantize.py", line 525, in quantize_static calibrator.collect_data(calibration_data_reader) File "/.../site-packages/onnxruntime/quantization/calibrate.py", line 571, in collect_data self.collector.collect(clean_merged_dict) File "/.../site-packages/onnxruntime/quantization/calibrate.py", line 746, in collect return self.collect_value(name_to_arr) File "/.../site-packages/onnxruntime/quantization/calibrate.py", line 836, in collect_value hist, hist_edges = np.histogram(data_arr, self.num_bins, range=(-threshold, threshold)) File "<__array_function__ internals>", line 180, in histogram File ".../site-packages/numpy/lib/histograms.py", line 793, in histogram bin_edges, uniform_bins = _get_bin_edges(a, bins, range, weights) File "/.../site-packages/numpy/lib/histograms.py", line 426, in _get_bin_edges first_edge, last_edge = _get_outer_edges(a, range) File "/.../site-packages/numpy/lib/histograms.py", line 315, in _get_outer_edges raise ValueError( ValueError: supplied range of [nan, nan] is not finite ``` The calibrators create an augmented model with all tensors (including model inputs) set as model outputs. The data for outputs that are also model inputs is corrupted as described in #21922. The corrupted data sometimes contains `NaN` values that cause numpy's histogram utilities to raise an exception.

github-actions · 2024-09-29T15:01:12Z

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

kshpv · 2024-11-19T16:19:51Z

Hi. I faced the same issue. Do you plan to take a look at it?
Note: the issue is not reproduced with ONNXRuntime==1.17.3

adrianlizarraga added quantization issues related to quantization core runtime issues related to core runtime labels Aug 29, 2024

adrianlizarraga mentioned this issue Sep 3, 2024

[Quantization] Apply workaround for crash when using histogram-based calibrators #21972

Merged

github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Sep 29, 2024

kshpv mentioned this issue Nov 19, 2024

[ONNX] Fix sporadic results in BC openvinotoolkit/nncf#3081

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted value for model outputs that are also model inputs #21922

Corrupted value for model outputs that are also model inputs #21922

adrianlizarraga commented Aug 29, 2024

adrianlizarraga commented Aug 29, 2024

github-actions bot commented Sep 29, 2024

kshpv commented Nov 19, 2024 •

edited

Loading

Corrupted value for model outputs that are also model inputs #21922

Corrupted value for model outputs that are also model inputs #21922

Comments

adrianlizarraga commented Aug 29, 2024

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

adrianlizarraga commented Aug 29, 2024

github-actions bot commented Sep 29, 2024

kshpv commented Nov 19, 2024 • edited Loading

kshpv commented Nov 19, 2024 •

edited

Loading