Skip to content

Commit 8795661

Browse files
committed
Comment on bytes representation for initilizer deduplication key
Signed-off-by: Christoph Berganski <[email protected]>
1 parent 8a4902d commit 8795661

File tree

1 file changed

+12
-2
lines changed

1 file changed

+12
-2
lines changed

src/onnx_ir/passes/common/initializer_deduplication.py

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,18 @@ def _should_skip_initializer(initializer: ir.Value, size_limit: int) -> bool:
4848

4949

5050
def _tobytes(val: ir.TensorProtocol):
51-
# StringTensor does not support tobytes. Use 'string_data'
52-
# instead.
51+
"""StringTensor does not support tobytes. Use 'string_data' instead.
52+
However, 'string_data' yields a list of bytes which cannot be hashed, i.e.,
53+
cannot be used to index into a dict. To generate keys for identifying
54+
tensors in initializer deduplication the following converts the list of
55+
bytes to an array of fixed-length strings which can be flattened into a
56+
bytes-string. This, together with the tensor shape, is sufficient for
57+
identifying tensors for deduplication, but it differs from the
58+
representation used for serializing tensors (that is string_data) by adding
59+
padding bytes so that each string occupies the same number of consecutive
60+
bytes in the flattened .tobytes representation.
61+
"""
62+
5363
if val.dtype.is_string():
5464
return np.array(val.string_data()).tobytes()
5565
return val.tobytes()

0 commit comments

Comments
 (0)