feat(query-engine): add hash functions fnv, murmur3, md5, sha1, sha512, xxh3, xxh128 by SzymonIwaniuk · Pull Request #2887 · open-telemetry/otel-arrow

SzymonIwaniuk · 2026-05-07T08:40:11Z

Change Summary

Add seven hash functions to the OTAP query-engine, per #2834:

md5(): MD5 digest, backed by DataFusion's built-in crypto::md5 UDF.
sha1(): SHA-1 digest, implemented as a custom ScalarUDF using the sha1 crate (DataFusion has no SHA-1 equivalent).
sha512(): SHA-512 digest, backed by DataFusion's built-in crypto::sha512 UDF.
fnv(): FNV-1a 64-bit hash, implemented as a custom ScalarUDF.
murmur3(): MurmurHash3 32-bit hash, implemented as a custom ScalarUDF.
xxh3(): XXH3 64-bit hash, implemented as a custom ScalarUDF using xxhash-rust.
xxh128(): XXH3 128-bit hash, implemented as a custom ScalarUDF using xxhash-rust.

logs | extend attributes["hash"] = encode(md5(attributes["body"]), "hex")
logs | extend attributes["bucket"] = murmur3(attributes["service.name"])
logs | extend attributes["sig"] = xxh3(attributes["message"])

Files

consts.rs, parser.rs: register all 7 functions as external functions with param_placeholders(1).
pipeline/expr.rs: import DataFusion's md5(), sha512() and the new custom UDFs, add them to DataFusionFunctionDef::from_func_name with requires_dict_downcast: true.
pipeline/functions/fnv.rs (new): custom FnvHashFunc - FNV-1a 64-bit, returns Int64.
pipeline/functions/murmur3.rs (new): custom Murmur3HashFunc - MurmurHash3 32-bit, returns Int64.
pipeline/functions/sha1.rs (new): custom Sha1Func using the sha1 crate, returns Binary.
pipeline/functions/xxh3.rs (new): custom Xxh3Func using xxhash-rust, returns Int64.
pipeline/functions/xxh128.rs (new): custom Xxh128Func using xxhash-rust, returns Binary (16 bytes big-endian).
pipeline/functions.rs: module declarations and make_udf_function! registrations.
crates/query-engine/Cargo.toml: add sha1 and xxhash-rust workspace dependencies.

What issue does this PR close?

Closes sub-issue of [OPL/OTAP query-engine functions] additional hash algorithms #2834
Part of [Epic] OPL/OTAP query engine scalar functions #2818

How are these changes tested?

Two layers of tests, all in this PR:

Unit tests in each custom UDF module (fnv, murmur3, sha1, xxh3, xxh128): verify scalar string input, binary input, and null handling.
pipeline::assign end-to-end tests for every function, exercising both OPL and KQL parsers through the full pipeline. Binary-returning functions (md5, sha1, sha512, xxh128) are wrapped with encode(..., "hex") and
the output hex string is asserted. Integer-returning functions (fnv, murmur3, xxh3) assert the Int64 value directly.
cargo xtask quick-check passes clean.

Are there any user-facing changes?

Yes users of the transform processor / query-engine can now call these hash functions in OPL/KQL programs:

logs | extend attributes["id"] = encode(sha1(attributes["body"]), "hex")
logs | extend attributes["bucket"] = fnv(attributes["service.name"])
logs | extend attributes["sig"] = xxh3(attributes["message"])
logs | extend attributes["hash"] = encode(xxh128(attributes["message"]), "hex")

lalitb · 2026-05-07T18:22:40Z

 weaver_resolved_schema = { git = "https://github.com/open-telemetry/weaver.git", tag = "v0.21.2"}
 weaver_resolver = { git = "https://github.com/open-telemetry/weaver.git", tag = "v0.21.2"}
 weaver_semconv = { git = "https://github.com/open-telemetry/weaver.git", tag = "v0.21.2"}
+sha1 = { version = "0.10", features = ["oid"] }


Can we avoid making sha1 a default dependency here, or gate just the sha1() function behind a feature? otap-df-query-engine is pulled into the normal df_engine build through core-nodes/transform-processor, so this makes SHA-1 part of the default engine dependency graph. Also, the workspace sha1 dependency enables oid, which does not look used by this implementation.

for reference, the concerns on sha-1 - #2827

I gated sha1 behind an sha1-hash cargo feature. The module, UDF registration, parser, and tests are all behind #[cfg(feature = "sha1-hash")]

@SzymonIwaniuk @lalitb That could be done in a future PR but I’m slightly concerned by the growing number of SHA-1 related features. While the current usages are valid (e.g. WebSocket protocol compatibility or non-cryptographic hash functions in OPL), multiplying fine-grained features could make the build matrix and dependency story harder to reason about over time.

I’d recommend converging toward a single high-level compatibility feature (e.g. sha1-compat) controlling all SHA-1 usage globally, instead of component-specific flags. Internally, all SHA-1 usage should go through a shared utility module with explicit documentation clarifying that it is used for protocol compatibility/non-security purposes only.

Please open a GH issue to track this if not integrated in this PR. Thanks.

@lquerel @lalitb I agree with the consolidation of SHA-1 related feature flags. I think this would be better addressed in a follow-up issue rather than in this PR, as it involves changes across multiple components. Feel free to assign it to me once the issue is created after this PR. @lalitb I'd like to know what you think about this approach, would that work for you?

lalitb · 2026-05-07T18:25:27Z

            FORMAT_DATETIME_FUNC_NAME => Self::new(to_char(), ExprLogicalType::String, false, None),
            RTRIM_FUNC_NAME => Self::new(rtrim(), ExprLogicalType::String, true, None),
            SHA256_FUNC_NAME => Self::new(sha256(), ExprLogicalType::Binary, true, None),
+            MD5_FUNC_NAME => Self::new(md5(), ExprLogicalType::Binary, true, None),


Can we check this return type? DataFusion md5() already returns a hex string, not raw bytes like sha256(). So marking it as Binary may be wrong.

This is correct, we should have String here, but it turns out this actually returns a DataType::Utf8View
https://github.com/apache/datafusion/blob/aca4d1377bf9785e6fa9a154c579cc761ca7b54b/datafusion/functions/src/crypto/md5.rs#L88-L90

We will actually fail if we try to assign this as is done in the test:

otel-arrow/rust/otap-dataflow/crates/pdata/src/otap/transform/upsert_attributes.rs

Lines 299 to 301 in 4857f2b

_ => Err(Error::UnexpectedRecordBatchState {

reason: format!("unexpected attribute value type {:?}", logical_type),

}),

I think this should actually be:

Suggested change

MD5_FUNC_NAME => Self::new(md5(), ExprLogicalType::Binary, true, None),

MD5_FUNC_NAME => Self::new(md5(), ExprLogicalType::String, true, Some(DataType::Utf8)),

codecov · 2026-05-07T18:27:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.25%. Comparing base (fbf013e) to head (99b1b1a).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2887       +/-   ##
===========================================
- Coverage   86.15%   82.25%    -3.91%     
===========================================
  Files         707      181      -526     
  Lines      267469    52892   -214577     
===========================================
- Hits       230440    43505   -186935     
+ Misses      36505     8863    -27642     
  Partials      524      524

Components	Coverage Δ
otap-dataflow	`∅ <ø> (∅)`
query_abstraction	`80.61% <ø> (ø)`
query_engine	`90.73% <ø> (ø)`
otel-arrow-go	`52.45% <ø> (ø)`
quiver	`∅ <ø> (∅)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

lalitb · 2026-05-07T18:28:36Z

+        let query = r#"logs | extend
+          attributes["str_attr"] = encode(md5(attributes["str_attr"]), "hex"),
+          attributes["binary_attr"] = encode(md5(attributes["binary_attr"]), "hex")
+      "#;


Related to the md5() return type: DataFusion md5() already returns a hex string, so should these be md5(attributes["..."]) directly instead of wrapping with encode(..., "hex")?

This is correct - but will need to add change from this comment for this test to pass

Done, I changed the return type toExprLogicalType::String with Some(DataType::Utf8) and removed the encode() wrapper from the test, also updated the expected hash values accordingly.

albertlockett · 2026-05-07T19:39:39Z

 mod substring;
-mod uuidv7;
+mod xxh128;
+mod xxh3;


This line got removed in the merge. Can we add it back?

Suggested change

mod xxh3;

mod xxh3;

mod uuidv7;

Sorry, I added uuidv7 back.

albertlockett · 2026-05-07T19:50:55Z

+        ScalarValue::Utf8(v) | ScalarValue::LargeUtf8(v) => {
+            Ok(v.as_deref().map(|s| murmur3_32(s.as_bytes())))
+        }
+        ScalarValue::Utf8View(v) => Ok(v.as_deref().map(|s| murmur3_32(s.as_bytes()))),
+        ScalarValue::Binary(v) | ScalarValue::LargeBinary(v) => {
+            Ok(v.as_ref().map(|b| murmur3_32(b)))
+        }


Why do we support different types when handling a scalar versus handling an Array?

For the Arrray case, it looks like we only support Utf8, LargeUtf8 and Binary. I think these supported types should maybe be consistent, unless there's a good reason for them not to be? Same comment for fnv, sha1 and xxh

Good catch, made the array path consistent with the scalar path by adding Utf8View and LargeBinary support to hash_array in all five udf files.

SzymonIwaniuk · 2026-05-09T15:03:30Z

Hey, all changes addressed and cargo test -p otap-df-query-engine unit tests pass locally.

test result: ok. 623 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.14s

   Doc-tests otap_df_query_engine

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

SzymonIwaniuk added 6 commits May 6, 2026 19:31

feat(query-engine): add md5 hash function support

c7189ac

feat(query-engine): add fnv hash function support

4825ba5

feat(query-engine): add murmur3 hash function support

582c3a0

feat(query-engine): add sha1 and sha512 hash function support

ac1934c

feat(query-engine): add xxh3 hash function

3cc8c6f

feat(query-engine): add xxh128 hash function

ff4d511

SzymonIwaniuk requested a review from a team as a code owner May 7, 2026 08:40

github-project-automation Bot added this to OTel-Arrow May 7, 2026

github-actions Bot added rust Pull requests that update Rust code query-engine Query Engine / Transform related tasks query-engine-columnar Columnar query engine which uses DataFusion to process OTAP Batches labels May 7, 2026

SzymonIwaniuk changed the title ~~Additional hash algorithms~~ feat(query-engine): add hash functions fnv, murmur3, md5, sha1, sha512, xxh3, xxh128 May 7, 2026

Merge branch 'main' into additional-hash-algorithms

99b1b1a

lalitb reviewed May 7, 2026

View reviewed changes

albertlockett reviewed May 7, 2026

View reviewed changes

feat(query-engine): address pr review suggestions

0a15f0f

SzymonIwaniuk requested review from albertlockett and lalitb May 9, 2026 14:36

Merge branch 'main' into additional-hash-algorithms

fbb6e66

	_ => Err(Error::UnexpectedRecordBatchState {
	reason: format!("unexpected attribute value type {:?}", logical_type),
	}),

	MD5_FUNC_NAME => Self::new(md5(), ExprLogicalType::Binary, true, None),
	MD5_FUNC_NAME => Self::new(md5(), ExprLogicalType::String, true, Some(DataType::Utf8)),

Conversation

SzymonIwaniuk commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Summary

Files

What issue does this PR close?

How are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lquerel May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SzymonIwaniuk May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SzymonIwaniuk May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 7, 2026

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertlockett May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SzymonIwaniuk commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SzymonIwaniuk commented May 7, 2026 •

edited

Loading

lquerel May 9, 2026 •

edited

Loading

SzymonIwaniuk May 9, 2026 •

edited

Loading

SzymonIwaniuk May 9, 2026 •

edited

Loading

albertlockett May 7, 2026 •

edited

Loading