Skip to content

feat(query-engine): add hash functions fnv, murmur3, md5, sha1, sha512, xxh3, xxh128#2887

Open
SzymonIwaniuk wants to merge 9 commits intoopen-telemetry:mainfrom
SzymonIwaniuk:additional-hash-algorithms
Open

feat(query-engine): add hash functions fnv, murmur3, md5, sha1, sha512, xxh3, xxh128#2887
SzymonIwaniuk wants to merge 9 commits intoopen-telemetry:mainfrom
SzymonIwaniuk:additional-hash-algorithms

Conversation

@SzymonIwaniuk
Copy link
Copy Markdown

@SzymonIwaniuk SzymonIwaniuk commented May 7, 2026

Change Summary

Add seven hash functions to the OTAP query-engine, per #2834:

  • md5(): MD5 digest, backed by DataFusion's built-in crypto::md5 UDF.
  • sha1(): SHA-1 digest, implemented as a custom ScalarUDF using the sha1 crate (DataFusion has no SHA-1 equivalent).
  • sha512(): SHA-512 digest, backed by DataFusion's built-in crypto::sha512 UDF.
  • fnv(): FNV-1a 64-bit hash, implemented as a custom ScalarUDF.
  • murmur3(): MurmurHash3 32-bit hash, implemented as a custom ScalarUDF.
  • xxh3(): XXH3 64-bit hash, implemented as a custom ScalarUDF using xxhash-rust.
  • xxh128(): XXH3 128-bit hash, implemented as a custom ScalarUDF using xxhash-rust.
logs | extend attributes["hash"] = encode(md5(attributes["body"]), "hex")
logs | extend attributes["bucket"] = murmur3(attributes["service.name"])
logs | extend attributes["sig"] = xxh3(attributes["message"])

Files

  • consts.rs, parser.rs: register all 7 functions as external functions with param_placeholders(1).
  • pipeline/expr.rs: import DataFusion's md5(), sha512() and the new custom UDFs, add them to DataFusionFunctionDef::from_func_name with requires_dict_downcast: true.
  • pipeline/functions/fnv.rs (new): custom FnvHashFunc - FNV-1a 64-bit, returns Int64.
  • pipeline/functions/murmur3.rs (new): custom Murmur3HashFunc - MurmurHash3 32-bit, returns Int64.
  • pipeline/functions/sha1.rs (new): custom Sha1Func using the sha1 crate, returns Binary.
  • pipeline/functions/xxh3.rs (new): custom Xxh3Func using xxhash-rust, returns Int64.
  • pipeline/functions/xxh128.rs (new): custom Xxh128Func using xxhash-rust, returns Binary (16 bytes big-endian).
  • pipeline/functions.rs: module declarations and make_udf_function! registrations.
  • crates/query-engine/Cargo.toml: add sha1 and xxhash-rust workspace dependencies.

What issue does this PR close?

How are these changes tested?

Two layers of tests, all in this PR:

  • Unit tests in each custom UDF module (fnv, murmur3, sha1, xxh3, xxh128): verify scalar string input, binary input, and null handling.
  • pipeline::assign end-to-end tests for every function, exercising both OPL and KQL parsers through the full pipeline. Binary-returning functions (md5, sha1, sha512, xxh128) are wrapped with encode(..., "hex") and
    the output hex string is asserted. Integer-returning functions (fnv, murmur3, xxh3) assert the Int64 value directly.
    cargo xtask quick-check passes clean.

Are there any user-facing changes?

Yes users of the transform processor / query-engine can now call these hash functions in OPL/KQL programs:

logs | extend attributes["id"] = encode(sha1(attributes["body"]), "hex")
logs | extend attributes["bucket"] = fnv(attributes["service.name"])
logs | extend attributes["sig"] = xxh3(attributes["message"])
logs | extend attributes["hash"] = encode(xxh128(attributes["message"]), "hex")

@SzymonIwaniuk SzymonIwaniuk requested a review from a team as a code owner May 7, 2026 08:40
@github-actions github-actions Bot added rust Pull requests that update Rust code query-engine Query Engine / Transform related tasks query-engine-columnar Columnar query engine which uses DataFusion to process OTAP Batches labels May 7, 2026
@SzymonIwaniuk SzymonIwaniuk changed the title Additional hash algorithms feat(query-engine): add hash functions fnv, murmur3, md5, sha1, sha512, xxh3, xxh128 May 7, 2026
Comment thread rust/otap-dataflow/Cargo.toml Outdated
weaver_resolved_schema = { git = "https://github.com/open-telemetry/weaver.git", tag = "v0.21.2"}
weaver_resolver = { git = "https://github.com/open-telemetry/weaver.git", tag = "v0.21.2"}
weaver_semconv = { git = "https://github.com/open-telemetry/weaver.git", tag = "v0.21.2"}
sha1 = { version = "0.10", features = ["oid"] }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid making sha1 a default dependency here, or gate just the sha1() function behind a feature? otap-df-query-engine is pulled into the normal df_engine build through core-nodes/transform-processor, so this makes SHA-1 part of the default engine dependency graph. Also, the workspace sha1 dependency enables oid, which does not look used by this implementation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for reference, the concerns on sha-1 - #2827

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gated sha1 behind an sha1-hash cargo feature. The module, UDF registration, parser, and tests are all behind #[cfg(feature = "sha1-hash")]

Copy link
Copy Markdown
Contributor

@lquerel lquerel May 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SzymonIwaniuk @lalitb That could be done in a future PR but I’m slightly concerned by the growing number of SHA-1 related features. While the current usages are valid (e.g. WebSocket protocol compatibility or non-cryptographic hash functions in OPL), multiplying fine-grained features could make the build matrix and dependency story harder to reason about over time.

I’d recommend converging toward a single high-level compatibility feature (e.g. sha1-compat) controlling all SHA-1 usage globally, instead of component-specific flags. Internally, all SHA-1 usage should go through a shared utility module with explicit documentation clarifying that it is used for protocol compatibility/non-security purposes only.

Please open a GH issue to track this if not integrated in this PR. Thanks.

Copy link
Copy Markdown
Author

@SzymonIwaniuk SzymonIwaniuk May 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lquerel @lalitb I agree with the consolidation of SHA-1 related feature flags. I think this would be better addressed in a follow-up issue rather than in this PR, as it involves changes across multiple components. Feel free to assign it to me once the issue is created after this PR. @lalitb I'd like to know what you think about this approach, would that work for you?

FORMAT_DATETIME_FUNC_NAME => Self::new(to_char(), ExprLogicalType::String, false, None),
RTRIM_FUNC_NAME => Self::new(rtrim(), ExprLogicalType::String, true, None),
SHA256_FUNC_NAME => Self::new(sha256(), ExprLogicalType::Binary, true, None),
MD5_FUNC_NAME => Self::new(md5(), ExprLogicalType::Binary, true, None),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check this return type? DataFusion md5() already returns a hex string, not raw bytes like sha256(). So marking it as Binary may be wrong.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, we should have String here, but it turns out this actually returns a DataType::Utf8View
https://github.com/apache/datafusion/blob/aca4d1377bf9785e6fa9a154c579cc761ca7b54b/datafusion/functions/src/crypto/md5.rs#L88-L90

We will actually fail if we try to assign this as is done in the test:

_ => Err(Error::UnexpectedRecordBatchState {
reason: format!("unexpected attribute value type {:?}", logical_type),
}),

I think this should actually be:

Suggested change
MD5_FUNC_NAME => Self::new(md5(), ExprLogicalType::Binary, true, None),
MD5_FUNC_NAME => Self::new(md5(), ExprLogicalType::String, true, Some(DataType::Utf8)),

Copy link
Copy Markdown
Author

@SzymonIwaniuk SzymonIwaniuk May 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.25%. Comparing base (fbf013e) to head (99b1b1a).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2887       +/-   ##
===========================================
- Coverage   86.15%   82.25%    -3.91%     
===========================================
  Files         707      181      -526     
  Lines      267469    52892   -214577     
===========================================
- Hits       230440    43505   -186935     
+ Misses      36505     8863    -27642     
  Partials      524      524               
Components Coverage Δ
otap-dataflow ∅ <ø> (∅)
query_abstraction 80.61% <ø> (ø)
query_engine 90.73% <ø> (ø)
otel-arrow-go 52.45% <ø> (ø)
quiver ∅ <ø> (∅)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

let query = r#"logs | extend
attributes["str_attr"] = encode(md5(attributes["str_attr"]), "hex"),
attributes["binary_attr"] = encode(md5(attributes["binary_attr"]), "hex")
"#;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to the md5() return type: DataFusion md5() already returns a hex string, so should these be md5(attributes["..."]) directly instead of wrapping with encode(..., "hex")?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct - but will need to add change from this comment for this test to pass

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I changed the return type toExprLogicalType::String with Some(DataType::Utf8) and removed the encode() wrapper from the test, also updated the expected hash values accordingly.

mod substring;
mod uuidv7;
mod xxh128;
mod xxh3;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line got removed in the merge. Can we add it back?

Suggested change
mod xxh3;
mod xxh3;
mod uuidv7;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I added uuidv7 back.

Comment on lines +130 to +136
ScalarValue::Utf8(v) | ScalarValue::LargeUtf8(v) => {
Ok(v.as_deref().map(|s| murmur3_32(s.as_bytes())))
}
ScalarValue::Utf8View(v) => Ok(v.as_deref().map(|s| murmur3_32(s.as_bytes()))),
ScalarValue::Binary(v) | ScalarValue::LargeBinary(v) => {
Ok(v.as_ref().map(|b| murmur3_32(b)))
}
Copy link
Copy Markdown
Member

@albertlockett albertlockett May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we support different types when handling a scalar versus handling an Array?

For the Arrray case, it looks like we only support Utf8, LargeUtf8 and Binary. I think these supported types should maybe be consistent, unless there's a good reason for them not to be? Same comment for fnv, sha1 and xxh

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, made the array path consistent with the scalar path by adding Utf8View and LargeBinary support to hash_array in all five udf files.

@SzymonIwaniuk
Copy link
Copy Markdown
Author

Hey, all changes addressed and cargo test -p otap-df-query-engine unit tests pass locally.

test result: ok. 623 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.14s

   Doc-tests otap_df_query_engine

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

query-engine Query Engine / Transform related tasks query-engine-columnar Columnar query engine which uses DataFusion to process OTAP Batches rust Pull requests that update Rust code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants