feat(query-engine): add hash functions fnv, murmur3, md5, sha1, sha512, xxh3, xxh128#2887
feat(query-engine): add hash functions fnv, murmur3, md5, sha1, sha512, xxh3, xxh128#2887SzymonIwaniuk wants to merge 9 commits intoopen-telemetry:mainfrom
Conversation
| weaver_resolved_schema = { git = "https://github.com/open-telemetry/weaver.git", tag = "v0.21.2"} | ||
| weaver_resolver = { git = "https://github.com/open-telemetry/weaver.git", tag = "v0.21.2"} | ||
| weaver_semconv = { git = "https://github.com/open-telemetry/weaver.git", tag = "v0.21.2"} | ||
| sha1 = { version = "0.10", features = ["oid"] } |
There was a problem hiding this comment.
Can we avoid making sha1 a default dependency here, or gate just the sha1() function behind a feature? otap-df-query-engine is pulled into the normal df_engine build through core-nodes/transform-processor, so this makes SHA-1 part of the default engine dependency graph. Also, the workspace sha1 dependency enables oid, which does not look used by this implementation.
There was a problem hiding this comment.
I gated sha1 behind an sha1-hash cargo feature. The module, UDF registration, parser, and tests are all behind #[cfg(feature = "sha1-hash")]
There was a problem hiding this comment.
@SzymonIwaniuk @lalitb That could be done in a future PR but I’m slightly concerned by the growing number of SHA-1 related features. While the current usages are valid (e.g. WebSocket protocol compatibility or non-cryptographic hash functions in OPL), multiplying fine-grained features could make the build matrix and dependency story harder to reason about over time.
I’d recommend converging toward a single high-level compatibility feature (e.g. sha1-compat) controlling all SHA-1 usage globally, instead of component-specific flags. Internally, all SHA-1 usage should go through a shared utility module with explicit documentation clarifying that it is used for protocol compatibility/non-security purposes only.
Please open a GH issue to track this if not integrated in this PR. Thanks.
There was a problem hiding this comment.
@lquerel @lalitb I agree with the consolidation of SHA-1 related feature flags. I think this would be better addressed in a follow-up issue rather than in this PR, as it involves changes across multiple components. Feel free to assign it to me once the issue is created after this PR. @lalitb I'd like to know what you think about this approach, would that work for you?
| FORMAT_DATETIME_FUNC_NAME => Self::new(to_char(), ExprLogicalType::String, false, None), | ||
| RTRIM_FUNC_NAME => Self::new(rtrim(), ExprLogicalType::String, true, None), | ||
| SHA256_FUNC_NAME => Self::new(sha256(), ExprLogicalType::Binary, true, None), | ||
| MD5_FUNC_NAME => Self::new(md5(), ExprLogicalType::Binary, true, None), |
There was a problem hiding this comment.
Can we check this return type? DataFusion md5() already returns a hex string, not raw bytes like sha256(). So marking it as Binary may be wrong.
There was a problem hiding this comment.
This is correct, we should have String here, but it turns out this actually returns a DataType::Utf8View
https://github.com/apache/datafusion/blob/aca4d1377bf9785e6fa9a154c579cc761ca7b54b/datafusion/functions/src/crypto/md5.rs#L88-L90
We will actually fail if we try to assign this as is done in the test:
I think this should actually be:
| MD5_FUNC_NAME => Self::new(md5(), ExprLogicalType::Binary, true, None), | |
| MD5_FUNC_NAME => Self::new(md5(), ExprLogicalType::String, true, Some(DataType::Utf8)), |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2887 +/- ##
===========================================
- Coverage 86.15% 82.25% -3.91%
===========================================
Files 707 181 -526
Lines 267469 52892 -214577
===========================================
- Hits 230440 43505 -186935
+ Misses 36505 8863 -27642
Partials 524 524
🚀 New features to boost your workflow:
|
| let query = r#"logs | extend | ||
| attributes["str_attr"] = encode(md5(attributes["str_attr"]), "hex"), | ||
| attributes["binary_attr"] = encode(md5(attributes["binary_attr"]), "hex") | ||
| "#; |
There was a problem hiding this comment.
Related to the md5() return type: DataFusion md5() already returns a hex string, so should these be md5(attributes["..."]) directly instead of wrapping with encode(..., "hex")?
There was a problem hiding this comment.
This is correct - but will need to add change from this comment for this test to pass
There was a problem hiding this comment.
Done, I changed the return type toExprLogicalType::String with Some(DataType::Utf8) and removed the encode() wrapper from the test, also updated the expected hash values accordingly.
| mod substring; | ||
| mod uuidv7; | ||
| mod xxh128; | ||
| mod xxh3; |
There was a problem hiding this comment.
This line got removed in the merge. Can we add it back?
| mod xxh3; | |
| mod xxh3; | |
| mod uuidv7; |
There was a problem hiding this comment.
Sorry, I added uuidv7 back.
| ScalarValue::Utf8(v) | ScalarValue::LargeUtf8(v) => { | ||
| Ok(v.as_deref().map(|s| murmur3_32(s.as_bytes()))) | ||
| } | ||
| ScalarValue::Utf8View(v) => Ok(v.as_deref().map(|s| murmur3_32(s.as_bytes()))), | ||
| ScalarValue::Binary(v) | ScalarValue::LargeBinary(v) => { | ||
| Ok(v.as_ref().map(|b| murmur3_32(b))) | ||
| } |
There was a problem hiding this comment.
Why do we support different types when handling a scalar versus handling an Array?
For the Arrray case, it looks like we only support Utf8, LargeUtf8 and Binary. I think these supported types should maybe be consistent, unless there's a good reason for them not to be? Same comment for fnv, sha1 and xxh
There was a problem hiding this comment.
Good catch, made the array path consistent with the scalar path by adding Utf8View and LargeBinary support to hash_array in all five udf files.
|
Hey, all changes addressed and |
Change Summary
Add seven hash functions to the OTAP query-engine, per #2834:
md5(): MD5 digest, backed by DataFusion's built-incrypto::md5UDF.sha1(): SHA-1 digest, implemented as a customScalarUDFusing thesha1crate (DataFusion has no SHA-1 equivalent).sha512(): SHA-512 digest, backed by DataFusion's built-incrypto::sha512UDF.fnv(): FNV-1a 64-bit hash, implemented as a customScalarUDF.murmur3(): MurmurHash3 32-bit hash, implemented as a customScalarUDF.xxh3(): XXH3 64-bit hash, implemented as a customScalarUDFusingxxhash-rust.xxh128(): XXH3 128-bit hash, implemented as a customScalarUDFusingxxhash-rust.Files
consts.rs,parser.rs: register all 7 functions as external functions withparam_placeholders(1).pipeline/expr.rs: import DataFusion'smd5(),sha512()and the new custom UDFs, add them toDataFusionFunctionDef::from_func_namewithrequires_dict_downcast: true.pipeline/functions/fnv.rs(new): customFnvHashFunc- FNV-1a 64-bit, returnsInt64.pipeline/functions/murmur3.rs(new): customMurmur3HashFunc- MurmurHash3 32-bit, returnsInt64.pipeline/functions/sha1.rs(new): customSha1Funcusing thesha1crate, returnsBinary.pipeline/functions/xxh3.rs(new): customXxh3Funcusingxxhash-rust, returnsInt64.pipeline/functions/xxh128.rs(new): customXxh128Funcusingxxhash-rust, returnsBinary(16 bytes big-endian).pipeline/functions.rs: module declarations andmake_udf_function!registrations.crates/query-engine/Cargo.toml: addsha1andxxhash-rustworkspace dependencies.What issue does this PR close?
How are these changes tested?
Two layers of tests, all in this PR:
fnv,murmur3,sha1,xxh3,xxh128): verify scalar string input, binary input, and null handling.pipeline::assignend-to-end tests for every function, exercising both OPL and KQL parsers through the full pipeline. Binary-returning functions (md5,sha1,sha512,xxh128) are wrapped withencode(..., "hex")andthe output hex string is asserted. Integer-returning functions (
fnv,murmur3,xxh3) assert theInt64value directly.cargo xtask quick-checkpasses clean.Are there any user-facing changes?
Yes users of the transform processor / query-engine can now call these hash functions in OPL/KQL programs: