-
Notifications
You must be signed in to change notification settings - Fork 849
refactor: heuristic int delta_binary_packed encoding rule #19144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
dantengsky
wants to merge
16
commits into
databendlabs:main
Choose a base branch
from
dantengsky:refactor/enhence-fuse-parquet-encoding
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 4 commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
4a1e675
refactor: heuristics int delta_binary_packed encoding rule
dantengsky 4703f32
split mods
dantengsky b54975c
tweak logic tests
dantengsky 048f697
refactor
dantengsky 0c5b558
tweak logic test
dantengsky 2b4514d
tweak logic test
dantengsky a8753b6
refactor
dantengsky 4348478
refine unit tests
dantengsky 67ca2ef
refine ut
dantengsky 899b8b7
refactor: remove unnecessary traits
dantengsky ee18eb8
fix: bug introduced by refactoring
dantengsky 31febad
refactor
dantengsky e950b86
cleanup
dantengsky 548be79
fix: skip calculating ordering stats if possible
dantengsky 8848452
fmt
dantengsky 12b67a0
fix: skip collecting bloom-filter-based NDV and column stats when enc…
dantengsky File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
110 changes: 110 additions & 0 deletions
110
src/query/storages/common/blocks/src/encoding_rules/delta_binary_packed.rs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,110 @@ | ||
| // Copyright 2021 Datafuse Labs | ||
| // | ||
| // Licensed under the Apache License, Version 2.0 (the "License"); | ||
| // you may not use this file except in compliance with the License. | ||
| // You may obtain a copy of the License at | ||
| // | ||
| // http://www.apache.org/licenses/LICENSE-2.0 | ||
| // | ||
| // Unless required by applicable law or agreed to in writing, software | ||
| // distributed under the License is distributed on an "AS IS" BASIS, | ||
| // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| // See the License for the specific language governing permissions and | ||
| // limitations under the License. | ||
|
|
||
| use databend_common_expression::Scalar; | ||
| use databend_common_expression::TableDataType; | ||
| use databend_common_expression::TableSchema; | ||
| use databend_common_expression::types::NumberDataType; | ||
| use databend_common_expression::types::number::NumberScalar; | ||
| use databend_storages_common_table_meta::meta::ColumnStatistics; | ||
| use parquet::basic::Encoding; | ||
| use parquet::file::properties::WriterPropertiesBuilder; | ||
|
|
||
| use crate::encoding_rules::ColumnPathsCache; | ||
| use crate::encoding_rules::EncodingStatsProvider; | ||
|
|
||
| // NDV must be close to row count (~90%+). Empirical value based on experiments and operational experience. | ||
| const DELTA_HIGH_CARDINALITY_RATIO: f64 = 0.9; | ||
| // Span (max - min + 1) should be close to NDV. Empirical value based on experiments and operational experience. | ||
| const DELTA_RANGE_TOLERANCE: f64 = 1.05; | ||
|
|
||
| pub fn apply_delta_binary_packed_heuristic( | ||
| mut builder: WriterPropertiesBuilder, | ||
| metrics: &dyn EncodingStatsProvider, | ||
| table_schema: &TableSchema, | ||
| num_rows: usize, | ||
| column_paths_cache: &mut ColumnPathsCache, | ||
| ) -> WriterPropertiesBuilder { | ||
| for field in table_schema.leaf_fields() { | ||
| // Restrict the DBP heuristic to native INT32/UINT32 columns for now. | ||
| // INT64 columns with high zero bits already compress well with PLAIN+Zstd, and other | ||
| // widths need more validation before enabling DBP. | ||
| if !matches!( | ||
| field.data_type().remove_nullable(), | ||
| TableDataType::Number(NumberDataType::Int32) | ||
| | TableDataType::Number(NumberDataType::UInt32) | ||
| ) { | ||
| continue; | ||
| } | ||
| let column_id = field.column_id(); | ||
| let Some(stats) = metrics.column_stats(&column_id) else { | ||
| continue; | ||
| }; | ||
| let Some(ndv) = metrics.column_ndv(&column_id) else { | ||
| continue; | ||
| }; | ||
| if should_apply_delta_binary_packed(stats, ndv, num_rows) { | ||
| let column_paths = column_paths_cache.get_or_build(table_schema); | ||
| if let Some(path) = column_paths.get(&column_id) { | ||
| builder = builder | ||
| .set_column_dictionary_enabled(path.clone(), false) | ||
| .set_column_encoding(path.clone(), Encoding::DELTA_BINARY_PACKED); | ||
| } | ||
| } | ||
| } | ||
| builder | ||
| } | ||
|
|
||
| /// Evaluate whether Delta Binary Packed (DBP) is worth enabling for a 32-bit integer column. | ||
| /// | ||
| /// The DBP heuristic rule is intentionally conservative: | ||
| /// - DBP is only considered when the block looks like a contiguous INT32/UINT32 range (no NULLs). | ||
| /// - NDV must be very close to the row count (`DELTA_HIGH_CARDINALITY_RATIO`). | ||
| /// - The `[min, max]` span should be close to NDV (`DELTA_RANGE_TOLERANCE`). | ||
| /// Experiments show that such blocks shrink dramatically after DBP + compression while decode CPU | ||
| /// remains affordable, yielding the best IO + CPU trade-off. | ||
| fn should_apply_delta_binary_packed(stats: &ColumnStatistics, ndv: u64, num_rows: usize) -> bool { | ||
| // Nulls weaken the contiguous-range signal, so we avoid the heuristic when they exist. | ||
| if num_rows == 0 || ndv == 0 || stats.null_count > 0 { | ||
| return false; | ||
| } | ||
| let Some(min) = scalar_to_i64(&stats.min) else { | ||
| return false; | ||
| }; | ||
| let Some(max) = scalar_to_i64(&stats.max) else { | ||
| return false; | ||
| }; | ||
| // Degenerate spans (single value) already compress well without DBP. | ||
| if max <= min { | ||
| return false; | ||
| } | ||
| // Use ratio-based heuristics instead of absolute NDV threshold to decouple from block size. | ||
| let ndv_ratio = ndv as f64 / num_rows as f64; | ||
| if ndv_ratio < DELTA_HIGH_CARDINALITY_RATIO { | ||
| return false; | ||
| } | ||
| let span = (max - min + 1) as f64; | ||
| let contiguous_ratio = span / ndv as f64; | ||
| contiguous_ratio <= DELTA_RANGE_TOLERANCE | ||
| } | ||
|
|
||
| fn scalar_to_i64(val: &Scalar) -> Option<i64> { | ||
| // Only 32-bit integers reach the delta heuristic (see matches! check above), | ||
| // so we deliberately reject other widths to avoid misinterpreting large values. | ||
| match val { | ||
| Scalar::Number(NumberScalar::Int32(v)) => Some(*v as i64), | ||
| Scalar::Number(NumberScalar::UInt32(v)) => Some(*v as i64), | ||
| _ => None, | ||
| } | ||
| } | ||
43 changes: 43 additions & 0 deletions
43
src/query/storages/common/blocks/src/encoding_rules/dictionary.rs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| // Copyright 2021 Datafuse Labs | ||
| // | ||
| // Licensed under the Apache License, Version 2.0 (the "License"); | ||
| // you may not use this file except in compliance with the License. | ||
| // You may obtain a copy of the License at | ||
| // | ||
| // http://www.apache.org/licenses/LICENSE-2.0 | ||
| // | ||
| // Unless required by applicable law or agreed to in writing, software | ||
| // distributed under the License is distributed on an "AS IS" BASIS, | ||
| // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| // See the License for the specific language governing permissions and | ||
| // limitations under the License. | ||
|
|
||
| use databend_common_expression::TableSchema; | ||
| use parquet::file::properties::WriterPropertiesBuilder; | ||
|
|
||
| use crate::encoding_rules::ColumnPathsCache; | ||
| use crate::encoding_rules::EncodingStatsProvider; | ||
|
|
||
| /// Disable dictionary encoding once the NDV-to-row ratio is greater than this threshold. | ||
| const HIGH_CARDINALITY_RATIO_THRESHOLD: f64 = 0.1; | ||
|
|
||
| pub fn apply_dictionary_high_cardinality_heuristic( | ||
| mut builder: WriterPropertiesBuilder, | ||
| metrics: &dyn EncodingStatsProvider, | ||
| table_schema: &TableSchema, | ||
| num_rows: usize, | ||
| column_paths_cache: &mut ColumnPathsCache, | ||
| ) -> WriterPropertiesBuilder { | ||
| if num_rows == 0 { | ||
| return builder; | ||
| } | ||
| let column_paths = column_paths_cache.get_or_build(table_schema); | ||
| for (column_id, column_path) in column_paths.iter() { | ||
| if let Some(ndv) = metrics.column_ndv(column_id) { | ||
| if (ndv as f64 / num_rows as f64) > HIGH_CARDINALITY_RATIO_THRESHOLD { | ||
| builder = builder.set_column_dictionary_enabled(column_path.clone(), false); | ||
| } | ||
| } | ||
| } | ||
| builder | ||
| } |
80 changes: 80 additions & 0 deletions
80
src/query/storages/common/blocks/src/encoding_rules/mod.rs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,80 @@ | ||
| // Copyright 2021 Datafuse Labs | ||
| // | ||
| // Licensed under the Apache License, Version 2.0 (the "License"); | ||
| // you may not use this file except in compliance with the License. | ||
| // You may obtain a copy of the License at | ||
| // | ||
| // http://www.apache.org/licenses/LICENSE-2.0 | ||
| // | ||
| // Unless required by applicable law or agreed to in writing, software | ||
| // distributed under the License is distributed on an "AS IS" BASIS, | ||
| // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| // See the License for the specific language governing permissions and | ||
| // limitations under the License. | ||
|
|
||
| use std::collections::HashMap; | ||
|
|
||
| use databend_common_expression::ColumnId; | ||
| use databend_common_expression::TableSchema; | ||
| use databend_common_expression::converts::arrow::table_schema_arrow_leaf_paths; | ||
| use databend_storages_common_table_meta::meta::ColumnStatistics; | ||
| use databend_storages_common_table_meta::meta::StatisticsOfColumns; | ||
| use parquet::schema::types::ColumnPath; | ||
|
|
||
| pub mod delta_binary_packed; | ||
| pub mod page_limit; | ||
|
|
||
| pub mod dictionary; | ||
|
|
||
| pub struct ColumnPathsCache { | ||
| cache: Option<HashMap<ColumnId, ColumnPath>>, | ||
| } | ||
|
|
||
| impl ColumnPathsCache { | ||
| pub fn new() -> Self { | ||
| Self { cache: None } | ||
| } | ||
|
|
||
| pub fn get_or_build(&mut self, table_schema: &TableSchema) -> &HashMap<ColumnId, ColumnPath> { | ||
| if self.cache.is_none() { | ||
| self.cache = Some( | ||
| table_schema_arrow_leaf_paths(table_schema) | ||
| .into_iter() | ||
| .map(|(id, path)| (id, ColumnPath::from(path))) | ||
| .collect(), | ||
| ); | ||
| } | ||
| self.cache.as_ref().unwrap() | ||
| } | ||
| } | ||
|
|
||
| /// Provides per column NDV statistics. | ||
| pub trait NdvProvider { | ||
| fn column_ndv(&self, column_id: &ColumnId) -> Option<u64>; | ||
| } | ||
|
|
||
| impl NdvProvider for &StatisticsOfColumns { | ||
| fn column_ndv(&self, column_id: &ColumnId) -> Option<u64> { | ||
| self.get(column_id).and_then(|item| item.distinct_of_values) | ||
| } | ||
| } | ||
|
|
||
| pub trait EncodingStatsProvider: NdvProvider { | ||
| fn column_stats(&self, column_id: &ColumnId) -> Option<&ColumnStatistics>; | ||
| } | ||
|
|
||
| pub struct ColumnStatsView<'a>(pub &'a StatisticsOfColumns); | ||
|
|
||
| impl<'a> NdvProvider for ColumnStatsView<'a> { | ||
| fn column_ndv(&self, column_id: &ColumnId) -> Option<u64> { | ||
| self.0 | ||
| .get(column_id) | ||
| .and_then(|item| item.distinct_of_values) | ||
| } | ||
| } | ||
|
|
||
| impl<'a> EncodingStatsProvider for ColumnStatsView<'a> { | ||
| fn column_stats(&self, column_id: &ColumnId) -> Option<&ColumnStatistics> { | ||
| self.0.get(column_id) | ||
| } | ||
| } |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.