[Docs] inconsistency between C++ implementation and the statistics schema #45560

arashandishgar · 2025-02-18T07:17:59Z

Describe the bug, including details regarding any error messages, version, and platform.

I have implemented the Example in c++ and as it is mentioned in this link I should expect a physical layout which has 9 elements in colunmn_field but I got only three elements in column_field

main.zip

Component(s)

Documentation

kou · 2025-02-18T08:06:24Z

Good catch!

The example was wrong. column should not have duplicated values.

"column" in examples have duplicated values for the same column but "column" must have only one value for each column.

arashandishgar · 2025-02-18T08:16:24Z

I think there is another issue in Record Bath statistics

arrow/cpp/src/arrow/record_batch.h

Line 289 in ec3d283

/// https://arrow.apache.org/docs/format/CDataInterfaceStatistics.html

the link does not exist no more

arashandishgar · 2025-02-18T11:15:59Z

There is another issue in a test regarding Record batch statistics.

arrow/cpp/src/arrow/record_batch_test.cc

Lines 1395 to 1424 in ec3d283

    
           TEST_F(TestRecordBatch, MakeStatisticsArrayMaxApproximate) { 
        
             auto schema = 
        
                 ::arrow::schema({field("no-statistics", boolean()), field("float64", float64())}); 
        
             auto no_statistics_array = ArrayFromJSON(boolean(), "[true, false, true]"); 
        
             auto float64_array_data = ArrayFromJSON(float64(), "[1.0, null, -1.0]")->data()->Copy(); 
        
             float64_array_data->statistics = std::make_shared<ArrayStatistics>(); 
        
             float64_array_data->statistics->min = -1.0; 
        
             auto float64_array = MakeArray(std::move(float64_array_data)); 
        
             auto batch = RecordBatch::Make(schema, float64_array->length(), 
        
                                            {no_statistics_array, float64_array}); 
        
             ASSERT_OK_AND_ASSIGN(auto statistics_array, batch->MakeStatisticsArray()); 
        
             ASSERT_OK_AND_ASSIGN( 
        
                 auto expected_statistics_array, 
        
                 MakeStatisticsArray("[null, 1]", 
        
                                     {{ 
        
                                          ARROW_STATISTICS_KEY_ROW_COUNT_EXACT, 
        
                                      }, 
        
                                      { 
        
                                          ARROW_STATISTICS_KEY_MIN_VALUE_APPROXIMATE, 
        
                                      }}, 
        
                                     {{ 
        
                                          ArrayStatistics::ValueType{int64_t{3}}, 
        
                                      }, 
        
                                      { 
        
                                          ArrayStatistics::ValueType{-1.0}, 
        
                                      }})); 
        
             AssertArraysEqual(*expected_statistics_array, *statistics_array, true); 
        
           }

This test is about min approximate value but title is about max approximate value

arashandishgar · 2025-02-18T11:47:17Z

I think one thing that is not clear about schema statistics specification is what I should expect from nested type statistics. I know that nested type statistics have not been implemented yet. link but nested types can have statistics like a simple type. Should the implementation accept all of that? for example is it acceptable that a struct in record batch to have max_approximate?

kou · 2025-02-19T00:04:16Z

There is another issue in a test regarding Record batch statistics.

This test is about min approximate value but title is about max approximate value

Good catch! Could you open a separated issue for this?

Let's fix it in a separated PR.

kou · 2025-02-19T00:19:04Z

Should the implementation accept all of that? for example is it acceptable that a struct in record batch to have max_approximate?

Yes.

We may need to add arrow::Scalar or something to arrow::ArrayStatistics::ValueType for it. (Or we may use arrow::Scalar instead of arrow::ArrayStatistics::ValueType.)

arashandishgar · 2025-02-19T07:11:03Z

let imagine a type like
auto type = struct_({field("list_a", fixed_size_list(int64(), 2)), field("list_b", fixed_size_list(int64(), 2))});
I think even if there is a max element for the struct(not child type) it would probably be a two fixed_size_list again. Thus it could not be a scalar

arashandishgar · 2025-02-19T07:26:43Z

There is another issue in a test regarding Record batch statistics.
This test is about min approximate value but title is about max approximate value

Good catch! Could you open a separated issue for this?

Let's fix it in a separated PR.

I think things connected to C++ implementation would likely better be solved in this issue. I will probably pick it up after I finish another issue.(One issue has already been assigned to me.)

kou · 2025-02-19T08:14:38Z

let imagine a type like auto type = struct_({field("list_a", fixed_size_list(int64(), 2)), field("list_b", fixed_size_list(int64(), 2))}); I think even if there is a max element for the struct(not child type) it would probably be a two fixed_size_list again. Thus it could not be a scalar

We have StructScalar and FixedSizeListScalar:

arrow/cpp/src/arrow/scalar.h

Lines 714 to 727 in c7a9100

    
           struct ARROW_EXPORT StructScalar : public Scalar { 
        
             using TypeClass = StructType; 
        
             using ValueType = std::vector<std::shared_ptr<Scalar>>; 
        
             ScalarVector value; 
        
             Result<std::shared_ptr<Scalar>> field(FieldRef ref) const; 
        
             StructScalar(ValueType value, std::shared_ptr<DataType> type, bool is_valid = true) 
        
                 : Scalar(std::move(type), is_valid), value(std::move(value)) {} 
        
             static Result<std::shared_ptr<StructScalar>> Make(ValueType value, 
        
                                                               std::vector<std::string> field_names); 
        
           };

arrow/cpp/src/arrow/scalar.h

Lines 705 to 712 in c7a9100

    
           struct ARROW_EXPORT FixedSizeListScalar : public BaseListScalar { 
        
             using TypeClass = FixedSizeListType; 
        
             FixedSizeListScalar(std::shared_ptr<Array> value, std::shared_ptr<DataType> type, 
        
                                 bool is_valid = true); 
        
             explicit FixedSizeListScalar(std::shared_ptr<Array> value, bool is_valid = true); 
        
           };

It seems that they can represet the max value.

kou · 2025-02-19T08:18:11Z

I think things connected to C++ implementation would likely better be solved in this issue. I will probably pick it up after I finish another issue.(One issue has already been assigned to me.)

The issue is for enhancing the current feature. But the test isn't an enhancement. It's a bug fix. I think that we should use a separated issue/pull request.

I think that we can merge the latter soon because the fix will be straightforward.

If we mix them, we will not be able to merge soon. Because the former will be complex. We can't merge only the latter.

arashandishgar · 2025-02-19T13:28:38Z

I think things connected to C++ implementation would likely better be solved in this issue. I will probably pick it up after I finish another issue.(One issue has already been assigned to me.)

The issue is for enhancing the current feature. But the test isn't an enhancement. It's a bug fix. I think that we should use a separated issue/pull request.

I think that we can merge the latter soon because the fix will be straightforward.

If we mix them, we will not be able to merge soon. Because the former will be complex. We can't merge only the latter.

Okay I will do it

by the way what about this?

I think there is another issue in Record Bath statistics

arrow/cpp/src/arrow/record_batch.h

Line 289 in ec3d283

/// https://arrow.apache.org/docs/format/CDataInterfaceStatistics.html

the link does not exist no more

kou · 2025-02-19T22:00:48Z

Ah, let's work on it as another separated task too. We'll be able to merge it quickly.

arashandishgar · 2025-02-20T12:23:41Z

Ah, let's work on it as another separated task too. We'll be able to merge it quickly.

Okay I will do it

### Rationale for this change "column" in examples have duplicated values for the same column but "column" must have only one value for each column. ### What changes are included in this PR? * Use "rowspan" for "column" in logical examples. * Remove duplicated values from "column" in physical examples. * Add missing "offsets" to "statistics" in physical examples. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #45560 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>

kou · 2025-02-27T06:21:48Z

Issue resolved by pull request 45561
#45561

arashandishgar · 2025-02-27T06:34:53Z

Issue resolved by pull request 45561 #45561

several things are remained which has not been addressed yet.
1-lack implementation for nested type as only bool,int,float,string are supported in ArrayStatistics
2- several statistics attributed has not implemented yet such as "ARROW:row_count:approximate","ARROW:average_byte_width:exact","ARROW:average_byte_width:approximate"

kou · 2025-02-27T07:13:09Z

1-lack implementation for nested type as only bool,int,float,string are supported in ArrayStatistics

We should work on it as #45474 not this.

2- several statistics attributed has not implemented yet such as "ARROW:row_count:approximate","ARROW:average_byte_width:exact","ARROW:average_byte_width:approximate"

Could you open a new issue for this? Let's work it as a separated task.
(BTW, arrow::RecordBatch::MakeStatisticsArray() doesn't need to use ARROW:row_count:approximate because it always knows exact row count.)

arashandishgar · 2025-02-27T07:39:48Z

2- several statistics attributed has not implemented yet such as "ARROW:row_count:approximate","ARROW:average_byte_width:exact","ARROW:average_byte_width:approximate"

Could you open a new issue for this? Let's work it as a separated task. (BTW, arrow::RecordBatch::MakeStatisticsArray() doesn't need to use ARROW:row_count:approximate because it always knows exact row count.)

I've just created. #45639

arashandishgar added the Type: bug label Feb 18, 2025

github-actions bot added the Component: Documentation label Feb 18, 2025

kou changed the title ~~[Docs] inconsistency between C++ implementation and the statistics schema.~~ [Docs] inconsistency between C++ implementation and the statistics schema Feb 18, 2025

kou added a commit to kou/arrow that referenced this issue Feb 18, 2025

apacheGH-45560: [Docs] Fix Statistics schema's "column" examples

5caafff

"column" in examples have duplicated values for the same column but "column" must have only one value for each column.

github-actions bot assigned kou Feb 18, 2025

github-actions bot mentioned this issue Feb 18, 2025

GH-45560: [Docs] Fix Statistics schema's "column" examples #45561

Merged

arashandishgar mentioned this issue Feb 19, 2025

[C++] MakeStatisticsArrayMaxApproximate test uses min not max #45578

Closed

arashandishgar mentioned this issue Feb 20, 2025

[C++][Docs] arrow::RecordBatch::MakeStatisticsArray()'s docstring refers a wrong URL #45587

Closed

kou added this to the 20.0.0 milestone Feb 27, 2025

kou closed this as completed Feb 27, 2025

arashandishgar mentioned this issue Feb 27, 2025

[C++] implement Statistic Schema attribute in C++ #45639

Open

arashandishgar mentioned this issue Feb 27, 2025

[C++] Add support for building statistics array for nested types #45474

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Docs] inconsistency between C++ implementation and the statistics schema #45560

[Docs] inconsistency between C++ implementation and the statistics schema #45560

arashandishgar commented Feb 18, 2025

kou commented Feb 18, 2025

arashandishgar commented Feb 18, 2025

arashandishgar commented Feb 18, 2025

arashandishgar commented Feb 18, 2025

kou commented Feb 19, 2025

kou commented Feb 19, 2025

arashandishgar commented Feb 19, 2025 •

edited

Loading

arashandishgar commented Feb 19, 2025

kou commented Feb 19, 2025 •

edited

Loading

kou commented Feb 19, 2025

arashandishgar commented Feb 19, 2025

kou commented Feb 19, 2025 •

edited

Loading

arashandishgar commented Feb 20, 2025

kou commented Feb 27, 2025

arashandishgar commented Feb 27, 2025 •

edited

Loading

kou commented Feb 27, 2025

arashandishgar commented Feb 27, 2025

[Docs] inconsistency between C++ implementation and the statistics schema #45560

[Docs] inconsistency between C++ implementation and the statistics schema #45560

Comments

arashandishgar commented Feb 18, 2025

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

kou commented Feb 18, 2025

arashandishgar commented Feb 18, 2025

arashandishgar commented Feb 18, 2025

arashandishgar commented Feb 18, 2025

kou commented Feb 19, 2025

kou commented Feb 19, 2025

arashandishgar commented Feb 19, 2025 • edited Loading

arashandishgar commented Feb 19, 2025

kou commented Feb 19, 2025 • edited Loading

kou commented Feb 19, 2025

arashandishgar commented Feb 19, 2025

kou commented Feb 19, 2025 • edited Loading

arashandishgar commented Feb 20, 2025

kou commented Feb 27, 2025

arashandishgar commented Feb 27, 2025 • edited Loading

kou commented Feb 27, 2025

arashandishgar commented Feb 27, 2025

arashandishgar commented Feb 19, 2025 •

edited

Loading

kou commented Feb 19, 2025 •

edited

Loading

kou commented Feb 19, 2025 •

edited

Loading

arashandishgar commented Feb 27, 2025 •

edited

Loading