Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] inconsistency between C++ implementation and the statistics schema #45560

Closed
arashandishgar opened this issue Feb 18, 2025 · 17 comments
Closed

Comments

@arashandishgar
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

I have implemented the Example in c++ and as it is mentioned in this link I should expect a physical layout which has 9 elements in colunmn_field but I got only three elements in column_field

main.zip

Component(s)

Documentation

@kou
Copy link
Member

kou commented Feb 18, 2025

Good catch!

The example was wrong. column should not have duplicated values.

@kou kou changed the title [Docs] inconsistency between C++ implementation and the statistics schema. [Docs] inconsistency between C++ implementation and the statistics schema Feb 18, 2025
kou added a commit to kou/arrow that referenced this issue Feb 18, 2025
"column" in examples have duplicated values for the same column but
"column" must have only one value for each column.
@arashandishgar
Copy link
Contributor Author

I think there is another issue in Record Bath statistics

/// https://arrow.apache.org/docs/format/CDataInterfaceStatistics.html

the link does not exist no more

@arashandishgar
Copy link
Contributor Author

There is another issue in a test regarding Record batch statistics.

TEST_F(TestRecordBatch, MakeStatisticsArrayMaxApproximate) {
auto schema =
::arrow::schema({field("no-statistics", boolean()), field("float64", float64())});
auto no_statistics_array = ArrayFromJSON(boolean(), "[true, false, true]");
auto float64_array_data = ArrayFromJSON(float64(), "[1.0, null, -1.0]")->data()->Copy();
float64_array_data->statistics = std::make_shared<ArrayStatistics>();
float64_array_data->statistics->min = -1.0;
auto float64_array = MakeArray(std::move(float64_array_data));
auto batch = RecordBatch::Make(schema, float64_array->length(),
{no_statistics_array, float64_array});
ASSERT_OK_AND_ASSIGN(auto statistics_array, batch->MakeStatisticsArray());
ASSERT_OK_AND_ASSIGN(
auto expected_statistics_array,
MakeStatisticsArray("[null, 1]",
{{
ARROW_STATISTICS_KEY_ROW_COUNT_EXACT,
},
{
ARROW_STATISTICS_KEY_MIN_VALUE_APPROXIMATE,
}},
{{
ArrayStatistics::ValueType{int64_t{3}},
},
{
ArrayStatistics::ValueType{-1.0},
}}));
AssertArraysEqual(*expected_statistics_array, *statistics_array, true);
}

This test is about min approximate value but title is about max approximate value

@arashandishgar
Copy link
Contributor Author

I think one thing that is not clear about schema statistics specification is what I should expect from nested type statistics. I know that nested type statistics have not been implemented yet. link but nested types can have statistics like a simple type. Should the implementation accept all of that? for example is it acceptable that a struct in record batch to have max_approximate?

@kou
Copy link
Member

kou commented Feb 19, 2025

There is another issue in a test regarding Record batch statistics.

This test is about min approximate value but title is about max approximate value

Good catch! Could you open a separated issue for this?

Let's fix it in a separated PR.

@kou
Copy link
Member

kou commented Feb 19, 2025

Should the implementation accept all of that? for example is it acceptable that a struct in record batch to have max_approximate?

Yes.

We may need to add arrow::Scalar or something to arrow::ArrayStatistics::ValueType for it. (Or we may use arrow::Scalar instead of arrow::ArrayStatistics::ValueType.)

@arashandishgar
Copy link
Contributor Author

arashandishgar commented Feb 19, 2025

let imagine a type like
auto type = struct_({field("list_a", fixed_size_list(int64(), 2)), field("list_b", fixed_size_list(int64(), 2))});
I think even if there is a max element for the struct(not child type) it would probably be a two fixed_size_list again. Thus it could not be a scalar

@arashandishgar
Copy link
Contributor Author

There is another issue in a test regarding Record batch statistics.
This test is about min approximate value but title is about max approximate value

Good catch! Could you open a separated issue for this?

Let's fix it in a separated PR.

I think things connected to C++ implementation would likely better be solved in this issue. I will probably pick it up after I finish another issue.(One issue has already been assigned to me.)

@kou
Copy link
Member

kou commented Feb 19, 2025

let imagine a type like auto type = struct_({field("list_a", fixed_size_list(int64(), 2)), field("list_b", fixed_size_list(int64(), 2))}); I think even if there is a max element for the struct(not child type) it would probably be a two fixed_size_list again. Thus it could not be a scalar

We have StructScalar and FixedSizeListScalar:

struct ARROW_EXPORT StructScalar : public Scalar {
using TypeClass = StructType;
using ValueType = std::vector<std::shared_ptr<Scalar>>;
ScalarVector value;
Result<std::shared_ptr<Scalar>> field(FieldRef ref) const;
StructScalar(ValueType value, std::shared_ptr<DataType> type, bool is_valid = true)
: Scalar(std::move(type), is_valid), value(std::move(value)) {}
static Result<std::shared_ptr<StructScalar>> Make(ValueType value,
std::vector<std::string> field_names);
};

struct ARROW_EXPORT FixedSizeListScalar : public BaseListScalar {
using TypeClass = FixedSizeListType;
FixedSizeListScalar(std::shared_ptr<Array> value, std::shared_ptr<DataType> type,
bool is_valid = true);
explicit FixedSizeListScalar(std::shared_ptr<Array> value, bool is_valid = true);
};

It seems that they can represet the max value.

@kou
Copy link
Member

kou commented Feb 19, 2025

I think things connected to C++ implementation would likely better be solved in this issue. I will probably pick it up after I finish another issue.(One issue has already been assigned to me.)

The issue is for enhancing the current feature. But the test isn't an enhancement. It's a bug fix. I think that we should use a separated issue/pull request.

I think that we can merge the latter soon because the fix will be straightforward.

If we mix them, we will not be able to merge soon. Because the former will be complex. We can't merge only the latter.

@arashandishgar
Copy link
Contributor Author

I think things connected to C++ implementation would likely better be solved in this issue. I will probably pick it up after I finish another issue.(One issue has already been assigned to me.)

The issue is for enhancing the current feature. But the test isn't an enhancement. It's a bug fix. I think that we should use a separated issue/pull request.

I think that we can merge the latter soon because the fix will be straightforward.

If we mix them, we will not be able to merge soon. Because the former will be complex. We can't merge only the latter.

Okay I will do it

by the way what about this?

I think there is another issue in Record Bath statistics

arrow/cpp/src/arrow/record_batch.h

Line 289 in ec3d283

/// https://arrow.apache.org/docs/format/CDataInterfaceStatistics.html

the link does not exist no more

@kou
Copy link
Member

kou commented Feb 19, 2025

Ah, let's work on it as another separated task too. We'll be able to merge it quickly.

@arashandishgar
Copy link
Contributor Author

Ah, let's work on it as another separated task too. We'll be able to merge it quickly.

Okay I will do it

kou added a commit that referenced this issue Feb 27, 2025
### Rationale for this change

"column" in examples have duplicated values for the same column but "column" must have only one value for each column.

### What changes are included in this PR?

* Use "rowspan" for "column" in logical examples.
* Remove duplicated values from "column" in physical examples.
* Add missing "offsets" to "statistics" in physical examples.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* GitHub Issue: #45560

Authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@kou kou added this to the 20.0.0 milestone Feb 27, 2025
@kou
Copy link
Member

kou commented Feb 27, 2025

Issue resolved by pull request 45561
#45561

@kou kou closed this as completed Feb 27, 2025
@arashandishgar
Copy link
Contributor Author

arashandishgar commented Feb 27, 2025

Issue resolved by pull request 45561 #45561

several things are remained which has not been addressed yet.
1-lack implementation for nested type as only bool,int,float,string are supported in ArrayStatistics
2- several statistics attributed has not implemented yet such as "ARROW:row_count:approximate","ARROW:average_byte_width:exact","ARROW:average_byte_width:approximate"

@kou
Copy link
Member

kou commented Feb 27, 2025

1-lack implementation for nested type as only bool,int,float,string are supported in ArrayStatistics

We should work on it as #45474 not this.

2- several statistics attributed has not implemented yet such as "ARROW:row_count:approximate","ARROW:average_byte_width:exact","ARROW:average_byte_width:approximate"

Could you open a new issue for this? Let's work it as a separated task.
(BTW, arrow::RecordBatch::MakeStatisticsArray() doesn't need to use ARROW:row_count:approximate because it always knows exact row count.)

@arashandishgar
Copy link
Contributor Author

2- several statistics attributed has not implemented yet such as "ARROW:row_count:approximate","ARROW:average_byte_width:exact","ARROW:average_byte_width:approximate"

Could you open a new issue for this? Let's work it as a separated task. (BTW, arrow::RecordBatch::MakeStatisticsArray() doesn't need to use ARROW:row_count:approximate because it always knows exact row count.)

I've just created. #45639

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants