Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33592: [C++] support casting nullable fields to non-nullable if there are no null values #43782

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Aug 21, 2024

Notes for myself/fixer:

  • tests that need to get updated (almost definitely not a complete list)
  • [update: actually we should handle the go implementation in the go repository.] hmm, looks like go wrapper does its own nullability checks. I assume this is just an optimization to not have to go into the C++ and hit the error down there. So if we just delete this check then I think the C++ logic will handle all of it???
  • how the plain column handles this cast, some logic like this probably needs to get ported over to the struct implementation
  • (running from /cpp/build) cmake .. --preset ninja-debug-basic, then cmake --build . && PYTHON=python ctest -R 'arrow-compute-scalar-cast-test' --output-on-failure to run the specific test
  • to format: uvx pre-commit run --all-files clang-format

Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@NickCrews NickCrews changed the title [C++][Go]: support casting nullable fields to non-nullable if there are no null values GH-33592 [C++][Go]: support casting nullable fields to non-nullable if there are no null values Aug 21, 2024
Copy link

⚠️ GitHub issue #33592 has been automatically assigned in GitHub to PR creator.

@NickCrews NickCrews force-pushed the field-cast-nullable-to-nonnullable branch 3 times, most recently from 52100dd to 49d5a99 Compare August 22, 2024 00:40
@rustyconover
Copy link

This PR looks pretty good to me, what additional help would you like getting it merged?

@NickCrews
Copy link
Contributor Author

Thanks for the help! I'm a random contributor, so I'm not sure about the exact workflow, but I think both getting the CI run approved and having a C++ owner approve it are needed.

Any more detailed thoughts on if the tests are adequate, if there's anywhere else in the code base that you think needs to change, my handling of the go implementation by just deleting the shortcircuit, or any other more detailed thoughts? Basically anything that would reduce the mental load on the code owner I think would increase the odds they approve it :)

@mapleFU
Copy link
Member

mapleFU commented Sep 15, 2024

You can mark as ready for review? Or this is still wip?

@NickCrews NickCrews marked this pull request as ready for review September 15, 2024 20:11
@NickCrews
Copy link
Contributor Author

Oops, didn't mean for this to still be marked WIP. Looks like a few failures that need to get fixed, but still a high-level review of the general approach would still be appreciated in the meantime.

@zeroshade
Copy link
Member

You probably need the same corresponding check on the Go side to verify that it's only allowed if there are no nulls.

Also, the Go implementation has been moved to the apache/arrow-go repository. So please file the PR there for the Go side. Sorry for the confusion, we just haven't removed the Go code from here yet.

@NickCrews
Copy link
Contributor Author

Thanks @zeroshade , will do. Do we need both PRs to land at about the same time, or can I do that independently?

I will undo my changes to the go code here, leaving it untouched.

@zeroshade
Copy link
Member

They can land independently, no issues there. Thanks!

@NickCrews NickCrews force-pushed the field-cast-nullable-to-nonnullable branch from 49d5a99 to eb9e7b6 Compare September 16, 2024 16:27
@NickCrews
Copy link
Contributor Author

Just pushed a new version:

  • dropped the Go changes
  • rebased on top of 3600db8, which is one commit behind main, because main is failing CI, but the commit I chose passed CI. I did this because I think the failing CI checks in this PR are not related to this PR
  • I discovered archery and formatted the files with archery lint --clang-format --fix

We will see if this passes CI now...

@NickCrews NickCrews force-pushed the field-cast-nullable-to-nonnullable branch from eb9e7b6 to e6d54d3 Compare September 18, 2024 07:49
@NickCrews
Copy link
Contributor Author

Pushed a new version, hopefully that fixed the broken tests:

  • To go from nullable to non-nullable type, need to use the unsafe cast option
  • fix typo of int8 to int64 so the precision matches
  • removed the [Go] tag from the commit message

@NickCrews
Copy link
Contributor Author

@zeroshade I think this is ready to review/merge, the failing CI runs look like flakes when trying to setup the environment?

@kou
Copy link
Member

kou commented Sep 19, 2024

Can we move this to apache/arrow-go?

@kou kou changed the title GH-33592 [C++][Go]: support casting nullable fields to non-nullable if there are no null values GH-33592: [C++] support casting nullable fields to non-nullable if there are no null values Sep 19, 2024
@kou
Copy link
Member

kou commented Sep 19, 2024

Ah, the Go part was removed from this PR.
I've removed "[Go]" from the PR title.

@NickCrews NickCrews force-pushed the field-cast-nullable-to-nonnullable branch from e6d54d3 to 106e627 Compare September 24, 2024 18:53
@NickCrews NickCrews force-pushed the field-cast-nullable-to-nonnullable branch from 40ae139 to ed9b368 Compare February 7, 2025 01:36
@NickCrews NickCrews force-pushed the field-cast-nullable-to-nonnullable branch from ed9b368 to 56c0d88 Compare February 7, 2025 06:29
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General LGTM!

cpp/src/arrow/compute/kernels/scalar_cast_test.cc Outdated Show resolved Hide resolved
@NickCrews NickCrews closed this Feb 7, 2025
@NickCrews NickCrews force-pushed the field-cast-nullable-to-nonnullable branch from 56c0d88 to 9016a83 Compare February 7, 2025 06:34
@NickCrews NickCrews reopened this Feb 7, 2025
@NickCrews
Copy link
Contributor Author

whoops, I accidentally force-pushed with no changes, which closed the PR. Now it's back up

@NickCrews NickCrews force-pushed the field-cast-nullable-to-nonnullable branch from b124c64 to ebe9a5a Compare February 7, 2025 08:01
@NickCrews
Copy link
Contributor Author

@mapleFU as I added the slicing tests, I also refactored the tests to make them much more consistent and concise. Take another look and make sure that your "LGTM" still holds after those changes.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit!

cpp/src/arrow/compute/kernels/scalar_cast_nested.cc Outdated Show resolved Hide resolved
@NickCrews NickCrews force-pushed the field-cast-nullable-to-nonnullable branch 2 times, most recently from ce00550 to ea241a5 Compare February 10, 2025 16:56
@NickCrews
Copy link
Contributor Author

NickCrews commented Feb 10, 2025

I'm getting these failing UBSAN errors in CI. See below for the relevant logs. Is it because of the new std::vector<std::shared_ptr<Array>> arrays_dest_ac = {arrays_dest[0], arrays_dest[2]} code that I added? It looks like the line numbers referenced are the ones AFTER the macros/preprocessor occurs, so I'm not sure which line in the source file is actually the problem :( Is there a good way to silence/ignore this error? Any hint as to how to run this locally so I can reproduce/test?

Details
[ RUN      ] Cast.StructToStructSubset
/usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_vector.h:1064:34: runtime error: addition of unsigned offset to 0x60700006f300 overflowed to 0x60700006f2e0
    #0 0x7f811aca4bc8 in std::vector<std::shared_ptr<arrow::Field>, std::allocator<std::shared_ptr<arrow::Field> > >::operator[](unsigned long) const /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_vector.h:1064:34
    #1 0x7f811ac9b27a in arrow::DataType::field(int) const /arrow/cpp/src/arrow/type.h:153:61
    #2 0x7f8105e4f8f1 in arrow::compute::internal::(anonymous namespace)::CastStruct::Exec(arrow::compute::KernelContext*, arrow::compute::ExecSpan const&, arrow::compute::ExecResult*) /arrow/cpp/src/arrow/compute/kernels/scalar_cast_nested.cc:403:38
    #3 0x7f8105ab206e in arrow::compute::detail::(anonymous namespace)::ScalarExecutor::ExecuteNonSpans(arrow::compute::detail::ExecListener*) /arrow/cpp/src/arrow/compute/exec.cc:920:7
    #4 0x7f8105aa99fa in arrow::compute::detail::(anonymous namespace)::ScalarExecutor::Execute(arrow::compute::ExecBatch const&, arrow::compute::detail::ExecListener*) /arrow/cpp/src/arrow/compute/exec.cc:810:14
    #5 0x7f8105c471a6 in arrow::compute::detail::FunctionExecutorImpl::Execute(std::vector<arrow::Datum, std::allocator<arrow::Datum> > const&, long) /arrow/cpp/src/arrow/compute/function.cc:278:5
    #6 0x7f8105c17bfe in arrow::compute::(anonymous namespace)::ExecuteInternal(arrow::compute::Function const&, std::vector<arrow::Datum, std::allocator<arrow::Datum> >, long, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) /arrow/cpp/src/arrow/compute/function.cc:343:21
    #7 0x7f8105c16d84 in arrow::compute::Function::Execute(std::vector<arrow::Datum, std::allocator<arrow::Datum> > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) const /arrow/cpp/src/arrow/compute/function.cc:350:10
    #8 0x7f8105a5d13c in arrow::compute::internal::(anonymous namespace)::CastMetaFunction::ExecuteImpl(std::vector<arrow::Datum, std::allocator<arrow::Datum> > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) const /arrow/cpp/src/arrow/compute/cast.cc:124:23
    #9 0x7f8105c21b5c in arrow::compute::MetaFunction::Execute(std::vector<arrow::Datum, std::allocator<arrow::Datum> > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) const /arrow/cpp/src/arrow/compute/function.cc:483:10
    #10 0x7f8105a8cfc8 in arrow::compute::CallFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<arrow::Datum, std::allocator<arrow::Datum> > const&, arrow::compute::FunctionOptions const*, arrow::compute::ExecContext*) /arrow/cpp/src/arrow/compute/exec.cc:1369:16
    #11 0x55f68eeb23c9 in arrow::compute::CheckScalarNonRecursive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<arrow::Datum, std::allocator<arrow::Datum> > const&, arrow::Datum const&, arrow::compute::FunctionOptions const*) /arrow/cpp/src/arrow/compute/kernels/test_util_internal.cc:80:3
    #12 0x55f68eeb8480 in arrow::compute::CheckScalar(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<arrow::Datum, std::allocator<arrow::Datum> > const&, arrow::Datum, arrow::compute::FunctionOptions const*) /arrow/cpp/src/arrow/compute/kernels/test_util_internal.cc:109:3
    #13 0x55f68eecc36c in arrow::compute::CheckScalarUnary(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, arrow::Datum, arrow::Datum, arrow::compute::FunctionOptions const*) /arrow/cpp/src/arrow/compute/kernels/test_util_internal.cc:255:3
    #14 0x55f68ef4d8f7 in arrow::compute::CheckCast(std::shared_ptr<arrow::Array>, std::shared_ptr<arrow::Array>, arrow::compute::CastOptions) /arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:111:3
    #15 0x55f68f079ab8 in arrow::compute::CheckStructToStructSubset(std::vector<std::shared_ptr<arrow::DataType>, std::allocator<std::shared_ptr<arrow::DataType> > > const&) /arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:3829:7
    #16 0x55f68f06d09f in arrow::compute::Cast_StructToStructSubset_Test::TestBody() /arrow/cpp/src/arrow/compute/kernels/scalar_cast_test.cc:4003:36
    #17 0x7f811af8260e in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/usr/local/lib/libarrow_testing.so.2000+0x108c60e) (BuildId: 408b230591f17f240ad9d2b7d868274f1179009d)
    #18 0x7f811af770c5 in testing::Test::Run() (/usr/local/lib/libarrow_testing.so.2000+0x10810c5) (BuildId: 408b230591f17f240ad9d2b7d868274f1179009d)
    #19 0x7f811af77244 in testing::TestInfo::Run() (/usr/local/lib/libarrow_testing.so.2000+0x1081244) (BuildId: 408b230591f17f240ad9d2b7d868274f1179009d)
    #20 0x7f811af777f8 in testing::TestSuite::Run() (/usr/local/lib/libarrow_testing.so.2000+0x10817f8) (BuildId: 408b230591f17f240ad9d2b7d868274f1179009d)
    #21 0x7f811af77efe in testing::internal::UnitTestImpl::RunAllTests() (/usr/local/lib/libarrow_testing.so.2000+0x1081efe) (BuildId: 408b230591f17f240ad9d2b7d868274f1179009d)
    #22 0x7f811af82bd6 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (/usr/local/lib/libarrow_testing.so.2000+0x108cbd6) (BuildId: 408b230591f17f240ad9d2b7d868274f1179009d)
    #23 0x7f811af7730b in testing::UnitTest::Run() (/usr/local/lib/libarrow_testing.so.2000+0x108130b) (BuildId: 408b230591f17f240ad9d2b7d868274f1179009d)
    #24 0x55f68edefb33 in main (/build/cpp/debug/arrow-compute-scalar-cast-test+0x2adb33) (BuildId: c176b331f0537175685c7888a83fc8e112006f6a)
    #0 0x7f80f7d06d8f in
    #26 0x7f80f7d06e3f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x29e3f) (BuildId: cd410b710f0f094c6832edd95931006d883af48e)
    #27 0x55f68edefb94 in _start (/build/cpp/debug/arrow-compute-scalar-cast-test+0x2adb94) (BuildId: c176b331f0537175685c7888a83fc8e112006f6a)

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_vector.h:1064:34 in
/build/cpp/src/arrow/compute/kernels

      Start 22: arrow-compute-scalar-cast-test
    Test #22: arrow-compute-scalar-cast-test ...............***Failed   12.44 sec

@pitrou
Copy link
Member

pitrou commented Feb 11, 2025

Well, the Windows 2019 CI test shows a similar error, so I think it needs diagnosing and solving:

[ RUN      ] Cast.StructToStructSubset
C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Tools\MSVC\14.29.30133\include\vector(1563) : Assertion failed: vector subscript out of range

@NickCrews NickCrews force-pushed the field-cast-nullable-to-nonnullable branch from ea241a5 to 45f1ac2 Compare February 11, 2025 17:01
@NickCrews NickCrews force-pushed the field-cast-nullable-to-nonnullable branch from 45f1ac2 to 0959343 Compare February 11, 2025 17:07
@NickCrews
Copy link
Contributor Author

I rebased on top of #45500 so that hopefully CI will pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants