Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bus errors when writing DataFrame #474

Open
simsurace opened this issue Jul 3, 2023 · 8 comments
Open

Bus errors when writing DataFrame #474

simsurace opened this issue Jul 3, 2023 · 8 comments

Comments

@simsurace
Copy link
Contributor

simsurace commented Jul 3, 2023

I intermittently get bus errors with crash to terminal when writing DataFrames to .arrow:

julia> Arrow.write("df.arrow", df)
[8796] signal (7.2): Bus error
in expression starting at REPL[243]:1
getindex at ./essentials.jl:13 [inlined]
getindex at ~/.julia/packages/Arrow/R2Rvz/src/arraytypes/primitive.jl:48 [inlined]
getindex at ~/.julia/packages/ArrowTypes/Nb4EC/src/ArrowTypes.jl:412 [inlined]
iterate at ./abstractarray.jl:1220 [inlined]
iterate at ./abstractarray.jl:1218 [inlined]
writearray at ~/.julia/packages/Arrow/R2Rvz/src/utils.jl:49
writebuffer at ~/.julia/packages/Arrow/R2Rvz/src/arraytypes/primitive.jl:102
unknown function (ip: 0x7ff9d489d0d0)
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2940
write at ~/.julia/packages/Arrow/R2Rvz/src/write.jl:363
macro expansion at ~/.julia/packages/Arrow/R2Rvz/src/write.jl:151 [inlined]
#124 at ./threadingconstructs.jl:373
unknown function (ip: 0x7ff9d487aa9f)
_jl_invoke at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/julia.h:1879 [inlined]
start_task at /cache/build/default-amdci4-6/julialang/julia-release-1-dot-9/src/task.c:1092
Allocations: 16091661486 (Pool: 15611643534; Big: 480017952); GC: 116806
Bus error (core dumped)

This was with Julia 1.9.1 and Arrow.jl v2.6.2

@simsurace
Copy link
Contributor Author

Further details, as shared on Slack already:
The DataFrame was 84x13075, a single Date column and 13074 Float64 columns.

@Moelf
Copy link
Contributor

Moelf commented Jul 3, 2023

13074 Float64 columns.

that's pretty extreme, why isn't the table transposed? otherwise I don't know what could Bus error indicate, Arrow.jl is not super friendly with memory but this doesn't seem to be OOM like?

@simsurace
Copy link
Contributor Author

The table isn't transposed because in general the columns are more heterogeneous. But this should be well below any hard limits, I've even written tables with 100000 columns before without any problems.

@stuartthomas25
Copy link

stuartthomas25 commented Jul 11, 2023

signal (7): Bus error
in expression starting at REPL[1]:1
getindex at ./array.jl:924 [inlined]
getindex at /home/snthomas/.julia/packages/Arrow/R2Rvz/src/arraytypes/primitive.jl:48 [inlined]
getindex at ./subarray.jl:315 [inlined]
iterate at /home/snthomas/.julia/packages/Arrow/R2Rvz/src/arraytypes/list.jl:174 [inlined]
writearray at /home/snthomas/.julia/packages/Arrow/R2Rvz/src/utils.jl:49
writebuffer at /home/snthomas/.julia/packages/Arrow/R2Rvz/src/arraytypes/primitive.jl:102
writebuffer at /home/snthomas/.julia/packages/Arrow/R2Rvz/src/arraytypes/map.jl:118
Allocations: 45171119 (Pool: 45154695; Big: 16424); GC: 36

I am getting a similar error when writing a TypedTable. This TypedTable is only 10x23. I got this error from running Arrow.write(datafile, data).

The result is a zero byte file at datafile.

Arrow version 2.6.2, Julia version 1.8.5.

@Moelf
Copy link
Contributor

Moelf commented Jul 11, 2023

is it possible to provide the schema of the 10x23 table? or better can you write a snippet reproducer to generate dummy data?

@stuartthomas25
Copy link

stuartthomas25 commented Jul 11, 2023

I think I resolved the issue. It has to do with how Arrow reads Tables from disk. It does not load the entire table into memory but uses only a view. If you write this view to the same file, it causes this bus error.

using TypedTables
using Arrow

tab = Table(
    a=[i for i=1:10],
    b=[fill(0.1,10) for i=1:10]
)
filename = "test.arrow"
Arrow.write(filename, tab)

newtab = Table(Arrow.Table(filename))
Arrow.write(filename, deepcopy(newtab)) # always works
Arrow.write(filename, copy(newtab)) # only works for simple columns like :a, but not :b
Arrow.write(filename, newtab) # always fails

I wonder if this is also the case in @simsurace's error.

@simsurace
Copy link
Contributor Author

I don't believe that was the case. I was writing some intermediate results from a large calculation to a new file name, and the error occurred intermittently.

@sprig
Copy link

sprig commented Apr 30, 2024

Can confirm the same issue that @stuartthomas25 identified. My quite obvious workaround is to write the table to a temporary file and then overwrite the original. However, julia mv and undocumented Base.rename aren't guaranteed to be atomic so I wonder what happens to existing views as the file is overwritten.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants