Skip to content

Commit 900f6fe

Browse files
jorgecarleitaoalamb
authored andcommitted
ARROW-10889: [Rust] [Proposal] Add guidelines about usage of unsafe
I would like to propose that we outline and enforce guidelines on the arrow crate implementation with respect to the usage of `unsafe`. The background of this proposal are PRs #8645 and #8829. In both cases, while addressing an unrelated issue, they hit undefined behavior (UB) due to an incorrect usage of `unsafe` in the code base. This UB was very time-consuming to identify and debug: combined, they accounted for more than 12hs of my time. Safety against undefined behavior is the core premise of the Rust language. In many cases, the maintenance burden (time to find and fix bugs) does not justify the performance improvements and the decrease in motivation in handling them (they are just painful due to how difficult they are to debug). In particular, IMO those 12 hours would have been better spent in other parts of the code if `unsafe` would have not been used in the first place, which would have been likely translated in faster code or more features. There are situations where `unsafe` is necessary, and the guidelines outline these cases. However, I also see many uses of `unsafe` that are not necessary nor properly documented. The goal of these guidelines is to motivate contributors of the Rust implementation to be conscious about the maintenance cost of `unsafe`, and outline specific necessary conditions for any new `unsafe` to be introduced in the code base. Closes #8901 from jorgecarleitao/arrow_unsafe Lead-authored-by: Jorge Leitao <jorgecarleitao@gmail.com> Co-authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
1 parent 13cf86f commit 900f6fe

File tree

1 file changed

+79
-0
lines changed

1 file changed

+79
-0
lines changed

rust/arrow/README.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,85 @@ Arrow uses the following features:
9797
Other than `simd` all the other features are enabled by default. Disabling `prettyprint` might be necessary in order to
9898
compile Arrow to the `wasm32-unknown-unknown` WASM target.
9999

100+
## Guidelines in usage of `unsafe`
101+
102+
[`unsafe`](https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html) has a high maintenance cost because debugging and testing it is difficult, time consuming, often requires external tools (e.g. `valgrind`), and requires a higher-than-usual attention to details. Undefined behavior is particularly difficult to identify and test, and usage of `unsafe` is the [primary cause of undefined behavior](https://doc.rust-lang.org/reference/behavior-considered-undefined.html) in a program written in Rust.
103+
For two real world examples of where `unsafe` has consumed time in the past in this project see [#8545](https://github.com/apache/arrow/pull/8645) and [8829](https://github.com/apache/arrow/pull/8829)
104+
This crate only accepts the usage of `unsafe` code upon careful consideration, and strives to avoid it to the largest possible extent.
105+
106+
### When can `unsafe` be used?
107+
108+
Generally, `unsafe` should only be used when a `safe` counterpart is not available and there is no `safe` way to achieve additional performance in that area. The following is a summary of the current components of the crate that require `unsafe`:
109+
110+
* alloc, dealloc and realloc of buffers along cache lines
111+
* Interpreting bytes as certain rust types, for access, representation and compute
112+
* Foreign interfaces (C data interface)
113+
* Inter-process communication (IPC)
114+
* SIMD
115+
* Performance (e.g. omit bounds checks, use of pointers to avoid bound checks)
116+
117+
#### cache-line aligned memory management
118+
119+
The arrow format recommends storing buffers aligned with cache lines, and this crate adopts this behavior.
120+
However, Rust's global allocator does not allocate memory aligned with cache-lines. As such, many of the low-level operations related to memory management require `unsafe`.
121+
122+
#### Interpreting bytes
123+
124+
The arrow format is specified in bytes (`u8`), which can be logically represented as certain types
125+
depending on the `DataType`.
126+
For many operations, such as access, representation, numerical computation and string manipulation,
127+
it is often necessary to interpret bytes as other physical types (e.g. `i32`).
128+
129+
Usage of `unsafe` for the purpose of interpreting bytes in their corresponding type (according to the arrow specification) is allowed. Specifically, the pointer to the byte slice must be aligned to the type that it intends to represent and the length of the slice is a multiple of the size of the target type of the transmutation.
130+
131+
#### FFI
132+
133+
The arrow format declares an ABI for zero-copy from and to libraries that implement the specification
134+
(foreign interfaces). In Rust, receiving and sending pointers via FFI requires usage of `unsafe` due to
135+
the impossibility of the compiler to derive the invariants (such as lifetime, null pointers, and pointer alignment) from the source code alone as they are part of the FFI contract.
136+
137+
#### IPC
138+
139+
The arrow format declares a IPC protocol, which this crate supports. IPC is equivalent to a FFI in that the rust compiler can't reason about the contract's invariants.
140+
141+
#### SIMD
142+
143+
The API provided by the `packed_simd` library is currently `unsafe`. However, SIMD offers a significant performance improvement over non-SIMD operations.
144+
145+
#### Performance
146+
147+
Some operations are significantly faster when `unsafe` is used.
148+
149+
A common usage of `unsafe` is to offer an API to access the `i`th element of an array (e.g. `UInt32Array`).
150+
This requires accessing the values buffer e.g. `array.buffers()[0]`, picking the slice
151+
`[i * size_of<i32>(), (i + 1) * size_of<i32>()]`, and then transmuting it to `i32`. In safe Rust,
152+
this operation requires boundary checks that are detrimental to performance.
153+
154+
Usage of `unsafe` for performance reasons is justified only when all other alternatives have been exhausted and the performance benefits are sufficiently large (e.g. >~10%).
155+
156+
### Considerations when introducing `unsafe`
157+
158+
Usage of `unsafe` in this crate *must*:
159+
160+
* not expose a public API as `safe` when there are necessary invariants for that API to be defined behavior.
161+
* have code documentation for why `safe` is not used / possible
162+
* have code documentation about which invariant the user needs to enforce to ensure [soundness](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#soundness-of-code--of-a-library), or which
163+
* invariant is being preserved.
164+
* if applicable, use `debug_assert`s to relevant invariants (e.g. bound checks)
165+
166+
Example of code documentation:
167+
168+
```rust
169+
// JUSTIFICATION
170+
// Benefit
171+
// Describe the benefit of using unsafe. E.g.
172+
// "30% performance degradation if the safe counterpart is used, see bench X."
173+
// Soundness
174+
// Describe why the code remains sound (according to the definition of rust's unsafe code guidelines). E.g.
175+
// "We bounded check these values at initialization and the array is immutable."
176+
let ... = unsafe { ... };
177+
```
178+
100179
# Publishing to crates.io
101180

102181
An Arrow committer can publish this crate after an official project release has

0 commit comments

Comments
 (0)