|
2 | 2 | //!
|
3 | 3 | //! These are normally part of the compiler-builtins crate. However the default routines
|
4 | 4 | //! do not use word sized aligned instructions, which is slow and moreover leads to crashes
|
5 |
| -//! when using IRAM (which only allows aligned accesses). |
| 5 | +//! when using memories/processors which only allows aligned accesses. |
6 | 6 | //!
|
7 | 7 | //! Implementation is optimized for large blocks of data. Assumption is that for small data,
|
8 |
| -//! they are inlined by the compiler. Some optimization done for often small sizes as |
9 |
| -//! otherwise lot of slowdown in debug mode. |
| 8 | +//! they are inlined by the compiler. Some optimization done for often used small sizes as |
| 9 | +//! otherwise significant slowdown in debug mode. |
10 | 10 | //!
|
11 | 11 | //! Implementation is optimized when dst/s1 and src/s2 have the same alignment.
|
12 | 12 | //! If alignment of s1 and s2 is unequal, then either s1 or s2 accesses are not aligned
|
13 | 13 | //! resulting in slower performance. (If s1 or s2 is aligned, then those accesses are aligned.)
|
14 | 14 | //!
|
15 |
| -//! Further optimization is possible by having dedicated code path for unaligned accesses, |
| 15 | +//! Further optimization is possible by having a dedicated code path for unaligned accesses, |
16 | 16 | //! which uses 2*PTR_SIZE to PTR_SIZE shift operation (i.e. llvm.fshr);
|
17 |
| -//! but implementation of this intrinsic is not well optimized on all platforms. |
| 17 | +//! but implementation of this intrinsic is not yet optimized and currently leads to worst results. |
18 | 18 | //!
|
| 19 | +//! Also loop unrolling in the memcpy_reverse function is not fully optimal due to limited current |
| 20 | +//! llvm optimization: uses add with negative offset + store, instead of store with positive |
| 21 | +//! offset; so 3 instructions per loop instead of 2 |
| 22 | +//! |
| 23 | +//! A further future optimization possibility is using zero overhead loop, but again |
| 24 | +//! currently not yet supported by llvm for xtensa. |
| 25 | +//! |
| 26 | +//! For large aligned memset and memcpy reaches ~88% of maximum memory bandwidth; |
| 27 | +//! for memcpy_reverse ~60%. |
19 | 28 | #[allow(warnings)]
|
20 | 29 | #[cfg(target_pointer_width = "64")]
|
21 | 30 | type c_int = u64;
|
|
0 commit comments