You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Compiler: Xuantie toolchain 2.8.0
Flags: -mcpu=c920 -mabi=lp64d -O3 -mtune=c920 -g -mrvv-vector-bits=128
HW: LicheePi 4A
Test data: 1920x1080 image of i32 x 1000 iterations
We observe some inefficient machine code being generated by the toolchain in some cases when using RVV intrinsics.
Function 1 (sum all elements of an int array)
In this function we load data into i32m2 register group by combining two i32m1 registers, then sum all elements. Produced machine code is working well enough.
intsum8(constint *data, size_t SZ)
{
int ret = 0;
for (size_t i = 0; i < SZ; i += 8, data += 8)
{
vint32m2_t val = vundefined_i32m2();
val = vset_v_i32m1_i32m2(val, 0, vle32_v_i32m1(data, 4));
val = vset_v_i32m1_i32m2(val, 1, vle32_v_i32m1(data + 4, 4));
vint32m1_t z = vmv_v_x_i32m1(0, 4);
ret += vmv_x_s_i32m1_i32(vredsum_vs_i32m2_i32m1(z, val, z, 8));
}
return ret;
}
Here produced machine code is quite reasonable and have good performance (same as the code written manually in assembly ~ 5.2 ms). It can be optimized by avoiding vmv.v.v calls and loading data into two consecutive registers in the first place (e.g. v2, v3), so that they would form register group.
Function 2 (sum all elements of an int array, 2x unrolled)
This function is almost the same as function 1, but it combines 4 registers into i32m4 group instead of 2. Its performance is 3.5 times worse (17.3 ms vs 5.2 ms) than previous and almost 6 times worse (17.3 ms vs 3.1 ms) than manually written assembly implementing using the same approach.
intsum16(constint *data, size_t SZ)
{
int ret = 0;
for (size_t i = 0; i < SZ; i += 16, data += 16)
{
vint32m4_t val = vundefined_i32m4();
val = vset_v_i32m1_i32m4(val, 0, vle32_v_i32m1(data + 0, 4));
val = vset_v_i32m1_i32m4(val, 1, vle32_v_i32m1(data + 4, 4));
val = vset_v_i32m1_i32m4(val, 2, vle32_v_i32m1(data + 8, 4));
val = vset_v_i32m1_i32m4(val, 3, vle32_v_i32m1(data + 12, 4));
vint32m1_t z = vmv_v_x_i32m1(0, 4);
ret += vmv_x_s_i32m1_i32(vredsum_vs_i32m4_i32m1(z, val, z, 16));
}
return ret;
}
Produced machine code contains two csrr/vsetvl blocks and which seemingly cause observed performance degradation:
It looks like compiler tries to reuse registers as much as possible and inserts unnecessary vsetvl which seem to affect the performance.
For comparison, below is a function with an assembly which we expected to get from the code written in intrinsics. Its performance is ~ 3.1 ms
intsum16_asm(constint *data, size_t SZ)
{
int ret = 0;
for (size_t i = 0; i < SZ; i += 16)
{
int val = 0;
asm(
"vsetvli zero,zero,e32,m1,d1\n""vle.v v8,(%1)\n""addi %1,%1,16\n""vle.v v9,(%1)\n""addi %1,%1,16\n""vle.v v10,(%1)\n""addi %1,%1,16\n""vle.v v11,(%1)\n""addi %1,%1,16\n""vmv.v.i v12,0\n""vsetvli zero,zero,e32,m4,d1\n""vredsum.vs v12,v8,v12\n""vsetvli zero,zero,e32,m1,d1\n""vmv.x.s %0,v12\n"
: "=r" (val), "+r" (data)
:
: "v8", "v9", "v10", "v11", "v12");
ret += val;
}
return ret;
}
Notes:
the more efficient way would be to load data directly to register group using vle32_v_i32m2 or vle32_v_i32m4, but in our case we have higher level abstraction over compiler intrinsics which works only with single registers and does not have concept of register groups yet. We rely on compiler to produce optimized code even if it is slightly less efficient than it could be. Recent mainstream compilers with RVV 1.0 support (clang 16-17) produce very efficient code in described cases, practically the same as manually written assembly: https://godbolt.org/z/Y5Kdq7bdY
we use fixed register size in these scenarios (-mrvv-vector-bits=128)
all performance results:
sum8...............5.246 ms
sum8_asm...........5.385 ms
sum16.............17.380 ms
sum16_asm..........3.170 ms
The text was updated successfully, but these errors were encountered:
Compiler: Xuantie toolchain 2.8.0
Flags:
-mcpu=c920 -mabi=lp64d -O3 -mtune=c920 -g -mrvv-vector-bits=128
HW: LicheePi 4A
Test data: 1920x1080 image of i32 x 1000 iterations
We observe some inefficient machine code being generated by the toolchain in some cases when using RVV intrinsics.
Function 1 (sum all elements of an int array)
In this function we load data into i32m2 register group by combining two i32m1 registers, then sum all elements. Produced machine code is working well enough.
Here produced machine code is quite reasonable and have good performance (same as the code written manually in assembly ~ 5.2 ms). It can be optimized by avoiding
vmv.v.v
calls and loading data into two consecutive registers in the first place (e.g. v2, v3), so that they would form register group.Function 2 (sum all elements of an int array, 2x unrolled)
This function is almost the same as function 1, but it combines 4 registers into i32m4 group instead of 2. Its performance is 3.5 times worse (17.3 ms vs 5.2 ms) than previous and almost 6 times worse (17.3 ms vs 3.1 ms) than manually written assembly implementing using the same approach.
Produced machine code contains two
csrr/vsetvl
blocks and which seemingly cause observed performance degradation:It looks like compiler tries to reuse registers as much as possible and inserts unnecessary
vsetvl
which seem to affect the performance.For comparison, below is a function with an assembly which we expected to get from the code written in intrinsics. Its performance is ~ 3.1 ms
Notes:
vle32_v_i32m2
orvle32_v_i32m4
, but in our case we have higher level abstraction over compiler intrinsics which works only with single registers and does not have concept of register groups yet. We rely on compiler to produce optimized code even if it is slightly less efficient than it could be. Recent mainstream compilers with RVV 1.0 support (clang 16-17) produce very efficient code in described cases, practically the same as manually written assembly: https://godbolt.org/z/Y5Kdq7bdY-mrvv-vector-bits=128
)The text was updated successfully, but these errors were encountered: