Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RVV 0.7.1: Compiler inserts unnecessary csrr/vsetvl combinations in some cases resulting in slower code #12

Open
mshabunin opened this issue Mar 12, 2024 · 0 comments

Comments

@mshabunin
Copy link

Compiler: Xuantie toolchain 2.8.0
Flags: -mcpu=c920 -mabi=lp64d -O3 -mtune=c920 -g -mrvv-vector-bits=128
HW: LicheePi 4A
Test data: 1920x1080 image of i32 x 1000 iterations

We observe some inefficient machine code being generated by the toolchain in some cases when using RVV intrinsics.

Function 1 (sum all elements of an int array)

In this function we load data into i32m2 register group by combining two i32m1 registers, then sum all elements. Produced machine code is working well enough.

int sum8(const int *data, size_t SZ)
{
    int ret = 0;
    for (size_t i = 0; i < SZ; i += 8, data += 8)
    {
        vint32m2_t val = vundefined_i32m2();
        val = vset_v_i32m1_i32m2(val, 0, vle32_v_i32m1(data, 4));
        val = vset_v_i32m1_i32m2(val, 1, vle32_v_i32m1(data + 4, 4));
        vint32m1_t z = vmv_v_x_i32m1(0, 4);
        ret += vmv_x_s_i32m1_i32(vredsum_vs_i32m2_i32m1(z, val, z, 8));
    }
    return ret;
}

Here produced machine code is quite reasonable and have good performance (same as the code written manually in assembly ~ 5.2 ms). It can be optimized by avoiding vmv.v.v calls and loading data into two consecutive registers in the first place (e.g. v2, v3), so that they would form register group.

12150:   008878d7            vsetvli a7,a6,e32,m1,d1
12154:   01078893            addi    a7,a5,16
12158:   0207f287            vle.v   v5,(a5)
1215c:   5e0030d7            vmv.v.i v1,0
12160:   0721                    addi    a4,a4,8
12162:   02078793            addi    a5,a5,32
12166:   0208f207            vle.v   v4,(a7)
1216a:   00807057            vsetvli zero,zero,e32,m1,d1  <- This
1216e:   5e028157            vmv.v.v v2,v5                <- block
12172:   5e0201d7            vmv.v.v v3,v4                <- can be avoided
12176:   009678d7            vsetvli a7,a2,e32,m2,d1
1217a:   0220a0d7            vredsum.vs  v1,v2,v1
1217e:   0086f8d7            vsetvli a7,a3,e32,m1,d1
12182:   321028d7            vmv.x.s a7,v1

Function 2 (sum all elements of an int array, 2x unrolled)

This function is almost the same as function 1, but it combines 4 registers into i32m4 group instead of 2. Its performance is 3.5 times worse (17.3 ms vs 5.2 ms) than previous and almost 6 times worse (17.3 ms vs 3.1 ms) than manually written assembly implementing using the same approach.

int sum16(const int *data, size_t SZ)
{
    int ret = 0;
    for (size_t i = 0; i < SZ; i += 16, data += 16)
    {
        vint32m4_t val = vundefined_i32m4();
        val = vset_v_i32m1_i32m4(val, 0, vle32_v_i32m1(data + 0, 4));
        val = vset_v_i32m1_i32m4(val, 1, vle32_v_i32m1(data + 4, 4));
        val = vset_v_i32m1_i32m4(val, 2, vle32_v_i32m1(data + 8, 4));
        val = vset_v_i32m1_i32m4(val, 3, vle32_v_i32m1(data + 12, 4));
        vint32m1_t z = vmv_v_x_i32m1(0, 4);
        ret += vmv_x_s_i32m1_i32(vredsum_vs_i32m4_i32m1(z, val, z, 16));
    }
    return ret;
}

Produced machine code contains two csrr/vsetvl blocks and which seemingly cause observed performance degradation:

121b0:   0088f757            vsetvli a4,a7,e32,m1,d1
121b4:   01078713            addi    a4,a5,16
121b8:   02077107            vle.v   v2,(a4)
121bc:   02078313            addi    t1,a5,32
121c0:   03078713            addi    a4,a5,48
121c4:   06c1                    addi    a3,a3,16
121c6:   0207f087            vle.v   v1,(a5)
121ca:   04078793            addi    a5,a5,64
121ce:   c2002f73            csrr    t5,vl                  <-
121d2:   c2102ff3            csrr    t6,vtype               <-
121d6:   00807057            vsetvli zero,zero,e32,m1,d1    <-
121da:   5e0102d7            vmv.v.v v5,v2                  <-
121de:   81ff7057            vsetvl  zero,t5,t6             <-
121e2:   02037187            vle.v   v3,(t1)
121e6:   c2002f73            csrr    t5,vl                  <-
121ea:   c2102ff3            csrr    t6,vtype               <-
121ee:   00807057            vsetvli zero,zero,e32,m1,d1    <-
121f2:   5e008257            vmv.v.v v4,v1                  <-
121f6:   81ff7057            vsetvl  zero,t5,t6             <-
121fa:   5e0030d7            vmv.v.i v1,0
121fe:   02077107            vle.v   v2,(a4)
12202:   00807057            vsetvli zero,zero,e32,m1,d1
12206:   5e018357            vmv.v.v v6,v3
1220a:   5e0103d7            vmv.v.v v7,v2
1220e:   00a87757            vsetvli a4,a6,e32,m4,d1
12212:   0240a0d7            vredsum.vs  v1,v4,v1
12216:   00867757            vsetvli a4,a2,e32,m1,d1
1221a:   32102757            vmv.x.s a4,v1

It looks like compiler tries to reuse registers as much as possible and inserts unnecessary vsetvl which seem to affect the performance.

For comparison, below is a function with an assembly which we expected to get from the code written in intrinsics. Its performance is ~ 3.1 ms

int sum16_asm(const int *data, size_t SZ)
{
    int ret = 0;
    for (size_t i = 0; i < SZ; i += 16)
    {
        int val = 0;
        asm(
            "vsetvli zero,zero,e32,m1,d1\n"
            "vle.v v8,(%1)\n"
            "addi %1,%1,16\n"
            "vle.v v9,(%1)\n"
            "addi %1,%1,16\n"
            "vle.v v10,(%1)\n"
            "addi %1,%1,16\n"
            "vle.v v11,(%1)\n"
            "addi %1,%1,16\n"
            "vmv.v.i v12,0\n"
            "vsetvli zero,zero,e32,m4,d1\n"
            "vredsum.vs  v12,v8,v12\n"
            "vsetvli zero,zero,e32,m1,d1\n"
            "vmv.x.s %0,v12\n"
        : "=r" (val), "+r" (data)
        :
        : "v8", "v9", "v10", "v11", "v12");
        ret += val;
    }
    return ret;
}

Notes:

  • the more efficient way would be to load data directly to register group using vle32_v_i32m2 or vle32_v_i32m4, but in our case we have higher level abstraction over compiler intrinsics which works only with single registers and does not have concept of register groups yet. We rely on compiler to produce optimized code even if it is slightly less efficient than it could be. Recent mainstream compilers with RVV 1.0 support (clang 16-17) produce very efficient code in described cases, practically the same as manually written assembly: https://godbolt.org/z/Y5Kdq7bdY
  • we use fixed register size in these scenarios (-mrvv-vector-bits=128)
  • all performance results:
    sum8...............5.246 ms
    sum8_asm...........5.385 ms
    sum16.............17.380 ms
    sum16_asm..........3.170 ms
    
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant