Skip to content

Conversation

@cbalint13
Copy link
Contributor

@cbalint13 cbalint13 commented Aug 27, 2025

This PR adds RISCV kernel templates in compliance with RVV v1.0 specifications


Notes

  • Enables high performance kernels covering majority of usual ML datatype inputs
  • It is currently compliant with RVV specs version v1.0 (does not work with older v0.7.1)
  • TIR kernels implemented here are using recently added VLA extension support

Like all other CPU intrisics only a limited set of operators, currently dense (linear) works with metaschedule.
The list of operators will be revisited and extended in the near future to transposed flavours and convs.

Benchmarks

The performance evaluation revealed an approximate 10x improvement on a SpaceMIT-x60 SoC board.

$ ./riscv64-dense-relax-metaschedule.py --num_trials 256 \
         --data_dtype uint8 --weight-dtype int8 --output-dtype int32
{...}
 ID |  Name |      FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
---------------------------------------------------------------------------------------------------------
  0 | dense | 268435456 |      1 |       852.2994 |     314.9544 |              314.9544 |    256 |      
---------------------------------------------------------------------------------------------------------

$ ./riscv64-dense-relax-metaschedule.py --num_trials 256 \
        --data_dtype float16 --weight-dtype float16 --output-dtype float16
 ID |  Name |      FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
---------------------------------------------------------------------------------------------------------
  0 | dense | 268435456 |      1 |       798.0926 |     336.3463 |              336.3463 |    256 |    Y 
---------------------------------------------------------------------------------------------------------

$ ./riscv64-dense-relax-metaschedule.py --num_trials 256 \
        --data_dtype float32 --weight-dtype float32 --output-dtype float32
 ID |  Name |      FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
---------------------------------------------------------------------------------------------------------
  0 | dense | 268435456 |      1 |       464.1279 |     578.3653 |              578.3653 |    256 |    Y 
---------------------------------------------------------------------------------------------------------
  • Previous performance (prior to this):
$ ./riscv64-dense-relax-metaschedule.py --num_trials 256 \
         --data_dtype uint8 --weight-dtype int8 --output-dtype int32
 ID |  Name |      FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
---------------------------------------------------------------------------------------------------------
  0 | dense | 268435456 |      1 |        54.0469 |    4966.7166 |             4966.7166 |    256 |    Y
---------------------------------------------------------------------------------------------------------

$ ./riscv64-dense-relax-metaschedule.py --num_trials 256 \
        --data_dtype float16 --weight-dtype float16 --output-dtype float16
 ID |  Name |      FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
---------------------------------------------------------------------------------------------------------
  0 | dense | 268435456 |      1 |        91.4154 |    2936.4361 |             2936.4361 |    256 |    Y
---------------------------------------------------------------------------------------------------------

$ ./riscv64-dense-relax-metaschedule.py --num_trials 256 \
        --data_dtype float32 --weight-dtype float32 --output-dtype float32
 ID |  Name |      FLOP | Weight | Speed (GFLOPS) | Latency (us) | Weighted Latency (us) | Trials | Done 
---------------------------------------------------------------------------------------------------------
  0 | dense | 268435456 |      1 |        50.4416 |    5321.7082 |             5321.7082 |    256 |    Y
---------------------------------------------------------------------------------------------------------

Tests

  • All kernel templates are tested for their numerical correctness, program & logs are provided below.

kernel-numerical-testing.log.gz
riscv64-rvv-kernels-numerical-testing.py.gz
riscv64-rvv-kernels-numerical-testing.sh.gz

$ ./riscv64-rvv-kernels-numerical-testing.sh
{...}
$ cat kernel-numerical-testing.log | grep -e Testing 
Testing rvv_dot_4u8_8x4i8_8i32
Testing rvv_dot_4i8_8x4i8_8i32
Testing rvv_dot_4f16_8x4f16_8f16
Testing rvv_dot_4f32_8x4f32_8f32
Testing rvv_dot_8u8_8x8i8_8i32
Testing rvv_dot_8i8_8x8i8_8i32
Testing rvv_dot_8f16_8x8f16_8f16
Testing rvv_dot_8f32_8x8f32_8f32
Testing rvv_dot_16u8_8x16i8_8i32
Testing rvv_dot_16i8_8x16i8_8i32
Testing rvv_dot_16f16_8x16f16_8f16
Testing rvv_dot_16f32_8x16f32_8f32
Testing rvv_dot_32u8_8x32i8_8i32
Testing rvv_dot_32i8_8x32i8_8i32
Testing rvv_dot_32f16_8x32f16_8f16
Testing rvv_dot_32f32_8x32f32_8f32
Testing rvv_dot_64u8_8x64i8_8i32
Testing rvv_dot_64i8_8x64i8_8i32
Testing rvv_dot_64f16_8x64f16_8f16
Testing rvv_dot_64f32_8x64f32_8f32
Testing rvv_dot_128u8_8x128i8_8i32
Testing rvv_dot_128i8_8x128i8_8i32
Testing rvv_dot_128f16_8x128f16_8f16
Testing rvv_dot_128f32_8x128f32_8f32

@cbalint13 cbalint13 force-pushed the riscv-rvv-metasch branch 4 times, most recently from d699397 to 92d6f9b Compare August 27, 2025 14:36
@cbalint13
Copy link
Contributor Author

cbalint13 commented Aug 27, 2025

This is ready for review,

Cc: @vinx13 @MasterJH5574, @mshr-h @tqchen

Cc: (folks with past riscv interests)
@jerryzj @JieGH @PhilippvK

@cbalint13 cbalint13 marked this pull request as ready for review August 27, 2025 15:14
@cbalint13 cbalint13 force-pushed the riscv-rvv-metasch branch 5 times, most recently from 13630df to 72a99bd Compare August 27, 2025 20:42
@cbalint13 cbalint13 marked this pull request as ready for review September 7, 2025 22:40
@cbalint13 cbalint13 merged commit 06fb02e into apache:main Sep 7, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants