mnist: 28x28x1 input,4->8->16, pad valid
mnist_q_valid.h 2.4KB Flash 1.4KB RAM
suit for MCU have >=16KB Flash, >=2KB RAM
cifar: 32x32x3 input, 32->32->64->1024->10, 5x5 conv
cifar10_q.h 89KB Flash 11KB RAM
suit for MCU have >=128KB Flash, >=20KB RAM
vww96: vww model based on mobile net v1 0.25 96x96x3 input
vww96_q.tmdl 227KB Flash 54KB RAM
suit for MCU have >=256KB Flash, >=64KB RAM
https://mlcommons.org/en/inference-tiny-07/
mbnet128: mobile net v1 0.25 128x128x3 input
mbnet128_0.25_q.tmdl 485KB Flash 96KB RAM
suit for MCU have >=512KB Flash, >=128KB RAM
https://github.com/fchollet/deep-learning-models/releases
model infer time unit is ms;
Sort by performance, compare priority: mbnet128 > vww96 > cifar > mnist
Note1: arduino run another smaller mnist model due to limited memory
Note2: all model record fastest model type's infer time, for example, C906 use FP16 result. Note3: XXX means impossible run this model on that chip
Chip/Board | Core | Flash | RAM | Freq | mbnet | vww96 | cifar | mnist | Note |
---|---|---|---|---|---|---|---|---|---|
BL808's NPU | BLAI | 16MB | 0.8+64MB | 320M | 5 | 3 | <1 | <1 | |
i5-4590T | AMD64 | 256GB | 8GB | 2000M | 7/24 | 5/17 | 0.9/4 | 0.04/<1 | native/wasm |
RK3399's A72 | ARM A72 | 32GB | 4GB | 1800M | 15 | 10 | 3 | 0.07 | |
RK3399's A53 | ARM A53 | 32GB | 4GB | 1600M | 29 | 19 | 5 | 0.14 | |
D1-H | RV64V | 128GB | 2GB | 1008M | 43 | 22 | 3.5 | 0.29 | |
BL808's C906 | RV64V | 16MB | 0.8+64MB | 480M | 81 | 57 | 10 | <1 | |
STM32H750 | ARM CM7 | 1MB | 1024KB | 480M | 94 | 64 | 15 | <1 | |
BL808's E907 | RV32P | 16MB | 0.8+64MB | 320M | 188 | 149 | 35 | <1 | mdl in psram |
F1C200S | ARM926EJ-S | 16MB | 64MB | 608M | 213 | 145 | 38.5 | 0.75 | |
AT32F403A | ARM CM4 | 1MB | 96KB | 240M | 477 | 136 | 30 | 0.6 | mbnet in 224k ram mode |
STM32G474RE | ARM CM4 | 512KB | 128KB | 170M | XXX | 195 | 43 | 1 | |
CH32V307 | RV32F QingKe V4F |
480KB | 128KB | 144M | XXX | 357 | 64 | 1 | |
STM32F411CE | ARM CM4 | 512KB | 128KB | 150M | 558 | 366 | 75 | 2 | |
ESP32-S3 | Xtensa LX7 | 8MB | 512KB | 240M | 610 | 381 | 86 | 5 | mdl in flash |
LPC4337 | ARM CM4F | 1MB | 136KB | 204M | 654 | 627 | 91 | 3 | need confirm |
XR806 | ARMv8-M Star-MC1 |
2MB | 288KB | 160M | 712 | 453 | 104 | 1 | |
ESP32 | Xtensa LX6 | 4MB | 520KB | 240M | 755 | 476 | 132 | 2 | mdl in flash |
ACM32F403 | ARM CM33 | 512KB | 192KB | 180M | XXX | 458 | 139 | 2 | |
STM32F767 | ARM CM7 | 2MB | 512KB | 216M | 869 | 640 | 185 | 3 | need confirm |
STM32L496 | ARM CM4 | 1MB | 320KB | 80M | 809 | 695 | 162 | 3 | |
RP2040 | ARM CM0+ | 16MB | 264KB | 280M | 1211 | 716 | 200 | 2 | overclock 280M |
CH32V203G6 | RV32 QingKe V4B |
32KB | 10KB | 144M | XXX | XXX | XXX | 2.5 | |
ESP32-C3 | RV32 | 4MB | 400KB | 160M | 2370 | 1430 | 127 | 6 | mdl in flash |
STM32F103C8 | ARM CM3 | 64KB | 20KB | 72M | XXX | XXX | XXX | 8 | |
CH32V103 | RV32 QingKe V3A |
64KB | 20KB | 72M | XXX | XXX | XXX | 13 | |
SAMD21G18 | ARM CM0+ | 256KB | 32KB | 48M | XXX | XXX | 700 | 14 | seeed XIAO |
STM32G030F6 | ARM CM0+ | 32KB | 8KB | 64M | XXX | XXX | XXX | 18 | |
CM0(Kintex-7) | ARM CM0 | --- | 1024KB | 50M | XXX | XXX | 1362 | 23 | Kintex-7 |
STC32G12K128 | 80251 | 128KB | 12KB | 35M | XXX | XXX | XXX | 37 | |
PicoRV32(GW2A) | RV32 | 1MB | 64KB | 54M | XXX | XXX | 26935 | 385 | Tang Primer 20K |
Atmega328 | AVR | 32KB | 2KB | 16M | XXX | XXX | XXX | 50(*) |
Normalization to 100M freq to compare CPU efficiency, using cifar model:
Chip/Board | Core | cifar(ms) |
---|---|---|
BL808's NPU | BLAI | 2 |
D1-H | RV64V | 35 |
BL808's C906 | RV64V | 48 |
RK3399's A72 | ARM A72 | 52 |
STM32H750 | ARM CM7 | 72 |
AT32F403A | ARM CM4 | 72 |
STM32G474RE | ARM CM4 | 73 |
RK3399's A53 | ARM A53 | 79 |
CH32V307 | RV32 IMAC | 92 |
BL808's E907 | RV32P | 112 |
STM32F411CE | ARM CM4 | 113 |
STM32L496 | ARM CM4 | 130 |
XR806 | ARMv8-M Star-MC1 |
166 |
ESP32-C3 | RV32 | 203 |
ESP32-S3 | Xtensa LX7 | 206 |
F1C200S | ARM926EJ-S | 234 |
ACM32F403 | ARM CM33 | 250 |
ESP32 | Xtensa LX6 | 317 |
SAMD21G18 | ARM CM0+ | 336 |
RP2040 | ARM CM0+ | 560 |
CM0(Kintex-7) | ARM CM0 | 681 |
PicoRV32(GW2A) | RV32 | 14545 |
mbnet infer time under different input size
BL808 C906 core 480M, use RV64V, FP16 model
input size | infer time |
---|---|
96x 96 | 60ms |
128x128 | 81ms |
160x160 | 156ms |
192x192 | 183ms |
224x224 | 296ms |
Optimization for MCU which don't have FPU
STM32F103C8 run mnist
Options | infer time |
---|---|
TM_FASTSCALE=0 | 16ms |
TM_FASTSCALE=1 | 10ms |
Optimization for ARM MCU which have DSP (Cortex-M4,M7,etc.), suoport INT8 acceleration
STM32F411CE run mbnet 0.25, 128x128x3 input
Options | infer time |
---|---|
TM_ARCH_OPT0 && INT8 | 1199ms |
TM_ARCH_ARM_SIMD && INT8 | 840ms |
Optimization for ARM MCU which have MVEI instructions (Cortex-M55,etc.), suoport INT8 acceleration.
Experimental, not test data.
Optimization for ARM MPU which have NEON instructions (Cortex-A7 and newer), suoport INT8/FP32 acceleration
Raspberry Pi4 single core run mbnet 1.0, 224x224x3 input
(NEON INT8 not well optimized)
ARCH | MDL_TYPE | OPT0 time | OPT1 time |
---|---|---|---|
TM_ARCH_CPU | INT8 | 860ms | 821ms |
TM_ARCH_CPU | FP32 | 2307ms | 2271ms |
TM_ARCH_ARM_NEON | FP32 | 1275ms | 1223ms |
TM_ARCH_ARM_NEON | INT8 | 959ms | 923ms |
Optimization for RISC-V MCU which have P-extend instructions (like T-Head E907), suoport INT8 acceleration
BL808 E907 core run mbnet 0.25, 128x128x3 input (mdl in psram, cpu run in 320M, O2)
ARCH | MDL_TYPE | OPT0 time | OPT1 time |
---|---|---|---|
TM_ARCH_CPU | INT8 | 443ms | 283ms |
TM_ARCH_RV32P | INT8 | 345ms | 188ms |
Optimization for RISC-V MCU which have V-extend instructions (like T-Head C906), suoport INT8/FP32 acceleration
BL808 C906 core run mbnet 0.25, 128x128x3 input (mdl in psram, VLEN=128, cpu run in 480M, O2)
ARCH | MDL_TYPE | OPT0 time | OPT1 time |
---|---|---|---|
TM_ARCH_CPU | INT8 | 153ms | 125ms |
TM_ARCH_CPU | FP32 | 215ms | 177ms |
TM_ARCH_RV64V | INT8 | 123ms | 95ms |
TM_ARCH_RV64V | FP32 | 160ms | 121ms |
TM_ARCH_RV64V | FP16 | 129ms | 81ms |
Use SmallCifar model. TinyMaix use stride=2's time multipy by 4.
NNoM&TinyMaix run with STM32H750@218M, other run with STM32F746@216M
InferLib | time(ms) |
---|---|
TFlite-micro | 393 |
MicroTVM untuned | 294 |
TinyMaix CPU O0 | 224 |
TinyMaix CPU O1 | 204 |
TinyMaix SIMD O0 | 176 |
NNoM | 159 |
MicroTVM tuned | 157 |
CMSIS-NN | 136 |
TinyMaix SIMD O1 | 132 |
tinyengine | 129 |