From 507d3ad62a5f3fba802c7ce06fcc804a07cc04d8 Mon Sep 17 00:00:00 2001 From: Wentao Guan Date: Mon, 16 Jun 2025 15:18:42 +0800 Subject: [PATCH] Loongarch: optimize syscall reg save MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit deepin inclusion category: performance It saves a st.d in the hot syscall path, and let the compiler know to optimize it in asm, and helps to improve the syscall performance little. I have test in 3A6000 After patch: Benchmark Run: 一 6月 16 2025 20:38:10 - 20:47:09 8 CPUs in system; running 1 parallel copy of tests Dhrystone 2 using register variables 47066632.8 lps (10.0 s, 2 samples) Double-Precision Whetstone 5036.1 MWIPS (10.0 s, 2 samples) Execl Throughput 4484.2 lps (29.2 s, 1 samples) File Copy 1024 bufsize 2000 maxblocks 656586.0 KBps (30.0 s, 1 samples) File Copy 256 bufsize 500 maxblocks 175086.0 KBps (30.0 s, 1 samples) File Copy 4096 bufsize 8000 maxblocks 1998702.0 KBps (30.0 s, 1 samples) Pipe Throughput 1365130.7 lps (10.0 s, 2 samples) Pipe-based Context Switching 126232.9 lps (10.0 s, 2 samples) Process Creation 9202.7 lps (30.0 s, 1 samples) Shell Scripts (1 concurrent) 12501.2 lpm (60.0 s, 1 samples) Shell Scripts (8 concurrent) 4974.9 lpm (60.0 s, 1 samples) System Call Overhead 1467021.7 lps (10.0 s, 2 samples) System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 47066632.8 4033.1 Double-Precision Whetstone 55.0 5036.1 915.7 Execl Throughput 43.0 4484.2 1042.8 File Copy 1024 bufsize 2000 maxblocks 3960.0 656586.0 1658.0 File Copy 256 bufsize 500 maxblocks 1655.0 175086.0 1057.9 File Copy 4096 bufsize 8000 maxblocks 5800.0 1998702.0 3446.0 Pipe Throughput 12440.0 1365130.7 1097.4 Pipe-based Context Switching 4000.0 126232.9 315.6 Process Creation 126.0 9202.7 730.4 Shell Scripts (1 concurrent) 42.4 12501.2 2948.4 Shell Scripts (8 concurrent) 6.0 4974.9 8291.5 System Call Overhead 15000.0 1467021.7 978.0 ======== System Benchmarks Index Score 1510.2 ------------------------------------------------------------------------ Benchmark Run: 一 6月 16 2025 20:47:09 - 20:56:08 8 CPUs in system; running 8 parallel copies of tests Dhrystone 2 using register variables 221748966.2 lps (10.0 s, 2 samples) Double-Precision Whetstone 37218.5 MWIPS (10.0 s, 2 samples) Execl Throughput 24364.4 lps (29.0 s, 1 samples) File Copy 1024 bufsize 2000 maxblocks 3681637.0 KBps (30.0 s, 1 samples) File Copy 256 bufsize 500 maxblocks 1020033.0 KBps (30.0 s, 1 samples) File Copy 4096 bufsize 8000 maxblocks 8054794.0 KBps (30.0 s, 1 samples) Pipe Throughput 8209249.1 lps (10.0 s, 2 samples) Pipe-based Context Switching 1058150.7 lps (10.0 s, 2 samples) Process Creation 49636.4 lps (30.0 s, 1 samples) Shell Scripts (1 concurrent) 43521.6 lpm (60.0 s, 1 samples) Shell Scripts (8 concurrent) 5672.4 lpm (60.0 s, 1 samples) System Call Overhead 9407101.4 lps (10.0 s, 2 samples) System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 221748966.2 19001.6 Double-Precision Whetstone 55.0 37218.5 6767.0 Execl Throughput 43.0 24364.4 5666.2 File Copy 1024 bufsize 2000 maxblocks 3960.0 3681637.0 9297.1 File Copy 256 bufsize 500 maxblocks 1655.0 1020033.0 6163.3 File Copy 4096 bufsize 8000 maxblocks 5800.0 8054794.0 13887.6 Pipe Throughput 12440.0 8209249.1 6599.1 Pipe-based Context Switching 4000.0 1058150.7 2645.4 Process Creation 126.0 49636.4 3939.4 Shell Scripts (1 concurrent) 42.4 43521.6 10264.5 Shell Scripts (8 concurrent) 6.0 5672.4 9454.0 System Call Overhead 15000.0 9407101.4 6271.4 ======== System Benchmarks Index Score 7335.3 Before patch: Benchmark Run: 一 6月 16 2025 22:58:12 - 23:07:11 8 CPUs in system; running 1 parallel copy of tests Dhrystone 2 using register variables 41001790.5 lps (10.0 s, 2 samples) Double-Precision Whetstone 5036.1 MWIPS (10.0 s, 2 samples) Execl Throughput 4482.0 lps (29.6 s, 1 samples) File Copy 1024 bufsize 2000 maxblocks 654904.0 KBps (30.0 s, 1 samples) File Copy 256 bufsize 500 maxblocks 173158.0 KBps (30.0 s, 1 samples) File Copy 4096 bufsize 8000 maxblocks 2008222.0 KBps (30.0 s, 1 samples) Pipe Throughput 1370314.7 lps (10.0 s, 2 samples) Pipe-based Context Switching 126314.0 lps (10.0 s, 2 samples) Process Creation 9063.9 lps (30.0 s, 1 samples) Shell Scripts (1 concurrent) 12506.3 lpm (60.0 s, 1 samples) Shell Scripts (8 concurrent) 4972.7 lpm (60.0 s, 1 samples) System Call Overhead 1448942.6 lps (10.0 s, 2 samples) System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 41001790.5 3513.4 Double-Precision Whetstone 55.0 5036.1 915.7 Execl Throughput 43.0 4482.0 1042.3 File Copy 1024 bufsize 2000 maxblocks 3960.0 654904.0 1653.8 File Copy 256 bufsize 500 maxblocks 1655.0 173158.0 1046.3 File Copy 4096 bufsize 8000 maxblocks 5800.0 2008222.0 3462.5 Pipe Throughput 12440.0 1370314.7 1101.5 Pipe-based Context Switching 4000.0 126314.0 315.8 Process Creation 126.0 9063.9 719.4 Shell Scripts (1 concurrent) 42.4 12506.3 2949.6 Shell Scripts (8 concurrent) 6.0 4972.7 8287.8 System Call Overhead 15000.0 1448942.6 966.0 ======== System Benchmarks Index Score 1488.9 ------------------------------------------------------------------------ Benchmark Run: 一 6月 16 2025 23:07:11 - 23:16:11 8 CPUs in system; running 8 parallel copies of tests Dhrystone 2 using register variables 221753204.3 lps (10.0 s, 2 samples) Double-Precision Whetstone 37215.6 MWIPS (10.0 s, 2 samples) Execl Throughput 24319.0 lps (30.0 s, 1 samples) File Copy 1024 bufsize 2000 maxblocks 3656936.0 KBps (30.0 s, 1 samples) File Copy 256 bufsize 500 maxblocks 1016886.0 KBps (30.0 s, 1 samples) File Copy 4096 bufsize 8000 maxblocks 7966493.0 KBps (30.0 s, 1 samples) Pipe Throughput 8211487.8 lps (10.0 s, 2 samples) Pipe-based Context Switching 1066013.7 lps (10.0 s, 2 samples) Process Creation 50743.5 lps (30.0 s, 1 samples) Shell Scripts (1 concurrent) 43664.4 lpm (60.0 s, 1 samples) Shell Scripts (8 concurrent) 5674.7 lpm (60.0 s, 1 samples) System Call Overhead 9320000.0 lps (10.0 s, 2 samples) System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 221753204.3 19002.0 Double-Precision Whetstone 55.0 37215.6 6766.5 Execl Throughput 43.0 24319.0 5655.6 File Copy 1024 bufsize 2000 maxblocks 3960.0 3656936.0 9234.7 File Copy 256 bufsize 500 maxblocks 1655.0 1016886.0 6144.3 File Copy 4096 bufsize 8000 maxblocks 5800.0 7966493.0 13735.3 Pipe Throughput 12440.0 8211487.8 6600.9 Pipe-based Context Switching 4000.0 1066013.7 2665.0 Process Creation 126.0 50743.5 4027.3 Shell Scripts (1 concurrent) 42.4 43664.4 10298.2 Shell Scripts (8 concurrent) 6.0 5674.7 9457.8 System Call Overhead 15000.0 9320000.0 6213.3 ======== System Benchmarks Index Score 7336.1 Signed-off-by: Wentao Guan --- arch/loongarch/kernel/entry.S | 3 ++- arch/loongarch/kernel/syscall.c | 2 ++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/loongarch/kernel/entry.S b/arch/loongarch/kernel/entry.S index 48e7e34e355e8..2dc49eda45d49 100644 --- a/arch/loongarch/kernel/entry.S +++ b/arch/loongarch/kernel/entry.S @@ -30,7 +30,8 @@ SYM_CODE_START(handle_syscall) addi.d sp, sp, -PT_SIZE cfi_st t2, PT_R3 cfi_rel_offset sp, PT_R3 - st.d zero, sp, PT_R0 + # Note it will be set in do_syscall regs->regs[0] = 0; + # st.d zero, sp, PT_R0 csrrd t2, LOONGARCH_CSR_PRMD st.d t2, sp, PT_PRMD csrrd t2, LOONGARCH_CSR_CRMD diff --git a/arch/loongarch/kernel/syscall.c b/arch/loongarch/kernel/syscall.c index b4c5acd7aa3b3..846a5777b8e57 100644 --- a/arch/loongarch/kernel/syscall.c +++ b/arch/loongarch/kernel/syscall.c @@ -44,6 +44,8 @@ void noinstr do_syscall(struct pt_regs *regs) sys_call_fn syscall_fn; nr = regs->regs[11]; + // Move from handle_syscall macro to save a memio + regs->regs[0] = 0; /* Set for syscall restarting */ if (nr < NR_syscalls) regs->regs[0] = nr + 1;