Skip to content

Conversation

aman4150
Copy link
Collaborator

@aman4150 aman4150 commented Sep 16, 2025

Instead of compiling individual division and remainder instructions inline, we now compile them as calls to trampoline functions. This will reduce the maximum number of compiled instruction per PolkaVM instruction.

We are doing this because division and remainder instructions are quite rarely used in practice, and inlining them bloats the compiled code size significantly.

Additionally, we have made division and remainder operations branchless.

For easy review, this is the final code for unsigned division operation:

push rax
push rdx
push rbx
mov rax, [rsp+0x20]
xor ebx, ebx
test rcx, rcx
sete bl
cmove rcx, rbx
xor edx, edx
div rcx
neg rbx
cmovne rax, rbx
mov rcx, rax
pop rbx
pop rdx
pop rax
ret

Fixes: #169

Instead of compiling individual division and remainder instructions inline,
we now compile them as calls to trampoline functions. This will reduce
the maximum number of compiled instruction per PolkaVM instruction.

We are doing this because division and remainder instructions are quite
rarely used in practice, and inlining them bloats the compiled code size
significantly.

Additionally, we have made division and remainder operations branchless.

Signed-off-by: Aman <[email protected]>
Comment on lines +1791 to +1795
self.push(mov(RegSize::R64, TMP_REG, conv_reg(s2)));
self.push(push(conv_reg(s1)));
self.call_to_label(label);
self.push(pop(conv_reg(d)));
self.push(mov(RegSize::R64, conv_reg(d), TMP_REG));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that all of these are constant-length, convert all of these to use asm.reserve (which makes codegen more efficient as it removes bounds checks from every push call).

Comment on lines +1782 to +1783
self.push(pop(conv_reg(d)));
self.push(mov(RegSize::R64, conv_reg(d), TMP_REG));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks weird. You're popping a value from the stack into d, and then you overwrite d with TMP_REG. Why? Is it only because you need to reverse the push you did before the call? In which case why don't we store the argument in vmctx? That will make the pop unnecessary, saving us one instruction per div.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did float this idea during our discussion but later I realized that this approach could be faster and smaller (total bytes). I can compare these two approaches on the test bench.

Comment on lines +735 to +738
match reg_size {
RegSize::R32 => self.push(mov_imm(rdx, imm32(cast(i32::MIN).to_unsigned()))),
RegSize::R64 => self.push(mov_imm64(rdx, cast(i64::MIN).to_unsigned())),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use mov_imm + imm64 instead of mov_imm64.

Comment on lines +744 to +750
match reg_size {
RegSize::R32 => self.push(cmp((TMP_REG, imm32(cast(-1_i32).to_unsigned())))),
RegSize::R64 => self.push(cmp((TMP_REG, imm64(-1_i32)))),
}

self.push(xor((RegSize::R32, r12, r12)));
self.push(setcc(Condition::Equal, r12));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. You're using cmp to compare TMP_REG with -1, but then you're throwing away the result of this comparison by overwriting the flag with xor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely incorrect. Not sure how it passed all the tests - I'll fix it and add a test for the same.

}

pub(crate) fn emit_divrem_trampoline(&mut self) {
log::trace!("Emitting trampoline: divrem");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make each trampoline have its own print (so that it's easy to see in the logs the exact assembly emitted for each)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Uninline the division in the AMD64 backend
2 participants