feat: add support for fine-grained assembly representation by ThinkOpenly · Pull Request #1527 · riscv/riscv-unified-db

ThinkOpenly · 2026-02-02T21:47:02Z

Currently, assembly syntax is represented as a simple string of comma-separated operands with a heuristic naming convention to indicate their respective type and purpose.

Make this more rigorous by adding a schema which supports:

registers
the registers' register file (GPR, FPR, VR, CSR, etc.)
dereference syntax "(reg)"
dereference+offset syntax "offset(reg)"
immediates
floating-point rounding mode and possible values
FENCE scopes
register lists (for POP/PUSH)

The new "operands" YAML field is currently optional and coexists with the existing "assembly" field. So, support can be added over time to both the YAML files and the infrastructure to support generation of actual assembly syntax where needed (documentation) until the "assembly" field is no longer needed.

ThinkOpenly · 2026-02-02T21:48:45Z

Not terribly worried about this, but DO NOT MERGE.

I had to specify all required fields and could not depend on the "default" values being used when trying to get past pre-commit run check-jsonschema.

lenary · 2026-02-03T04:52:52Z

Thank you for getting this ball rolling, Paul.

I would like to talk about testing unified-db against existing tools, and how we want to go about it. Currently, we test LLVM by inspecting the internal data descriptions (TableGen), but I'm not sure this is sustainable long-term. I would prefer to do the testing end-to-end - i.e. given a set of extensions(+parameters?), generate a random valid assembly string and check whether the assembler turns it into the expected instruction (+relocations). Similarly, generate a random invalid assembly string and check it is rejected. We should also be able to do the same for generating random encoding bits for disassembling (this is easier today, i think), for both valid and invalid cases. While these all seem like they should mirror each other, they don't quite, because the output of a disassembler is never meant to be assembled again, so the reality is you need an oracle to work out what the correct assembly should have been - in our case, that likely means writing an assembler/disassembler that can do so entirely based on UDB information alone. These are not small undertakings, but I think would stand us in a better situation long-term.

I think you're right that we need to see how to join this up with the field information at some point. I'm keen for us to have a side table of "these are operands others have used, please re-use them", which hopefully would cover some combo of fields and operands, but it's hard to know how this would look today, and will need iteration.

One kind of operand you've missed out is Symbol Expressions (foo, foo+1, %pcrel_hi(foo)) - exactly which specifiers (the <name> in %<name>(expr)) are allowed depends on the instruction/field, and corresponds to available relocations on the field.

I started writing a much longer comment, which I'm going to hide below, because I think it's useful to write down these cases, but they start to look like the longer tail of things. I do think it's important to think about the more complex cases earlier, though, rather than designing something that works for simpler cases like add rd, rs1, rs2 and needs to be entirely changed later for complex cases.

More Hard Cases

Sorry for the stream of conciousness thoughts, I want to provide some degree of "here's complex cases we need to get right", rather than just thinking about simple cases like `add rd, rs1, rs2`:

One thing to be really careful about right now is PC-relative immediate operands, such as the offset in beq and jal. These are treated differently by GCC and by Clang -- beq a0, a1, 28 in GCC means branch to address 28, and in clang means branch to address pc+28. Fixing this incompatibility is not something we should seek to do with the specification at this time, as fixing one of the assemblers is not a very easy thing to do. Note that beq a0, a1, symbol is treated identically by both.

Another "fun" snare is the xlen-dependent operands, which come up in shifts, where the immediate range accepted depends on xlen.

We probably also want to be careful with "which registers are valid to write", but that's quite difficult to do right now, and is closely connected to "what is encodable". This is especially the case when we want to say "this operand actually represents a GPR Pair, not a GPR" such in zilsd, but similar also occurs in C/Zca instructions.

Zfinx/Zdinx are going to be a nightmare on the "which registers are valid to write", especially rv32 zdinx. I haven't looked at how these are represented, but they're a clear case of "one mnemonic can mean a bunch of different instructions depending on the operands", which is a joy to deal with.

We eventually want to cover pseudos (which expand to sequences of instructions). call and tail are probably good places to start, lw <reg>, <sym> and sw <reg>, <sym>, <reg> are harder instances of similar things, as are the la.tls.ie and la.tls.gd pseudos.

We probably also need to cover optional operands, somewhere along the way. the 0 in 0(reg) is optional, and in the atomic instructions does not correspond to any encoding bits (the offset in these can only be 0). There are similar complexities in vsetvli.

We also made the decision in the toolchains recently that MOP/HINT-compatible instructions can always be written as their non-hint variant - i.e. c.sspush ra (from zicfiss) can always be written if you have c.mop.1 available (from zcmop) - see discussion here: riscv-non-isa/riscv-elf-psabi-doc#474 - this ends up adding some complexity to udb, but we don't believe that MOPs/HINTs can be re-allocated anyway.

codecov · 2026-02-03T13:28:13Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.24%. Comparing base (a1225d7) to head (95f77c5).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1527   +/-   ##
=======================================
  Coverage   72.24%   72.24%           
=======================================
  Files          52       52           
  Lines       27671    27671           
  Branches     6009     6009           
=======================================
  Hits        19992    19992           
  Misses       7679     7679

Flag	Coverage Δ
idlc	`76.18% <ø> (ø)`
udb	`66.26% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

dhower-qc · 2026-02-03T14:48:40Z

This PR is timely -- see #1435, which is just about ready. It introduces the new, new instruction schema that has dedicated operand objects that you could attach this information to.

ThinkOpenly · 2026-02-03T17:49:42Z

This PR is timely -- see #1435

Indeed. Looks like we're headed in the same rough direction.

which is just about ready.

Is it? ;-) Is the versioning in some of the file names an indicator of "works in progress"?

It introduces the new, new instruction schema that has dedicated operand objects that you could attach this information to.

OK. Looking...

Should we discuss this here, there, in a meeting, in a GitHub "discussion"? This is a fairly big topic.

dhower-qc · 2026-02-03T17:53:25Z

which is just about ready.

Is it? ;-) Is the versioning in some of the file names an indicator of "works in progress"?

The versioning is supposed to indicate that we are nearing 1.0 of a schema. I was thinking that we'd get 0.9 in main and then start a review process.

Should we discuss this here, there, in a meeting, in a GitHub "discussion"? This is a fairly big topic.

Probably big enough for a meeting. Maybe you, me, and Sam to start, then we can report back in the SIG?

ThinkOpenly · 2026-02-03T20:42:45Z

Probably big enough for a meeting. Maybe you, me, and Sam to start, then we can report back in the SIG?

I opened discussion #1532, since time zones can make getting everyone to a meeting challenging. I understand the gist of #1435, but I'm going to spend more time with it.

Currently, assembly syntax is represented as a simple string of comma-separated operands with a heuristic naming convention to indicate their respective type and purpose. Make this more rigorous by adding a schema which supports: - registers - the registers' register file (GPR, FPR, VR, CSR, etc.) - dereference syntax "(reg)" - dereference+offset syntax "offset(reg)" - immediates - floating-point rounding mode and possible values - FENCE scopes - register lists (for POP/PUSH) The new "operands" YAML field is currently optional and coexists with the existing "assembly" field. So, support can be added over time to both the YAML files and the infrastructure to support generation of actual assembly syntax where needed (documentation) until the "assembly" field is no longer needed.

ThinkOpenly · 2026-04-29T04:30:10Z

Notes on current state...

I treat csr operands as coming from a "csr" regfile
vector mask uses "possible_values: [0]" for "v0.t"
I kinda brute-forced reg_list definitions, but am content with the implementation
cm.mvsa01 uses operands "r1s, r2s", which I find weird. I used operand description name "xdm" (X regfile, destination, "move") and "xsm" would be used for cm.mva01s, but I override them to "r1s" and "r2s", reluctantly.

Things that would be nice to address before merging:

will that IDL "offset" syntax work? "operands[xs1].offset"
add encoding and decoding tests for new "encdec" backend
shall I implement "possible_values" for "csr" as "[0-4095]"? probably.
add more descriptions to the YAML files? to operand definitions?
operand description file names. This is subjective, and I'm not sure how important. Might be nice to document some conventions.

Can wait:

encode.py and decode.py are very hacky/fragile. They need to invoke the IDL functions.
need to validate "(reg)" operand definition "works" (xs-nooffset)
need to better handle register names (ABI names)... should there be a parameter to decode()?
what about CSR names as operand values?
need real IDL testing for for encode()/decode()... some way for an assembler/disassembler to utilize IDL.
support for pushing JSON schema "default" values into resolved arch
support for identifying "implicit" operands (e.g. register pair)
some sort of tie-in with "regfile" content instead of hardcoding regfile names?

henry-hsieh · 2026-05-07T16:28:56Z

Copy some ideas from #1814 :

Implicit operand (with new implicit attribute), e.g., c.addi4spn

$schema: "operand_schema.json#"
kind: operand
name: sp-implicit
data:
  $inherits: operand/xs.yaml#/data
  possible_values: [2]
  implicit: true

Name with multiple of actual registers, e.g., c.ld in RV32

Introduce name to values mapping
The name could be an array for ABI name swapping
The values can be an array with all included values

$schema: "operand_schema.json#"
kind: operand
name: xs-pair
data:
  type: reg_pair
  name: reg_pair_xs
  possible_values:
    - $inherits: operand/xs.yaml#/data
      name: ["x0", "zero"]
      values: [0, 1]
    - $inherits: operand/xs.yaml#/data
      name: ["x2", "sp"]
      values: [2, 3]
    - $inherits: operand/xs.yaml#/data
      name: ["x4", "tp"]
      values: [4, 5]
...

The instructions use the same operand for both source and destination, e.g., c.and

$schema: "operand_schema.json#"
kind: operand
name: xsdc
data:
  $inherits: operand/xdc.yaml#/data
  source: true

The instruction only allow one possible value for an operand, e.g., c.addi16sp

$schema: "operand_schema.json#"
kind: operand
name: xsd-sp-only
data:
  $inherits: operand/xs.yaml#/data
  possible_values: [2]
  destination: true

With Optimization 2, the reg_range can be integrated with GPR more naturally:

$schema: "operand_schema.json#"
kind: operand
name: reg-range1-rv32
data:
  type: reg_range
  name: reg_range1
  possible_values:
    - $inherits: operand/xs.yaml#/data
      name: ["x8", "s0"]
      values: [8]
    - $inherits: operand/xs.yaml#/data
      name: ["x8-x9", "s0-s1"]
      values: [8, 9]
  optional: true

The problem of vm could be solved, too:

$schema: "operand_schema.json#"
kind: operand
name: vm
data:
  $inherits: operand/vs.yaml#/data
  name: vm
  possible_values:
    - $inherits: operand/vs.yaml#/data
      name: ["v0.t"]
      values: [0]
  optional: true

henry-hsieh · 2026-05-07T16:42:17Z

Small question for execution:

Do you consider move this PR to separate develop branch and let everyone can contribute to it? It would take quite some time to migrate every extension to new format. Moreover, this PR will be extremely large for review.

ThinkOpenly · 2026-05-07T18:07:30Z

Do you consider move this PR to separate develop branch and let everyone can contribute to it?

Fair question. I've been hoping to get a reasonably comprehensive schema in place as a "phase 1". I think it's close. "implicit" probably needs to be added first, although that could be a "phase 2". Thoughts?

Phases:

Add non-breaking support for enough schema to encode and decode everything. Implement at least one example of every type of operand. Get general agreement with the approach.
Add "implicit" to give downstream more insight into dependencies/inputs and side-effects/outputs. Implement at least one example of every type of implicit operand. Get general agreement with the approach.
Implement for all instructions

Note that this PR could just cover phase 1, to get things moving.

It would take quite some time to migrate every extension to new format.

It will be a fair amount of effort, but I'm really hoping that AI can pitch in a lot.

Moreover, this PR will be extremely large for review.

I'm not sure the best way to mitigate that problem. One big gulp, or many bite-sized pieces. The effort will be roughly the same in aggregate. I'm in the process of creating some reasonably robust tests, although they do not utilize the IDL at all. I'd really like a way to invoke IDL functions from Python, but for now, I just did a very rough (hacky) equivalent translation of the IDL to Python.

henry-hsieh · 2026-05-08T01:27:18Z

Do you consider move this PR to separate develop branch and let everyone can contribute to it?

Fair question. I've been hoping to get a reasonably comprehensive schema in place as a "phase 1". I think it's close. "implicit" probably needs to be added first, although that could be a "phase 2". Thoughts?

Phases:
1. Add non-breaking support for enough _schema_ to encode and decode everything. Implement at least one example of every type of operand. Get general agreement with the approach.

2. Add "implicit" to give downstream more insight into dependencies/inputs and side-effects/outputs. Implement at least one example of every type of implicit operand. Get general agreement with the approach.

3. Implement for all instructions

LGTM.

Note that this PR could just cover phase 1, to get things moving.

It would take quite some time to migrate every extension to new format.

It will be a fair amount of effort, but I'm really hoping that AI can pitch in a lot.

Moreover, this PR will be extremely large for review.

I'm not sure the best way to mitigate that problem. One big gulp, or many bite-sized pieces. The effort will be roughly the same in aggregate. I'm in the process of creating some reasonably robust tests, although they do not utilize the IDL at all. I'd really like a way to invoke IDL functions from Python, but for now, I just did a very rough (hacky) equivalent translation of the IDL to Python.

You gonna need to translate grammar generated in Ruby to the parser interpreter in Python like Arpeggio. If I'm not wrong, both of them use PEG-based grammar. Then you visit the parser tree to get result. I'm not very familiar to parser, but this article provides some directions to do this.

ThinkOpenly · 2026-05-08T13:43:26Z

You gonna need to translate grammar generated in Ruby

Or, call Ruby to do the work. There is some setup required as well, as "encode" needs access to the operands, and "decode" needs access to the fields.

This still needs thought and a fair amount of effort. :-/

ThinkOpenly requested a review from dhower-qc as a code owner February 2, 2026 21:47

ThinkOpenly mentioned this pull request Feb 10, 2026

fix: update c.nop encoding and assembly to support HINTs (#1177) #1289

Draft

ThinkOpenly marked this pull request as draft March 19, 2026 22:11

ThinkOpenly force-pushed the operands branch from 27aef65 to 907ea07 Compare March 24, 2026 03:37

ThinkOpenly force-pushed the operands branch 4 times, most recently from 811bd8a to 2cd9558 Compare April 24, 2026 18:19

ThinkOpenly added 15 commits April 28, 2026 13:14

v2

333db7e

support fence_scope operands

9d57a64

support rounding_mode operands

c56b8e7

missing 'operands' may not be an error

90f31e6

minor comment

1c2ca55

v3

0cc34f9

optionality

d99be15

wip

cd7bce9

use lowercase for operand mnemonics

1c94344

lowercase operand mnemonics, support optionality

f525193

better support for optionality, reg_list, and stack_adj

e439fee

support sreg

0600adf

implement schema and associated changes

e777062

fix CI failures

87f3c46

ThinkOpenly added 2 commits April 28, 2026 13:14

more CI fixes

e00170f

significant refactor

1eb7e7a

ThinkOpenly force-pushed the operands branch from 2cd9558 to 1eb7e7a Compare April 28, 2026 18:15

update golden instruction appendix

788e419

ThinkOpenly marked this pull request as ready for review April 28, 2026 20:54

tighter refactoring, fix a few bugs

95f77c5

ThinkOpenly force-pushed the operands branch from 795735a to 95f77c5 Compare April 29, 2026 04:04

henry-hsieh mentioned this pull request Apr 30, 2026

feat: optimize C extensions #1814

Open

ThinkOpenly added 5 commits April 30, 2026 11:24

refactor encode.py and some fixes

0df4a3c

more refactoring

3352a4c

add --debug to make quieter by default

8ecbbde

use register_file names

b2f79c5

make decode.py quiter

bafdbcb

This was referenced May 4, 2026

feat(qc_iu): Add Xqccmi custom compressed instruction lookup table extension #1802

Merged

Add a means to indicate instruction references/changes PC #1825

Open

move to tools, update decode for ABI names

f1971d0

ThinkOpenly mentioned this pull request May 5, 2026

Add explicit PC metadata for instructions (#1825) #1827

Open

ThinkOpenly added 2 commits May 5, 2026 10:17

support floating point immediate operand type

0172b82

support float_immediate

58b4c15

Conversation

ThinkOpenly commented Feb 2, 2026

Uh oh!

ThinkOpenly commented Feb 2, 2026

Uh oh!

lenary commented Feb 3, 2026

Uh oh!

codecov Bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dhower-qc commented Feb 3, 2026

Uh oh!

ThinkOpenly commented Feb 3, 2026

Uh oh!

dhower-qc commented Feb 3, 2026

Uh oh!

ThinkOpenly commented Feb 3, 2026

Uh oh!

ThinkOpenly commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henry-hsieh commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henry-hsieh commented May 7, 2026

Uh oh!

ThinkOpenly commented May 7, 2026

Uh oh!

henry-hsieh commented May 8, 2026

Uh oh!

ThinkOpenly commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Feb 3, 2026 •

edited

Loading

ThinkOpenly commented Apr 29, 2026 •

edited

Loading

henry-hsieh commented May 7, 2026 •

edited

Loading