feat: add support for fine-grained assembly representation#1527
feat: add support for fine-grained assembly representation#1527ThinkOpenly wants to merge 27 commits intoriscv:mainfrom
Conversation
|
Not terribly worried about this, but DO NOT MERGE. I had to specify all required fields and could not depend on the "default" values being used when trying to get past |
|
Thank you for getting this ball rolling, Paul. I would like to talk about testing unified-db against existing tools, and how we want to go about it. Currently, we test LLVM by inspecting the internal data descriptions (TableGen), but I'm not sure this is sustainable long-term. I would prefer to do the testing end-to-end - i.e. given a set of extensions(+parameters?), generate a random valid assembly string and check whether the assembler turns it into the expected instruction (+relocations). Similarly, generate a random invalid assembly string and check it is rejected. We should also be able to do the same for generating random encoding bits for disassembling (this is easier today, i think), for both valid and invalid cases. While these all seem like they should mirror each other, they don't quite, because the output of a disassembler is never meant to be assembled again, so the reality is you need an oracle to work out what the correct assembly should have been - in our case, that likely means writing an assembler/disassembler that can do so entirely based on UDB information alone. These are not small undertakings, but I think would stand us in a better situation long-term. I think you're right that we need to see how to join this up with the field information at some point. I'm keen for us to have a side table of "these are operands others have used, please re-use them", which hopefully would cover some combo of fields and operands, but it's hard to know how this would look today, and will need iteration. One kind of operand you've missed out is Symbol Expressions ( I started writing a much longer comment, which I'm going to hide below, because I think it's useful to write down these cases, but they start to look like the longer tail of things. I do think it's important to think about the more complex cases earlier, though, rather than designing something that works for simpler cases like More Hard CasesSorry for the stream of conciousness thoughts, I want to provide some degree of "here's complex cases we need to get right", rather than just thinking about simple cases like `add rd, rs1, rs2`:One thing to be really careful about right now is PC-relative immediate operands, such as the offset in Another "fun" snare is the xlen-dependent operands, which come up in shifts, where the immediate range accepted depends on xlen. We probably also want to be careful with "which registers are valid to write", but that's quite difficult to do right now, and is closely connected to "what is encodable". This is especially the case when we want to say "this operand actually represents a GPR Pair, not a GPR" such in zilsd, but similar also occurs in C/Zca instructions. Zfinx/Zdinx are going to be a nightmare on the "which registers are valid to write", especially rv32 zdinx. I haven't looked at how these are represented, but they're a clear case of "one mnemonic can mean a bunch of different instructions depending on the operands", which is a joy to deal with. We eventually want to cover pseudos (which expand to sequences of instructions). We probably also need to cover optional operands, somewhere along the way. the We also made the decision in the toolchains recently that MOP/HINT-compatible instructions can always be written as their non-hint variant - i.e. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1527 +/- ##
=======================================
Coverage 72.24% 72.24%
=======================================
Files 52 52
Lines 27671 27671
Branches 6009 6009
=======================================
Hits 19992 19992
Misses 7679 7679
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This PR is timely -- see #1435, which is just about ready. It introduces the new, new instruction schema that has dedicated operand objects that you could attach this information to. |
Indeed. Looks like we're headed in the same rough direction.
Is it? ;-) Is the versioning in some of the file names an indicator of "works in progress"?
OK. Looking... Should we discuss this here, there, in a meeting, in a GitHub "discussion"? This is a fairly big topic. |
The versioning is supposed to indicate that we are nearing 1.0 of a schema. I was thinking that we'd get 0.9 in main and then start a review process.
Probably big enough for a meeting. Maybe you, me, and Sam to start, then we can report back in the SIG? |
811bd8a to
2cd9558
Compare
Currently, assembly syntax is represented as a simple string of comma-separated operands with a heuristic naming convention to indicate their respective type and purpose. Make this more rigorous by adding a schema which supports: - registers - the registers' register file (GPR, FPR, VR, CSR, etc.) - dereference syntax "(reg)" - dereference+offset syntax "offset(reg)" - immediates - floating-point rounding mode and possible values - FENCE scopes - register lists (for POP/PUSH) The new "operands" YAML field is currently optional and coexists with the existing "assembly" field. So, support can be added over time to both the YAML files and the infrastructure to support generation of actual assembly syntax where needed (documentation) until the "assembly" field is no longer needed.
|
Notes on current state...
Things that would be nice to address before merging:
Can wait:
|
|
Copy some ideas from #1814 :
$schema: "operand_schema.json#"
kind: operand
name: sp-implicit
data:
$inherits: operand/xs.yaml#/data
possible_values: [2]
implicit: true
$schema: "operand_schema.json#"
kind: operand
name: xs-pair
data:
type: reg_pair
name: reg_pair_xs
possible_values:
- $inherits: operand/xs.yaml#/data
name: ["x0", "zero"]
values: [0, 1]
- $inherits: operand/xs.yaml#/data
name: ["x2", "sp"]
values: [2, 3]
- $inherits: operand/xs.yaml#/data
name: ["x4", "tp"]
values: [4, 5]
...
$schema: "operand_schema.json#"
kind: operand
name: xsdc
data:
$inherits: operand/xdc.yaml#/data
source: true
$schema: "operand_schema.json#"
kind: operand
name: xsd-sp-only
data:
$inherits: operand/xs.yaml#/data
possible_values: [2]
destination: trueWith Optimization 2, the reg_range can be integrated with GPR more naturally: $schema: "operand_schema.json#"
kind: operand
name: reg-range1-rv32
data:
type: reg_range
name: reg_range1
possible_values:
- $inherits: operand/xs.yaml#/data
name: ["x8", "s0"]
values: [8]
- $inherits: operand/xs.yaml#/data
name: ["x8-x9", "s0-s1"]
values: [8, 9]
optional: trueThe problem of $schema: "operand_schema.json#"
kind: operand
name: vm
data:
$inherits: operand/vs.yaml#/data
name: vm
possible_values:
- $inherits: operand/vs.yaml#/data
name: ["v0.t"]
values: [0]
optional: true |
|
Small question for execution: Do you consider move this PR to separate develop branch and let everyone can contribute to it? It would take quite some time to migrate every extension to new format. Moreover, this PR will be extremely large for review. |
Fair question. I've been hoping to get a reasonably comprehensive schema in place as a "phase 1". I think it's close. "implicit" probably needs to be added first, although that could be a "phase 2". Thoughts? Phases:
Note that this PR could just cover phase 1, to get things moving.
It will be a fair amount of effort, but I'm really hoping that AI can pitch in a lot.
I'm not sure the best way to mitigate that problem. One big gulp, or many bite-sized pieces. The effort will be roughly the same in aggregate. I'm in the process of creating some reasonably robust tests, although they do not utilize the IDL at all. I'd really like a way to invoke IDL functions from Python, but for now, I just did a very rough (hacky) equivalent translation of the IDL to Python. |
LGTM.
You gonna need to translate grammar generated in Ruby to the parser interpreter in Python like Arpeggio. If I'm not wrong, both of them use PEG-based grammar. Then you visit the parser tree to get result. I'm not very familiar to parser, but this article provides some directions to do this. |
Or, call Ruby to do the work. There is some setup required as well, as "encode" needs access to the operands, and "decode" needs access to the fields. This still needs thought and a fair amount of effort. :-/ |
Currently, assembly syntax is represented as a simple string of comma-separated operands with a heuristic naming convention to indicate their respective type and purpose.
Make this more rigorous by adding a schema which supports:
The new "operands" YAML field is currently optional and coexists with the existing "assembly" field. So, support can be added over time to both the YAML files and the infrastructure to support generation of actual assembly syntax where needed (documentation) until the "assembly" field is no longer needed.