Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable binary encoding and printing of multi-package wit files #1609

Closed

Conversation

macovedj
Copy link
Contributor

Wit files that leverage multiple packages should be able to be encoded into binary and printed back out with the changes in this PR

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! On a skim this looks reasonable enough, but there's one part that will require careful review I haven't gotten to yet. Specifically this is affecting the binary format for wasm-encoded WIT packages and I need to review to see if this has an impact on how this interacts with older versions of wasm-tools. For example older versions need to be able to decode binaries produced by this PR and vice-versa, binaries produced here additionally need to be able to be decoded by older versions (assuming the new feature of more-than-one-package isn't used).

To perhaps help with this can you add some tests? There's no tests currently for this functionality and I think it would be good to have some in crates/wit-component/tests/interfaces/*.wit. Additionally would it be possible to avoid changing the v1 encoding? That in theory isn't being emitted by deafult any more and may even be able to be removed, it's been a bit since I reviewed the state of that.

@alexcrichton
Copy link
Member

Ok reading over this some more I don't think that this is going to work as-is. It looks like this is basically adding a few loops in a few places, but I think that the changes required are going to be more significant than that. For example each interface/world in a package, when encoded today, is exported as the kebab-cased-name of the interface/world. Each kebab-cased export is a standalone component type which has full context for just that interface/world.

With multiple packages something will need to change about this format. Otherwise it's not possible to encode two packages which contains interface foo for example since that would generate two kebab-cased exports named "foo". I haven't thought much about what the format should be, so I don't have an answer off the top of my head.

@alexcrichton
Copy link
Member

cc @lukewagner for possible brainstorming on the format here. The question being that today we have:

$ wasm-tools component wit -t <<< "package a:b; interface foo {}"
(component
  (type (;0;)
    (component
      (type (;0;)
        (instance)
      )
      (export (;0;) "a:b/foo" (instance (type 0)))
    )
  )
  (export (;1;) "foo" (type 0))
  (@custom "package-docs" "\00{}")
  (@producers
    (processed-by "wit-component" "0.210.0")
  )
)

What's the best way to represent this WIT package:

package a:b1 { interface foo {} }
package a:b2 { interface foo {} }

I'm tempted to say:

(component
  (component $a:b1 (; output of `wasm-tools component wit -t for just `package a:b1` ;) )
  (component $a:b2 (; output of `wasm-tools component wit -t for just `package a:b2` ;) )

  (export "a:b1" (component $a:b1))
  (export "a:b2" (component $a:b2))
)

This isn't currently valid though as "a:b1" isn't a valid export string, right now it needs to be fully qualified. This could be seen as motivation for relaxing those restrictions, but that's balooning the change a good deal so I'm curious if you have thoughts on this.

@lukewagner
Copy link

Ooh yeah, really good question. Agreed that the natural first step seems to be to put each nested WIT package into its own nested component. Then to your question: maybe this is weird but: what if we just didn't export those nested WIT packages at all? That would imply that when WIT tooling was slurping up a binary WIT package, it would need to recurse through all nested components and populate its internal namespace with all interfacename-d type exports of exported component types (rather than only doing so for the root component). Would that work?

What's neat about this is that it preserves the distinction between:

package ns:a;
package ns:b { interface i {} }
interface i {}

and

package ns:a { interface i {} }
package ns:b { interface i {} }

The former you could say is defining the package ns:a (which exports i) and bundling ns:b as an external dependency whereas the latter is defining an empty unnamed package that bundles external packages ns:a and ns:b. I think that distinction makes sense?

@alexcrichton
Copy link
Member

I think that could work yeah and that seems reasonable to me. For the start I think it'd just walk the root component looking for subcomponents rather than spidering the whole thing, but either probably wouldn't be too hard to implement. I'm thinking though it'd be best to only accept a conservative shape of components to keep our options open in the future to different organizations if we want.

Otherwise though this WIT:

package ns:a;
package ns:b { interface i {} }
interface i {}

currently isn't valid where WIT text files are either package foo:bar; mode or exclusively package foo:bar { ... } mode. That being said being able to express the distinction there eventually I think is good regardless if we want to support it in the future.

@lukewagner
Copy link

Ah right, good point; unless and until we add package { ... } blocks inside package { ... } blocks, you shouldn't need to spider.

@macovedj
Copy link
Contributor Author

So if I'm following correctly then for the scope of this PR, since mixed package declaration styles isn't yet supported, then in Alex's example where we have:

package a:b1 { interface foo {} }
package a:b2 { interface foo {} }

this would end up encoded with no exports:

(component
  (component $a:b1 (; output of `wasm-tools component wit -t for just `package a:b1` ;) )
  (component $a:b2 (; output of `wasm-tools component wit -t for just `package a:b2` ;) )
)

Is that right?

Then whenever we end up with mixed declaration styles being supported (maybe I"ll do that as a follow up or if it doesn't seem so bad I may try to include it in this PR) we could see both this and the current day encodings appear side by side in a single encoded wit file.

@alexcrichton
Copy link
Member

Indeed! So the encoder will basically say if there's one package do what it does today but with multiple packages being encoded they're encoded in this new format. Similarly the decoding process would need to recognize the new multi-package format and decode that in addition to supporting the previous single-package format.

There's still sort of a question of defaults, but I think that's reasonable to tackle at a later date. Specifically I'm wondering for example what the output of wasm-tools component wit would be for wasi:cli. Whether that by default transitively includes all dependency packages I'm not sure.

@macovedj
Copy link
Contributor Author

So I have implemented these changes locally for encoding multiple wit packages. When decoding though, I found that some modifications will be have to made here as the current tooling will try to parse the binary as a component instead of a wit package if no exports are present.

I think a convention needs to be decided on how to discern whether these are nested components in a component implementation vs being used to describe a wit package? I was thinking that maybe the component names can can provide a mechanism? If not then we could go back to expecting exports for these cases, and the convention could be established in how they're exported?

@macovedj
Copy link
Contributor Author

After some more thought... I'll actually probably investigate refactoring ComponentInfo into ComponentsInfo, with multiple packages in it. Then we could perform the existing check on each package, and it could also provide a clearer route to mixing package declaration styles.

@alexcrichton
Copy link
Member

That's a good point! I agree though that we can probably update how metadata is tracked and extend the "is empty" check to exclude the case of "no exports but has a bunch of components which all look like WIT packages"

@macovedj
Copy link
Contributor Author

macovedj commented Jun 21, 2024

Cool. I think I've essentially accomplished the thrust of what's involved in order to do to that now (locally)... As I'm tidying it up before requesting additional review, I'll be evaluating whether or not to include support for mixing package declaration styles or have that be a follow up.

@macovedj macovedj force-pushed the roundtrip-multi-package branch 2 times, most recently from 301c1a4 to fd677d1 Compare June 26, 2024 20:41
@macovedj
Copy link
Contributor Author

Hmm seems I botched something up switching branches... should be able to look into what went wrong shortly...

@macovedj
Copy link
Contributor Author

macovedj commented Jul 1, 2024

I think this should be ready for review now... just investigating why it's struggling to get gcc with yum in CI/CD.

let mut names = NameMap::new();
for pkg in self.packages {
let package = &self.resolve.packages[*pkg];
if let PackageKind::Explicit = package.kind {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused as to what this new PackageKind field is and how it's processed here. I think ideally this would be removed entirely since it seems to be an artifact of the original text encoding and shouldn't affect this binary encoding here.

Instead what I might recommend is:

  • If there's only a single package, encode in the old "v2 style".
  • If there's multiple packages, encoded in this "explicit" style where a sub-component is created.

I'd also recommend extracing the shared code between these two branches of the if since right now it looks like the body of the loop was copy/pasted.

Additionally I think long-term we'll want to move towards the "new v2 style" where all packages, even if it's just one, are encoded with a sub-component. In that sense I think it would make sense to have a few more knobs here like other transitions we've made with the binary format:

  • An env var should be added which if specified and is 0 or 1 indicates which encoding branch to use.
  • Faililng that a const with a doc block explaining what it's controlling.

Eventually in the future we'll invert the value of the const to default to the new encoding. Further in the future we'll remove the env var. Further still we'll remove decoding support for the original encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some thought, and keeping in mind that we'll eventually want to support a wit file that has both an "implicit"/"unscoped" package declaration as well as an "explicit"/"scoped" package declaration, is the number of packages actually what we want to use as the criteria? If we do things this way instead of capturing the PackageKind... then I'm not sure how we can tell the difference between package a:b {} and package a:b if in the scoped case, it's the only scoped package. I think we'd end up transforming one package declaration style into the other when performing a roundtrip if this choice is made based on the number of packages that exist, which may be fine, but just wanted to make sure that we're considering this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the limit of time I would prefer to not spec/support multiple encoding formats for a WIT package in wasm. Right now we support v1/v2 and this PR is adding a third. I'd eventually like to sunset all the prior versions, however.

Given that my goal would be to use this new "v3" unconditionally. I'm proposing to distinguish based off 1-or-more packages purely for backwards compatibility. That way preexisting workflows don't break with older versions since the new format wouldn't be used by default. Once the new format is in enough supporting tooling it would then be enabled by default.

Personally I don't think it's too important to preserve package a:b { ... } vs package a:b; ... since they're semantically the same thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ok that makes sense to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thought that struck me...

Is there a use case where the scoped packages are used similarly to how the deps folder is used today?
In this case where we're mixing package declaration styles, the implicit package may be the one being used by guest lang tooling for knowing which world to target, and the others are there as deps. If we needed to encode in this case, then I think this logic would treat the implicit package like the deps, rather than it being treated as "higher" than the others? I guess perhaps we could rely on guest lang tooling to keep track of which package is the one that matters for targeting?

}
// Finally convert the decoded package to wasm again and make sure it
// matches the prior wasm.
let wasm2 = wit_component::encode(Some(true), resolve, &[decoded_package])?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised this test is passing because I would expect that the decoded_package argument should be the result of decoded.packages() now that multiple packages are supported. Could you investigate if a new test is needed to reflect that?

Copy link
Contributor Author

@macovedj macovedj Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the assertion was wrong! accidentally setting wat2 to be wasm instead of wasm2 when checking the diff. You're right, when fixing the assertion, the test fails as is. Will push up changes to fix the assertion and handle decoded_package properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out when I fixed it fixed the assertion, it came out that the docs are being stored in the wat in a different order depending on if a wit is being parsed vs a binary, in such a way that it doesn't impact how the docs get printed I believe. Can fix the tests by performing a sort on the metadata fields, but still investigating how to just get encode things in the same order regardless.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a regression from this PR? Or a preexisting issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a preexisting issue.

if wasm != wasm2 {
        let wat2 = wasmprinter::print_bytes(&wasm)?;
        assert_eq!(wat, wat2, "document did not roundtrip correctly");
}

If I understand correctly, assert_eq is receiving two different wat samples created from wasm, neither represent wasm2... Was spending some time trying to dig into this, and it looks like modifying main also has this issue of doc metadata being reordered if I'm reading the code correctly...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, in that case feel free to leave this as-is and I can investigate it after landing this

cur_package.package_metadata = _package_metadata.clone();
} else if s.name() == "component-name" {
if let KnownCustom::ComponentName(reader) = s.as_known() {
let _packages = register_names(reader, &mut cur_package)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit hesitant to rely on a custom section for something so significant to the structure of the final WIT, would it be possible to avoid this inference of infer via other means?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I actually had similar thoughts when implementing... But I think that the way names of components are handled today is exclusively via custom sections? Will investigate some more, but if we want something similar to the wat outlined above where the name of the component is tied to the package name, I'm not sure of another way to get the information that informs a package name out of the binary format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation and @since/@unstable are definitely all handled with custom sections, but to me those are semi-ok to strip at any time where the name of a package is significant enough that you'd never want it stripped. I'm mostly wondering if this can be inferred from the exported items or something like that rather than having to be listed in the name section. An alternative could be to export the components themselves from the outer component where the export name has the full package name (of sorts at least)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there's maybe an edge case here where you could have an empty scoped package? I think we could infer things from exports so long as there are exports, and as far as I know the only case where there wouldn't be is if the package is empty... Would you be comfortable erroring on empty scoped packages then? Or yeah, we could also do the alternative and export the components themselves...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukewagner do you have thoughts on this point specifically? The question being how is the package name recovered when multiple packages are encoded in a component and one package is possibly empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or yeah... I think I like your idea of exporting the component, seems like the easiest way to deduce the name of the component without relying on the name section... How the PR handles it at the moment, even if we can tell it's an explicit/scoped package from the way the exports are structured, I think we still need the name section to figured out the actual name of the package that should be printed from a binary (unless there's another mechanism I'm missing). Also curious to hear @lukewagner's thoughts about if we exported the components.

crates/wit-parser/src/decoding.rs Outdated Show resolved Hide resolved
crates/wit-component/src/encoding/wit/mod.rs Show resolved Hide resolved
crates/wit-component/src/encoding/wit/v2.rs Outdated Show resolved Hide resolved
crates/wit-component/src/encoding/wit/v2.rs Outdated Show resolved Hide resolved
let mut names = NameMap::new();
for pkg in self.packages {
let package = &self.resolve.packages[*pkg];
if let PackageKind::Explicit = package.kind {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the limit of time I would prefer to not spec/support multiple encoding formats for a WIT package in wasm. Right now we support v1/v2 and this PR is adding a third. I'd eventually like to sunset all the prior versions, however.

Given that my goal would be to use this new "v3" unconditionally. I'm proposing to distinguish based off 1-or-more packages purely for backwards compatibility. That way preexisting workflows don't break with older versions since the new format wouldn't be used by default. Once the new format is in enough supporting tooling it would then be enabled by default.

Personally I don't think it's too important to preserve package a:b { ... } vs package a:b; ... since they're semantically the same thing.

}
// Finally convert the decoded package to wasm again and make sure it
// matches the prior wasm.
let wasm2 = wit_component::encode(Some(true), resolve, &[decoded_package])?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a regression from this PR? Or a preexisting issue?

cur_package.package_metadata = _package_metadata.clone();
} else if s.name() == "component-name" {
if let KnownCustom::ComponentName(reader) = s.as_known() {
let _packages = register_names(reader, &mut cur_package)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukewagner do you have thoughts on this point specifically? The question being how is the package name recovered when multiple packages are encoded in a component and one package is possibly empty.

crates/wit-parser/src/decoding.rs Outdated Show resolved Hide resolved
crates/wit-parser/src/decoding.rs Outdated Show resolved Hide resolved
name: Option<PackageName>,
/// Wasmparser-defined type information learned after a component is fully
/// validated.
types: Option<types::Types>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be ok to only store the types at the "root" since I don't think a snapshot is needed along the way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, at this point though I think I've been able to combine ComponentInfo with ExplicitPackageInfo, so I think it'll just have a value of None when for nested packages.

crates/wit-parser/src/decoding.rs Show resolved Hide resolved
@alexcrichton
Copy link
Member

I'm going to close this as the conclusion of this PR manifested in #1700 which is to instead of supporting the feature being added here instead explicitly don't support it. The encoding/printing phases will continue to work on one "primary package".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants