-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert protocol description to structured data format, generate everything from that #50
Comments
Here's the XML documentation format I started working on a while ago. I think it should be pretty self-explanatory, but I can provide some more info on what's going on if required. |
Definitely a good idea! Now, it's nothing personal but I don't think I can handle xml for one more nanosecond :D Humans are going to be writing and maintaining this spec, so the format should be easy for humans to work with. XML completely fails that criterion. I'm thinking either a modern light-weight format like {JSON,YAML, TOML}, or a simple home brew that I can quickly whip up a parser for. It's not rocket surgery after all :) I'll see if I can make an example later |
I wouldn't say that XML completely fails the criterion (sometimes it can be useful, especially when paired with a good editor... or maybe that's just me), but yes, I agree. Going with the 'easy to work with' criteria, I'm not sure JSON would be the best to use - it gets ugly fast, and I find can be difficult to read (an alternative would be CSON, however I'm not sure how well supported it is). I've previously found YAML to be a bit too loose syntactically for my liking (but again, maybe thats just me and I just prefer strong nested structures like in XML), but IMO its definitely one to consider. I've never actually used TOML before (just had a look at it then), it looks interesting and could potentially work well if we do it right. As for a custom format, that definitely could work. Since part of the goal of this project is to allow software to automatically generate protocol parsers from the docs, however, we might want to avoid this as it will add overhead to implementing a parser for the documentation (whereas parsers for JSON/YAML/TOML would probably be available for most popular languages). I'll try to fiddle around with potential structures that I think could work with a few of these options sometime in the near future so we can compare how they'd look. |
I agree that CSON is slightly better than JSON, and I'm also worried about the getting-ugly-fast potential of JSON. I'm much more worried about XML, however ;)
This is without comments, so far, but that would easily be added. Here's a preliminary example of the client packets:
And so on. The types I imagine, are the following:
Almost the entire packet list can be specified like this. We also need to specify certain structs (for example, ships). It is a similar process:
A couple of points:
I haven't talked about bitfields, static/dynamic arrays, and delta objects, but I have a plan for those too. What do you guys think of this syntax? Anything I missed? |
This is looking pretty good, but I'm not really sure I like the idea of including version data in the documentation - it seems like it could make some packets get very messy, and (depending on how its done) might not be powerful enough for some changes through versions (e.g. the packet ID of something changing between two versions). Depending on the use for version metadata, I would propose we simply use Git tags to keep track of versions of the documentation for each Artemis version. Of course, this means that packet parsers that behave differently across different versions (i.e they change the version of the protocol they use based on the version used by the other end) would be more difficult to create, but I'm not sure that's necessarily an important thing as, as far as I know, Artemis itself isn't backwards compatible protocol-wise. One more thing: you didn't include packet types/subtypes in the packet lists. I'm guessing those could look something like this?
I feel like this syntax/structure is simple enough that someone else is bound to have made some kind of markup language which would fit it, in which case I would feel more comfortable using that as its one less thing we, and implementors, have to create and maintain. One more thing: as far as I know, the enumerations are always used with the same type - this means that the enumeration type can be stored along with the enumeration, rather than where it is used. |
Just did some experimenting with YAML, and came up with this as a potential syntax option (mostly based on your examples) - as you can see, its surprisingly similar to your syntax in many ways (although there are some awkward bits, like defining types and subtypes). YAML also has a thing called 'anchors', which effectively allows a value to be named and then re-used later on - we could potentially use this for structures or enumerations, like I've shown below. ---
enums:
MainScreenView: &enums.MainScreenView
- type: i32
- Forward: 0x00
Port: 0x01
Starboard: 0x02
etc
ObjectType: &enums.ObjectType
- type: i32
- EndOfObjectUpdatePacket: 0x00
PlayerShip: 0x01
WeaponsConsole: 0x02
etc
structs:
Ship: &structs.Ship
drive_type: *enums.DriveType
ship_type: u32
accent_color: u32
unknown: u32
name: string
client_packets:
AudioCommand:
- type: 0x6aadc57f
# The ID for the audio message. This is given by the IncomingAudio packet.
- audio_id: i32
# The desired action to perform.
audio_command: *enums.AudioCommand
CaptainSelect:
- type: 0x4c821d3c
subtype: 0x11
# The object ID for the new target, or 1 if the target has been cleared.
- target_id: i32 I'm going to try out some other markup languages to see if I can find something that works well (becaue as mentioned previously I would prefer something where someone else has already done/is doing the work to maintain a markup language), but if not then I definitely like the layout and syntax of your idea. |
Side note: I didn't include the version metadata in the code sample above (for the reasons I outlined above), but I believe YAML may have a feature that can do that - I'll see if I can figure something out. |
Here's an example of what versioning could look like, using YAMLs tag feature: structs:
Ship: &structs.Ship
drive_type: *enums.DriveType
ship_type: u32
!min=3.2 accent_color: u32
unknown: u32
name: string (the The syntax of what goes in the tag would be completely up to us, for example we could use |
I haven't had a lot of luck with TOML, and I don't think JSON/CSON will provide enough features to result in a compact structure. Here's another crazy idea I had though: S-Expressions. While this would require a custom parser (although parsing s-expressions isn't really hard), and probably isn't particularly practical, I thought it might be interesting to play around with: (artemis
(enums
(MainScreenView i32
(Forward 0x00)
(Port 0x00)
(Starboard 0x00))
(ObjectType i32
(EndOfObjectUpdatePacket 0x00)
(PlayerShip 0x00)
(WeaponsConsole 0x00)))
(structs
(Ship
(drive_type (enums DriveType))
(ship_type u32)
(accent_color u32 (min_version 3.2))
(unknown u32)
(name string)))
(client
(AudioCommand 0x6aadc57f
(audio_id i32)
(audio_command (enum MainScreenView)))
(CaptainSelect 0x4c821d3c 0x11
(target_id i32)))) But I digress. It looks like currently the only two real potential options are @chrivers' custom syntax and YAML - out of these, syntax-wise, I prefer the custom syntax, however it does come with the issues I've discussed above. @rjwut, what do you think? |
I think that if we can make an existing data format work without too much difficulty, we should, so as not not have to maintain a custom parser. Of what's been shown so far, I like the YAML example the best. While I'm a fan of JSON and generally not a fan of syntactically-significant whitespace, I can see why YAML might be a good fit for this project. I was experimenting with a JSON implementation just to see what it would look like, but I agree that the anchor and tagging features of YAML would be useful for this. I'm not sure that I like using comments for descriptions; I get that it makes sense in that they are intended for humans and therefore not useful to code generators, but the documentation generator will need it, so it's actually data, not just a comment. Would a YAML parser throw the comments away, or are they accessible in the resulting data structure? |
Lots of good comments here. To be honest, I think a few points have been overlooked :) it's late here, but I'll send a proper reply tomorrow |
That's a good point - I doubt the comments will remain accessible. One potential alternative is to use tags for the property type, and the value would be the comment, like this: client_packets:
AudioCommand:
- type: 0x6aadc57f
- !i32 audio_id: The ID for the audio message. This is given by the IncomingAudio packet.
!enums.audioCommand audio_command: The desired action to perform. However this prevents us from using anchors, and I'm not sure if multiple tags (for versioning and property type) is supported. client_packets:
AudioCommand:
- type: 0x6aadc57f
# The ID for the audio message. This is given by the IncomingAudio packet.
- audio_id: i32
# The desired action to perform.
- audio_command: *enums.AudioCommand Using this style, maybe we could move comments in to the list items, like this? client_packets:
AudioCommand:
- type: 0x6aadc57f
- _: The ID for the audio message. This is given by the IncomingAudio packet.
audio_id: i32
- _: The desired action to perform.
audio_command: *enums.AudioCommand YAML requires all items in a map to have a key, hence why I'm using |
Since we now know the canonical names for each packet type, I'm guessing we want to replace the |
Hey guys, I've been a bit busy. I'll try to do a proper writeup of my thoughts on this, asap. The current suggestions are not terrible, but there are some crucial pieces of information missing. Give me a day or two! :) |
Hey guys Sorry for the last reply being kind of vague and ominous - I was a bit pressed for time :) Here are my thoughts on the progress so far: I can understand the hesitation to use a custom format - "not invented here syndrome" is definitely a valid point of concernHowever, I also think you might be too afraid of inventing a small wheel, when existing wheels don't quite fit. Ok, terrible analogy. I have written parsers for countless things, and the example I gave was specifically designed to be parseable by a little handfull of regexes. In fact, I would make it a design requirement that we could keep a simple list of regexes in a "grammar" file. This would ensure that the complexity never got out of hand, and that other parsers could easily be written. So, I still think the custom format would be appropriate, which leads me to: We are trying to hammer a round protocol into a square YAMLYAML is represents a data structure, and like JSON (et al) it is almost exclusively concerned with values, not structure. The proposed way of using YAML is deceptively alluring, but has several problems. First, comments are thrown away in (almost?) all parsers. Second, anchors are usually not visible after parsing. Third, we NEED size specifiers on enums and bools. Fourth, using tags, while certainly possible, would probably have to be so prevalent, as to be as complicated as just writing our own small regex-based parser. Fifth, we definitely need to support version data, and not on a different branch. Which leads me to: YAML (and JSON, etc) are data formats, not grammarsLet's consider for a moment what we want to achieve here. I want the protocol to be in an introspectable format, because generating, checking and maintaining protocol code in several different languages (possibly even for different protocol versions) is both boring, difficult and error-prone. By having a common source of protocol truth, we can generate code for all the languages and projects we want, and not worry about if some implementations lack a certain field, or got updated when we learned what unknown_field_17 did. It also means that the docs and code can always stay in sync. And of course, anyone is still welcome to implement by hand, but this change makes it possible to gradually change to generated code, for the boring bulk of the code, if nothing else. To be able to generate the protocol de/serializers, we need to know the exact size layout, we need to know some (rather simple) parsing rules, and we need to know the mapping between bytes and values. In my mind, we would have a common spec parser, and then a small generator for each language that we want to generate something for, along with a set of templates for other boilerplate code that isn't strictly related to the protocol. Half parsing, half structureOne part of the challenge here, is to describe the exact format of various pieces of structured data (packets). The other challenge is how to make decisions on how to parse them. This part includes packet type determination (by id + possible sub-id), arrays (fixed length or token-delimited?) and bitstreams (how do we parse flags?) If we want to simplify the project, we could skip the parsing specification for now, even though that's probably pretty simple to do. When we make a parsing description, the canonical packet names from #52 would be an ideal addition to the grammar. From there, it would be a series of "match and branch" tables, something like:
Yes, there are a few corner cases we would have to figure out (like the damned inconsistent array formats), but that is not insurmountable. Personally, I'm less and less convinced that YAML will be a good fit for this - and I don't think JSON is any better (worse, probably). I'm still very much open to arguments one way or the other. We could start very small, and just describe enums in this (or some) format, and generate the index.html from a template that is 95% index.html, and 5% generated content. Then we would have a starting point. At the same time, we could generate enums for the languages we use. Already at that point, that would have significant value. I look forward to hearing your comments on this :) |
Honestly, the more I think about it, the more I feel like that I wouldn't make use of code generation anyway for IAN, mainly because of certain intelligent enhancements (mostly in the form of static helper methods) that I have put in my code that generated code would lack. Some examples from IAN, just with the enums:
You get the idea. There are a bunch more in the packet classes. Given that I'm unlikely to use this to generate code, the only thing it does for me is generates documentation, at which point we've only shifted the work from one document using a well-understood HTML syntax to another with a proprietary syntax, with the added burden of creating and maintaining a parser. Perhaps we're going about this the wrong way. What if we could establish some conventions and make some modifications to the existing HTML documentation so that the data we want can be parsed from it with some XPath expressions? |
I completely agree that missing all those pieces of implementation would be a dealbreaker. However, that was never the plan. In my rust implementation, I have the same situation. My suggestion would be to have template source files, that contain all the additional code. Depending on the language, this could be done in a number of ways. For example, I'm fairly sure you could make java classes that inherit the basic grunt work of the protocol parsing, and add all the logic you need? Then there's no conflict, and the generation part really is not difficult. What do you think? |
Alright, I think I can see now where our opinions on what exactly the structured format should do differ. I also have a few other thoughts to add, but I'm somewhat busy right now so I'll try to do a 'proper writeup' as soon as possible (mostly in response to @chrivers monolithic comment 😛). |
Apologies for taking so long, but here are my thoughts - first off I'll explain how I think our opinions are differing (and what exactly my opinion on what this format should do are), and then I've got a few smaller comments/questions that are related. I can pretty much see where our disagreements are coming from, from the title "YAML (and JSON, etc) are data formats, not grammars" in #50 (comment). @chrivers's concept language for the documentation format is a DSL/grammar (akin to language parsers), which I guess makes sense. But I feel like a grammar language is far too complicated and unnecessary for what we need - the Artemis packet format is consistent enough that we can make assumptions in order to simplify what our documentation actually specifies. For example, this line from the example code above:
Two things here seem odd to me - firstly, specifying that the program should read a Put simply, I don't think it should be up to our document to define how things are done. We should define the structure of those things, and provide some primitives in order to allow us to do that (e.g. specifying that something is an array or a bitmap), but it is getting too far if we are writing code to specify how to do those operations. This is why I think a data format would best suite this project. A few other points:
Related to point 4, a few other things from #50 (comment) (these are probably verging on nit-picking, so don't take them too seriously)...
You have, yes. However potential future users of this documentation may not have, and this restricts the growth of a potential community.
So now we have a grammar file for our grammar file? Also, different languages support different regex features - does this mean we'll need a grammar file for our grammar file for our grammar file, in order to list which regex features we require? Now for a few other general comments on things...
Correct; I did come up with a potential solution to that (for YAML, at least) - see above.
The original purpose in proposing anchors is that, if we design the format well, they wouldn't need to be visible after parsing - effectively also allowing "anonymous enumerations" and possibly other things such as arrays of property lists, etc.
Size specifies on enums relates back to what I was talking about with assumptions. From some past experience working on a JS library for Artemis that used a similar format to what I'm proposing, enumerations are always used with the same size (e.g.
I'm not sure I see your point here. Depending on the parser, I believe that tags are either simply attached to the value, or a callback defined by the code using the parser is run that allows it to modify the value to be injected into the final result (in which case you could just attach the type to the value very easily). It's not very complicated.
I would like to hear your reasons behind this. Keep in mind that Git does have a feature called 'tags' that allow you to mark the repository at a certain commit (often used to mark versions) - there's no need for different branches. The reasoning behind my proposal to use tags instead of embedded version data is simply that it means we have to consider every possible way the protocol is likely to be changed and account for it - this could include enum changes, packet types could change names, fields could change types, etc. IMO it would end up getting far too messy and difficult to modify if we try to keep all of this information in the one file with embedded version data, and it would likely add a lot of complexity to the parser, no matter how we do it.
Continuing with assumptions, why do we need both the integer and string types here? Since we know how to convert from string -> int types, wouldn't it make sense just to use string types throughout? To wrap things up, I guess that my opinion is that the parser generator (or simply a parser that consumes the documentation files) doesn't need to be completely dumb - we can definitely make assumptions on things in order to simplify the documentation. As a result, I think using a grammar-type language is overkill - we only want to define the structure of packets, not define how they should be parsed. Also, we need to clear up exactly how much this documentation format should be doing. In my opinion it should just be the low-level stuff: enough to generate an API where the user can do something like: server.send('CaptainSelect', {
target_id: 500
}); The user is then free to build whatever they want on top of that - this includes static helper methods and other APIs. I guess this is similar to what @chrivers explained in #50 (comment). One more thing I wanted to mention: there's no requirement that this documentation file is used in order to generate code - it should be just as easy for a parser program to be able to load the file and then use it to parse packets (i.e no compilation step needed). While I don't think this would actually change anything, I thought I'd just mention it anyway to ensure it's accounted for. Hopefully I've helped to clarify a few things on what I think the documentation format should look like. Thoughts? |
No, because it's common practice to run a network sniffer (wireshark et al), or otherwise print out the raw values of unknown (or doubtful) packets. This means that seeing the integers on screen is not unheard of, and it's nicer to have those in a readily readable form. |
The generated documentation would, I assume, still have the integer types (they could be calculated by the program that generates the documentation HTML file/s), I'm just talking about whether we need to store the integer types as they can be calculated from the string types. |
I thought that was my line ;-)
Ah! Yes, of course. You are entirely correct, of course. I was definitely going for a minimalistic DSL, to have a way of describing the protocol data in a sane way. I think it's definitely the right way to go, and I think the worries over complexity are overblown - however, I can clearly see that there's not a huge impetus to go in this direction.. :-)
Ah, perhaps the "read" was a poor choice of wording. The idea was that each "parser" (perhaps also a poorly named entity) would be a simple "match 1 value, take 1 action" type thing. Allow me to construct a slightly more fleshed-out example:
The idea was that any "parser" entity takes a value, compares it to some other values, and takes an action. The right hand side names aren't strings, they're other entities. If the next entity is a parser too, we repeat the process. If it's an "object", we read that, according to its list of fields. It's really a quite simple system, I think. It also allows us to succinctly and precisely describe parsing of all subtypes, which is something that is not super clear right now. For example, many (but not all(!)) subtype IDs changed between u32 and u8 in version 2.1.1 and 2.4.
I completely agree - the spec should not be code! The read "keyword" was a bad choice. It's just the type that the list it matched against. How this translates to parser code in real life, the spec makes no assumptions about. However, the fact that you need to compare a u32 to list of known values, is both valueable information, and something that any implementer absolutely has to do.
Well.. maybe. The spec isn't code - I completely agree with that. However, a fairly complicated YAML encoding is also not super easy for humans to read. Hopefully, the documentation is the easiest-possible specs for humans :)
Well, I certainly disagree here. We can standardize the syntax (YAML, for example), but we still need to agree exactly how it is used, especially since YAML is not a completely natural fit for this kind of thing. Realistically, I don't think this will be a problem in practice, with either YAML or a custom format.
For the syntax, yes. But you need to know what to do with the data.
I agree with a), certainly. Although it's really not a huge implementation, or much work. Since we are discussing the theoretical aspects, I would argue that b) would be just as easy to get started with, especially if we (or I) contribute example generators for a few good use cases.
Yes, I have :) I meant it more in a "therefore, I could get us up and running quickly" way. Right now we have no parsable format, so having one that perhaps has 1 downside, really isn't worse :)
Wait, wait. You just said it was bad if people would find this hard to parse, and now it's bad that we are helping them? You can't have it both ways, surely? ;-)
Well, it.. ehm.. I'm trying to be diplomatic here... ;-) The problem is, the file looks fine, but the data structure is bordering on insane. And if parks us in quirks-land. If the comment is blank, then a) it has to be there for it to work, and b) yaml parsers don't have consistent behaviour with "weird" keys. Sorry to say, but it's not my favorite. Maybe we could do something with tags instead?
I'm writing in a statically-typed language (Rust), and I need the enumerations to be there after parsing, otherwise I can't use this format.. I know things are a little easier in soft-statically typed languages (Java, C#), or dynamically typed language (Python, JS, etc).
Ah, that's a good point. I saw a whole lot of changes between 2.1.1 and 2.4, but it actually seems that most enums are either 32- or 8-bit, pretty consistently. However, when writing a parser, it is much easier to have the needed information in one place - that is, with the packet descriptions. Otherwise, one would have to jump all over the document to find out the field sizes - what's the gain in that? Although, if we generate the docs from this, we can probably save it just on the enum, and then write it everywhere we need, in the docs.
Well, git tags certainly would not work here. That would imply that we never learned anything new about the (for instance) 2.1.1 protocol, or that we would have to do some serious rebasing when we wanted to update the 2.1 version. That doesn't make sense for me. Branches are also no good, since we will then have ~95% shared code, but now suddenly writing a single library that speaks more than one version (without straight up having 2 complete libraries), becomes the much more difficult task of figuring out all the differences between 2 YAML data structures. That's quite a lot more difficult, than tagging the few differences we do know about.
It's hard to say - I think opposite branches would be quite a lot more complicated, especially for using more than one. Otherwise, we could just give up documenting old versions, and refer people to the historic git versions for reference, but I don't like that either.
As @IvanSanchez pointed out, it's very nice to have the hex values for network sniffing. Also, the "strings" here are meant to be references, as noted earlier.
I think that would be a nice feature, but if that's the deciding factor, we can certainly cut away that part :)
I'm not quite following? The spec should describe the protocol serialization format. So far, there's only very little about semantics (a small paragraph about common packet exchanges at the beginning of games, but that's it).
Thank you for taking the time to answer my mammoth post with a sibling! :) I'm still firmly of the opinion that it's much easier to do this with a custom format than everybody seems to think it is. However, if no one else wants this, I don't see it happening. There is another potential way forward. If we changed the HTML docs to be improved in certain ways, and to follow a strict style, then we could create a small program that checks the parseability of the HTML docs, without generating anything. Like the "patchcheck" tool that the Linux kernel uses to check for programming style, etc. That way, we could simply clean up what we have, and it could be used without much change - but at the same time, we could enforce the validity with a simple HTML parser, that checks that certain simple conventions are kept in place. (such as all fields having a data type and a size, for example). Thoughts? |
Sssshhh 😛
Yeah, I understand what you meant, but I still feel like even having a separate 'entity' in this manner is still overcomplicating a bit - why not just provide a list of packets by name?
Again here I see what you mean, but I feel like representing it in this way is overcomplicating it compared to how we can represent it with a data format. Re-using your example: client_packets:
StartGame:
- type: startGame
- difficulty: u32
- game_type: *enums.GameType # of course we don't need to do it like this, I'll come back to that I just feel like this kind of format is easier to understand/read and write (especially for those who potentially aren't familiar with it). It isn't as flexible as your approach but I really don't think we need that flexibility.
Good point, but I'm not sure this is solved by your example either (at least in its current form), as parsers expect the same type for the field they are looking at. This is definitely something to consider, however.
If we do it properly, I think it should be easy enough - of course the same thing goes for a custom format, but IMO what we've currently got in respect to YAML is easier to read than the current custom format (of course I'd be biased, however).
You are correct, however with a custom format a user would potentially need to implement both a syntax parser and then something to actually run through the data and do things with it (of course these could be part of the same code, but they're still different processes effectively).
I think you misinterpeted my point - I was just trying to point out the irony/complexity in requiring a grammar file for a grammar file (and then maybe a grammar file for a grammar file for a grammar file). It was mostly tongue-in-cheek though, so not really important.
Yeah, those are valid points - I mostly proposed that just as a potential initial solution. Side note: I don't think the key/value pair for the comment would have to be there for it to work... I suppose this depends on the language/parser, but surely the
Unfortunately, I think tags are pretty limited as to what characters they can contain - they're parsed as identifiers, so I don't believe they can have spaces.
That's a good point - but keep in mind we don't have to use anchors, something like the following would work fine: client_packets:
StartGame:
- type: startGame
- difficulty: u32
- game_type: GameType This would definitely need further investigation, however.
I don't really think this would be a problem - couldn't the type just be looked up from the list of enumerations that have been defined? (this would especially be easy as the document would be parsed into an AST before the packets are all looked at to generate code, or whatever we're doing with the file)
Hmm, interesting points. I definitely agree that branches are a no-go, and I can see what you mean with tags, but I just don't think that in-text version information (at least in the form you previously presented) is the best way to go - it seems like it will get very messy, very fast.
This is pretty much what tagging would be.
Keep in mind that the document isn't intended to be used in this way - that would be the purpose of the generated documentation HTML (unless its a program requiring that information, of course).
Yes, but as I mentioned previously, IMO the whole thing with references and separate entities in that manner is overkill and makes the documentation harder to read/edit when required.
Yeah, don't worry about that... re-reading that bit I wrote, the point I was trying to make didn't really make a lot of sense.
Any time 😉
I like this idea, although I think a documentation file in a different format would potentially be better (if we can agree on how to do it, I guess 😛). This is, in a way, even vaguely bordering on my original XML idea - perhaps we could write an XSLT to transform the XML to HTML and display the site that way? (just an idea, although I know you're not likely to agree with it) And, I suppose, in the words of @chrivers,
|
So since it may take a while to figure out exactly what this documentation format should look like (and it would be a pretty big change from what we've currently got), I suppose the place to start would be re-arranging the markup to a standard format as @rjwut suggested - that way we get a parsable structure quickly, as well as the ability to easily convert this into a different format if/when we figure that out. I'm willing to do some work on this and submit a pull request, does anyone else want to try it? |
Hey - I'll make a, haha, "proper writeup" soon ;-) In the meantime, I'm a bit pressed for time. However, I thought it was easier to show you guys what I thought, instead of arguing about it, so after about 2-3 hours of coding, I've made great progress on a parser (almost complete) and a rust code generator (half done). Give me just a short while to clean some things up, implement a few examples, and then I'll present it. If you still don't like it, then at least I've tried :) So far, the entire parser is a whopping 88 lines of python. The code that generates rust modules for ClientPacket, ServerPacket, enums and bitfields is 51 lines of python. Sure, there are features missing, but it's really quite manageable, and very easy to read and maintain the input files :) |
Ok, so it turns out I ended up implementing the complete solution. I now have a (still quite small) parser for the custom format. This is then connected to Mako, which is a standard templating system, with good documentation. The templates then inspect and loop over the data structure as they see fit, which means absolutely any type of code or docs can be generated. I'll clean up a few loose ends, and show you the result. I hope you'll like it! At this point, the tool is so useful that I can't go without it :) Of course, this also means I've already converted the entire protocol spec to the new format. It's really very easy to read, and I hope we can all benefit from it. I'd be happy to help write generators for existing use cases like docs and Java, if anyone is interested :) Give me a day or two for real-life obligations, then I'll be back |
Okay, so I promised I would get back in "a day or two".. that was five days ago :) I haven't been sitting idly by, though. As I mentioned, I ended up implementing the custom parser and generator, to try out the idea. After lots of work on it, I can unequivocally say: it WORKS, and it's AWESOME :D I'm currently using a 100% generated protocol serializer/deserializer, including support for arrays, enums, structs, dynamic array sized, etc. I want to give you guys an overview of what I'm worked on here: Isolinear ChipsThe isolinear chips is a complete specification of the Artemis 2.4.x network protocol, in Even if no one else wants to use this, the utility for me is so great, that I'm definitely going to continue development of it. I very much hope that I can convince you that we should render the HTML docs based on this data source. We can make it a completely smooth transition, starting at 0% dynamic content, and slowly extending it. TranswarpThe Simple Type Format is not artemis-specific in any way! It's a generic format for describing network protocols and data structures. To do anything with TricorderWhen using transwarp to generate all the protocol parsing code, it's extremely easy to try another field layout. Simple changes a few lines in the specification, regenerate, and recompile. The risk of introducing errors is basically zero, since the program and the spec always stay in sync. The next challenge is testing the protocol code against a corpus of real-life data packets. I've had some long, and very fruitful discussions with @NoseyNick, about collecting and handling game network data to test against. The tricorder utility is going to make it very easy to work with packet dumps. Features include binary parsing, hex output, frame splitting, deframing (deadbeef-header removal), and packet type searches. Using this tool, one might search the raw corpus for all instances of a type of message, and collect them in a single file. This allows us to generate any set of test data we want, based on a common collection of captured streams. HolodeckTricorder is the tool to work with network dumps, but it does not include any corpus data. In the holodeck project, @NoseyNick and myself aim to collect a corpus of game data, to test parsing, do reasearch, and test out new ideas for packet layouts. This project may or may not end up on github, due to the potential sheer size of the data. Disclaimer: I did not introduce these project names to @NoseyNick before mentioning them here, so they might not be final :) Next steps..I'm currently working on getting transwarp cleaned up, writing some examples for it, and putting it on github. This should give everybody a good chance of seeing how it works in real life. I'm very much looking forward to hearing your comments on the format, and the outlined ideas. I've aimed for good internally consistency in the format, but feedback is always welcome. If everybody is more or less on board with the format, then I hope we can discuss a transition plan for the docs, so we can gain the advantages that are to be had here :) |
The name Data would seem more appropriate, but also far less useful as a search term, so Holodeck will do fine :-) I'll see if I can make good use of your Isolinear Chips in my Perl. Thanks! |
@NoseyNick Yeah, I agree :D But "data" is probably going to have a million or so search results.. oh well ;) @everybody I've implemented transwarp as a complete compiler project. This is not just a one-off script file. It has a user-friendly command line interface, documentation (more coming), and should be relatively straightforward to use. Take a look here: transwarp project (https://github.com/chrivers/transwarp) Now, in terms of good examples, I'm not 100% there yet. However, the complete artemis protocol spec is available in the isolinear chips project (https://github.com/chrivers/isolinear-chips) Also, my complete rust templates are available in the duranium-templates project (https://github.com/chrivers/duranium-templates) Using the transwarp compiler on this dataset, you can see for yourself how a complete parser can be generated. Granted, it's in rust, but I'm sure you can imagine that this can really be used to generated absolutely anything. I think the next task will be creating a documentation template, that also runs on isolinear chips. @rjwut what are your thoughts on slowly transitioning the documentation to a more generated format? I'll help, of course. I look forward to hearing your thoughts on this. The future is now ;-) |
This is really great work, and I definitely don't want it to be in vain, but I think we still need to have some discussion and refine everything to make it even moar awesome before we begin the transitioning process. A small nitpick - I think the name of the A small side-note that may be worth considering for future development:
I really don't think we should be making this as a generic format - it complicates the format and adds a lot of unecessary info to this data; this relates back to some points I've previously made, and overall I guess my point here is that we can assume this format will only be used with the Artemis format. I haven't yet wrapped my head around the whole thing so this point might not be relevant, however. Also, if I have the time I may look at setting up a way to run automatic missions to collect even moar packet data for Holodeck - would you guys be interested in this? |
Also another small nitpick: Simple Type Format (the name) seems really generic - it doesn't actually really describe the format at all (other than saying that it's simple, which I'd say is arguable :P). If you really want it to be generic, how about something like Binary Schema Format? (Schema may not be the best word here, maybe Binary Structure Format or similar might be better). If this is gonna be Artemis-specific, something like Artemis Packet Format might be better. |
Oh, a few more things I forgot about in respect to Isolinear Chips:
Alright, I think that's all for now. I'm planning on trying writing a small STF parser in JS to test things out for asbs-lib, so I'll let you know if I come across any difficulties. |
Oh, yes, definitely :) I didn't intend for this to be the end-all-no-discussion version. But I thought, that since this issue is already of.. astronomical length (heh), it would be easier to make it, then show it :)
Good suggestion! I'm actually working on a (planned) slight change to the format already. All section names (enum, object, parser, etc) are going to be completely free-form! This means that we can just use the ones we think are the most descriptive. All sections can have sub-sections, and so on. This is almost where we are right now, but I took a few shortcuts to get this out the door.
Good point! It's actually very much deliberately. The compiler doesn't care which stf file a definition comes from, so grouping it into different files is just to make it easier for us. We can reorganize at any time, without affecting the output at all. With FrameType specifically, I put it in parser.stf, because that's where it is used. It was (and still is, when searching, editing, etc) much more useful to have it close to where it is referenced.
Good idea! The consts part is being adressed in the next version :)
It's already generic.. :) All I mean by this, is that I didn't make any articial limitations, or gross assumptions about what it's used for. The templates are very (100%) specific to the project they are used in, but there's really no need for the stf itself to be. It's really just a markup language.
Oh yes! Please! :) Can you send me an email, and we can coordinate further? @NoseyNick and myself are still working on the challenge of having a reasonable capture format, generating test corpuses, etc. The ascii-based data dump format we use, is (or rather, will be) documented in https://github.com/chrivers/tricorder. Right now, that repos does not have the newest stuff implemented. Perhaps you can capture in pcap format first? Then I'm sure we will find a way to convert the data later.
You know, that bugs me too. I think the best option would be to find a short, memorable name. That's what I tried to do with the other Artemis-related projects I made. Suggestions welcome, it's still in beta :) |
Yeah, it's not very logical. It's the size of the bitmask, in bytes. In the next version, all structures can have associated constants, as well as the body. So this would be:
(also, see below)
Good point! Again, there's bound to be a few rough edges here and there, and this is probably one of them. However, it is vital that the type is marked. Otherwise, the parser complexity explodes completely. Right now, the parsing rules for types are quite simple:
Now, without the struct type markup, there's no way to tell if, for instance, "f32" is the name of a struct, a built-in primitive, or something completely different? It's possible to fix, but it would require a lot of namespace code in the compiler, for no real perceived gain.
Ah, it's actually pretty deliberate! But I didn't get to that part of the documentation :) Notice how enums and flags (which also use "=") have values, whereas structs, packets, etc have fields with types. Anything after "=" is parsed (as an int, for example) while ":" denotes a type is coming up. That should be clearer, I agree.
I completely agree with this sentiment... but! It turns out the artemis protocol is so bloody inconsistent that this isn't true! I have 2 examples right now, but I think there's been a couple more I've seen in previous versions:
Now maybe, maybe one could do a few assumptions and other unsavory things to normalize this, but who's to say when the next crazy thing will happen?
Sure, let me know if you need any help :) If you don't mind me asking, do you intend to implement a templating system/some kind of generator as well? Certainly, more implementations are better than fewer, but isn't it a little bit of reinventing the wheel we just made? :) |
Perhaps we should start making issues on the Isolinear Chips (and related) repo/s for discussing things now - that might help to organise thoughts better and avoid massive responses like this one.
By this, do you mean that sections don't have a type?
Yeah, I thought that would be the case, it's just that all the files seem to be named by the types of sections they contain, and then there's this random enum. Just curious, how does the parser decide which stf files to open? Does it just open all of the ones in a folder?
But AFAIK it's not actually used by the STF files at all, but instead read by the parser, which is why I think it should be a different type (all other enums are used by the program) - but I suppose this doesn't matter anymore with the free-form sections.
Yeah - I suppose my stance on this project has been that we should make the structure language specifically to describe the Artemis formats, as it allows us to simplify some things in the documentation. But of course, there are pros and cons to each side.
Sent :)
If you're wanting to publicise this and encourage people to use it for other projects, IMO a descriptive name for the format would probably be best (or at least a name that somewhat gives away what the format is for) - e.g. for a Star Trek layman, Isolinear Chips probably doesn't mean a whole lot (although in this case that's probably fine as it's not really meant as a major public project).
Oh gosh, now we have equals signs and colons in the same sections... I get that it's to differentiate between constants and types, but it also seems very easy to mix these up while writing - perhaps there's a better way to separate these? (e.g. surround a section of constants with
Couldn't you just disallow using those names? It seems to me this is a bit like using a function in a language that's defined in the standard API, as opposed to one you've defined - sections could just be thought of as types defined in the program. I guess this depends on how you parse the format.
Oh wow, okay... It definitely wasn't like that back when I was working on my old parser, but clearly I haven't kept up-to-date with the documentation. After I read this I was considering pitching the idea of having enums specify a 'base type' and then places where the enum is used that don't use that type would provide their own, but I think that'd get far too annoying to work on and probably complicate the compiler considerably.
Potentially, I'm not yet sure - I may end up making a packet parser that reads these structure files and 'interprets' them on-the-fly (i.e no code gen necessary). Also, since we want other people working on Artemis-related projects to be able to use the format, I thought it'd be a good test to make sure people other than you can write a parser that makes sense ;P |
Good idea! This infamous ticket #50 is definitely getting out of hand :) @everybody: so from now on, syntax and compiler -> https://github.com/chrivers/transwarp Specifically, this thread is continued here: |
I beg to differ on this one. In my own code, I've tried hard to maintain compatibility with "all protocol versions" if at all possible... I've found it very useful to be able to mark particular fields / packets / parts of packets as "PROTOVERSION >= 2.6" or whatever, and I'm often reverse-engineering things from OLDER versions of the protocol, not just "the current". I think you'd struggle to improve docs for ALL versions unless the protocol-docs can have a different version scheme to the protocol itself. #78 I refer to some 2.4.0 updates that I'd presumably have to commit into 2.4.0, 2.5.1, AND 2.6 branches if we were maintaining separate protocol-docs for separate protocol versions 🤷♂️ In the meantime... Should I be waiting for https://github.com/chrivers/isolinear-chips to replace https://github.com/artemis-nerds/protocol-docs or do we feel it already has done? In other words, the issues I've been adding #75 #76 #77 #78 #79 for protocol 2.6.204, 2.6.0, and earlier... should I be trying to turn into pull-requests for https://github.com/chrivers/isolinear-chips or https://github.com/artemis-nerds/protocol-docs or ... both? 😕 -- Cheers |
Someone just drew my attention to http://kaitai.io/ |
... and more generically: https://github.com/dloss/binary-parsing |
No description provided.
The text was updated successfully, but these errors were encountered: