Skip to content

opcode tables in definition section for binary encoding needs clarification #274

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
m4b opened this issue Jul 21, 2015 · 16 comments
Closed
Milestone

Comments

@m4b
Copy link

m4b commented Jul 21, 2015

In Global Structure in the definition section, I'm hoping for some clarification on the purpose of the opcode tables.

In particular, it seems like this description for the opcode table:

A sequence of standardized string literal opcode names, where order determines opcode index

indicates that the opcode table in the wasm module is a list of all the opcodes for... wasm? In use in the module? I don't see the purpose of either of these alternatives, hence my confusion (otherwise you're transmitting the opcode set (or subset) in every binary, which seems crazy and wasteful, as this can be accomplished by simply using instruction set version ordinals).

@jfbastien
Copy link
Member

wasm defines textual opcodes, but no indices for them.
A module then defines which opcodes it uses in this textual table, and which indices it gives to each.

This means wasm doesn't define magic numbers, it lets the application choose which numbers to use to represent its instructions. It makes it much easier to add new features, potentially independently, and for an application to optimize its encoding based on which opcodes it actually uses.

If this explanation makes sense we can clarify the documentation, or you can send a pull request :-)

@m4b
Copy link
Author

m4b commented Jul 22, 2015

Thanks for the quick response, and sorry for the delay.

Ok this makes more sense; so are you saying:

  1. the index of the textual opcode = "number to use to represent it"
  2. the number of an opcode is its binary encoding value

? (and as you said, these numbers can change/are defined per binary/module/application)

If 2. isn't true (just a guess), why do opcodes need numbers to represent them then? I don't think a pull request is necessary, but maybe some comments on this saying just what you just said, or a link to a document describing this feature more in depth (I've never heard of defining opcode ordinals/encoding values per binary, so it's interesting)

@titzer
Copy link

titzer commented Jul 22, 2015

On Wed, Jul 22, 2015 at 5:42 PM, m4b [email protected] wrote:

Thanks for the quick response, and sorry for the delay.

Ok this makes more sense; so are you saying:

  1. the indice of the textual opcode = "number to use to represent it"
  2. the number of an opcode is its binary encoding value

? (and as you said, these numbers can change/are defined per
binary/module/application)

If 2. isn't true (just a guess), why do opcodes need numbers to represent
them then? I don't think a pull request is necessary, but maybe some
comments on this saying just what you just said, or a link to a document
describing this feature more in depth (I've never heard of defining opcode
ordinals/encoding values per binary, so it's interesting)

The opcodes are to reduce each occurrence of the operation in the code down
to a single byte.


Reply to this email directly or view it on GitHub
#274 (comment).

@m4b
Copy link
Author

m4b commented Jul 22, 2015

Right, so you're encoding the opcode as a single byte (the index) in the binary representation? Also, doesn't this limit you to 255 distinct opcodes?

@titzer
Copy link

titzer commented Jul 22, 2015

You can use two-byte opcodes for infrequently used operations, with, e.g.
#255 indicating that another byte follows.

On Wed, Jul 22, 2015 at 6:15 PM, m4b [email protected] wrote:

Right, so you're encoding the opcode as a single byte (the index) in the
binary representation? Also, doesn't this limit you to 255 distinct opcodes?


Reply to this email directly or view it on GitHub
#274 (comment).

@kg
Copy link
Contributor

kg commented Jul 22, 2015

I don't think opcodes are bytes. I think we either pick a specific int size for the file (based on the size of its opcode table), or we encode them using a variable length scheme.

It is potentially feasible to keep the opcode space at 255 entries, though. This would rely on us encoding information into opcode arguments that would normally reserve a separate opcode. aligned vs unaligned, int32 vs float32, etc.

@MikeHolman
Copy link
Member

I thought the current scheme was to use type inference to separate the opcode spaces without needing to encode that information?

@kg
Copy link
Contributor

kg commented Jul 22, 2015

I'm not convinced that will actually work or provide any measurable size reduction. It's a good idea, but it directly ties the definition of the AST to the definition of the binary encoding and has other nasty effects like making backwards/forwards compatibility harder and prohibiting certain encoding optimizations.

FWIW, variable length integer encodings will automatically grant us 'most opcodes are one byte' if a compiler merely sorts the opcode table so that most-frequently-used opcodes are at the front. This can be done post-compile as an optimization pass and avoids the need for us to optimize specific opcodes by making them short or classifying them.

@m4b
Copy link
Author

m4b commented Jul 22, 2015

So I'm still missing some pieces, maybe cause they aren't in place yet, but let me know if this is more or less correct:

  1. the opcode table is a list of opcodes, mapping textual name -> integer (potentially > 1 byte, with 255 the escape byte)
  2. the binary encoding uses this mapping to represent opcodes in the on-disk format.
  3. As @kg suggested, after a compiler produces the wasm, a second size optimization pass could count the number of opcode occurrences, sort them, and write the binary encoding (again if necessary) using the new sorted table to reduce the number of on-disk bytes used by the binary encoding (Note, this only reduces the on-disk size iff there are >= 255 opcodes in use in the application)

@kg
Copy link
Contributor

kg commented Jul 22, 2015

Sorting the values reduces the post-stream-compression size even if varints aren't in use, actually. Stream compressors are really friggin' smart. But yes, the size win is the biggest if it pulls most of the opcodes down to 1 byte (reducing the average size over the whole file).

And to clarify #1, the opcode table solves multiple problems - size is one of them, but it also lets us avoid baking in arbitrary opcode values and it provides a good extensibility path (detecting unknown opcodes in the table; expressing fallback strategies/polyfills in the table)

The numeric opcode mappings are binary-encoding only, as you describe in 2. Though a native implementation could use them if it wanted, and I suspect most of them will have a numeric enum internally of all the opcodes they actually supported.

Most parts of our binary encoding scheme will be things you can do in passes, either inside or outside of the compiler. In the long run I think most production compilers (like LLVM) will integrate all the passes and do some of them simultaneously.

@m4b
Copy link
Author

m4b commented Jul 22, 2015

Ok, yea this is good stuff. I'd suggest portions of this discussion definitely be included in the binary format, as it's very elucidating.

Also, just a technical point, the optimization in 3 has to be performed after code generation, because until you generate the code, you don't know which opcodes you're using (and therefore how many), and so you can't sort them yet.

EDIT: on second thought, the final step of linking, relocation, and container meta-data generation would take care of this table, and at that point you already have your code ready to be embedded in the container, so it would be strange not to count and sort, since you're already making the table with the final data anyway.

@naturaltransformation
Copy link
Contributor

Please excuse my jumping into the thread. I was wondering what are the remaining issues regarding opcode tables. It was referenced in the BinaryFormat doc at some point, but I don't see it being mentioned now. I assume someone will be reintroducing it. It seems at least some of the details regarding opcode tables are more Layer 1 concerns (e.g., mapping opcode + immediate(s)). But is the basic approach settled enough to be implemented?

@rossberg
Copy link
Member

@naturaltransformation, no, I don't think it is. We don't have a concrete design yet.

@lukewagner
Copy link
Member

As one of the people originally advocating for the opcode table, I'm increasingly thing that (1) it could be added post-MVP (as just a new section type which defines the opcode table), (2) could well be subsumed by a layer 1 (which is also post-MVP), depending on how layer 1 is designed. So I'd be fine removing this issue from the MVP milestone or closing the issue.

@lukewagner
Copy link
Member

I'll close this for now; a new issue can be created for whatever gets proposed in the future.

@ghost
Copy link

ghost commented Jul 25, 2016

An number of other encoding decisions were made on the basis that there would be good support for opcode tables. If it has been conceded that opcode tables are not practical (perhaps significant objections for their burden in literal wasm interpreters) then it would seem that some other issues need to be reconsidered. Some that come to mind are the immediate arguments to the memory access opcodes and immediate arity arguments. It might now be usefully to address some high frequency opcode/immediate cases by having separately defined opcodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants