opcode tables in definition section for binary encoding needs clarification #274

m4b · 2015-07-21T16:51:50Z

In Global Structure in the definition section, I'm hoping for some clarification on the purpose of the opcode tables.

In particular, it seems like this description for the opcode table:

A sequence of standardized string literal opcode names, where order determines opcode index

indicates that the opcode table in the wasm module is a list of all the opcodes for... wasm? In use in the module? I don't see the purpose of either of these alternatives, hence my confusion (otherwise you're transmitting the opcode set (or subset) in every binary, which seems crazy and wasteful, as this can be accomplished by simply using instruction set version ordinals).

jfbastien · 2015-07-21T18:24:31Z

wasm defines textual opcodes, but no indices for them.
A module then defines which opcodes it uses in this textual table, and which indices it gives to each.

This means wasm doesn't define magic numbers, it lets the application choose which numbers to use to represent its instructions. It makes it much easier to add new features, potentially independently, and for an application to optimize its encoding based on which opcodes it actually uses.

If this explanation makes sense we can clarify the documentation, or you can send a pull request :-)

m4b · 2015-07-22T15:41:58Z

Thanks for the quick response, and sorry for the delay.

Ok this makes more sense; so are you saying:

the index of the textual opcode = "number to use to represent it"
the number of an opcode is its binary encoding value

? (and as you said, these numbers can change/are defined per binary/module/application)

If 2. isn't true (just a guess), why do opcodes need numbers to represent them then? I don't think a pull request is necessary, but maybe some comments on this saying just what you just said, or a link to a document describing this feature more in depth (I've never heard of defining opcode ordinals/encoding values per binary, so it's interesting)

titzer · 2015-07-22T15:55:41Z

On Wed, Jul 22, 2015 at 5:42 PM, m4b [email protected] wrote:

Thanks for the quick response, and sorry for the delay.

Ok this makes more sense; so are you saying:

the indice of the textual opcode = "number to use to represent it"

the number of an opcode is its binary encoding value

? (and as you said, these numbers can change/are defined per
binary/module/application)

If 2. isn't true (just a guess), why do opcodes need numbers to represent
them then? I don't think a pull request is necessary, but maybe some
comments on this saying just what you just said, or a link to a document
describing this feature more in depth (I've never heard of defining opcode
ordinals/encoding values per binary, so it's interesting)

The opcodes are to reduce each occurrence of the operation in the code down
to a single byte.

—
Reply to this email directly or view it on GitHub
#274 (comment).

m4b · 2015-07-22T16:15:21Z

Right, so you're encoding the opcode as a single byte (the index) in the binary representation? Also, doesn't this limit you to 255 distinct opcodes?

titzer · 2015-07-22T17:58:32Z

You can use two-byte opcodes for infrequently used operations, with, e.g.
#255 indicating that another byte follows.

On Wed, Jul 22, 2015 at 6:15 PM, m4b [email protected] wrote:

Right, so you're encoding the opcode as a single byte (the index) in the
binary representation? Also, doesn't this limit you to 255 distinct opcodes?

—
Reply to this email directly or view it on GitHub
#274 (comment).

kg · 2015-07-22T17:58:52Z

I don't think opcodes are bytes. I think we either pick a specific int size for the file (based on the size of its opcode table), or we encode them using a variable length scheme.

It is potentially feasible to keep the opcode space at 255 entries, though. This would rely on us encoding information into opcode arguments that would normally reserve a separate opcode. aligned vs unaligned, int32 vs float32, etc.

MikeHolman · 2015-07-22T18:21:40Z

I thought the current scheme was to use type inference to separate the opcode spaces without needing to encode that information?

kg · 2015-07-22T18:53:10Z

I'm not convinced that will actually work or provide any measurable size reduction. It's a good idea, but it directly ties the definition of the AST to the definition of the binary encoding and has other nasty effects like making backwards/forwards compatibility harder and prohibiting certain encoding optimizations.

FWIW, variable length integer encodings will automatically grant us 'most opcodes are one byte' if a compiler merely sorts the opcode table so that most-frequently-used opcodes are at the front. This can be done post-compile as an optimization pass and avoids the need for us to optimize specific opcodes by making them short or classifying them.

m4b · 2015-07-22T19:24:00Z

So I'm still missing some pieces, maybe cause they aren't in place yet, but let me know if this is more or less correct:

the opcode table is a list of opcodes, mapping textual name -> integer (potentially > 1 byte, with 255 the escape byte)
the binary encoding uses this mapping to represent opcodes in the on-disk format.
As @kg suggested, after a compiler produces the wasm, a second size optimization pass could count the number of opcode occurrences, sort them, and write the binary encoding (again if necessary) using the new sorted table to reduce the number of on-disk bytes used by the binary encoding (Note, this only reduces the on-disk size iff there are >= 255 opcodes in use in the application)

kg · 2015-07-22T19:26:32Z

Sorting the values reduces the post-stream-compression size even if varints aren't in use, actually. Stream compressors are really friggin' smart. But yes, the size win is the biggest if it pulls most of the opcodes down to 1 byte (reducing the average size over the whole file).

And to clarify #1, the opcode table solves multiple problems - size is one of them, but it also lets us avoid baking in arbitrary opcode values and it provides a good extensibility path (detecting unknown opcodes in the table; expressing fallback strategies/polyfills in the table)

The numeric opcode mappings are binary-encoding only, as you describe in 2. Though a native implementation could use them if it wanted, and I suspect most of them will have a numeric enum internally of all the opcodes they actually supported.

Most parts of our binary encoding scheme will be things you can do in passes, either inside or outside of the compiler. In the long run I think most production compilers (like LLVM) will integrate all the passes and do some of them simultaneously.

m4b · 2015-07-22T19:34:43Z

Ok, yea this is good stuff. I'd suggest portions of this discussion definitely be included in the binary format, as it's very elucidating.

Also, just a technical point, the optimization in 3 has to be performed after code generation, because until you generate the code, you don't know which opcodes you're using (and therefore how many), and so you can't sort them yet.

EDIT: on second thought, the final step of linking, relocation, and container meta-data generation would take care of this table, and at that point you already have your code ready to be embedded in the container, so it would be strange not to count and sort, since you're already making the table with the final data anyway.

naturaltransformation · 2016-04-13T19:12:52Z

Please excuse my jumping into the thread. I was wondering what are the remaining issues regarding opcode tables. It was referenced in the BinaryFormat doc at some point, but I don't see it being mentioned now. I assume someone will be reintroducing it. It seems at least some of the details regarding opcode tables are more Layer 1 concerns (e.g., mapping opcode + immediate(s)). But is the basic approach settled enough to be implemented?

rossberg · 2016-04-14T06:47:37Z

@naturaltransformation, no, I don't think it is. We don't have a concrete design yet.

lukewagner · 2016-07-12T19:02:05Z

As one of the people originally advocating for the opcode table, I'm increasingly thing that (1) it could be added post-MVP (as just a new section type which defines the opcode table), (2) could well be subsumed by a layer 1 (which is also post-MVP), depending on how layer 1 is designed. So I'd be fine removing this issue from the MVP milestone or closing the issue.

lukewagner · 2016-07-25T14:49:20Z

I'll close this for now; a new issue can be created for whatever gets proposed in the future.

ghost · 2016-07-25T22:26:21Z

An number of other encoding decisions were made on the basis that there would be good support for opcode tables. If it has been conceded that opcode tables are not practical (perhaps significant objections for their burden in literal wasm interpreters) then it would seem that some other issues need to be reconsidered. Some that come to mind are the immediate arguments to the memory access opcodes and immediate arity arguments. It might now be usefully to address some high frequency opcode/immediate cases by having separately defined opcodes.

sunfishcode added the blocks_binary label Jul 29, 2015

sunfishcode modified the milestone: MVP Jul 29, 2015

kripken removed the blocks_binary label Jul 31, 2015

sunfishcode added the binary format label Aug 4, 2015

naturaltransformation mentioned this issue Jun 1, 2016

Operator table #678

Closed

lukewagner closed this as completed Jul 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

opcode tables in definition section for binary encoding needs clarification #274

opcode tables in definition section for binary encoding needs clarification #274

m4b commented Jul 21, 2015

jfbastien commented Jul 21, 2015

Uh oh!

m4b commented Jul 22, 2015

Uh oh!

titzer commented Jul 22, 2015

Uh oh!

m4b commented Jul 22, 2015

Uh oh!

titzer commented Jul 22, 2015

Uh oh!

kg commented Jul 22, 2015

Uh oh!

MikeHolman commented Jul 22, 2015

Uh oh!

kg commented Jul 22, 2015

Uh oh!

m4b commented Jul 22, 2015

Uh oh!

kg commented Jul 22, 2015

Uh oh!

m4b commented Jul 22, 2015

Uh oh!

naturaltransformation commented Apr 13, 2016

Uh oh!

rossberg commented Apr 14, 2016

Uh oh!

lukewagner commented Jul 12, 2016

Uh oh!

lukewagner commented Jul 25, 2016

Uh oh!

ghost commented Jul 25, 2016

Uh oh!

opcode tables in definition section for binary encoding needs clarification #274

opcode tables in definition section for binary encoding needs clarification #274

Comments

m4b commented Jul 21, 2015

jfbastien commented Jul 21, 2015

Uh oh!

m4b commented Jul 22, 2015

Uh oh!

titzer commented Jul 22, 2015

Uh oh!

m4b commented Jul 22, 2015

Uh oh!

titzer commented Jul 22, 2015

Uh oh!

kg commented Jul 22, 2015

Uh oh!

MikeHolman commented Jul 22, 2015

Uh oh!

kg commented Jul 22, 2015

Uh oh!

m4b commented Jul 22, 2015

Uh oh!

kg commented Jul 22, 2015

Uh oh!

m4b commented Jul 22, 2015

Uh oh!

naturaltransformation commented Apr 13, 2016

Uh oh!

rossberg commented Apr 14, 2016

Uh oh!

lukewagner commented Jul 12, 2016

Uh oh!

lukewagner commented Jul 25, 2016

Uh oh!

ghost commented Jul 25, 2016

Uh oh!