Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Concerns about integer vs floating-point instructions on x86 #125

Closed
Maratyszcza opened this issue Oct 28, 2019 · 6 comments
Closed

Concerns about integer vs floating-point instructions on x86 #125

Maratyszcza opened this issue Oct 28, 2019 · 6 comments

Comments

@Maratyszcza
Copy link
Contributor

Maratyszcza commented Oct 28, 2019

In SSE and AVX instruction sets on x86 many instructions have separate integer, single-precision, and double-precision forms, e.g. MOVDQU/MOVUPS/MOVUPD. On "big" Intel and AMD cores, there is an extra penalty if a register produced by an integer SIMD op is consumed by a floating-point SIMD op, and vice versa.

However, WebAssembly SIMD doesn't make distinction between e.g. integer & FP loads, and although this information can, in theory, be reconstructed from instruction stream, such reconstruction requires expensive analysis passes, which streaming WebAssembly engines can not afford.

There are only few classes of ops have separate integer / floating-point instructions on x86:

  1. Loads and stores
  2. Shuffles
  3. Broadcasts ("load-and-splat")
  4. Binary logic (AND, OR, XOR, ANDNOT)
  5. Blends

I think it is worth to consider splitting corresponding WebAssembly instructions into separate integer and floating-point variants in the SIMD spec. Initially both compilers and WAsm engines can treat both the integer and the floating-point variants the same, but at least it will allow to properly fix it in the future. Here is the list of instructions that would need two forms:

  • v128.const
  • v8x16.shuffle
  • v128.and
  • v128.or
  • v128.xor
  • v128.not
  • v128.andnot
  • v128.bitselect (decomposed into AND, ANDNOT, and OR on x86)
  • v128.load
  • v8x16.load_splat
  • v16x8.load_splat
  • v32x4.load_splat
  • v64x2.load_splat
  • v128.store

Note that the problem is specific to the distinction between integer and floating-point SIMD instructions on x86. ARM NEON doesn't distinguish between integer/floating-point variants at ISA level, and as far as I know no x86 CPUs distinguish between "double-precision" (e.g. ANDPD) and "single-precision" (e.g. ANDPS) instructions.

@Maratyszcza Maratyszcza changed the title Concerns about difference integer & floating-point instructions on x86 Concerns about integer vs floating-point instructions on x86 Oct 28, 2019
@AndrewScheidecker
Copy link
Contributor

FWIW, there's some past discussion about this here: #1 (comment)

@nfrechette
Copy link

I wrote about this last week on my blog here. I discuss real world use cases for this and performance measurements with/without. I only mention quaternion math related functions but usage of these instructions happens in lots of other code.

ARM64 seems to suffer when using XOR with floating point inputs. It isn't impossible that it has a similar penalty internally but no instruction to bypass it. Different chips have different performance here, you can see numbers from my Pixel 3 and an iPad if you follow the links in my post. Performance ranged from slightly worse to much worse. Perhaps someday we'll see a NEON extension that will add these instructions as well. I just can't seem to find good ARM internal documents to really shine light on this and I don't have time to measure myself.

@dtig
Copy link
Member

dtig commented Dec 19, 2019

Labeling this as pending data as the result of the discussion in 10/22/2019 was to have some benchmarks to see how this affects usage in practice. (#121)

@dtig
Copy link
Member

dtig commented May 20, 2020

Following up, the notes have an AI for @penzn to see if there's any benchmarking data here to share. Is this still the plan? If not, given that we won't be adding separate integer/floating point ops at this stage, I would suggest we close this issue.

@midnight-dev
Copy link

Given the late phase, separate ops won't be implemented regardless of benchmarks, no? In which case, might as well close it.

I think it'd still be good to have a few benchmark samples. I'm planning to use some of SIMD for matrix & quaternion math and model volumetrics in the next year for native-grade performance, but an implicit conversion penalty on every op or needing to waste cycles on testing types may derail my plans if the resulting overhead is too great.

Assuming this is set in stone for now, could this be revisited for the next revision of wasm SIMD?

@dtig
Copy link
Member

dtig commented Dec 11, 2020

Closing as per #396.

@dtig dtig closed this as completed Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants