-
Notifications
You must be signed in to change notification settings - Fork 43
Conversation
Is the i8x16 variant of popcnt widely used in the implementations of these applications? If so can you share any implementation references that use these in its critical path? This seems to be a risky addition to current MVP given its non-trivial architecture support. |
The applications are insensitive to the element size in population size instruction: the maximum output for byte inputs is 8, which enables enough accumulation using packed 8-bit SIMD to amortize the overhead of converting to a wider data type. I choose 8-bit elements for the instruction as it is the most portable, but another option would be to support all variants (8/16/32/64-bit). This repository by @WojciechMula provides many implementations for population count, and Larq compute engine use the 8-bit population count on the critical path. I suggest we follow the agreed-upon criteria for evaluating instructions at Stage 3, and if they are demonstrably faster than emulation sequences on both x86-64 and ARM64, accept them to them SIMD spec. |
Another key application is computational biology and genetics with as example 2 biologists exploring the popcount rabbithole: |
Published results on this algorithm are not clear cut - it has negative speedups on smaller inputs when compiled natively. Wasm use of this instruction would be even slower, so a good starting point would be some description of why this should produce any speedup at all. |
We're removing two |
If I remember correctly, the criteria is speedup against scalar, not prior SIMD instructions. |
I don't think we ever discussed the intended baseline, but de facto all previous proposals were evaluated against the baseline of prior SIMD instructions. I don't mind to evaluate against both SIMD and scalar baselines in this case. |
@tlively and @zeux can correct me, but I thought we evaluated new instructions against scalar, the logic being that it does not make sense to add SIMD instructions that are slower than scalar. Also, in the comment you are referencing there are the following criteria:
I am not sure having this complex of a lowering on x86 makes it "well supported" on the platform (edit: the point is that SIMD popcount is known to be not well supported on x86). |
As proposed at WebAssembly/simd#379. Use a target builtin and intrinsic rather than normal codegen patterns to make the instruction opt-in until it is merged to the proposal and stabilized in engines. Differential Revision: https://reviews.llvm.org/D89446
Since prototyping seems to be underway, I am curious how are we going to run this. The use cases here are a number of research papers (I personally can't access the chemistry one), and a Wikipedia article. I have to admit I did not read the entirety of the papers, but at a first glance I don't see a code we can actually run. |
My plan is to implement one or more algorithms built on top of Population Count (either binary GEMM or N-point correlation basecase) and evaluate with dedicated Population Count instruction and with its simulation using baseline SIMD. |
That sounds great! @Maratyszcza Just for reference, in addition to the binary GEMM you linked to we also have an indirect binary convolution implementation similar to XNNPack available in larq/compute-engine which might be easier to experiment with for this analysis. Feel free to reach out to use, if you run into any trouble. |
As proposed in WebAssembly/simd#379. Since this instruction is still being evaluated for inclusion in the SIMD proposal, this PR does not add support for it to the C/JS APIs or to the fuzzer. This PR also performs a drive-by fix for unrelated instructions in c-api-kitchen-sink.c
As proposed in WebAssembly/simd#379. Since this instruction is still being evaluated for inclusion in the SIMD proposal, this PR does not add support for it to the C/JS APIs or to the fuzzer. This PR also performs a drive-by fix for unrelated instructions in c-api-kitchen-sink.c
This instruction has been landed in both LLVM and Binaryen, so it should be ready to use from tip-of-tree Emscripten in a few hours. The builtin function to use is |
To evaluate this proposal I implemented scalar and SIMD versions of the O(N^3) part of the 3-Point Correlation computation. This part is responsible for most time spent in the 3-Point Correlation computation. The algorithm follows this paper and is loosely based on the reference code for it. The code is specialized for 128 points. Here's the scalar version: uint32_t ThreePointCorrelationsScalar64Bit(
const uint64_t pairwise_ab[2 * 3 * 128],
const uint64_t pairwise_ac[2 * 3 * 128],
const uint64_t pairwise_bc[2 * 3 * 128])
{
uint32_t count = 0;
for (size_t i = 128; i != 0; i--) {
uint64_t sat_ab0_hi = pairwise_ab[1];
uint64_t sat_ab1_hi = pairwise_ab[3];
uint64_t sat_ab2_hi = pairwise_ab[5];
const uint64_t sat_ac0_lo = pairwise_ac[0];
const uint64_t sat_ac0_hi = pairwise_ac[1];
const uint64_t sat_ac1_lo = pairwise_ac[2];
const uint64_t sat_ac1_hi = pairwise_ac[3];
const uint64_t sat_ac2_lo = pairwise_ac[4];
const uint64_t sat_ac2_hi = pairwise_ac[5];
pairwise_ac += 6;
const uint64_t* restrict bc = pairwise_bc;
for (size_t j = 64; j != 0; j--) {
const uint64_t sat_ab0_bcast = (uint64_t) ((int64_t) sat_ab0_hi >> 63);
const uint64_t sat_ab1_bcast = (uint64_t) ((int64_t) sat_ab1_hi >> 63);
const uint64_t sat_ab2_bcast = (uint64_t) ((int64_t) sat_ab2_hi >> 63);
const uint64_t sat_bc0_lo = bc[0];
const uint64_t sat_bc0_hi = bc[1];
const uint64_t sat_bc1_lo = bc[2];
const uint64_t sat_bc1_hi = bc[3];
const uint64_t sat_bc2_lo = bc[4];
const uint64_t sat_bc2_hi = bc[5];
bc += 6;
count += (uint32_t) __builtin_popcountll(
(sat_ab0_bcast & ((sat_ac1_lo & sat_bc2_lo) | (sat_ac2_lo & sat_bc1_lo))) |
(sat_ab1_bcast & ((sat_ac0_lo & sat_bc2_lo) | (sat_ac2_lo & sat_bc0_lo))) |
(sat_ab2_bcast & ((sat_ac0_lo & sat_bc1_lo) | (sat_ac1_lo & sat_bc0_lo))));
count += (uint32_t) __builtin_popcountll(
(sat_ab0_bcast & ((sat_ac1_hi & sat_bc2_hi) | (sat_ac2_hi & sat_bc1_hi))) |
(sat_ab1_bcast & ((sat_ac0_hi & sat_bc2_hi) | (sat_ac2_hi & sat_bc0_hi))) |
(sat_ab2_bcast & ((sat_ac0_hi & sat_bc1_hi) | (sat_ac1_hi & sat_bc0_hi))));
sat_ab0_hi += sat_ab0_hi;
sat_ab1_hi += sat_ab1_hi;
sat_ab2_hi += sat_ab2_hi;
}
uint64_t sat_ab0_lo = pairwise_ab[0];
uint64_t sat_ab1_lo = pairwise_ab[2];
uint64_t sat_ab2_lo = pairwise_ab[4];
pairwise_ab += 6;
for (size_t j = 64; j != 0; j--) {
const uint64_t sat_ab0_bcast = (uint64_t) ((int64_t) sat_ab0_lo >> 63);
const uint64_t sat_ab1_bcast = (uint64_t) ((int64_t) sat_ab1_lo >> 63);
const uint64_t sat_ab2_bcast = (uint64_t) ((int64_t) sat_ab2_lo >> 63);
const uint64_t sat_bc0_lo = bc[0];
const uint64_t sat_bc0_hi = bc[1];
const uint64_t sat_bc1_lo = bc[2];
const uint64_t sat_bc1_hi = bc[3];
const uint64_t sat_bc2_lo = bc[4];
const uint64_t sat_bc2_hi = bc[5];
bc += 6;
count += (uint32_t) __builtin_popcountll(
(sat_ab0_bcast & ((sat_ac1_lo & sat_bc2_lo) | (sat_ac2_lo & sat_bc1_lo))) |
(sat_ab1_bcast & ((sat_ac0_lo & sat_bc2_lo) | (sat_ac2_lo & sat_bc0_lo))) |
(sat_ab2_bcast & ((sat_ac0_lo & sat_bc1_lo) | (sat_ac1_lo & sat_bc0_lo))));
count += (uint32_t) __builtin_popcountll(
(sat_ab0_bcast & ((sat_ac1_hi & sat_bc2_hi) | (sat_ac2_hi & sat_bc1_hi))) |
(sat_ab1_bcast & ((sat_ac0_hi & sat_bc2_hi) | (sat_ac2_hi & sat_bc0_hi))) |
(sat_ab2_bcast & ((sat_ac0_hi & sat_bc1_hi) | (sat_ac1_hi & sat_bc0_hi))));
sat_ab0_lo += sat_ab0_lo;
sat_ab1_lo += sat_ab1_lo;
sat_ab2_lo += sat_ab2_lo;
}
}
return count;
} And here's the SIMD version with the Population Count instruction: #define USE_X86_WORKAROUND 0
#define USE_EXTPADD 0
uint32_t ThreePointCorrelationsSIMDPopcount(
const uint64_t pairwise_ab[2 * 3 * 128],
const uint64_t pairwise_ac[2 * 3 * 128],
const uint64_t pairwise_bc[2 * 3 * 128])
{
#if !USE_EXTPADD
const v128_t mask_00FF00FF = wasm_i32x4_const(0x00FF00FF, 0x00FF00FF, 0x00FF00FF, 0x00FF00FF);
const v128_t mask_0000FFFF = wasm_i32x4_const(0x0000FFFF, 0x0000FFFF, 0x0000FFFF, 0x0000FFFF);
#endif
v128_t count_simd32 = wasm_i32x4_const(0, 0, 0, 0);
for (size_t i = 128; i != 0; i--) {
const v128_t sat_ac0 = wasm_v128_load(pairwise_ac);
const v128_t sat_ac1 = wasm_v128_load(pairwise_ac + 2);
const v128_t sat_ac2 = wasm_v128_load(pairwise_ac + 4);
pairwise_ac += 6;
const uint64_t* restrict bc = pairwise_bc;
v128_t sat_ab0_hi = wasm_v64x2_load_splat(pairwise_ab + 1);
v128_t sat_ab1_hi = wasm_v64x2_load_splat(pairwise_ab + 3);
v128_t sat_ab2_hi = wasm_v64x2_load_splat(pairwise_ab + 5);
v128_t count_simd16 = wasm_i16x8_const(0, 0, 0, 0, 0, 0, 0, 0);
for (size_t j = 64; j != 0; j--) {
#if USE_X86_WORKAROUND
const v128_t sat_ab0_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab0_hi, sat_ab0_hi, 1, 1, 3, 3), 31);
const v128_t sat_ab1_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab1_hi, sat_ab1_hi, 1, 1, 3, 3), 31);
const v128_t sat_ab2_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab2_hi, sat_ab2_hi, 1, 1, 3, 3), 31);
#else
const v128_t sat_ab0_bcast = wasm_i64x2_shr(sat_ab0_hi, 63);
const v128_t sat_ab1_bcast = wasm_i64x2_shr(sat_ab1_hi, 63);
const v128_t sat_ab2_bcast = wasm_i64x2_shr(sat_ab2_hi, 63);
#endif
const v128_t sat_bc0 = wasm_v128_load(bc);
const v128_t sat_bc1 = wasm_v128_load(bc + 2);
const v128_t sat_bc2 = wasm_v128_load(bc + 4);
bc += 6;
const v128_t bitmask =
wasm_v128_or(
wasm_v128_or(
wasm_v128_and(sat_ab0_bcast, wasm_v128_or(wasm_v128_and(sat_ac1, sat_bc2), wasm_v128_and(sat_ac2, sat_bc1))),
wasm_v128_and(sat_ab1_bcast, wasm_v128_or(wasm_v128_and(sat_ac0, sat_bc2), wasm_v128_and(sat_ac2, sat_bc0)))),
wasm_v128_and(sat_ab2_bcast, wasm_v128_or(wasm_v128_and(sat_ac0, sat_bc1), wasm_v128_and(sat_ac1, sat_bc0))));
const v128_t count_simd8 = __builtin_wasm_popcnt_i8x16(bitmask);
#if USE_EXTPADD
count_simd16 = wasm_i16x8_add(count_simd16, __builtin_wasm_extadd_pairwise_i8x16_u_i16x8(count_simd8));
#else
count_simd16 = wasm_i16x8_add(count_simd16, wasm_u16x8_shr(count_simd8, 8));
count_simd16 = wasm_i16x8_add(count_simd16, wasm_v128_and(count_simd8, mask_00FF00FF));
#endif
sat_ab0_hi = wasm_i64x2_shl(sat_ab0_hi, 1);
sat_ab1_hi = wasm_i64x2_shl(sat_ab1_hi, 1);
sat_ab2_hi = wasm_i64x2_shl(sat_ab2_hi, 1);
}
v128_t sat_ab0_lo = wasm_v64x2_load_splat(pairwise_ab);
v128_t sat_ab1_lo = wasm_v64x2_load_splat(pairwise_ab + 2);
v128_t sat_ab2_lo = wasm_v64x2_load_splat(pairwise_ab + 4);
pairwise_ab += 6;
for (size_t j = 64; j != 0; j--) {
#if USE_X86_WORKAROUND
const v128_t sat_ab0_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab0_lo, sat_ab0_lo, 1, 1, 3, 3), 31);
const v128_t sat_ab1_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab1_lo, sat_ab1_lo, 1, 1, 3, 3), 31);
const v128_t sat_ab2_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab2_lo, sat_ab2_lo, 1, 1, 3, 3), 31);
#else
const v128_t sat_ab0_bcast = wasm_i64x2_shr(sat_ab0_lo, 63);
const v128_t sat_ab1_bcast = wasm_i64x2_shr(sat_ab1_lo, 63);
const v128_t sat_ab2_bcast = wasm_i64x2_shr(sat_ab2_lo, 63);
#endif
const v128_t sat_bc0 = wasm_v128_load(bc);
const v128_t sat_bc1 = wasm_v128_load(bc + 2);
const v128_t sat_bc2 = wasm_v128_load(bc + 4);
bc += 6;
const v128_t bitmask =
wasm_v128_or(
wasm_v128_or(
wasm_v128_and(sat_ab0_bcast, wasm_v128_or(wasm_v128_and(sat_ac1, sat_bc2), wasm_v128_and(sat_ac2, sat_bc1))),
wasm_v128_and(sat_ab1_bcast, wasm_v128_or(wasm_v128_and(sat_ac0, sat_bc2), wasm_v128_and(sat_ac2, sat_bc0)))),
wasm_v128_and(sat_ab2_bcast, wasm_v128_or(wasm_v128_and(sat_ac0, sat_bc1), wasm_v128_and(sat_ac1, sat_bc0))));
const v128_t count_simd8 = __builtin_wasm_popcnt_i8x16(bitmask);
#if USE_EXTPADD
count_simd16 = wasm_i16x8_add(count_simd16, __builtin_wasm_extadd_pairwise_i8x16_u_i16x8(count_simd8));
#else
count_simd16 = wasm_i16x8_add(count_simd16, wasm_u16x8_shr(count_simd8, 8));
count_simd16 = wasm_i16x8_add(count_simd16, wasm_v128_and(count_simd8, mask_00FF00FF));
#endif
sat_ab0_lo = wasm_i64x2_shl(sat_ab0_lo, 1);
sat_ab1_lo = wasm_i64x2_shl(sat_ab1_lo, 1);
sat_ab2_lo = wasm_i64x2_shl(sat_ab2_lo, 1);
}
#if USE_EXTPADD
count_simd32 = wasm_i32x4_add(count_simd32, __builtin_wasm_extadd_pairwise_i16x8_u_i32x4(count_simd16));
#else
count_simd32 = wasm_i32x4_add(count_simd32, wasm_u32x4_shr(count_simd16, 16));
count_simd32 = wasm_i32x4_add(count_simd32, wasm_v128_and(count_simd16, mask_0000FFFF));
#endif
}
count_simd32 = wasm_i32x4_add(count_simd32,
wasm_v32x4_shuffle(count_simd32, count_simd32, 2, 3, 0, 1));
return wasm_i32x4_extract_lane(count_simd32, 0) + wasm_i32x4_extract_lane(count_simd32, 1);
} Additionally, I evaluated SIMD versions without the SIMD Population Count instruction. One version simulates SIMD Population Count using
Another version simulates SIMD Population Count using HACKMEM algorithm:
Performance on ARM64 devices is presented below:
The version with the |
In all SIMD versions on x86 I use
Performance on Intel Celeron N3060 is poor for two reasons:
Performance on Intel Xeon is presumably poor due to the overhead of reloading the lookup table and the 0x0F mask. Because of suboptimal loading of constants in V8, it generates 5 more instructions that the necessary minimum. Still, despite the overhead related to loading constants in V8, which would presumably get fixed later on, the |
This illustrates the point I've been trying to make for some time about feasibility of building and measuring larger apps - shaving off a few instructions by adding a more specialized operation only pays off if there are no other bottlenecks. That's why a speedup on a larger app is not only more convincing, but may be necessary - it would reduce the likelihood of running into issues like this. I suspect some of the other recently added operations would suffer from "somebody else's slowdown" if ran on all of listed use cases (from respective PR descriptions). Also, some of the issues from #380 (comment) would apply here as well. |
Re-benchmarked performance on Celeron N3060 after the recent fix for inefficient |
Adding a preliminary vote for the inclusion of population count operation to the SIMD proposal below. Please vote with - 👍 For including population count |
I've implemented a number of HammingDistance functions with Neon, SSSE3 & AVX2. The first step is a vector popcnt that produces a count per element. Neon and AVX512 have that instruction. SSE/AVX work almost as well using shufb. |
ARM and ARM64 support only 8-bit SIMD Population Count. Wider Population Count instruction could be emulated through extended pairwise addition instructions ( Both HACKMEM and To sum up, wider-element Population Count instructions are not exposed because they don't have an efficient lowering, and their inefficient lowering can be expressed using more general extended pairwise addition instructions.
V8 now does memory loads, albeit not yet rip-relative (needs an extra instruction to setup a general-purpose register for the load address). |
@Maratyszcza Thanks, this helps. So it looks like 8-bit popcnt is the only width that's practical to run natively on ARM and the emulation on Intel is more expensive for larger widths as well. Wrt Intel lowering it looks like this is similar to some other instructions that we've standardized in this proposal where it's not 1-1, but there's no good substitute. IOW if your algorithm requires popcnt, there's no good alternative and as such it's not obvious that the perf. "cliff" is an issue... (It's somewhat surprising [to me] that on ARM the delta between emulation and native execution isn't as high as I would expect although maybe the throughput/latency characteristics of |
Consider a dot product primitive: Moreover, it is unnecessary to reduce precision down to a single bit: dot products of multi-bit elements can be processed bit-by-bit. The complexity is quadratic in the number of bits, but @ajtulloch found that it still outperforms direct integer computations for 3-bit elements (i.e. 9x the cost of single-bit computation).
IIRC, OpenCV has a WAsm SIMD port in the works.
This a good point for including SIMD Population Count too, as it has wider availability than scalar Population Count: ARM/ARM64 devices feature hardware SIMD Population Count instruction, but no scalar Population Count instructions. |
In the interest of being precise, for Intel it's the opposite - any CPU that supports AVX512BITALG supports POPCNT but the reverse is not true. |
FWIW SVE and RiscV V have popcnt instructions for masks, and on RVV that extends to whole registers (by taking LMUL=8). |
Updated the table with x86-64 results following the merge of additional |
As an implementor I'm pretty neutral on this and I'm not able to properly judge whether the instruction is useful or not - several concrete uses are presented above, while it's a tough job to argue against it except in the abstract ("exotic operation", "few applications"). On ARM64 it's a no-brainer to include it. The Intel AVX512 implementation may not see wide use given the dearth of consumer devices with AVX512, but even the ssse3 implementation beats the alternatives (except on Celeron), if only barely. |
AVX512 availability on future Intel client platforms is not guaranteed and we should not make any decisions in a hurry at this point on the assumption that IA implementation will eventually become efficient with AVX512 support. |
@arunetm To be clear: |
Sorry if I missed something, the thread is getting long.
This is not helping the generality argument, as outside of machine learning this would be viewed as an esoteric take on those three operations - which doesn't make it wrong by any measure, it just that precision requirements outside of ML make this infeasible. To my best knowledge bio and chemistry algorithms would use it in a much more conventional way (as its name implies). However it don't think that just this being niche is a good enough argument of not including it - I am not convinced the examples and 3-point correlation performance prove that we really need this. As an aside and a background, among the 3-point correlation implementations the fastest one on Celeron is the scalar one (which is in and of itself problematic). Also, Xeon performance is far from great (it passes the bar technically, though). So firstly, I don't think adding one more SIMD operation that is slower than scalar is beneficial, as instead of two inefficient ways to write something we give users three (even if a new problematic operation is faster than existing problematic operation). Surely, existing alternatives ( Given that, I am trying to understand why we actually need a On the data side, we know from the 3-point example that simulating this instruction does not work on some of the lower end x86 chips, while using alternative sequence of instructions works on Arm. Frankly, I think the overall solution for porting code which uses Neon's
I think we should investigate a solution like this instead of adding a new instruction. |
OpenCV uses SIMD population count on all platforms, and does simulate it on x86. The SSE version uses a HACKMEM-like algorithm, the AVX version uses Moreover, OpenCV has a port to WebAssembly SIMD, which simulates the SIMD population count too! The implementation uses HACKMEM-like algorithm: it predates
A solution necessitates adding a new instruction. WAsm engines have the information (microarchitecture ID) to generate alternative lowering, but this information can't be passed to WAsm because of anti-fingerprinting considerations. Thus, a perfect work-around is possible only by abstracting the code-generation for this operation away from the WAsm application into the WAsm engine. |
So OpenCV already works and uses HACKMEM solution - this still does not convince me we need a new instruction for |
I investigated performance issues on the Celeron N3060, and turns out there's a lot of performance variance between V8 versions. E.g. on the latest V8 build, SIMD versions outperform scalar:
I also tried to change the SSE implementation of __ xorps(tmp, tmp);
__ pavgb(tmp, src);
__ Move(dst, src);
__ andps(tmp,
__ ExternalReferenceAsOperand(
ExternalReference::address_of_wasm_i8x16_splat_0x55()));
__ psubb(dst, tmp);
Operand splat_0x33 = __ ExternalReferenceAsOperand(
ExternalReference::address_of_wasm_i8x16_splat_0x33());
__ movaps(tmp, dst);
__ andps(dst, splat_0x33);
__ psrlw(tmp, 2);
__ andps(tmp, splat_0x33);
__ paddb(dst, tmp);
__ movaps(tmp, dst);
__ psrlw(dst, 4);
__ paddb(dst, tmp);
__ andps(dst,
__ ExternalReferenceAsOperand(
ExternalReference::address_of_wasm_i8x16_splat_0x0f())); The resulting V8 build further improves the |
The latest v8 build has this optimization https://chromium.googlesource.com/v8/v8/+/173d660849c1e4f43b15b67321b074a89b73f9ce so that's probably why it got slightly faster. |
The optimization for Silvermont (and similar Atom-like processors) is merged in https://chromium.googlesource.com/v8/v8/+/71fc222f4930d7f9425e45ee1b2111bee742f20e |
We voted on this instruction at the 1/29/21 meeting with these results: SF 1, F 6, N 3, A 2, SA 1. This is closer to the border between clear consensus and clear lack of consensus than many other votes we've taken, but I am inclined to interpret this vote as consensus to include the instruction. Here are the considerations leading me to this decision:
|
Code sequence from WebAssembly/simd#379, and exactly the same as x64, with minor tweaks for ExternalReferenceAsOperand. Bug: v8:11002 Change-Id: Icbfdac62b21c2734ad4886b3d48f34e29f7a8222 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2664860 Commit-Queue: Zhi An Ng <[email protected]> Reviewed-by: Deepti Gandluri <[email protected]> Cr-Commit-Position: refs/heads/master@{#72495}
Implementation for PPC will be added in a later CL. Port dd90d10 Original Commit Message: Code sequence from WebAssembly/simd#379, and exactly the same as x64, with minor tweaks for ExternalReferenceAsOperand. [email protected], [email protected], [email protected], [email protected] BUG= LOG=N Change-Id: I2be8a9cf04d0b327c15f47c2575877925238353c Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2672706 Reviewed-by: Junliang Yan <[email protected]> Commit-Queue: Milad Fa <[email protected]> Cr-Commit-Position: refs/heads/master@{#72510}
I believe the suggested lowering for SSE2 is incorrect. @Maratyszcza |
@aqrit Good catch, updated the PR description and filed issue 11591 for V8. |
@aqrit On a side note, would you be interesting to participate in WAsm SIMD meetings? The next one will be on April 9 (see WebAssembly/flexible-vectors#31 for details). |
Thanks for the report, this is fixed in v8 in https://crrev.com/c/2780218, should go out in canary tomorrow. |
As proposed at WebAssembly/simd#379. Use a target builtin and intrinsic rather than normal codegen patterns to make the instruction opt-in until it is merged to the proposal and stabilized in engines. Differential Revision: https://reviews.llvm.org/D89446
Introduction
The PR introduce a SIMD variant of Population Count operation. This operation counts the number of bits set to 1, is commonly used in algorithms involving Hamming distance between two binary vectors. Population Count is represented in the scalar WebAssembly instruction set, and many native SIMD instruction sets. E.g. ARM NEON and AVX512BITALG extensions natively support the 8-bit lane variant introduced in this proposal. Implementation in SSSE3 is less trivial, but still more efficient than emulation with
i8x16.swizzle
instruction due to elimination ofPADDUSB
instruction and direct lowering of table lookups intoPSHUFB
. Moreover, the SIMD Population Count can be efficiently implemented on SSE2-level processors wherei8x16.swizzle
-based emulation would have been scalarized.Applications
Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX512BITALG and AVX512VL instruction sets
y = i8x16.popcnt(x)
is lowered toVPOPCNTB xmm_y, xmm_x
x86/x86-64 processors with AVX instruction set
y = i8x16.popcnt(x)
is lowered to:VMOVDQA xmm_tmp0, [wasm_i8x16_splat(0x0F)]
VPANDN xmm_tmp1, xmm_tmp0, xmm_x
VPAND xmm_y, xmm_tmp0, xmm_x
VMOVDQA xmm_tmp0, [wasm_i8x16_const(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4)]
VPSRLW xmm_tmp1, xmm_tmp1, 4
VPSHUFB xmm_y, xmm_tmp0, xmm_y
VPSHUFB xmm_tmp1, xmm_tmp0, xmm_tmp1
VPADDB xmm_y, xmm_y, xmm_tmp1
x86/x86-64 processors with SSSE3 instruction set
y = i8x16.popcnt(x)
is lowered to:MOVDQA xmm_tmp0, [wasm_i8x16_splat(0x0F)]
MOVDQA xmm_tmp1, xmm_x
PAND xmm_tmp1, xmm_tmp0
PANDN xmm_tmp0, xmm_x
MOVDQA xmm_y, [wasm_i8x16_const(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4)]
PSRLW xmm_tmp0, 4
PSHUFB xmm_y, xmm_tmp1
MOVDQA xmm_tmp1, [wasm_i8x16_const(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4)]
PSHUFB xmm_tmp1, xmm_tmp0
PADDB xmm_y, xmm_tmp1
x86/x86-64 processors with SSE2 instruction set
y = i8x16.popcnt(x)
is lowered to (based on @WojciechMula's algorithm here):MOVDQA xmm_tmp, xmm_x
PSRLW xmm_tmp, 1
MOVDQA xmm_y, xmm_x
PAND xmm_tmp, [wasm_i8x16_splat(0x55)]
PSUBB xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_y
PAND xmm_y, [wasm_i8x16_splat(0x33)]
PSRLW xmm_tmp, 2
PAND xmm_tmp, [wasm_i8x16_splat(0x33)]
PADDB xmm_y, xmm_tmp
MOVDQA xmm_tmp, xmm_y
PSRLW xmm_tmp, 4
PADDB xmm_y, xmm_tmp
PAND xmm_y, [wasm_i8x16_splat(0x0F)]
ARM64 processors
y = i8x16.popcnt(x)
is lowered toCNT Vy.16B, Vx.16B
ARMv7 processors with NEON instruction set
y = i8x16.popcnt(x)
is lowered toVCNT.8 Qy, Qx
POWER processors with POWER 2.07+ instruction set and VMX
y = i8x16.popcnt(x)
is lowered toVPOPCNTB VRy, VRx
MIPS processors with MSA instruction set
y = i8x16.popcnt(x)
is lowered toPCNT.B Wy, Wx