Population count instruction #379

Maratyszcza · 2020-10-08T04:05:40Z

Introduction

The PR introduce a SIMD variant of Population Count operation. This operation counts the number of bits set to 1, is commonly used in algorithms involving Hamming distance between two binary vectors. Population Count is represented in the scalar WebAssembly instruction set, and many native SIMD instruction sets. E.g. ARM NEON and AVX512BITALG extensions natively support the 8-bit lane variant introduced in this proposal. Implementation in SSSE3 is less trivial, but still more efficient than emulation with i8x16.swizzle instruction due to elimination of PADDUSB instruction and direct lowering of table lookups into PSHUFB. Moreover, the SIMD Population Count can be efficiently implemented on SSE2-level processors where i8x16.swizzle-based emulation would have been scalarized.

Applications

Hamming distance (e.g. implemented in Fast Library for Approximate Nearest Neighbors (FLANN) and OpenCV computer vision library)
N-point correlations
Binary neural networks (e.g. implemented in larq/compute-engine and Blueoil, cc @lgeiger)
Ultra low precision neural networks (implemented in Caffe2/PyTorch, cc @ajtulloch)
2D similarity in chemical informatics

Mapping to Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.

x86/x86-64 processors with AVX512BITALG and AVX512VL instruction sets

i8x16.popcnt
- y = i8x16.popcnt(x) is lowered to VPOPCNTB xmm_y, xmm_x

x86/x86-64 processors with AVX instruction set

i8x16.popcnt
- y = i8x16.popcnt(x) is lowered to:
  - VMOVDQA xmm_tmp0, [wasm_i8x16_splat(0x0F)]
  - VPANDN xmm_tmp1, xmm_tmp0, xmm_x
  - VPAND xmm_y, xmm_tmp0, xmm_x
  - VMOVDQA xmm_tmp0, [wasm_i8x16_const(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4)]
  - VPSRLW xmm_tmp1, xmm_tmp1, 4
  - VPSHUFB xmm_y, xmm_tmp0, xmm_y
  - VPSHUFB xmm_tmp1, xmm_tmp0, xmm_tmp1
  - VPADDB xmm_y, xmm_y, xmm_tmp1

x86/x86-64 processors with SSSE3 instruction set

i8x16.popcnt
- y = i8x16.popcnt(x) is lowered to:
  - MOVDQA xmm_tmp0, [wasm_i8x16_splat(0x0F)]
  - MOVDQA xmm_tmp1, xmm_x
  - PAND xmm_tmp1, xmm_tmp0
  - PANDN xmm_tmp0, xmm_x
  - MOVDQA xmm_y, [wasm_i8x16_const(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4)]
  - PSRLW xmm_tmp0, 4
  - PSHUFB xmm_y, xmm_tmp1
  - MOVDQA xmm_tmp1, [wasm_i8x16_const(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4)]
  - PSHUFB xmm_tmp1, xmm_tmp0
  - PADDB xmm_y, xmm_tmp1

x86/x86-64 processors with SSE2 instruction set

i8x16.popcnt
- y = i8x16.popcnt(x) is lowered to (based on @WojciechMula's algorithm here):
  - MOVDQA xmm_tmp, xmm_x
  - PSRLW xmm_tmp, 1
  - MOVDQA xmm_y, xmm_x
  - PAND xmm_tmp, [wasm_i8x16_splat(0x55)]
  - PSUBB xmm_y, xmm_tmp
  - MOVDQA xmm_tmp, xmm_y
  - PAND xmm_y, [wasm_i8x16_splat(0x33)]
  - PSRLW xmm_tmp, 2
  - PAND xmm_tmp, [wasm_i8x16_splat(0x33)]
  - PADDB xmm_y, xmm_tmp
  - MOVDQA xmm_tmp, xmm_y
  - PSRLW xmm_tmp, 4
  - PADDB xmm_y, xmm_tmp
  - PAND xmm_y, [wasm_i8x16_splat(0x0F)]

ARM64 processors

i8x16.popcnt
- y = i8x16.popcnt(x) is lowered to CNT Vy.16B, Vx.16B

ARMv7 processors with NEON instruction set

i8x16.popcnt
- y = i8x16.popcnt(x) is lowered to VCNT.8 Qy, Qx

POWER processors with POWER 2.07+ instruction set and VMX

i8x16.popcnt
- y = i8x16.popcnt(x) is lowered to VPOPCNTB VRy, VRx

MIPS processors with MSA instruction set

i8x16.popcnt
- y = i8x16.popcnt(x) is lowered to PCNT.B Wy, Wx

arunetm · 2020-10-08T05:48:11Z

Is the i8x16 variant of popcnt widely used in the implementations of these applications? If so can you share any implementation references that use these in its critical path?
The emulation on IA for this instruction will be very expensive considering that AVX or AVX512 availability is not the baseline for SIMD128 MVP. If we have good evidence that this instruction variant is necessary for the real world implementations, I suggest we consider these in SIMD post MVP.

This seems to be a risky addition to current MVP given its non-trivial architecture support.

Maratyszcza · 2020-10-08T07:46:19Z

The applications are insensitive to the element size in population size instruction: the maximum output for byte inputs is 8, which enables enough accumulation using packed 8-bit SIMD to amortize the overhead of converting to a wider data type. I choose 8-bit elements for the instruction as it is the most portable, but another option would be to support all variants (8/16/32/64-bit). This repository by @WojciechMula provides many implementations for population count, and Larq compute engine use the 8-bit population count on the critical path.

I suggest we follow the agreed-upon criteria for evaluating instructions at Stage 3, and if they are demonstrably faster than emulation sequences on both x86-64 and ARM64, accept them to them SIMD spec.

mratsim · 2020-10-08T12:56:47Z

Another key application is computational biology and genetics with as example 2 biologists exploring the popcount rabbithole:

(genetic ancestry) performance of avx512 bit ops and popcounts mratsim/laser#41, https://github.com/brentp/somalier#how-it-works
(molecular fingerprinting) http://www.dalkescientific.com/writings/diary/archive/2008/07/05/bitslice_and_popcount.html

penzn · 2020-10-08T16:57:52Z

I suggest we follow the agreed-upon criteria for evaluating instructions at Stage 3, and if they are demonstrably faster than emulation sequences on both x86-64 and ARM64, accept them to them SIMD spec.

Published results on this algorithm are not clear cut - it has negative speedups on smaller inputs when compiled natively. Wasm use of this instruction would be even slower, so a good starting point would be some description of why this should produce any speedup at all.

Maratyszcza · 2020-10-08T17:03:06Z

a good starting point would be some description of why this should produce any speedup at all.

We're removing two PADDUSB instructions, and related constant loads, compared to the emulation of Population Count in WebAssembly SIMD through the i8x16.swizzle operation.

penzn · 2020-10-08T17:13:02Z

We're removing two PADDUSB instructions, and related constant loads, compared to the emulation of Population Count in WebAssembly SIMD through the i8x16.swizzle operation.

If I remember correctly, the criteria is speedup against scalar, not prior SIMD instructions.

Maratyszcza · 2020-10-08T17:16:57Z

If I remember correctly, the criteria is speedup against scalar, not prior SIMD instructions.

I don't think we ever discussed the intended baseline, but de facto all previous proposals were evaluated against the baseline of prior SIMD instructions. I don't mind to evaluate against both SIMD and scalar baselines in this case.

penzn · 2020-10-08T17:22:46Z

@tlively and @zeux can correct me, but I thought we evaluated new instructions against scalar, the logic being that it does not make sense to add SIMD instructions that are slower than scalar. Also, in the comment you are referencing there are the following criteria:

hard/expensive to emulate in absence of spec change

well supported on 2+ major archs

useful in 2+ important use cases

I am not sure having this complex of a lowering on x86 makes it "well supported" on the platform (edit: the point is that SIMD popcount is known to be not well supported on x86).

As proposed at WebAssembly/simd#379. Use a target builtin and intrinsic rather than normal codegen patterns to make the instruction opt-in until it is merged to the proposal and stabilized in engines. Differential Revision: https://reviews.llvm.org/D89446

penzn · 2020-10-17T02:27:07Z

Since prototyping seems to be underway, I am curious how are we going to run this. The use cases here are a number of research papers (I personally can't access the chemistry one), and a Wikipedia article. I have to admit I did not read the entirety of the papers, but at a first glance I don't see a code we can actually run.

Maratyszcza · 2020-10-19T05:58:35Z

My plan is to implement one or more algorithms built on top of Population Count (either binary GEMM or N-point correlation basecase) and evaluate with dedicated Population Count instruction and with its simulation using baseline SIMD.

lgeiger · 2020-10-19T10:21:26Z

My plan is to implement one or more algorithms built on top of Population Count (either binary GEMM or N-point correlation basecase) and evaluate with dedicated Population Count instruction and with its simulation using baseline SIMD.

That sounds great! @Maratyszcza Just for reference, in addition to the binary GEMM you linked to we also have an indirect binary convolution implementation similar to XNNPack available in larq/compute-engine which might be easier to experiment with for this analysis. Feel free to reach out to use, if you run into any trouble.

As proposed in WebAssembly/simd#379. Since this instruction is still being evaluated for inclusion in the SIMD proposal, this PR does not add support for it to the C/JS APIs or to the fuzzer. This PR also performs a drive-by fix for unrelated instructions in c-api-kitchen-sink.c

tlively · 2020-10-27T16:00:25Z

This instruction has been landed in both LLVM and Binaryen, so it should be ready to use from tip-of-tree Emscripten in a few hours. The builtin function to use is __builtin_wasm_popcnt_i8x16.

Maratyszcza · 2021-01-14T19:19:48Z

To evaluate this proposal I implemented scalar and SIMD versions of the O(N^3) part of the 3-Point Correlation computation. This part is responsible for most time spent in the 3-Point Correlation computation. The algorithm follows this paper and is loosely based on the reference code for it. The code is specialized for 128 points.

Here's the scalar version:

uint32_t ThreePointCorrelationsScalar64Bit(
    const uint64_t pairwise_ab[2 * 3 * 128],
    const uint64_t pairwise_ac[2 * 3 * 128],
    const uint64_t pairwise_bc[2 * 3 * 128])
{
  uint32_t count = 0;
  for (size_t i = 128; i != 0; i--) {
    uint64_t sat_ab0_hi = pairwise_ab[1];
    uint64_t sat_ab1_hi = pairwise_ab[3];
    uint64_t sat_ab2_hi = pairwise_ab[5];

    const uint64_t sat_ac0_lo = pairwise_ac[0];
    const uint64_t sat_ac0_hi = pairwise_ac[1];
    const uint64_t sat_ac1_lo = pairwise_ac[2];
    const uint64_t sat_ac1_hi = pairwise_ac[3];
    const uint64_t sat_ac2_lo = pairwise_ac[4];
    const uint64_t sat_ac2_hi = pairwise_ac[5];
    pairwise_ac += 6;

    const uint64_t* restrict bc = pairwise_bc;
    for (size_t j = 64; j != 0; j--) {
      const uint64_t sat_ab0_bcast = (uint64_t) ((int64_t) sat_ab0_hi >> 63);
      const uint64_t sat_ab1_bcast = (uint64_t) ((int64_t) sat_ab1_hi >> 63);
      const uint64_t sat_ab2_bcast = (uint64_t) ((int64_t) sat_ab2_hi >> 63);

      const uint64_t sat_bc0_lo = bc[0];
      const uint64_t sat_bc0_hi = bc[1];
      const uint64_t sat_bc1_lo = bc[2];
      const uint64_t sat_bc1_hi = bc[3];
      const uint64_t sat_bc2_lo = bc[4];
      const uint64_t sat_bc2_hi = bc[5];
      bc += 6;

      count += (uint32_t) __builtin_popcountll(
        (sat_ab0_bcast & ((sat_ac1_lo & sat_bc2_lo) | (sat_ac2_lo & sat_bc1_lo))) |
        (sat_ab1_bcast & ((sat_ac0_lo & sat_bc2_lo) | (sat_ac2_lo & sat_bc0_lo))) |
        (sat_ab2_bcast & ((sat_ac0_lo & sat_bc1_lo) | (sat_ac1_lo & sat_bc0_lo))));

      count += (uint32_t) __builtin_popcountll(
        (sat_ab0_bcast & ((sat_ac1_hi & sat_bc2_hi) | (sat_ac2_hi & sat_bc1_hi))) |
        (sat_ab1_bcast & ((sat_ac0_hi & sat_bc2_hi) | (sat_ac2_hi & sat_bc0_hi))) |
        (sat_ab2_bcast & ((sat_ac0_hi & sat_bc1_hi) | (sat_ac1_hi & sat_bc0_hi))));

      sat_ab0_hi += sat_ab0_hi;
      sat_ab1_hi += sat_ab1_hi;
      sat_ab2_hi += sat_ab2_hi;
    }

    uint64_t sat_ab0_lo = pairwise_ab[0];
    uint64_t sat_ab1_lo = pairwise_ab[2];
    uint64_t sat_ab2_lo = pairwise_ab[4];
    pairwise_ab += 6;
    for (size_t j = 64; j != 0; j--) {
      const uint64_t sat_ab0_bcast = (uint64_t) ((int64_t) sat_ab0_lo >> 63);
      const uint64_t sat_ab1_bcast = (uint64_t) ((int64_t) sat_ab1_lo >> 63);
      const uint64_t sat_ab2_bcast = (uint64_t) ((int64_t) sat_ab2_lo >> 63);

      const uint64_t sat_bc0_lo = bc[0];
      const uint64_t sat_bc0_hi = bc[1];
      const uint64_t sat_bc1_lo = bc[2];
      const uint64_t sat_bc1_hi = bc[3];
      const uint64_t sat_bc2_lo = bc[4];
      const uint64_t sat_bc2_hi = bc[5];
      bc += 6;

      count += (uint32_t) __builtin_popcountll(
        (sat_ab0_bcast & ((sat_ac1_lo & sat_bc2_lo) | (sat_ac2_lo & sat_bc1_lo))) |
        (sat_ab1_bcast & ((sat_ac0_lo & sat_bc2_lo) | (sat_ac2_lo & sat_bc0_lo))) |
        (sat_ab2_bcast & ((sat_ac0_lo & sat_bc1_lo) | (sat_ac1_lo & sat_bc0_lo))));

      count += (uint32_t) __builtin_popcountll(
        (sat_ab0_bcast & ((sat_ac1_hi & sat_bc2_hi) | (sat_ac2_hi & sat_bc1_hi))) |
        (sat_ab1_bcast & ((sat_ac0_hi & sat_bc2_hi) | (sat_ac2_hi & sat_bc0_hi))) |
        (sat_ab2_bcast & ((sat_ac0_hi & sat_bc1_hi) | (sat_ac1_hi & sat_bc0_hi))));

      sat_ab0_lo += sat_ab0_lo;
      sat_ab1_lo += sat_ab1_lo;
      sat_ab2_lo += sat_ab2_lo;
    }
  }
  return count;
}

And here's the SIMD version with the Population Count instruction:

#define USE_X86_WORKAROUND 0
#define USE_EXTPADD 0

uint32_t ThreePointCorrelationsSIMDPopcount(
    const uint64_t pairwise_ab[2 * 3 * 128],
    const uint64_t pairwise_ac[2 * 3 * 128],
    const uint64_t pairwise_bc[2 * 3 * 128])
{
#if !USE_EXTPADD
  const v128_t mask_00FF00FF = wasm_i32x4_const(0x00FF00FF, 0x00FF00FF, 0x00FF00FF, 0x00FF00FF);
  const v128_t mask_0000FFFF = wasm_i32x4_const(0x0000FFFF, 0x0000FFFF, 0x0000FFFF, 0x0000FFFF);
#endif

  v128_t count_simd32 = wasm_i32x4_const(0, 0, 0, 0);
  for (size_t i = 128; i != 0; i--) {

    const v128_t sat_ac0 = wasm_v128_load(pairwise_ac);
    const v128_t sat_ac1 = wasm_v128_load(pairwise_ac + 2);
    const v128_t sat_ac2 = wasm_v128_load(pairwise_ac + 4);
    pairwise_ac += 6;

    const uint64_t* restrict bc = pairwise_bc;
    v128_t sat_ab0_hi = wasm_v64x2_load_splat(pairwise_ab + 1);
    v128_t sat_ab1_hi = wasm_v64x2_load_splat(pairwise_ab + 3);
    v128_t sat_ab2_hi = wasm_v64x2_load_splat(pairwise_ab + 5);
    v128_t count_simd16 = wasm_i16x8_const(0, 0, 0, 0, 0, 0, 0, 0);
    for (size_t j = 64; j != 0; j--) {
#if USE_X86_WORKAROUND
      const v128_t sat_ab0_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab0_hi, sat_ab0_hi, 1, 1, 3, 3), 31);
      const v128_t sat_ab1_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab1_hi, sat_ab1_hi, 1, 1, 3, 3), 31);
      const v128_t sat_ab2_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab2_hi, sat_ab2_hi, 1, 1, 3, 3), 31);
#else
      const v128_t sat_ab0_bcast = wasm_i64x2_shr(sat_ab0_hi, 63);
      const v128_t sat_ab1_bcast = wasm_i64x2_shr(sat_ab1_hi, 63);
      const v128_t sat_ab2_bcast = wasm_i64x2_shr(sat_ab2_hi, 63);
#endif

      const v128_t sat_bc0 = wasm_v128_load(bc);
      const v128_t sat_bc1 = wasm_v128_load(bc + 2);
      const v128_t sat_bc2 = wasm_v128_load(bc + 4);
      bc += 6;

      const v128_t bitmask =
        wasm_v128_or(
          wasm_v128_or(
            wasm_v128_and(sat_ab0_bcast, wasm_v128_or(wasm_v128_and(sat_ac1, sat_bc2), wasm_v128_and(sat_ac2, sat_bc1))),
            wasm_v128_and(sat_ab1_bcast, wasm_v128_or(wasm_v128_and(sat_ac0, sat_bc2), wasm_v128_and(sat_ac2, sat_bc0)))),
          wasm_v128_and(sat_ab2_bcast, wasm_v128_or(wasm_v128_and(sat_ac0, sat_bc1), wasm_v128_and(sat_ac1, sat_bc0))));

      const v128_t count_simd8 = __builtin_wasm_popcnt_i8x16(bitmask);

#if USE_EXTPADD
      count_simd16 = wasm_i16x8_add(count_simd16, __builtin_wasm_extadd_pairwise_i8x16_u_i16x8(count_simd8));
#else
      count_simd16 = wasm_i16x8_add(count_simd16, wasm_u16x8_shr(count_simd8, 8));
      count_simd16 = wasm_i16x8_add(count_simd16, wasm_v128_and(count_simd8, mask_00FF00FF));
#endif

      sat_ab0_hi = wasm_i64x2_shl(sat_ab0_hi, 1);
      sat_ab1_hi = wasm_i64x2_shl(sat_ab1_hi, 1);
      sat_ab2_hi = wasm_i64x2_shl(sat_ab2_hi, 1);
    }
    v128_t sat_ab0_lo = wasm_v64x2_load_splat(pairwise_ab);
    v128_t sat_ab1_lo = wasm_v64x2_load_splat(pairwise_ab + 2);
    v128_t sat_ab2_lo = wasm_v64x2_load_splat(pairwise_ab + 4);
    pairwise_ab += 6;
    for (size_t j = 64; j != 0; j--) {
#if USE_X86_WORKAROUND
      const v128_t sat_ab0_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab0_lo, sat_ab0_lo, 1, 1, 3, 3), 31);
      const v128_t sat_ab1_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab1_lo, sat_ab1_lo, 1, 1, 3, 3), 31);
      const v128_t sat_ab2_bcast = wasm_i32x4_shr(wasm_v32x4_shuffle(sat_ab2_lo, sat_ab2_lo, 1, 1, 3, 3), 31);
#else
      const v128_t sat_ab0_bcast = wasm_i64x2_shr(sat_ab0_lo, 63);
      const v128_t sat_ab1_bcast = wasm_i64x2_shr(sat_ab1_lo, 63);
      const v128_t sat_ab2_bcast = wasm_i64x2_shr(sat_ab2_lo, 63);
#endif

      const v128_t sat_bc0 = wasm_v128_load(bc);
      const v128_t sat_bc1 = wasm_v128_load(bc + 2);
      const v128_t sat_bc2 = wasm_v128_load(bc + 4);
      bc += 6;

      const v128_t bitmask =
        wasm_v128_or(
          wasm_v128_or(
            wasm_v128_and(sat_ab0_bcast, wasm_v128_or(wasm_v128_and(sat_ac1, sat_bc2), wasm_v128_and(sat_ac2, sat_bc1))),
            wasm_v128_and(sat_ab1_bcast, wasm_v128_or(wasm_v128_and(sat_ac0, sat_bc2), wasm_v128_and(sat_ac2, sat_bc0)))),
          wasm_v128_and(sat_ab2_bcast, wasm_v128_or(wasm_v128_and(sat_ac0, sat_bc1), wasm_v128_and(sat_ac1, sat_bc0))));

      const v128_t count_simd8 = __builtin_wasm_popcnt_i8x16(bitmask);

#if USE_EXTPADD
      count_simd16 = wasm_i16x8_add(count_simd16, __builtin_wasm_extadd_pairwise_i8x16_u_i16x8(count_simd8));
#else
      count_simd16 = wasm_i16x8_add(count_simd16, wasm_u16x8_shr(count_simd8, 8));
      count_simd16 = wasm_i16x8_add(count_simd16, wasm_v128_and(count_simd8, mask_00FF00FF));
#endif

      sat_ab0_lo = wasm_i64x2_shl(sat_ab0_lo, 1);
      sat_ab1_lo = wasm_i64x2_shl(sat_ab1_lo, 1);
      sat_ab2_lo = wasm_i64x2_shl(sat_ab2_lo, 1);
    }
#if USE_EXTPADD
    count_simd32 = wasm_i32x4_add(count_simd32, __builtin_wasm_extadd_pairwise_i16x8_u_i32x4(count_simd16));
#else
    count_simd32 = wasm_i32x4_add(count_simd32, wasm_u32x4_shr(count_simd16, 16));
    count_simd32 = wasm_i32x4_add(count_simd32, wasm_v128_and(count_simd16, mask_0000FFFF));
#endif
  }
  count_simd32 = wasm_i32x4_add(count_simd32,
                                wasm_v32x4_shuffle(count_simd32, count_simd32, 2, 3, 0, 1));
  return wasm_i32x4_extract_lane(count_simd32, 0) + wasm_i32x4_extract_lane(count_simd32, 1);
}

Additionally, I evaluated SIMD versions without the SIMD Population Count instruction. One version simulates SIMD Population Count using i8x16.swizzle instruction:

const v128_t table = wasm_i8x16_const(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
const v128_t mask_0F0F0F0F = wasm_i32x4_const(0x0F0F0F0F, 0x0F0F0F0F, 0x0F0F0F0F, 0x0F0F0F0F);

...

const v128_t count_simd8 = wasm_i8x16_add(
    wasm_v8x16_swizzle(table, wasm_v128_and(bitmask, mask_0F0F0F0F)),
    wasm_v8x16_swizzle(table, wasm_u8x16_shr(bitmask, 4)));

Another version simulates SIMD Population Count using HACKMEM algorithm:

const v128_t mask_55555555 = wasm_i32x4_const(0x55555555, 0x55555555, 0x55555555, 0x55555555);
const v128_t mask_33333333 = wasm_i32x4_const(0x33333333, 0x33333333, 0x33333333, 0x33333333);
const v128_t mask_0F0F0F0F = wasm_i32x4_const(0x0F0F0F0F, 0x0F0F0F0F, 0x0F0F0F0F, 0x0F0F0F0F);

...

const v128_t count_simd2 = wasm_i8x16_sub(bitmask, wasm_v128_and(wasm_u8x16_shr(bitmask, 1), mask_55555555));
const v128_t count_simd4 = wasm_i8x16_add(wasm_v128_and(count_simd2, mask_33333333),
                                          wasm_v128_and(wasm_u8x16_shr(count_simd2, 2), mask_33333333));
const v128_t count_simd8 = wasm_v128_and(wasm_i8x16_add(count_simd4, wasm_u8x16_shr(count_simd4, 4)), mask_0F0F0F0F);

Performance on ARM64 devices is presented below:

Implementation	Time on Snapdragon 670 (Pixel 3a)	Time on Exynos 8895 (Galaxy S8)
64-bit scalar	441 us	466 us
WAsm SIMD with HACKMEM	325 us	207 us
WAsm SIMD with `i8x16.swizzle`	274 us	169 us
WAsm SIMD with `i8x16.popcount`	188 us	125 us

The version with the i8x16.popcount instruction is 1.4-1.5X faster than the fastest SIMD version without i8x16.popcount, and 2.3-3.7X faster than the scalar version.

Maratyszcza · 2021-01-20T03:03:41Z

In all SIMD versions on x86 I use USE_X86_WORKAROUND=1 to work-around poor code-generation for wasm_i64x2_shr(v, 63) in V8. Performance on x86-64 systems is presented below:

Implementation	Time on AMD PRO A10-8700B	Time on AMD A4-7210	Time on Intel Xeon W-2135	Time on Intel Celeron N3060
64-bit scalar	161 us	375 us	77 us	302 us
WAsm SIMD with HACKMEM	141 us	376 us	66 us	307 us
WAsm SIMD with `i8x16.swizzle`	103 us	317 us	55 us	350 us
WAsm SIMD with `i8x16.popcount`	91 us	267 us	54 us	316 us

Performance on Intel Celeron N3060 is poor for two reasons:

~~V8 implementation for x86-64 SSE loads the lookup table twice. Combined with suboptimal loading of constants this generates 4 redundant instructions.~~ This is fixed in a recent change
PSHUFB instruction on Silvermont cores in Celeron N3060 is very slow with both latency and reciprocal throughput of 5 cycles (per Agner Fog). It would be worth to detect these cores in V8 and generate HACKMEM-like implementation for them.

Performance on Intel Xeon is presumably poor due to the overhead of reloading the lookup table and the 0x0F mask. Because of suboptimal loading of constants in V8, it generates 5 more instructions that the necessary minimum.

Still, despite the overhead related to loading constants in V8, which would presumably get fixed later on, the i8x16.popcount version wins on 3 out of 4 tested systems.

penzn · 2021-01-22T02:42:20Z

In all SIMD versions on x86 I use USE_X86_WORKAROUND=1 to work-around poor code-generation for wasm_i64x2_shr(v, 64) in V8.

This illustrates the point I've been trying to make for some time about feasibility of building and measuring larger apps - shaving off a few instructions by adding a more specialized operation only pays off if there are no other bottlenecks. That's why a speedup on a larger app is not only more convincing, but may be necessary - it would reduce the likelihood of running into issues like this. I suspect some of the other recently added operations would suffer from "somebody else's slowdown" if ran on all of listed use cases (from respective PR descriptions).

Also, some of the issues from #380 (comment) would apply here as well.

Maratyszcza · 2021-01-22T02:55:54Z

Re-benchmarked performance on Celeron N3060 after the recent fix for inefficient i8x16.popcnt on SSE and update the results table.

dtig · 2021-01-25T19:32:32Z

Adding a preliminary vote for the inclusion of population count operation to the SIMD proposal below. Please vote with -

👍 For including population count
👎 Against including population count

fbarchard · 2021-01-25T23:09:57Z

I've implemented a number of HammingDistance functions with Neon, SSSE3 & AVX2. The first step is a vector popcnt that produces a count per element. Neon and AVX512 have that instruction. SSE/AVX work almost as well using shufb.
You'll eventually need a horizontal or paired add, but you can unroll a few vpopcnt's before the sum needs to be widened.
If we have use cases for popcnts I'm in favor of a dedicated instruction.

Maratyszcza · 2021-01-26T01:09:44Z

Scalar Wasm includes 32 and 64-bit popcnt but this one is 8-bit, which feels somewhat one-off. Is this because wider widths don't have interesting practical uses or because of suboptimal lowering?

ARM and ARM64 support only 8-bit SIMD Population Count. Wider Population Count instruction could be emulated through extended pairwise addition instructions (SADDL/UADDL), but they are partially exposed in #380.

Both HACKMEM and PSHUFB methods on pre-AVX512 x86 produce per-byte Population Count. 16- and 32-bit Population Count could be emulated using the instructions from the extended pairwise addition proposal #380. 64-bit Population Count could be emulated more efficiently, through, by doing PSADBW with a zero register.

To sum up, wider-element Population Count instructions are not exposed because they don't have an efficient lowering, and their inefficient lowering can be expressed using more general extended pairwise addition instructions.

The Intel lowering is pretty sad and that's after using rip-relative loads which I think v8 used to not support (although this may have changed)

V8 now does memory loads, albeit not yet rip-relative (needs an extra instruction to setup a general-purpose register for the load address).

zeux · 2021-01-26T03:26:22Z

@Maratyszcza Thanks, this helps. So it looks like 8-bit popcnt is the only width that's practical to run natively on ARM and the emulation on Intel is more expensive for larger widths as well. Wrt Intel lowering it looks like this is similar to some other instructions that we've standardized in this proposal where it's not 1-1, but there's no good substitute.

IOW if your algorithm requires popcnt, there's no good alternative and as such it's not obvious that the perf. "cliff" is an issue...

(It's somewhat surprising [to me] that on ARM the delta between emulation and native execution isn't as high as I would expect although maybe the throughput/latency characteristics of cnt aren't perfect...)

Maratyszcza · 2021-01-26T05:02:12Z

I think is, frankly, an exotic operation rarely found in practice - there are surely a few algorithms which would benefit from it (just like with anything else), but they are quite specialized and obscure.

Consider a dot product primitive: sum(A[i] * B[i]). When A[i] and B[i] are both binary, multiplication gets replaced with the boolean AND, and sum gets replaced with POPCOUNT. Thus, any algorithm involving dot product or matrix multiplication that is tolerant to precision reduction can benefit from the SIMD Population Count instruction. These include most popular machine learning methods, e.g. neural networks, support vector machines, linear regression, and kernel methods.

Moreover, it is unnecessary to reduce precision down to a single bit: dot products of multi-bit elements can be processed bit-by-bit. The complexity is quadratic in the number of bits, but @ajtulloch found that it still outperforms direct integer computations for 3-bit elements (i.e. 9x the cost of single-bit computation).

Also, despite this being present in libraries, we don't have a path to running wasm apps with this operation yet.

IIRC, OpenCV has a WAsm SIMD port in the works.

I don't think orthogonality with scalar Wasm should be the motivation - scalar popcnt exists due to its widespread availability in hardware.

This a good point for including SIMD Population Count too, as it has wider availability than scalar Population Count: ARM/ARM64 devices feature hardware SIMD Population Count instruction, but no scalar Population Count instructions.

zeux · 2021-01-26T05:17:41Z

This a good point for including SIMD Population Count too, as it has wider availability than scalar Population Count: ARM/ARM64 devices feature hardware SIMD Population Count instruction, but no scalar Population Count instructions.

In the interest of being precise, for Intel it's the opposite - any CPU that supports AVX512BITALG supports POPCNT but the reverse is not true.

jan-wassenberg · 2021-01-26T12:13:28Z

FWIW SVE and RiscV V have popcnt instructions for masks, and on RVV that extends to whole registers (by taking LMUL=8).

Maratyszcza · 2021-01-27T08:53:43Z

Updated the table with x86-64 results following the merge of additional i8x16.popcount optimizations in V8.

lars-t-hansen · 2021-01-27T09:43:15Z

As an implementor I'm pretty neutral on this and I'm not able to properly judge whether the instruction is useful or not - several concrete uses are presented above, while it's a tough job to argue against it except in the abstract ("exotic operation", "few applications"). On ARM64 it's a no-brainer to include it. The Intel AVX512 implementation may not see wide use given the dearth of consumer devices with AVX512, but even the ssse3 implementation beats the alternatives (except on Celeron), if only barely.

arunetm · 2021-01-27T22:36:09Z

AVX512 availability on future Intel client platforms is not guaranteed and we should not make any decisions in a hurry at this point on the assumption that IA implementation will eventually become efficient with AVX512 support.
"Alder Lake Intel Hybrid Technology will not support Intel® AVX-512. ISA features such as Intel® AVX, AVX-VNNI, Intel® AVX2, and UMONITOR/UMWAIT/TPAUSE are supported." Ref: https://software.intel.com/content/dam/develop/external/us/en/documents-tps/architecture-instruction-set-extensions-programming-reference.pdf

Maratyszcza · 2021-01-27T22:51:46Z

@arunetm To be clear: i8x16.popcnt instruction improves performance on x86-64 systems with SSSE3 compared to emulating this operation on top of existing i8x16.swizzle WAsm SIMD instruction. The fact that i8x16.popcnt gets further speedup on AVX512-BITALG is just a cherry on top of that.

penzn · 2021-01-28T03:04:00Z

Sorry if I missed something, the thread is getting long.

Consider a dot product primitive: sum(A[i] * B[i]). When A[i] and B[i] are both binary, multiplication gets replaced with the boolean AND, and sum gets replaced with POPCOUNT. Thus, any algorithm involving dot product or matrix multiplication that is tolerant to precision reduction can benefit from the SIMD Population Count instruction.

This is not helping the generality argument, as outside of machine learning this would be viewed as an esoteric take on those three operations - which doesn't make it wrong by any measure, it just that precision requirements outside of ML make this infeasible. To my best knowledge bio and chemistry algorithms would use it in a much more conventional way (as its name implies). However it don't think that just this being niche is a good enough argument of not including it - I am not convinced the examples and 3-point correlation performance prove that we really need this.

As an aside and a background, among the 3-point correlation implementations the fastest one on Celeron is the scalar one (which is in and of itself problematic). Also, Xeon performance is far from great (it passes the bar technically, though). So firstly, I don't think adding one more SIMD operation that is slower than scalar is beneficial, as instead of two inefficient ways to write something we give users three (even if a new problematic operation is faster than existing problematic operation). Surely, existing alternatives (swizzle and HACKMEM) are not portable across x86 chips, but this operation won't be portable either.

Given that, I am trying to understand why we actually need a popcount. I understand that all code examples are Neon because that is the native ISA where it exists, but despite this, those applications (OpenCV, etc) still run on x86. I don't think any of the applications simulate popcount on x86 either, which means that the problem of porting them is not equivalent to arriving at a SIMD popcount on x86.

On the data side, we know from the 3-point example that simulating this instruction does not work on some of the lower end x86 chips, while using alternative sequence of instructions works on Arm. Frankly, I think the overall solution for porting code which uses Neon's popcount would neither be an SSE/AVX clone nor a Neon clone - it is probably somewhere in between.

PSHUFB instruction on Silvermont cores in Celeron N3060 is very slow with both latency and reciprocal throughput of 5 cycles (per Agner Fog). It would be worth to detect these cores in V8 and generate HACKMEM-like implementation for them.

I think we should investigate a solution like this instead of adding a new instruction.

Maratyszcza · 2021-01-28T03:52:01Z

Given that, I am trying to understand why we actually need a popcount. I understand that all code examples are Neon because that is the native ISA where it exists, but despite this, those applications (OpenCV, etc) still run on x86. I don't think any of the applications simulate popcount on x86 either, which means that the problem of porting them is not equivalent to arriving at a SIMD popcount on x86.

OpenCV uses SIMD population count on all platforms, and does simulate it on x86. The SSE version uses a HACKMEM-like algorithm, the AVX version uses VPSHUFB instruction for 16-entry table lookup (similar to i8x16.swizzle in WAsm SIMD), and the AVX512 version uses the hardware SIMD population count instruction when available.

Moreover, OpenCV has a port to WebAssembly SIMD, which simulates the SIMD population count too! The implementation uses HACKMEM-like algorithm: it predates i8x16.swizzle instruction in WAsm SIMD, so HACKMEM was the only option at the time.

PSHUFB instruction on Silvermont cores in Celeron N3060 is very slow with both latency and reciprocal throughput of 5 cycles (per Agner Fog). It would be worth to detect these cores in V8 and generate HACKMEM-like implementation for them.

I think we should investigate a solution like this instead of adding a new instruction.

A solution necessitates adding a new instruction. WAsm engines have the information (microarchitecture ID) to generate alternative lowering, but this information can't be passed to WAsm because of anti-fingerprinting considerations. Thus, a perfect work-around is possible only by abstracting the code-generation for this operation away from the WAsm application into the WAsm engine.

penzn · 2021-01-28T17:26:59Z

So OpenCV already works and uses HACKMEM solution - this still does not convince me we need a new instruction for popcount. I don't think we need to add another inneficient way to perform the same operation, this time codified as an instruction. I also don't think we need to add another instruction with a caveat that it is is slower than scalar on some platforms.

Maratyszcza · 2021-01-28T19:26:08Z

I investigated performance issues on the Celeron N3060, and turns out there's a lot of performance variance between V8 versions. E.g. on the latest V8 build, SIMD versions outperform scalar:

Implementation	Time on Intel Celeron N3060
64-bit scalar	340 us
WAsm SIMD with HACKMEM	344 us
WAsm SIMD with `i8x16.swizzle`	350 us
WAsm SIMD with `i8x16.popcount`	316 us

I also tried to change the SSE implementation of i8x16.popcount with the following (SSE2 implementation based on HACKMEM):

        __ xorps(tmp, tmp);
        __ pavgb(tmp, src);
        __ Move(dst, src);
        __ andps(tmp,
                 __ ExternalReferenceAsOperand(
                     ExternalReference::address_of_wasm_i8x16_splat_0x55()));
        __ psubb(dst, tmp);
        Operand splat_0x33 = __ ExternalReferenceAsOperand(
            ExternalReference::address_of_wasm_i8x16_splat_0x33());
        __ movaps(tmp, dst);
        __ andps(dst, splat_0x33);
        __ psrlw(tmp, 2);
        __ andps(tmp, splat_0x33);
        __ paddb(dst, tmp);
        __ movaps(tmp, dst);
        __ psrlw(dst, 4);
        __ paddb(dst, tmp);
        __ andps(dst, 
                 __ ExternalReferenceAsOperand(
                     ExternalReference::address_of_wasm_i8x16_splat_0x0f()));

The resulting V8 build further improves the i8x16.popcount microkernel timing: 316 us -> 304 us.

ngzhian · 2021-01-28T21:19:25Z

The latest v8 build has this optimization https://chromium.googlesource.com/v8/v8/+/173d660849c1e4f43b15b67321b074a89b73f9ce so that's probably why it got slightly faster.

Maratyszcza · 2021-01-29T08:38:40Z

The optimization for Silvermont (and similar Atom-like processors) is merged in https://chromium.googlesource.com/v8/v8/+/71fc222f4930d7f9425e45ee1b2111bee742f20e

tlively · 2021-01-31T05:53:20Z

We voted on this instruction at the 1/29/21 meeting with these results: SF 1, F 6, N 3, A 2, SA 1. This is closer to the border between clear consensus and clear lack of consensus than many other votes we've taken, but I am inclined to interpret this vote as consensus to include the instruction. Here are the considerations leading me to this decision:

The favorable/unfavorable ratio is 7/3, which is greater than in the recent vote we declared not to achieve consensus (for signed i64x2 comparisons), which had ratio 5/3. There is no inconsistency in accepting this vote as successful.
Although the objections based on poor lowering, niche/limited use cases, and limited benchmarking are reasonable, they are not unique to this instruction. Many of those in favor of this instruction are users who can immediately make use of it. When in doubt, I would prefer to err on the side of making the proposal more useful to our users, especially when they have put a great deal of effort into demonstrating that the instruction is indeed useful to them.

Code sequence from WebAssembly/simd#379, and exactly the same as x64, with minor tweaks for ExternalReferenceAsOperand. Bug: v8:11002 Change-Id: Icbfdac62b21c2734ad4886b3d48f34e29f7a8222 Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2664860 Commit-Queue: Zhi An Ng <[email protected]> Reviewed-by: Deepti Gandluri <[email protected]> Cr-Commit-Position: refs/heads/master@{#72495}

Implementation for PPC will be added in a later CL. Port dd90d10 Original Commit Message: Code sequence from WebAssembly/simd#379, and exactly the same as x64, with minor tweaks for ExternalReferenceAsOperand. [email protected], [email protected], [email protected], [email protected] BUG= LOG=N Change-Id: I2be8a9cf04d0b327c15f47c2575877925238353c Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/2672706 Reviewed-by: Junliang Yan <[email protected]> Commit-Queue: Milad Fa <[email protected]> Cr-Commit-Position: refs/heads/master@{#72510}

aqrit · 2021-03-19T01:24:34Z

I believe the suggested lowering for SSE2 is incorrect. @Maratyszcza
PAVGB is (a + b + 1) >> 1 // rounds up!
so one would want to x & (0x55 << 1) first then PAVGB or else follow the original formula of (x >> 1) & 0x55

Chromium also seems affected by this. @ngzhian

Maratyszcza · 2021-03-22T19:26:55Z

@aqrit Good catch, updated the PR description and filed issue 11591 for V8.

Maratyszcza · 2021-03-22T19:29:28Z

@aqrit On a side note, would you be interesting to participate in WAsm SIMD meetings? The next one will be on April 9 (see WebAssembly/flexible-vectors#31 for details).

ngzhian · 2021-03-23T00:18:38Z

Thanks for the report, this is fixed in v8 in https://crrev.com/c/2780218, should go out in canary tomorrow.

As proposed at WebAssembly/simd#379. Use a target builtin and intrinsic rather than normal codegen patterns to make the instruction opt-in until it is merged to the proposal and stabilized in engines. Differential Revision: https://reviews.llvm.org/D89446

penzn mentioned this pull request Oct 17, 2020

Soft Freeze on new operation additions post Phase 3 #203

Closed

ngzhian mentioned this pull request Oct 21, 2020

Remove integer SIMD not-equals instructions #351

Closed

tlively mentioned this pull request Oct 27, 2020

Implement i8x16.popcnt WebAssembly/binaryen#3286

Merged

arunetm mentioned this pull request Oct 29, 2020

Agenda for sync meeting 10/30/2020 #386

Closed

Maratyszcza force-pushed the popcnt branch from d20f3ab to aa28364 Compare January 14, 2021 18:28

Maratyszcza mentioned this pull request Jan 14, 2021

Extended pairwise addition instructions #380

Merged

aqrit mentioned this pull request Jan 16, 2021

Include vectorized bit count instructions #6

Open

Maratyszcza force-pushed the popcnt branch from aa28364 to eb717e6 Compare January 19, 2021 21:01

Maratyszcza mentioned this pull request Jan 22, 2021

Agenda for sync meeting 1/22/21 #419

Closed

tlively mentioned this pull request Jan 23, 2021

Agenda for sync meeting 1/29/21 #429

Closed

ngzhian added the 2021-01-29 Agenda for sync meeting 1/29/21 label Jan 26, 2021

Maratyszcza force-pushed the popcnt branch from eb717e6 to 9a1dcda Compare February 1, 2021 16:51

dtig removed the 2021-01-29 Agenda for sync meeting 1/29/21 label Feb 2, 2021

i8x16.popcnt instruction

5cccadd

Maratyszcza force-pushed the popcnt branch from 9a1dcda to 5cccadd Compare February 3, 2021 08:49

tlively merged commit 25f9d41 into WebAssembly:master Feb 4, 2021

Population count instruction #379

Population count instruction #379

Uh oh!

Conversation

Maratyszcza commented Oct 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction

Applications

Mapping to Common Instruction Sets

x86/x86-64 processors with AVX512BITALG and AVX512VL instruction sets

x86/x86-64 processors with AVX instruction set

x86/x86-64 processors with SSSE3 instruction set

x86/x86-64 processors with SSE2 instruction set

ARM64 processors

ARMv7 processors with NEON instruction set

POWER processors with POWER 2.07+ instruction set and VMX

MIPS processors with MSA instruction set

Uh oh!

arunetm commented Oct 8, 2020

Uh oh!

Maratyszcza commented Oct 8, 2020

Uh oh!

mratsim commented Oct 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

penzn commented Oct 8, 2020

Uh oh!

Maratyszcza commented Oct 8, 2020

Uh oh!

penzn commented Oct 8, 2020

Uh oh!

Maratyszcza commented Oct 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

penzn commented Oct 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

penzn commented Oct 17, 2020

Uh oh!

Maratyszcza commented Oct 19, 2020

Uh oh!

lgeiger commented Oct 19, 2020

Uh oh!

tlively commented Oct 27, 2020

Uh oh!

Maratyszcza commented Jan 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Maratyszcza commented Jan 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

penzn commented Jan 22, 2021

Uh oh!

Maratyszcza commented Jan 22, 2021

Uh oh!

dtig commented Jan 25, 2021

Uh oh!

fbarchard commented Jan 25, 2021

Uh oh!

Maratyszcza commented Jan 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zeux commented Jan 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Maratyszcza commented Jan 26, 2021

Uh oh!

zeux commented Jan 26, 2021

Uh oh!

jan-wassenberg commented Jan 26, 2021

Uh oh!

Maratyszcza commented Jan 27, 2021

Uh oh!

lars-t-hansen commented Jan 27, 2021

Uh oh!

arunetm commented Jan 27, 2021

Uh oh!

Maratyszcza commented Jan 27, 2021

Uh oh!

penzn commented Jan 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Maratyszcza commented Oct 8, 2020 •

edited

Loading

mratsim commented Oct 8, 2020 •

edited

Loading

Maratyszcza commented Oct 8, 2020 •

edited

Loading

penzn commented Oct 8, 2020 •

edited

Loading

Maratyszcza commented Jan 14, 2021 •

edited

Loading

Maratyszcza commented Jan 20, 2021 •

edited

Loading

Maratyszcza commented Jan 26, 2021 •

edited

Loading

zeux commented Jan 26, 2021 •

edited

Loading

penzn commented Jan 28, 2021 •

edited

Loading

Maratyszcza commented Jan 28, 2021 •

edited

Loading

Maratyszcza commented Jan 28, 2021 •

edited

Loading