SSSE3 implementations of 8x8 IDCT and YCbCr conversion #211

veluca93 · 2021-12-16T12:26:54Z

Hi folks!

This PR is more of a proof of concept / demo than an actual PR - I realize there's probably a few things to discuss/agree on/do before merging this in, but I figured a benchmark was the best way to get the discussion started ;). See also #146 for a previous take on this.

This PR contains two SSSE3 implementations of respectively 8x8 IDCT and YCbCr conversion.
Why SSSE3? Two reasons: it is extremely common nowadays (99.25% of x86 PCs according to the Steam Hardware survey), while offering some operations (in particular blends) that provide some significant speedups over SSE3 or SSE2. On the other hand, the next "useful" upgrade in x86 SIMD would be AVX2, which wouldn't help that much (in particular it wouldn't help IDCT) and is not that common (85.98%).

From the benchmark, you can see speedups (with a single thread) of a factor of approximately 2x in the common case of a 444 YCbCr image.

While I didn't (yet!) implement this for ARM NEON, I expect the speedup there to be similar, likely resolving #202. See also #79.

decode a 512x512 JPEG   time:   [1.7992 ms 1.8000 ms 1.8009 ms]                                   
                        change: [-49.653% -49.613% -49.574%] (p = 0.00 < 0.05)
                        Performance has improved.

decode a 512x512 progressive JPEG                                                                             
                        time:   [4.5663 ms 4.5848 ms 4.6057 ms]
                        change: [-29.470% -28.997% -28.514%] (p = 0.00 < 0.05)
                        Performance has improved.

decode a 512x512 grayscale JPEG                                                                            
                        time:   [780.95 us 782.16 us 783.31 us]
                        change: [-36.512% -36.366% -36.222%] (p = 0.00 < 0.05)
                        Performance has improved.


decode a 2268x1512 JPEG time:   [22.317 ms 22.339 ms 22.362 ms]                                    
                        change: [-52.237% -52.149% -52.073%] (p = 0.00 < 0.05)
                        Performance has improved.

This brings me to the open questions I have related to this PR:

Unfortunately, different rounding produces slightly different results in the SIMD code, although I am fairly confident this is still within the bounds of conformance as defined in the JPEG standard ("DCTs back to the same coefficients"). I'm not sure if we'd want to make sure that the non-SIMD implementation matches the SIMD one.
The SIMD implementation doesn't handle all the values that the non-SIMD one does -- in particular it doesn't handle i16::MAX, although to my understanding such values can never appear in a JPEG bitstream.
As this code uses hand-crafted intrinsics, it makes use of two unsafe functions. I'm confident they are safe (they just do arithmetic, swizzle, and loads/store to arrays that are bounds-checked in advance), but it may be a good idea to put the SIMD code in a separate crate, in separate files and/or behind a feature flag.
Should we also SIMDfy other things? What comes to mind are other color conversions, smaller (4x4 and 2x2) DCTs and chroma upsampling.

I'd be happy to hear your thoughts on this :)

Compared with baseline, no rayon: decode a 2268x1512 JPEG time: [35.694 ms 35.736 ms 35.778 ms] change: [-23.510% -23.401% -23.290%] (p = 0.00 < 0.05)

Note that this does not give identical results to the non-SSSE3 version. However, this should be OK as the JPEG specification doesn't mandate a specific IDCT implementation. decode a 2268x1512 JPEG time: [22.236 ms 22.260 ms 22.283 ms] change: [-36.889% -36.804% -36.726%] (p = 0.00 < 0.05)

197g

Thank you for this PR 🎉

Regarding folder structures it would make sense to me to have src/arch/sse3.rs for those particular algorithms. This has two purposes:

#[allow(unsafe)] could be scoped to just the arch module where it is generally accepted to be required. This would imply having dispatch code in that module as well.
It groups code by the specialized knowledge required to read it, understand it, review it, change it.

Regarding the code, the implement indeed looks fine on a first impression. I do not consider it to be necessary to aim for 100% reproducability of the non-SIMD code iff we make it some flag that can be turned off. It should at least be possible to enforce consistent behavior between systems from the deterministic decoder.

There's also some questions regarding style in the code itself. See below.

src/idct.rs

tests/reftest/mod.rs

src/idct.rs

veluca93 · 2021-12-16T20:09:59Z

Thank you for this PR tada

Regarding folder structures it would make sense to me to have src/arch/sse3.rs for those particular algorithms. This has two purposes:

#[allow(unsafe)] could be scoped to just the arch module where it is generally accepted to be required. This would imply having dispatch code in that module as well.

It groups code by the specialized knowledge required to read it, understand it, review it, change it.

Regarding the code, the implement indeed looks fine on a first impression. I do not consider it to be necessary to aim for 100% reproducability of the non-SIMD code iff we make it some flag that can be turned off. It should at least be possible to enforce consistent behavior between systems from the deterministic decoder.

There's also some questions regarding style in the code itself. See below.

Replied to some higher-level comments while I deal with the other ones :)

veluca93 · 2021-12-16T20:59:56Z

Thank you for this PR tada
Regarding folder structures it would make sense to me to have src/arch/sse3.rs for those particular algorithms. This has two purposes:

#[allow(unsafe)] could be scoped to just the arch module where it is generally accepted to be required. This would imply having dispatch code in that module as well.

It groups code by the specialized knowledge required to read it, understand it, review it, change it.

Regarding the code, the implement indeed looks fine on a first impression. I do not consider it to be necessary to aim for 100% reproducability of the non-SIMD code iff we make it some flag that can be turned off. It should at least be possible to enforce consistent behavior between systems from the deterministic decoder.
There's also some questions regarding style in the code itself. See below.

Replied to some higher-level comments while I deal with the other ones :)

I added a enabled-by-default "simd" feature to allow disabling compilation of the SIMD / unsafe code. PTAL :)

Cargo.toml

197g

I like the new structure better. Thanks for integrating all of the feedback, this is quite a new kind of addition to any of the libraries (png mostly relies on auto-vectorization, if you're looking for a task afterwards 😏) and thus I'd like to find a solution that we can use as a template for similar additions otherwhere.

src/arch/ssse3.rs

Cargo.toml

src/arch/mod.rs

veluca93 · 2021-12-20T10:48:35Z

I like the new structure better. Thanks for integrating all of the feedback, this is quite a new kind of addition to any of the libraries (png mostly relies on auto-vectorization, if you're looking for a task afterwards smirk) and thus I'd like to find a solution that we can use as a template for similar additions otherwhere.

FWIW I doubt SIMD in png decoders can help much :D (perhaps in the encoder...)

197g · 2021-12-20T11:23:54Z

FWIW I doubt SIMD in png decoders can help much :D (perhaps in the encoder...)

There's a whole adler32 stage that's going to be simd-ified soon for some speed gains. Plus it seems that the main zlib (through miniz_oxide) is not simdified despite zlib-ng demonstrating how platform-specific code might show speed gains over the similarly written zlib. And then the actual color stage involve some line-by-line filtering, permutation of bytes, etc. There's always code that can be sped up by careful SIMD. The big question is whether LLVM has already done so implicitly but then it's most likely a 'generic' variant based on avx or w/e is in the default set of extensions of modern x86-64 ABI.

Also return fn pointers rather than using has_* functions.

veluca93 · 2021-12-20T11:49:55Z

FWIW I doubt SIMD in png decoders can help much :D (perhaps in the encoder...)

There's a whole adler32 stage that's going to be simd-ified soon for some speed gains. Plus it seems that the main zlib (through miniz_oxide) is not simdified despite zlib-ng demonstrating how platform-specific code might show speed gains over the similarly written zlib. And then the actual color stage involve some line-by-line filtering, permutation of bytes, etc. There's always code that can be sped up by careful SIMD. The big question is whether LLVM has already done so implicitly but then it's most likely a 'generic' variant based on avx or w/e is in the default set of extensions of modern x86-64 ABI.

In my experience, most of the gains from manual SIMDfication would be in encoding for lz77, not in decoding - decoding is simple enough that compilers do a reasonable job about it. Although I might be wrong ;) (also probably adler32)

As for other things in png: line-by-line filtering in the decoder would be rather tricky to SIMDfy (except perhaps across the channels, but then you don't gain that much - IIRC libpng does that). Permutation of bytes would benefit though. OTOH, in the encoder, quite a bit can be done :)

veluca93 · 2021-12-20T14:03:19Z

@HeroicKatora: Any clue why the CI would fail on beta and on beta only?

fintelia · 2021-12-20T15:50:04Z

Not sure why that's happening, but I tried rerunning the CI and the same test failed. It is probably worth trying to replicate locally

veluca93 · 2021-12-21T10:01:50Z

Not sure why that's happening, but I tried rerunning the CI and the same test failed. It is probably worth trying to replicate locally

I can reproduce locally both on beta and nightly, and with and without rayon -- albeit on a different file...

This is looking like it will be a fun investigation :)

veluca93 · 2021-12-21T15:37:18Z

Not sure why that's happening, but I tried rerunning the CI and the same test failed. It is probably worth trying to replicate locally

I can reproduce locally both on beta and nightly, and with and without rayon -- albeit on a different file...

This is looking like it will be a fun investigation :)

I was right - somehow, between stable and beta some intrinsics managed to produce slightly different effects, so I modified the code not to use those intrinsics :)

197g · 2021-12-21T15:45:51Z

I was right - somehow, between stable and beta some intrinsics managed to produce slightly different effects

Have you checked rust-lang/rust for issues? This sounds like a serious codegen bug if the effect of (unsafe) instructions is incorrect.

veluca93 · 2021-12-21T15:47:39Z

I was right - somehow, between stable and beta some intrinsics managed to produce slightly different effects

Have you checked rust-lang/rust for issues? This sounds like a serious codegen bug if the effect of (unsafe) instructions is incorrect.

Yup, there is rust-lang/rust#84042 :)

src/arch/ssse3.rs

197g

LGTM. Can you strew in just a few more comments on the approach and constants, rather than the specific implementation? The idct8 function starts off exemplary in this regard.

src/arch/ssse3.rs

197g

LGTM.

veluca93 · 2022-01-08T12:47:59Z

Hi!
I was thinking of doing the aarch64/arm version, but would it be possible to merge this PR first?

paolobarbolini · 2022-01-27T18:13:59Z

src/arch/ssse3.rs

+#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
+#[target_feature(enable = "ssse3")]
+pub unsafe fn dequantize_and_idct_block_8x8(
+    coefficients: &[i16],


Drive-by review: just like #215 replaced some slices with arrays it would make sense to do the same here

197g · 2022-01-27T18:28:48Z

Some formatting changes in another branch had caused a conflict, I fixed that formatting in a merge.

SSSE3 implementations of 8x8 IDCT and YCbCr conversion

veluca93 added 5 commits December 16, 2021 13:04

Add large image benchmark

4d6502a

Autoformat decoder.rs

6806e03

SSE3 YCbCr conversion

e831f95

Compared with baseline, no rayon: decode a 2268x1512 JPEG time: [35.694 ms 35.736 ms 35.778 ms] change: [-23.510% -23.401% -23.290%] (p = 0.00 < 0.05)

Autoformat idct.rs

d7af61d

197g reviewed Dec 16, 2021

View reviewed changes

src/idct.rs Outdated Show resolved Hide resolved

src/idct.rs Outdated Show resolved Hide resolved

src/idct.rs Outdated Show resolved Hide resolved

src/idct.rs Outdated Show resolved Hide resolved

tests/reftest/mod.rs Show resolved Hide resolved

src/idct.rs Outdated Show resolved Hide resolved

Address review comments.

774806a

veluca93 marked this pull request as ready for review December 16, 2021 20:58

Move allow_unsafe to the whole arch module.

71eb5d0

fintelia reviewed Dec 16, 2021

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

Remove debuginfo

5a33d5d

197g reviewed Dec 20, 2021

View reviewed changes

src/arch/ssse3.rs Show resolved Hide resolved

Cargo.toml Outdated Show resolved Hide resolved

src/arch/mod.rs Outdated Show resolved Hide resolved

Rename "simd" feature to not(platform_independent)

5db115c

Also return fn pointers rather than using has_* functions.

Fix ssse3 code on beta+.

80cc0be

197g reviewed Dec 21, 2021

View reviewed changes

src/arch/ssse3.rs Outdated Show resolved Hide resolved

Update rust version message.

5a425fe

197g reviewed Dec 21, 2021

View reviewed changes

src/arch/ssse3.rs Outdated Show resolved Hide resolved

Add more comments.

3aa8a4f

197g approved these changes Dec 21, 2021

View reviewed changes

paolobarbolini reviewed Jan 27, 2022

View reviewed changes

Merge remote-tracking branch 'origin/master' into HEAD

2986d5a

197g merged commit 56bb2a0 into image-rs:master Jan 27, 2022

197g mentioned this pull request Jan 27, 2022

Drive-by review: just like #215 replaced some slices with arrays it would make sense to do the same here #217

Closed

wartmanm pushed a commit to wartmanm/jpeg-decoder that referenced this pull request Sep 15, 2022

Merge pull request image-rs#211 from veluca93/master

edcb0d8

SSSE3 implementations of 8x8 IDCT and YCbCr conversion

SSSE3 implementations of 8x8 IDCT and YCbCr conversion #211

SSSE3 implementations of 8x8 IDCT and YCbCr conversion #211

Uh oh!

Conversation

veluca93 commented Dec 16, 2021

Uh oh!

197g left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

veluca93 commented Dec 16, 2021

Uh oh!

veluca93 commented Dec 16, 2021

Uh oh!

Uh oh!

197g left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

veluca93 commented Dec 20, 2021

Uh oh!

197g commented Dec 20, 2021

Uh oh!

veluca93 commented Dec 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

veluca93 commented Dec 20, 2021

Uh oh!

fintelia commented Dec 20, 2021

Uh oh!

veluca93 commented Dec 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

veluca93 commented Dec 21, 2021

Uh oh!

197g commented Dec 21, 2021

Uh oh!

veluca93 commented Dec 21, 2021

Uh oh!

Uh oh!

197g left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

197g left a comment

Choose a reason for hiding this comment

Uh oh!

veluca93 commented Jan 8, 2022

Uh oh!

paolobarbolini Jan 27, 2022

Choose a reason for hiding this comment

Uh oh!

197g commented Jan 27, 2022

Uh oh!

Uh oh!

veluca93 commented Dec 20, 2021 •

edited

Loading

veluca93 commented Dec 21, 2021 •

edited

Loading

197g left a comment •

edited

Loading