Skip to content

SSSE3 implementations of 8x8 IDCT and YCbCr conversion #211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jan 27, 2022

Conversation

veluca93
Copy link
Contributor

Hi folks!

This PR is more of a proof of concept / demo than an actual PR - I realize there's probably a few things to discuss/agree on/do before merging this in, but I figured a benchmark was the best way to get the discussion started ;). See also #146 for a previous take on this.

This PR contains two SSSE3 implementations of respectively 8x8 IDCT and YCbCr conversion.
Why SSSE3? Two reasons: it is extremely common nowadays (99.25% of x86 PCs according to the Steam Hardware survey), while offering some operations (in particular blends) that provide some significant speedups over SSE3 or SSE2. On the other hand, the next "useful" upgrade in x86 SIMD would be AVX2, which wouldn't help that much (in particular it wouldn't help IDCT) and is not that common (85.98%).

From the benchmark, you can see speedups (with a single thread) of a factor of approximately 2x in the common case of a 444 YCbCr image.

While I didn't (yet!) implement this for ARM NEON, I expect the speedup there to be similar, likely resolving #202. See also #79.

decode a 512x512 JPEG   time:   [1.7992 ms 1.8000 ms 1.8009 ms]                                   
                        change: [-49.653% -49.613% -49.574%] (p = 0.00 < 0.05)
                        Performance has improved.

decode a 512x512 progressive JPEG                                                                             
                        time:   [4.5663 ms 4.5848 ms 4.6057 ms]
                        change: [-29.470% -28.997% -28.514%] (p = 0.00 < 0.05)
                        Performance has improved.

decode a 512x512 grayscale JPEG                                                                            
                        time:   [780.95 us 782.16 us 783.31 us]
                        change: [-36.512% -36.366% -36.222%] (p = 0.00 < 0.05)
                        Performance has improved.


decode a 2268x1512 JPEG time:   [22.317 ms 22.339 ms 22.362 ms]                                    
                        change: [-52.237% -52.149% -52.073%] (p = 0.00 < 0.05)
                        Performance has improved.

This brings me to the open questions I have related to this PR:

  • Unfortunately, different rounding produces slightly different results in the SIMD code, although I am fairly confident this is still within the bounds of conformance as defined in the JPEG standard ("DCTs back to the same coefficients"). I'm not sure if we'd want to make sure that the non-SIMD implementation matches the SIMD one.
  • The SIMD implementation doesn't handle all the values that the non-SIMD one does -- in particular it doesn't handle i16::MAX, although to my understanding such values can never appear in a JPEG bitstream.
  • As this code uses hand-crafted intrinsics, it makes use of two unsafe functions. I'm confident they are safe (they just do arithmetic, swizzle, and loads/store to arrays that are bounds-checked in advance), but it may be a good idea to put the SIMD code in a separate crate, in separate files and/or behind a feature flag.
  • Should we also SIMDfy other things? What comes to mind are other color conversions, smaller (4x4 and 2x2) DCTs and chroma upsampling.

I'd be happy to hear your thoughts on this :)

Compared with baseline, no rayon:

decode a 2268x1512 JPEG time:   [35.694 ms 35.736 ms 35.778 ms]
                        change: [-23.510% -23.401% -23.290%] (p = 0.00 < 0.05)
Note that this does not give identical results to the non-SSSE3 version.
However, this should be OK as the JPEG specification doesn't mandate a
specific IDCT implementation.

decode a 2268x1512 JPEG time:   [22.236 ms 22.260 ms 22.283 ms]
                        change: [-36.889% -36.804% -36.726%] (p = 0.00 < 0.05)
Copy link
Member

@197g 197g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR 🎉

Regarding folder structures it would make sense to me to have src/arch/sse3.rs for those particular algorithms. This has two purposes:

  • #[allow(unsafe)] could be scoped to just the arch module where it is generally accepted to be required. This would imply having dispatch code in that module as well.
  • It groups code by the specialized knowledge required to read it, understand it, review it, change it.

Regarding the code, the implement indeed looks fine on a first impression. I do not consider it to be necessary to aim for 100% reproducability of the non-SIMD code iff we make it some flag that can be turned off. It should at least be possible to enforce consistent behavior between systems from the deterministic decoder.

There's also some questions regarding style in the code itself. See below.

@veluca93
Copy link
Contributor Author

Thank you for this PR tada

Regarding folder structures it would make sense to me to have src/arch/sse3.rs for those particular algorithms. This has two purposes:

  • #[allow(unsafe)] could be scoped to just the arch module where it is generally accepted to be required. This would imply having dispatch code in that module as well.
  • It groups code by the specialized knowledge required to read it, understand it, review it, change it.

Regarding the code, the implement indeed looks fine on a first impression. I do not consider it to be necessary to aim for 100% reproducability of the non-SIMD code iff we make it some flag that can be turned off. It should at least be possible to enforce consistent behavior between systems from the deterministic decoder.

There's also some questions regarding style in the code itself. See below.

Replied to some higher-level comments while I deal with the other ones :)

@veluca93 veluca93 marked this pull request as ready for review December 16, 2021 20:58
@veluca93
Copy link
Contributor Author

Thank you for this PR tada
Regarding folder structures it would make sense to me to have src/arch/sse3.rs for those particular algorithms. This has two purposes:

  • #[allow(unsafe)] could be scoped to just the arch module where it is generally accepted to be required. This would imply having dispatch code in that module as well.
  • It groups code by the specialized knowledge required to read it, understand it, review it, change it.

Regarding the code, the implement indeed looks fine on a first impression. I do not consider it to be necessary to aim for 100% reproducability of the non-SIMD code iff we make it some flag that can be turned off. It should at least be possible to enforce consistent behavior between systems from the deterministic decoder.
There's also some questions regarding style in the code itself. See below.

Replied to some higher-level comments while I deal with the other ones :)

I added a enabled-by-default "simd" feature to allow disabling compilation of the SIMD / unsafe code. PTAL :)

Copy link
Member

@197g 197g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new structure better. Thanks for integrating all of the feedback, this is quite a new kind of addition to any of the libraries (png mostly relies on auto-vectorization, if you're looking for a task afterwards 😏) and thus I'd like to find a solution that we can use as a template for similar additions otherwhere.

@veluca93
Copy link
Contributor Author

I like the new structure better. Thanks for integrating all of the feedback, this is quite a new kind of addition to any of the libraries (png mostly relies on auto-vectorization, if you're looking for a task afterwards smirk) and thus I'd like to find a solution that we can use as a template for similar additions otherwhere.

FWIW I doubt SIMD in png decoders can help much :D (perhaps in the encoder...)

@197g
Copy link
Member

197g commented Dec 20, 2021

FWIW I doubt SIMD in png decoders can help much :D (perhaps in the encoder...)

There's a whole adler32 stage that's going to be simd-ified soon for some speed gains. Plus it seems that the main zlib (through miniz_oxide) is not simdified despite zlib-ng demonstrating how platform-specific code might show speed gains over the similarly written zlib. And then the actual color stage involve some line-by-line filtering, permutation of bytes, etc. There's always code that can be sped up by careful SIMD. The big question is whether LLVM has already done so implicitly but then it's most likely a 'generic' variant based on avx or w/e is in the default set of extensions of modern x86-64 ABI.

Also return fn pointers rather than using has_* functions.
@veluca93
Copy link
Contributor Author

veluca93 commented Dec 20, 2021

FWIW I doubt SIMD in png decoders can help much :D (perhaps in the encoder...)

There's a whole adler32 stage that's going to be simd-ified soon for some speed gains. Plus it seems that the main zlib (through miniz_oxide) is not simdified despite zlib-ng demonstrating how platform-specific code might show speed gains over the similarly written zlib. And then the actual color stage involve some line-by-line filtering, permutation of bytes, etc. There's always code that can be sped up by careful SIMD. The big question is whether LLVM has already done so implicitly but then it's most likely a 'generic' variant based on avx or w/e is in the default set of extensions of modern x86-64 ABI.

In my experience, most of the gains from manual SIMDfication would be in encoding for lz77, not in decoding - decoding is simple enough that compilers do a reasonable job about it. Although I might be wrong ;) (also probably adler32)

As for other things in png: line-by-line filtering in the decoder would be rather tricky to SIMDfy (except perhaps across the channels, but then you don't gain that much - IIRC libpng does that). Permutation of bytes would benefit though. OTOH, in the encoder, quite a bit can be done :)

@veluca93
Copy link
Contributor Author

@HeroicKatora: Any clue why the CI would fail on beta and on beta only?

@fintelia
Copy link
Contributor

Not sure why that's happening, but I tried rerunning the CI and the same test failed. It is probably worth trying to replicate locally

@veluca93
Copy link
Contributor Author

veluca93 commented Dec 21, 2021

Not sure why that's happening, but I tried rerunning the CI and the same test failed. It is probably worth trying to replicate locally

I can reproduce locally both on beta and nightly, and with and without rayon -- albeit on a different file...

This is looking like it will be a fun investigation :)

@veluca93
Copy link
Contributor Author

Not sure why that's happening, but I tried rerunning the CI and the same test failed. It is probably worth trying to replicate locally

I can reproduce locally both on beta and nightly, and with and without rayon -- albeit on a different file...

This is looking like it will be a fun investigation :)

I was right - somehow, between stable and beta some intrinsics managed to produce slightly different effects, so I modified the code not to use those intrinsics :)

@197g
Copy link
Member

197g commented Dec 21, 2021

I was right - somehow, between stable and beta some intrinsics managed to produce slightly different effects

Have you checked rust-lang/rust for issues? This sounds like a serious codegen bug if the effect of (unsafe) instructions is incorrect.

@veluca93
Copy link
Contributor Author

I was right - somehow, between stable and beta some intrinsics managed to produce slightly different effects

Have you checked rust-lang/rust for issues? This sounds like a serious codegen bug if the effect of (unsafe) instructions is incorrect.

Yup, there is rust-lang/rust#84042 :)

Copy link
Member

@197g 197g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you strew in just a few more comments on the approach and constants, rather than the specific implementation? The idct8 function starts off exemplary in this regard.

Copy link
Member

@197g 197g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@veluca93
Copy link
Contributor Author

veluca93 commented Jan 8, 2022

Hi!
I was thinking of doing the aarch64/arm version, but would it be possible to merge this PR first?

#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
#[target_feature(enable = "ssse3")]
pub unsafe fn dequantize_and_idct_block_8x8(
coefficients: &[i16],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by review: just like #215 replaced some slices with arrays it would make sense to do the same here

@197g
Copy link
Member

197g commented Jan 27, 2022

Some formatting changes in another branch had caused a conflict, I fixed that formatting in a merge.

@197g 197g merged commit 56bb2a0 into image-rs:master Jan 27, 2022
wartmanm pushed a commit to wartmanm/jpeg-decoder that referenced this pull request Sep 15, 2022
SSSE3 implementations of 8x8 IDCT and YCbCr conversion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants