-
Notifications
You must be signed in to change notification settings - Fork 90
SSSE3 implementations of 8x8 IDCT and YCbCr conversion #211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Compared with baseline, no rayon: decode a 2268x1512 JPEG time: [35.694 ms 35.736 ms 35.778 ms] change: [-23.510% -23.401% -23.290%] (p = 0.00 < 0.05)
Note that this does not give identical results to the non-SSSE3 version. However, this should be OK as the JPEG specification doesn't mandate a specific IDCT implementation. decode a 2268x1512 JPEG time: [22.236 ms 22.260 ms 22.283 ms] change: [-36.889% -36.804% -36.726%] (p = 0.00 < 0.05)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR 🎉
Regarding folder structures it would make sense to me to have src/arch/sse3.rs
for those particular algorithms. This has two purposes:
#[allow(unsafe)]
could be scoped to just the arch module where it is generally accepted to be required. This would imply having dispatch code in that module as well.- It groups code by the specialized knowledge required to read it, understand it, review it, change it.
Regarding the code, the implement indeed looks fine on a first impression. I do not consider it to be necessary to aim for 100% reproducability of the non-SIMD code iff we make it some flag that can be turned off. It should at least be possible to enforce consistent behavior between systems from the deterministic decoder.
There's also some questions regarding style in the code itself. See below.
Replied to some higher-level comments while I deal with the other ones :) |
I added a enabled-by-default "simd" feature to allow disabling compilation of the SIMD / unsafe code. PTAL :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new structure better. Thanks for integrating all of the feedback, this is quite a new kind of addition to any of the libraries (png
mostly relies on auto-vectorization, if you're looking for a task afterwards 😏) and thus I'd like to find a solution that we can use as a template for similar additions otherwhere.
FWIW I doubt SIMD in |
There's a whole adler32 stage that's going to be simd-ified soon for some speed gains. Plus it seems that the main zlib (through |
Also return fn pointers rather than using has_* functions.
In my experience, most of the gains from manual SIMDfication would be in encoding for lz77, not in decoding - decoding is simple enough that compilers do a reasonable job about it. Although I might be wrong ;) (also probably adler32) As for other things in |
@HeroicKatora: Any clue why the CI would fail on beta and on beta only? |
Not sure why that's happening, but I tried rerunning the CI and the same test failed. It is probably worth trying to replicate locally |
I can reproduce locally both on beta and nightly, and with and without rayon -- albeit on a different file... This is looking like it will be a fun investigation :) |
I was right - somehow, between stable and beta some intrinsics managed to produce slightly different effects, so I modified the code not to use those intrinsics :) |
Have you checked rust-lang/rust for issues? This sounds like a serious codegen bug if the effect of (unsafe) instructions is incorrect. |
Yup, there is rust-lang/rust#84042 :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Can you strew in just a few more comments on the approach and constants, rather than the specific implementation? The idct8
function starts off exemplary in this regard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Hi! |
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))] | ||
#[target_feature(enable = "ssse3")] | ||
pub unsafe fn dequantize_and_idct_block_8x8( | ||
coefficients: &[i16], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive-by review: just like #215 replaced some slices with arrays it would make sense to do the same here
Some formatting changes in another branch had caused a conflict, I fixed that formatting in a merge. |
SSSE3 implementations of 8x8 IDCT and YCbCr conversion
Hi folks!
This PR is more of a proof of concept / demo than an actual PR - I realize there's probably a few things to discuss/agree on/do before merging this in, but I figured a benchmark was the best way to get the discussion started ;). See also #146 for a previous take on this.
This PR contains two SSSE3 implementations of respectively 8x8 IDCT and YCbCr conversion.
Why SSSE3? Two reasons: it is extremely common nowadays (99.25% of x86 PCs according to the Steam Hardware survey), while offering some operations (in particular blends) that provide some significant speedups over SSE3 or SSE2. On the other hand, the next "useful" upgrade in x86 SIMD would be AVX2, which wouldn't help that much (in particular it wouldn't help IDCT) and is not that common (85.98%).
From the benchmark, you can see speedups (with a single thread) of a factor of approximately 2x in the common case of a 444 YCbCr image.
While I didn't (yet!) implement this for ARM NEON, I expect the speedup there to be similar, likely resolving #202. See also #79.
This brings me to the open questions I have related to this PR:
unsafe
functions. I'm confident they are safe (they just do arithmetic, swizzle, and loads/store to arrays that are bounds-checked in advance), but it may be a good idea to put the SIMD code in a separate crate, in separate files and/or behind a feature flag.I'd be happy to hear your thoughts on this :)