Consider accelerated IDCT? #79

lilith · 2017-09-13T22:41:18Z

I would imagine that with IDCT accelerated, Jpeg->YCbCr unscaled planar performance should meet/exceed libjpeg-turbo (it hasn't merged these yet - I think these handle multiple blocks).

https://github.com/huaqiangwang/libjpeg-turbo/blob/avx2-dev/simd/jidctflt-avx2-64.asm
https://github.com/huaqiangwang/libjpeg-turbo/blob/avx2-dev/simd/jfdctflt-avx2-64.asm
https://github.com/huaqiangwang/libjpeg-turbo/blob/avx2-dev/simd/jfdctmflt-avx2-64.asm

kaksmet · 2017-09-15T19:52:59Z

Patches for SIMD acceleration are welcome, I don't have the time to implement it myself.

lilith · 2020-04-09T06:37:43Z

Is this something you would be able to delegate or work on with additional funding?

197g · 2020-04-09T15:35:17Z

For the forseeable future the project is mainly limited by dev head count/time. Delegation is not a problem as long as the changes are properly licensed. Porting libjpeg-turbo's SIMD implementation, which seems to be zlib licensed, sounds reasonable. We'd have to figure out how best to test the combination of simd/no-simd and the rest of the code but I don't see any blocking concerns there either.

lilith · 2020-04-09T16:15:34Z

I'm considering using this project in https://github.com/imazen/imageflow as the default jpeg decoder, but I expect I will need to expand things like color profile/exif support and add assembly and C to the build to optimize performance-critical bits. I'd like to make sure these changes align with the project direction before investing.

Ideally, I'd like to locate someone who can work on this codec full-time for a while and bring it up to parity with libjpeg-turbo (which is not usable in Rust due to setjmp usage).

197g · 2020-04-10T16:16:06Z

Ideally, I'd like to locate someone who can work on this codec full-time for a while and bring it up to parity with libjpeg-turbo (which is not usable in Rust due to setjmp usage).

That would be awesome. As for governance, this crate in particular was previously a separate entity. It moved to the image-rs organization to maintain and we're open to collaborators and I have no problem with another developer in a lead role per-se.¹

Adding C dependencies or large amounts of unsafe, however, would be controversial and require good arguments for why it is necessary. Adding assembly as a performance optimization, as long as these bits stay optional, is okay. It would be preferrable to use simd intrinsics of Rust but that's an opinion that can be changed with a convincing performance argument. We'd just need to figure out a way to avoid the incompatibility concerns/symbol collisions that come with linking through native interface (e.g. ring` has gone this route and using different versions simply doesn't work).

¹Random thought: Maybe Embark Studios would also be interested and/or have someone in mind?

lilith · 2020-04-10T16:56:13Z

I believe the use case for C would be the optimized huffman decoder (probably optional), but for assembly I'd like to keep the files in sync with the audited codebase of libjpeg-turbo if possible. https://github.com/libjpeg-turbo/libjpeg-turbo/tree/master/simd/x86_64

I'm not sure how to resolve multi-versioning with needing to use NASM specifically.
edit: perhaps a preprocessing build step could mangle the function names?

lovasoa · 2020-04-10T18:31:17Z

Even before re-implementing it in assembler, I think there is a lot that can be done to improve the performance of the existing IDCT code in safe rust.

I am in no way a specialist, but a quick look at the generated assembly lets me think it is far from optimal. There must be a way to make the compiler use at least some simd instructions by changing how we organize statements there.

See image-rs#79 The new code is both more readable and faster Old assembly: mov eax, edi cmp edi, 256 jb .LBB1_2 sar eax, 31 not al .LBB1_2: ret New assembly: xor ecx, ecx test edi, edi cmovns ecx, edi cmp ecx, 255 mov eax, 255 cmovl eax, ecx ret Benchmark results : Benchmarking decode a 512x512 JPEG: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.5s or reduce sample count to 40 decode a 512x512 JPEG time: [2.4692 ms 2.4873 ms 2.5106 ms] change: [-18.558% -17.141% -15.659%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) high mild 6 (6.00%) high severe Benchmarking decode a 512x512 progressive JPEG: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 28.0s or reduce sample count to 20 decode a 512x512 progressive JPEG time: [5.5010 ms 5.5212 ms 5.5459 ms] change: [-12.718% -11.746% -10.721%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 3 (3.00%) high mild 6 (6.00%) high severe extract metadata from an image time: [1.3028 us 1.3110 us 1.3207 us] change: [+1.8341% +2.8787% +3.8439%] (p = 0.00 < 0.05) Performance has regressed. Found 9 outliers among 100 measurements (9.00%) 7 (7.00%) high mild 2 (2.00%) high severe

197g · 2020-04-10T18:44:25Z

There was a similar (well, somewhat similar) code converting between color representation in image/webp. There are likely some lessons that can be learned from this, in particular about when Rust/llvm is able to elide bounds checks on slice access and consequently able to automatically vectorize simple loops.

lilith · 2020-04-10T23:20:08Z

After 4 years I finally was able to overcome the blocking issue I had with using libjpeg-turbo, so I probably won't be using this as the primary decoder, but I think I will keep it as a backup. Thanks!

See image-rs#79 The new code is both more readable and faster Old assembly: mov eax, edi cmp edi, 256 jb .LBB1_2 sar eax, 31 not al .LBB1_2: ret New assembly: xor ecx, ecx test edi, edi cmovns ecx, edi cmp ecx, 255 mov eax, 255 cmovl eax, ecx ret Benchmark results : Benchmarking decode a 512x512 JPEG: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.5s or reduce sample count to 40 decode a 512x512 JPEG time: [2.4692 ms 2.4873 ms 2.5106 ms] change: [-18.558% -17.141% -15.659%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) high mild 6 (6.00%) high severe Benchmarking decode a 512x512 progressive JPEG: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 28.0s or reduce sample count to 20 decode a 512x512 progressive JPEG time: [5.5010 ms 5.5212 ms 5.5459 ms] change: [-12.718% -11.746% -10.721%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 3 (3.00%) high mild 6 (6.00%) high severe extract metadata from an image time: [1.3028 us 1.3110 us 1.3207 us] change: [+1.8341% +2.8787% +3.8439%] (p = 0.00 < 0.05) Performance has regressed. Found 9 outliers among 100 measurements (9.00%) 7 (7.00%) high mild 2 (2.00%) high severe

lovasoa mentioned this issue Apr 10, 2020

small idct code improvement #140

Merged

lovasoa mentioned this issue Apr 13, 2020

[WIP] SIMD-accelerated IDCT #146

Closed

veluca93 mentioned this issue Dec 16, 2021

SSSE3 implementations of 8x8 IDCT and YCbCr conversion #211

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consider accelerated IDCT? #79

Consider accelerated IDCT? #79

lilith commented Sep 13, 2017

kaksmet commented Sep 15, 2017

Uh oh!

lilith commented Apr 9, 2020

Uh oh!

197g commented Apr 9, 2020

Uh oh!

lilith commented Apr 9, 2020

Uh oh!

197g commented Apr 10, 2020

Uh oh!

lilith commented Apr 10, 2020 •

edited

Loading

Uh oh!

lovasoa commented Apr 10, 2020

Uh oh!

197g commented Apr 10, 2020 •

edited

Loading

Uh oh!

lilith commented Apr 10, 2020

Uh oh!

Consider accelerated IDCT? #79

Consider accelerated IDCT? #79

Comments

lilith commented Sep 13, 2017

kaksmet commented Sep 15, 2017

Uh oh!

lilith commented Apr 9, 2020

Uh oh!

197g commented Apr 9, 2020

Uh oh!

lilith commented Apr 9, 2020

Uh oh!

197g commented Apr 10, 2020

Uh oh!

lilith commented Apr 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lovasoa commented Apr 10, 2020

Uh oh!

197g commented Apr 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lilith commented Apr 10, 2020

Uh oh!

lilith commented Apr 10, 2020 •

edited

Loading

197g commented Apr 10, 2020 •

edited

Loading