Skip to content

Consider accelerated IDCT? #79

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lilith opened this issue Sep 13, 2017 · 9 comments
Open

Consider accelerated IDCT? #79

lilith opened this issue Sep 13, 2017 · 9 comments

Comments

@lilith
Copy link
Contributor

lilith commented Sep 13, 2017

I would imagine that with IDCT accelerated, Jpeg->YCbCr unscaled planar performance should meet/exceed libjpeg-turbo (it hasn't merged these yet - I think these handle multiple blocks).

https://github.com/huaqiangwang/libjpeg-turbo/blob/avx2-dev/simd/jidctflt-avx2-64.asm
https://github.com/huaqiangwang/libjpeg-turbo/blob/avx2-dev/simd/jfdctflt-avx2-64.asm
https://github.com/huaqiangwang/libjpeg-turbo/blob/avx2-dev/simd/jfdctmflt-avx2-64.asm

@kaksmet
Copy link
Member

kaksmet commented Sep 15, 2017

Patches for SIMD acceleration are welcome, I don't have the time to implement it myself.

@lilith
Copy link
Contributor Author

lilith commented Apr 9, 2020

Is this something you would be able to delegate or work on with additional funding?

@197g
Copy link
Member

197g commented Apr 9, 2020

For the forseeable future the project is mainly limited by dev head count/time. Delegation is not a problem as long as the changes are properly licensed. Porting libjpeg-turbo's SIMD implementation, which seems to be zlib licensed, sounds reasonable. We'd have to figure out how best to test the combination of simd/no-simd and the rest of the code but I don't see any blocking concerns there either.

@lilith
Copy link
Contributor Author

lilith commented Apr 9, 2020

I'm considering using this project in https://github.com/imazen/imageflow as the default jpeg decoder, but I expect I will need to expand things like color profile/exif support and add assembly and C to the build to optimize performance-critical bits. I'd like to make sure these changes align with the project direction before investing.

Ideally, I'd like to locate someone who can work on this codec full-time for a while and bring it up to parity with libjpeg-turbo (which is not usable in Rust due to setjmp usage).

@197g
Copy link
Member

197g commented Apr 10, 2020

Ideally, I'd like to locate someone who can work on this codec full-time for a while and bring it up to parity with libjpeg-turbo (which is not usable in Rust due to setjmp usage).

That would be awesome. As for governance, this crate in particular was previously a separate entity. It moved to the image-rs organization to maintain and we're open to collaborators and I have no problem with another developer in a lead role per-se.¹

Adding C dependencies or large amounts of unsafe, however, would be controversial and require good arguments for why it is necessary. Adding assembly as a performance optimization, as long as these bits stay optional, is okay. It would be preferrable to use simd intrinsics of Rust but that's an opinion that can be changed with a convincing performance argument. We'd just need to figure out a way to avoid the incompatibility concerns/symbol collisions that come with linking through native interface (e.g. ring` has gone this route and using different versions simply doesn't work).

¹Random thought: Maybe Embark Studios would also be interested and/or have someone in mind?

@lilith
Copy link
Contributor Author

lilith commented Apr 10, 2020

I believe the use case for C would be the optimized huffman decoder (probably optional), but for assembly I'd like to keep the files in sync with the audited codebase of libjpeg-turbo if possible. https://github.com/libjpeg-turbo/libjpeg-turbo/tree/master/simd/x86_64

I'm not sure how to resolve multi-versioning with needing to use NASM specifically.
edit: perhaps a preprocessing build step could mangle the function names?

@lovasoa
Copy link
Contributor

lovasoa commented Apr 10, 2020

Even before re-implementing it in assembler, I think there is a lot that can be done to improve the performance of the existing IDCT code in safe rust.

I am in no way a specialist, but a quick look at the generated assembly lets me think it is far from optimal. There must be a way to make the compiler use at least some simd instructions by changing how we organize statements there.

lovasoa added a commit to lovasoa/jpeg-decoder that referenced this issue Apr 10, 2020
See image-rs#79
The new code is both more readable and faster

Old assembly:
        mov     eax, edi
        cmp     edi, 256
        jb      .LBB1_2
        sar     eax, 31
        not     al
   .LBB1_2:
        ret

New assembly:
        xor     ecx, ecx
        test    edi, edi
        cmovns  ecx, edi
        cmp     ecx, 255
        mov     eax, 255
        cmovl   eax, ecx
        ret

Benchmark results :

Benchmarking decode a 512x512 JPEG: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.5s or reduce sample count to 40
decode a 512x512 JPEG   time:   [2.4692 ms 2.4873 ms 2.5106 ms]
                        change: [-18.558% -17.141% -15.659%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

Benchmarking decode a 512x512 progressive JPEG: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 28.0s or reduce sample count to 20
decode a 512x512 progressive JPEG
                        time:   [5.5010 ms 5.5212 ms 5.5459 ms]
                        change: [-12.718% -11.746% -10.721%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

extract metadata from an image
                        time:   [1.3028 us 1.3110 us 1.3207 us]
                        change: [+1.8341% +2.8787% +3.8439%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe
@197g
Copy link
Member

197g commented Apr 10, 2020

There was a similar (well, somewhat similar) code converting between color representation in image/webp. There are likely some lessons that can be learned from this, in particular about when Rust/llvm is able to elide bounds checks on slice access and consequently able to automatically vectorize simple loops.

@lilith
Copy link
Contributor Author

lilith commented Apr 10, 2020

After 4 years I finally was able to overcome the blocking issue I had with using libjpeg-turbo, so I probably won't be using this as the primary decoder, but I think I will keep it as a backup. Thanks!

wartmanm pushed a commit to wartmanm/jpeg-decoder that referenced this issue Oct 4, 2021
See image-rs#79
The new code is both more readable and faster

Old assembly:
        mov     eax, edi
        cmp     edi, 256
        jb      .LBB1_2
        sar     eax, 31
        not     al
   .LBB1_2:
        ret

New assembly:
        xor     ecx, ecx
        test    edi, edi
        cmovns  ecx, edi
        cmp     ecx, 255
        mov     eax, 255
        cmovl   eax, ecx
        ret

Benchmark results :

Benchmarking decode a 512x512 JPEG: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.5s or reduce sample count to 40
decode a 512x512 JPEG   time:   [2.4692 ms 2.4873 ms 2.5106 ms]
                        change: [-18.558% -17.141% -15.659%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

Benchmarking decode a 512x512 progressive JPEG: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 28.0s or reduce sample count to 20
decode a 512x512 progressive JPEG
                        time:   [5.5010 ms 5.5212 ms 5.5459 ms]
                        change: [-12.718% -11.746% -10.721%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

extract metadata from an image
                        time:   [1.3028 us 1.3110 us 1.3207 us]
                        change: [+1.8341% +2.8787% +3.8439%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants