-
Notifications
You must be signed in to change notification settings - Fork 90
Consider accelerated IDCT? #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Patches for SIMD acceleration are welcome, I don't have the time to implement it myself. |
Is this something you would be able to delegate or work on with additional funding? |
For the forseeable future the project is mainly limited by dev head count/time. Delegation is not a problem as long as the changes are properly licensed. Porting libjpeg-turbo's SIMD implementation, which seems to be zlib licensed, sounds reasonable. We'd have to figure out how best to test the combination of simd/no-simd and the rest of the code but I don't see any blocking concerns there either. |
I'm considering using this project in https://github.com/imazen/imageflow as the default jpeg decoder, but I expect I will need to expand things like color profile/exif support and add assembly and C to the build to optimize performance-critical bits. I'd like to make sure these changes align with the project direction before investing. Ideally, I'd like to locate someone who can work on this codec full-time for a while and bring it up to parity with libjpeg-turbo (which is not usable in Rust due to setjmp usage). |
That would be awesome. As for governance, this crate in particular was previously a separate entity. It moved to the Adding ¹Random thought: Maybe Embark Studios would also be interested and/or have someone in mind? |
I believe the use case for C would be the optimized huffman decoder (probably optional), but for assembly I'd like to keep the files in sync with the audited codebase of libjpeg-turbo if possible. https://github.com/libjpeg-turbo/libjpeg-turbo/tree/master/simd/x86_64 I'm not sure how to resolve multi-versioning with needing to use NASM specifically. |
Even before re-implementing it in assembler, I think there is a lot that can be done to improve the performance of the existing IDCT code in safe rust. I am in no way a specialist, but a quick look at the generated assembly lets me think it is far from optimal. There must be a way to make the compiler use at least some simd instructions by changing how we organize statements there. |
See image-rs#79 The new code is both more readable and faster Old assembly: mov eax, edi cmp edi, 256 jb .LBB1_2 sar eax, 31 not al .LBB1_2: ret New assembly: xor ecx, ecx test edi, edi cmovns ecx, edi cmp ecx, 255 mov eax, 255 cmovl eax, ecx ret Benchmark results : Benchmarking decode a 512x512 JPEG: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.5s or reduce sample count to 40 decode a 512x512 JPEG time: [2.4692 ms 2.4873 ms 2.5106 ms] change: [-18.558% -17.141% -15.659%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) high mild 6 (6.00%) high severe Benchmarking decode a 512x512 progressive JPEG: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 28.0s or reduce sample count to 20 decode a 512x512 progressive JPEG time: [5.5010 ms 5.5212 ms 5.5459 ms] change: [-12.718% -11.746% -10.721%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 3 (3.00%) high mild 6 (6.00%) high severe extract metadata from an image time: [1.3028 us 1.3110 us 1.3207 us] change: [+1.8341% +2.8787% +3.8439%] (p = 0.00 < 0.05) Performance has regressed. Found 9 outliers among 100 measurements (9.00%) 7 (7.00%) high mild 2 (2.00%) high severe
There was a similar (well, somewhat similar) code converting between color representation in |
After 4 years I finally was able to overcome the blocking issue I had with using libjpeg-turbo, so I probably won't be using this as the primary decoder, but I think I will keep it as a backup. Thanks! |
See image-rs#79 The new code is both more readable and faster Old assembly: mov eax, edi cmp edi, 256 jb .LBB1_2 sar eax, 31 not al .LBB1_2: ret New assembly: xor ecx, ecx test edi, edi cmovns ecx, edi cmp ecx, 255 mov eax, 255 cmovl eax, ecx ret Benchmark results : Benchmarking decode a 512x512 JPEG: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.5s or reduce sample count to 40 decode a 512x512 JPEG time: [2.4692 ms 2.4873 ms 2.5106 ms] change: [-18.558% -17.141% -15.659%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) high mild 6 (6.00%) high severe Benchmarking decode a 512x512 progressive JPEG: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 28.0s or reduce sample count to 20 decode a 512x512 progressive JPEG time: [5.5010 ms 5.5212 ms 5.5459 ms] change: [-12.718% -11.746% -10.721%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 3 (3.00%) high mild 6 (6.00%) high severe extract metadata from an image time: [1.3028 us 1.3110 us 1.3207 us] change: [+1.8341% +2.8787% +3.8439%] (p = 0.00 < 0.05) Performance has regressed. Found 9 outliers among 100 measurements (9.00%) 7 (7.00%) high mild 2 (2.00%) high severe
I would imagine that with IDCT accelerated, Jpeg->YCbCr unscaled planar performance should meet/exceed libjpeg-turbo (it hasn't merged these yet - I think these handle multiple blocks).
https://github.com/huaqiangwang/libjpeg-turbo/blob/avx2-dev/simd/jidctflt-avx2-64.asm
https://github.com/huaqiangwang/libjpeg-turbo/blob/avx2-dev/simd/jfdctflt-avx2-64.asm
https://github.com/huaqiangwang/libjpeg-turbo/blob/avx2-dev/simd/jfdctmflt-avx2-64.asm
The text was updated successfully, but these errors were encountered: