Skip to content
This repository was archived by the owner on Mar 4, 2025. It is now read-only.

Optimizing parsepngstream #73

Open
ajstarks opened this issue May 24, 2024 · 4 comments
Open

Optimizing parsepngstream #73

ajstarks opened this issue May 24, 2024 · 4 comments

Comments

@ajstarks
Copy link

Based on profiling my application, image processing is the most CPU-intensive, with parsepngstream using a good bit of the time. Is there an opportunity to optimize this function? Perhaps using image/png from the standard library

Here is my profile:

File: ppdfdeck
Type: cpu
Time: May 23, 2024 at 3:11pm (EDT)
Duration: 4.11s, Total samples = 3.91s (95.24%)
Showing nodes accounting for 3.83s, 97.95% of 3.91s total
Dropped 36 nodes (cum <= 0.02s)
      flat  flat%   sum%        cum   cum%
     0.64s 16.37% 16.37%      0.64s 16.37%  github.com/go-pdf/fpdf.(*wbuffer).u8 (inline)
     0.36s  9.21% 25.58%      0.36s  9.21%  compress/flate.(*deflateFast).matchLen
     0.36s  9.21% 34.78%      0.36s  9.21%  hash/adler32.update
     0.27s  6.91% 41.69%      0.27s  6.91%  runtime.memmove
     0.26s  6.65% 48.34%      0.77s 19.69%  compress/flate.(*deflateFast).encode
     0.24s  6.14% 54.48%      0.29s  7.42%  compress/flate.(*decompressor).huffSym
     0.23s  5.88% 60.36%      0.76s 19.44%  compress/flate.(*decompressor).huffmanBlock
     0.21s  5.37% 65.73%      0.21s  5.37%  runtime.memclrNoHeapPointers
     0.19s  4.86% 70.59%      0.37s  9.46%  compress/flate.(*huffmanBitWriter).writeTokens
     0.18s  4.60% 75.19%      3.67s 93.86%  github.com/go-pdf/fpdf.(*Fpdf).parsepngstream
     0.10s  2.56% 77.75%      0.11s  2.81%  compress/flate.(*huffmanBitWriter).writeCode
     0.08s  2.05% 79.80%      0.33s  8.44%  compress/flate.(*huffmanBitWriter).indexTokens
     0.07s  1.79% 81.59%      0.09s  2.30%  compress/flate.(*huffmanEncoder).bitCounts
     0.06s  1.53% 83.12%      0.14s  3.58%  compress/flate.(*dictDecoder).tryWriteCopy (inline)
     0.06s  1.53% 84.65%      0.06s  1.53%  compress/flate.load32 (inline)
     0.05s  1.28% 85.93%      0.05s  1.28%  bytes.(*Reader).ReadByte
     0.04s  1.02% 86.96%      0.04s  1.02%  compress/flate.(*decompressor).moreBits
     0.04s  1.02% 87.98%      0.04s  1.02%  compress/flate.(*dictDecoder).writeByte (inline)
     0.04s  1.02% 89.00%      0.04s  1.02%  compress/flate.(*huffmanBitWriter).writeBits
     0.04s  1.02% 90.03%      0.04s  1.02%  compress/flate.load64 (inline)
     0.04s  1.02% 91.05%      0.04s  1.02%  compress/flate.token.length (inline)
     0.04s  1.02% 92.07%      0.04s  1.02%  runtime/internal/syscall.Syscall6
     0.04s  1.02% 93.09%      0.05s  1.28%  sort.insertionSort
     0.03s  0.77% 93.86%      0.03s  0.77%  crypto/sha1.blockAVX2
     0.03s  0.77% 94.63%      0.04s  1.02%  sort.partition
     0.02s  0.51% 95.14%      0.02s  0.51%  compress/flate.(*dictDecoder).histSize (inline)
     0.02s  0.51% 95.65%      0.02s  0.51%  compress/flate.(*huffmanBitWriter).writeDynamicHeader
     0.02s  0.51% 96.16%      0.02s  0.51%  compress/flate.byLiteral.Less
     0.02s  0.51% 96.68%      0.02s  0.51%  compress/flate.lengthCode (inline)
     0.02s  0.51% 97.19%      0.02s  0.51%  compress/flate.token.offset (inline)
     0.02s  0.51% 97.70%      0.02s  0.51%  runtime.procyield
     0.01s  0.26% 97.95%      0.20s  5.12%  compress/flate.(*huffmanEncoder).generate
         0     0% 97.95%      1.10s 28.13%  bytes.(*Buffer).ReadFrom
         0     0% 97.95%      0.06s  1.53%  bytes.(*Buffer).WriteString
         0     0% 97.95%      0.16s  4.09%  bytes.(*Buffer).grow
         0     0% 97.95%      0.16s  4.09%  bytes.growSlice
         0     0% 97.95%      1.53s 39.13%  compress/flate.(*Writer).Write (inline)
         0     0% 97.95%      0.07s  1.79%  compress/flate.(*byFreq).sort (inline)
         0     0% 97.95%      0.03s  0.77%  compress/flate.(*byLiteral).sort (inline)
         0     0% 97.95%      1.51s 38.62%  compress/flate.(*compressor).encSpeed
         0     0% 97.95%      0.02s  0.51%  compress/flate.(*compressor).fillStore
         0     0% 97.95%      0.02s  0.51%  compress/flate.(*compressor).init
         0     0% 97.95%      1.53s 39.13%  compress/flate.(*compressor).write
         0     0% 97.95%      0.83s 21.23%  compress/flate.(*decompressor).Read
         0     0% 97.95%      0.14s  3.58%  compress/flate.(*decompressor).nextBlock
         0     0% 97.95%      0.74s 18.93%  compress/flate.(*huffmanBitWriter).writeBlockDynamic
         0     0% 97.95%      0.03s  0.77%  compress/flate.(*huffmanEncoder).assignEncodingAndSize
         0     0% 97.95%      0.04s  1.02%  compress/flate.NewWriter (inline)
         0     0% 97.95%      0.04s  1.02%  compress/flate.NewWriterDict
         0     0% 97.95%      1.78s 45.52%  compress/zlib.(*Writer).Write
         0     0% 97.95%      0.04s  1.02%  compress/zlib.(*Writer).writeHeader
         0     0% 97.95%      0.98s 25.06%  compress/zlib.(*reader).Read
         0     0% 97.95%      0.03s  0.77%  crypto/sha1.(*digest).Write
         0     0% 97.95%      0.03s  0.77%  crypto/sha1.block
         0     0% 97.95%      0.09s  2.30%  github.com/go-pdf/fpdf.(*Fpdf).Close
         0     0% 97.95%      3.77s 96.42%  github.com/go-pdf/fpdf.(*Fpdf).ImageOptions
         0     0% 97.95%      0.10s  2.56%  github.com/go-pdf/fpdf.(*Fpdf).Output
         0     0% 97.95%      3.77s 96.42%  github.com/go-pdf/fpdf.(*Fpdf).RegisterImageOptions
         0     0% 97.95%      3.77s 96.42%  github.com/go-pdf/fpdf.(*Fpdf).RegisterImageOptionsReader
         0     0% 97.95%      0.09s  2.30%  github.com/go-pdf/fpdf.(*Fpdf).enddoc
         0     0% 97.95%      0.06s  1.53%  github.com/go-pdf/fpdf.(*Fpdf).out
         0     0% 97.95%      3.73s 95.40%  github.com/go-pdf/fpdf.(*Fpdf).parsepng
         0     0% 97.95%      0.08s  2.05%  github.com/go-pdf/fpdf.(*Fpdf).putimage
         0     0% 97.95%      0.08s  2.05%  github.com/go-pdf/fpdf.(*Fpdf).putimages
         0     0% 97.95%      0.08s  2.05%  github.com/go-pdf/fpdf.(*Fpdf).putresources
         0     0% 97.95%      0.07s  1.79%  github.com/go-pdf/fpdf.(*Fpdf).putstream
         0     0% 97.95%      0.03s  0.77%  github.com/go-pdf/fpdf.(*idEncoder).bytes (inline)
         0     0% 97.95%      1.79s 45.78%  github.com/go-pdf/fpdf.(*xmempool).compress
         0     0% 97.95%      1.03s 26.34%  github.com/go-pdf/fpdf.(*xmempool).uncompress
         0     0% 97.95%      0.03s  0.77%  github.com/go-pdf/fpdf.generateImageID
         0     0% 97.95%      0.06s  1.53%  github.com/go-pdf/fpdf.newRBuffer
         0     0% 97.95%      0.36s  9.21%  hash/adler32.(*digest).Write
         0     0% 97.95%      0.03s  0.77%  internal/poll.(*FD).Read
         0     0% 97.95%      0.04s  1.02%  internal/poll.ignoringEINTRIO (inline)
         0     0% 97.95%      3.87s 98.98%  main.dodeck
         0     0% 97.95%      3.77s 96.42%  main.doslides
         0     0% 97.95%      3.87s 98.98%  main.main
         0     0% 97.95%      3.77s 96.42%  main.pdfslide
         0     0% 97.95%      0.03s  0.77%  os.(*File).Read
         0     0% 97.95%      0.03s  0.77%  os.(*File).read (inline)
         0     0% 97.95%      0.03s  0.77%  runtime.gcBgMarkWorker
         0     0% 97.95%      0.03s  0.77%  runtime.gcBgMarkWorker.func2
         0     0% 97.95%      0.03s  0.77%  runtime.gcDrain
         0     0% 97.95%      0.03s  0.77%  runtime.gcDrainMarkWorkerDedicated (inline)
         0     0% 97.95%      0.02s  0.51%  runtime.growslice
         0     0% 97.95%      3.87s 98.98%  runtime.main
         0     0% 97.95%      0.04s  1.02%  runtime.makeslice
         0     0% 97.95%      0.06s  1.53%  runtime.mallocgc
         0     0% 97.95%      0.03s  0.77%  runtime.markroot
         0     0% 97.95%      0.03s  0.77%  runtime.markroot.func1
         0     0% 97.95%      0.03s  0.77%  runtime.memclrNoHeapPointersChunked
         0     0% 97.95%      0.02s  0.51%  runtime.newobject
         0     0% 97.95%      0.03s  0.77%  runtime.suspendG
         0     0% 97.95%      0.04s  1.02%  runtime.systemstack
         0     0% 97.95%      0.10s  2.56%  sort.Sort
         0     0% 97.95%      0.10s  2.56%  sort.pdqsort
         0     0% 97.95%      0.04s  1.02%  syscall.RawSyscall6
         0     0% 97.95%      0.03s  0.77%  syscall.Read (inline)
         0     0% 97.95%      0.04s  1.02%  syscall.Syscall
         0     0% 97.95%      0.03s  0.77%  syscall.read
@sbinet
Copy link
Contributor

sbinet commented May 24, 2024

hi Anthony,

do you have a simple reproducer ?
(so I don't optimize something irrelevant for your use case)


note for self: parsepngstream looks into the pHYs ancillary PNG chunk. image/png.Decode ignores it.
here's some biblio about it:

@ajstarks
Copy link
Author

Here is a reproducer: It works by making 20 pages using copies of two files "hello-world.png", and "hello-mars.png" (attached) so that each process is a separate file. I prepared them like this:

for i in `seq -f %02g 0 19`
do 
    cp hello-world.png hello-world-$i.png
    cp hello-mars.png hello-mars-$i.png
done

FYI, just processing the same files is a lot faster, (200ms vs. 1.1sec) but that does not match my use case.

package main

import (
	"fmt"
	"os"
	"runtime/pprof"

	"github.com/go-pdf/fpdf"
)

func main() {

	// PDF page is 1920x1080
	pw := 1920.0
	ph := 1080.0

	// set up the page
	pageconfig := fpdf.InitType{
		UnitStr:    "pt",
		SizeStr:    "Widescreen",
		Size:       fpdf.SizeType{Wd: pw, Ht: ph},
		FontDirStr: "",
	}

	pdf := fpdf.NewCustom(&pageconfig)
	imgopt := fpdf.ImageOptions{AllowNegativePosition: true}

	// use copies of files to exercise processing individual files
	f1 := []string{}
	f2 := []string{}
	for i := 0; i < 20; i++ {
		f1 = append(f1, fmt.Sprintf("hello-world-%02d.png", i))
		f2 = append(f2, fmt.Sprintf("hello-mars-%02d.png", i))
	}

	// Profiling...
	pf, err := os.Create("default.pgo")
	if err != nil {
		fmt.Fprintf(os.Stderr, "could not create CPU profile: ", err)
		os.Exit(1)
	}
	defer pf.Close()

	if err := pprof.StartCPUProfile(pf); err != nil {
		fmt.Fprintf(os.Stderr, "could not start CPU profile: ", err)
		os.Exit(1)
	}
	defer pprof.StopCPUProfile()

	// make pages: four images per page, scaled 50%
	// substitute discrete files for f1[i], and f2[i] to
	// test processing the sames files in the loop
	scale := 50.0
	iw := pw * (scale / 100)
	ih := ph * (scale / 100)
	for i := 0; i < len(f1); i++ {
		pdf.AddPage()
		pdf.ImageOptions(f1[i], 0, 0, iw, ih, false, imgopt, 0, "")
		pdf.ImageOptions(f2[i], 960, 0, iw, ih, false, imgopt, 0, "")
		pdf.ImageOptions(f2[i], 0, 540, iw, ih, false, imgopt, 0, "")
		pdf.ImageOptions(f1[i], 960, 540, iw, ih, false, imgopt, 0, "")
	}
	// write the PDF
	if err := pdf.OutputFileAndClose("pdfimg.pdf"); err != nil {
		fmt.Fprintf(os.Stderr, "%v\n", err)
		os.Exit(1)
	}
}

here is its profile:

File: pdfimg
Type: cpu
Time: May 24, 2024 at 6:08pm (EDT)
Duration: 1.10s, Total samples = 990ms (89.86%)
Showing nodes accounting for 990ms, 100% of 990ms total
      flat  flat%   sum%        cum   cum%
     240ms 24.24% 24.24%      240ms 24.24%  github.com/go-pdf/fpdf.(*wbuffer).u8 (inline)
     150ms 15.15% 39.39%      150ms 15.15%  hash/adler32.update
     120ms 12.12% 51.52%      120ms 12.12%  compress/flate.(*deflateFast).matchLen
     100ms 10.10% 61.62%      100ms 10.10%  runtime.memmove
      80ms  8.08% 69.70%      960ms 96.97%  github.com/go-pdf/fpdf.(*Fpdf).parsepngstream
      80ms  8.08% 77.78%       80ms  8.08%  runtime.memclrNoHeapPointers
      50ms  5.05% 82.83%       50ms  5.05%  compress/flate.(*huffmanEncoder).bitCounts (inline)
      40ms  4.04% 86.87%      170ms 17.17%  compress/flate.(*deflateFast).encode (inline)
      20ms  2.02% 88.89%       20ms  2.02%  compress/flate.(*huffmanBitWriter).writeCode
      10ms  1.01% 89.90%       10ms  1.01%  bytes.(*Reader).ReadByte
      10ms  1.01% 90.91%       30ms  3.03%  compress/flate.(*dictDecoder).tryWriteCopy (inline)
      10ms  1.01% 91.92%       10ms  1.01%  compress/flate.(*dictDecoder).writeCopy
      10ms  1.01% 92.93%       10ms  1.01%  compress/flate.byFreq.Less
      10ms  1.01% 93.94%       10ms  1.01%  compress/flate.byFreq.Swap
      10ms  1.01% 94.95%       10ms  1.01%  compress/flate.hash (inline)
      10ms  1.01% 95.96%       10ms  1.01%  compress/flate.reverseBits (inline)
      10ms  1.01% 96.97%       10ms  1.01%  crypto/sha1.blockAVX2
      10ms  1.01% 97.98%      160ms 16.16%  hash/adler32.(*digest).Write (inline)
      10ms  1.01% 98.99%       10ms  1.01%  runtime.procyield
      10ms  1.01%   100%       20ms  2.02%  sort.insertionSort
         0     0%   100%       10ms  1.01%  bytes.(*Buffer).Grow (inline)
         0     0%   100%      230ms 23.23%  bytes.(*Buffer).ReadFrom
         0     0%   100%       30ms  3.03%  bytes.(*Buffer).grow
         0     0%   100%       30ms  3.03%  bytes.growSlice
         0     0%   100%      290ms 29.29%  compress/flate.(*Writer).Write (inline)
         0     0%   100%       30ms  3.03%  compress/flate.(*byFreq).sort (inline)
         0     0%   100%      280ms 28.28%  compress/flate.(*compressor).encSpeed
         0     0%   100%       10ms  1.01%  compress/flate.(*compressor).fillStore
         0     0%   100%      290ms 29.29%  compress/flate.(*compressor).write
         0     0%   100%      120ms 12.12%  compress/flate.(*decompressor).Read
         0     0%   100%       10ms  1.01%  compress/flate.(*decompressor).huffSym (inline)
         0     0%   100%       50ms  5.05%  compress/flate.(*decompressor).huffmanBlock
         0     0%   100%       90ms  9.09%  compress/flate.(*huffmanBitWriter).indexTokens
         0     0%   100%      110ms 11.11%  compress/flate.(*huffmanBitWriter).writeBlockDynamic
         0     0%   100%       20ms  2.02%  compress/flate.(*huffmanBitWriter).writeTokens (inline)
         0     0%   100%       10ms  1.01%  compress/flate.(*huffmanEncoder).assignEncodingAndSize
         0     0%   100%       90ms  9.09%  compress/flate.(*huffmanEncoder).generate
         0     0%   100%       40ms  4.04%  compress/flate.NewWriter (inline)
         0     0%   100%       40ms  4.04%  compress/flate.NewWriterDict
         0     0%   100%      400ms 40.40%  compress/zlib.(*Writer).Write
         0     0%   100%       40ms  4.04%  compress/zlib.(*Writer).writeHeader
         0     0%   100%      210ms 21.21%  compress/zlib.(*reader).Read
         0     0%   100%       10ms  1.01%  crypto/sha1.(*digest).Write
         0     0%   100%       10ms  1.01%  crypto/sha1.block
         0     0%   100%       10ms  1.01%  github.com/go-pdf/fpdf.(*Fpdf).Close
         0     0%   100%      970ms 97.98%  github.com/go-pdf/fpdf.(*Fpdf).ImageOptions
         0     0%   100%       10ms  1.01%  github.com/go-pdf/fpdf.(*Fpdf).Output
         0     0%   100%       10ms  1.01%  github.com/go-pdf/fpdf.(*Fpdf).OutputFileAndClose
         0     0%   100%      970ms 97.98%  github.com/go-pdf/fpdf.(*Fpdf).RegisterImageOptions
         0     0%   100%      970ms 97.98%  github.com/go-pdf/fpdf.(*Fpdf).RegisterImageOptionsReader
         0     0%   100%       10ms  1.01%  github.com/go-pdf/fpdf.(*Fpdf).enddoc
         0     0%   100%      960ms 96.97%  github.com/go-pdf/fpdf.(*Fpdf).parsepng
         0     0%   100%       10ms  1.01%  github.com/go-pdf/fpdf.(*Fpdf).putpages
         0     0%   100%       10ms  1.01%  github.com/go-pdf/fpdf.(*idEncoder).bytes (inline)
         0     0%   100%      410ms 41.41%  github.com/go-pdf/fpdf.(*xmempool).compress
         0     0%   100%      230ms 23.23%  github.com/go-pdf/fpdf.(*xmempool).uncompress
         0     0%   100%       10ms  1.01%  github.com/go-pdf/fpdf.generateImageID
         0     0%   100%      980ms 98.99%  main.main
         0     0%   100%       10ms  1.01%  runtime.gcBgMarkWorker
         0     0%   100%       10ms  1.01%  runtime.gcBgMarkWorker.func2
         0     0%   100%       10ms  1.01%  runtime.gcDrain
         0     0%   100%       10ms  1.01%  runtime.gcDrainMarkWorkerDedicated (inline)
         0     0%   100%      980ms 98.99%  runtime.main
         0     0%   100%       10ms  1.01%  runtime.makeslice
         0     0%   100%       50ms  5.05%  runtime.mallocgc
         0     0%   100%       10ms  1.01%  runtime.markroot
         0     0%   100%       10ms  1.01%  runtime.markroot.func1
         0     0%   100%       10ms  1.01%  runtime.memclrNoHeapPointersChunked
         0     0%   100%       40ms  4.04%  runtime.newobject
         0     0%   100%       10ms  1.01%  runtime.suspendG
         0     0%   100%       10ms  1.01%  runtime.systemstack
         0     0%   100%       30ms  3.03%  sort.Sort
         0     0%   100%       10ms  1.01%  sort.partition
         0     0%   100%       30ms  3.03%  sort.pdqsort

Here are the test images:
hello-mars
hello-world

@ajstarks
Copy link
Author

I note that JPG vs. PNG processing is 5x faster (by converting the pngs to jpg via convert(1)

@ajstarks
Copy link
Author

ajstarks commented May 26, 2024

It looks like the hot spots these two nested loops, where the .u8 method is on the image data before it's compressed.

fpdf/png.go

Lines 190 to 200 in 504c6dd

for i := 0; i < height; i++ {
pos = (1 + length) * i
color.u8(data[pos])
alpha.u8(data[pos])
elPos = pos + 1
for k := 0; k < width; k++ {
color.u8(data[elPos])
alpha.u8(data[elPos+1])
elPos += 2
}
}

and

fpdf/png.go

Lines 210 to 223 in 504c6dd

for i := 0; i < height; i++ {
pos = (1 + length) * i
color.u8(data[pos])
alpha.u8(data[pos])
elPos = pos + 1
for k := 0; k < width; k++ {
tmp := data[elPos : elPos+4]
color.u8(tmp[0])
color.u8(tmp[1])
color.u8(tmp[2])
alpha.u8(tmp[3])
elPos += 4
}
}

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants