Improve math character extraction #2009

MartinThoma · 2023-07-24T11:07:31Z

Explanation

Extracting math content is super hard and at the moment completely out of reach (e.g. fractions, subscripts / superscripts, curly braces for two cases, roots). However, maybe we can improve the extraction a little bit by supporting some heavily used single characters:

hbar: https://www.compart.com/de/unicode/U+0127
integral: https://www.compart.com/de/unicode/U+222B
phi
delta
alpha
beta
gamma
partial derivative: https://www.compart.com/de/unicode/U+2202
times
cdot

Expectations and current state

File	Expected	pypdf	PyMuPDF	PDFium	Tika	Copy-paste from Evince
cdot.pdf	·		·	·	·	·
hbar.pdf	ħ	~	ℏ	~	~	~
integral.pdf	∫	R	�	R	∫	R
partial-derivative.pdf	∂	@	∂	∂	∂	∂
phi.pdf	φ		φ	φ	φ	φ
varphi.pdf	φ	'	ϕ	ϕ	ϕ	φ

Generated via:

from pypdf import PdfReader
import fitz as PyMuPDF
import pypdfium2 as pdfium
import tika
from tika import parser  # pip install tika

tika.initVM()

def pymupdf_get_text(path) -> str:
    with PyMuPDF.open(path) as doc:
        text = ""
        for page in doc:
            text += page.get_text() + "\n"
    return text

def pdfium_get_text(data: bytes) -> str:
    text = ""
    pdf = pdfium.PdfDocument(data)
    for i in range(len(pdf)):
        page = pdf.get_page(i)
        textpage = page.get_textpage()
        text += textpage.get_text_range() + "\n"
    return text

expected = {
    "integral.pdf": "∫",
    "cdot.pdf": "·",
    "phi.pdf": "φ",
    "varphi.pdf": "φ",
    "partial-derivative.pdf": "∂",
    "hbar.pdf": "ħ",
}

# Print header
file = "File"
expected_str = "Expected"
c_pypdf = "pypdf"
c_pymupdf = "PyMuPDF"
c_pdfium = "PDFium"
c_tika = "Tika"
print(f"{file:<25}|{expected_str:<9}|{c_pypdf:<5}|{c_pymupdf:<7}|{c_pdfium:<6}|{c_tika:<5}")
file = "-" * 25
expected_str = "-------"
c_pypdf = "-----"
c_pymupdf = "-------"
c_pdfium = "------"
c_tika = "----"
print(f"{file:<25}|{expected_str:<9}|{c_pypdf:<5}|{c_pymupdf:<7}|{c_pdfium:<6}|{c_tika:<5}")

# Print data
for file in sorted(expected.keys()):
    expected_str = expected.get(file, 'unknwon')
    c_pypdf = PdfReader(file).pages[0].extract_text().strip()
    c_pymupdf =pymupdf_get_text(file).strip()
    c_pdfium = pdfium_get_text(file).strip()
    c_tika = parser.from_file(file)[ "content" ].strip()
    print(f"{file:<25}|{expected_str:<9}|{c_pypdf:<5}|{c_pymupdf:<7}|{c_pdfium:<6}|{c_tika:<5}")

with those files:

Proof that it's relevant

MartinThoma · 2023-07-24T11:12:39Z

I guess this is about supporting the "math" part of https://fontinfo.opensuse.org/fonts/msbm10Medium.html .

pubpub-zz · 2023-07-24T11:33:33Z

@MartinThoma
Can you also add what is the output of the copy/paste with acrobat reader

MartinThoma · 2023-07-24T11:50:29Z

I don't have Acrobat Reader, but I added a column for Evince. It's the same for the Google chrome PDF viewer.

pubpub-zz · 2023-07-25T11:10:36Z

I've reviewed your files and they do not include the ToUnicode cmap that is expected in pypdf to translate.
The other programs seem to use the /CharSet entry within the /FontDescriptor. your examples are very limited as we have only one character defined (not even the space character).
To elaborate a strategy can you produce a new example with all the characters in the same page separated by spaces and some other standard text around

MartinThoma · 2023-07-25T16:14:53Z

Sure: math-in-text-created-via-latex.pdf

MartinThoma · 2023-07-25T16:16:54Z

[pdfs generated via pdflatex] do not include the ToUnicode cmap

Interesting. Maybe I can talk with the pdflatex developers to change that 🤔

However, there are tons of documents out there which are generated by pdflatex. That approach would not help for existing documents.

MartinThoma · 2023-07-25T16:20:09Z

https://tug.org/pipermail/pdftex/2001-February/000385.html : it seems they became aware of the issue in 2001:

one should embed ToUnicode CMap with
the font (PDF 1.3 Reference 5.9 sec.ed.). This CMap relates codes
of font's glyphs to Unicode codes, it can easily be made when such
relations are known.

But there were issues:

This can be done by defining pdf font resources but hooking it into your
font system could be non trivial

\pdffontattr is mentioned (from TeX)

MartinThoma · 2023-07-25T16:26:57Z

In https://github.com/mozilla/pdf.js/pull/16735/files it was added to a glyph list which is used in the cmap lookup: https://github.com/mozilla/pdf.js/blob/86165a7ba6d843f3520aa697933d14e6607e1394/src/core/font_renderer.js#L562C70-L562C75 :

            cmap = lookupCmap(
              font.cmap,
              String.fromCharCode(font.glyphNameMap[StandardEncoding[bchar]])
            );

Can we just add it to the following?

charset_encoding: Dict[str, List[str]] = {
    "/StandardCoding": _std_encoding,
    "/WinAnsiEncoding": _win_encoding,
    "/MacRomanEncoding": _mac_encoding,
    "/PDFDocEncoding": _pdfdoc_encoding,
    "/Symbol": _symbol_encoding,
    "/ZapfDingbats": _zapfding_encoding,
}

pubpub-zz · 2023-07-25T17:14:54Z

thanks,
this gives me some ideas.

pubpub-zz · 2023-07-25T22:18:14Z

Sure: math-in-text-created-via-latex.pdf

this is my output:

Theαis followed by βandγ. The greek alphabet also has φ. A
variation is ϕ.
In Physics, ℏis used. They also like integrals, which are denoted by∫
as well as partial derivatives denoted by ∂.
3·4 = 12
I also see people write 3 ×4 = 12. Seems to be an American notation.

There is an inversion in the phi which seems to be a bug in latex (mathjax/MathJax#353 (comment))

from acrobat:

as text:
The � is followed by and
. The greek alphabet also has �. A
variation is '.
R In Physics, ~ is used. They also like integrals, which are denoted by
as well as partial derivatives denoted by @.
3 � 4 = 12
I also see people write 3�4 = 12. Seems to be an American notation.

we are doing better 😁🎇🎇

closes py-pdf#2009 note: code clean up removed duplicates from adobe_glyphs

note: code clean up removed duplicates from adobe_glyphs Closes #2009

MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Jul 24, 2023

MartinThoma self-assigned this Jul 24, 2023

Snuffleupagus mentioned this issue Jul 24, 2023

Fix copying of the reduced Planck constant mozilla/pdf.js#16735

Merged

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jul 25, 2023

ENH : extract latex characters

bdfaa49

closes py-pdf#2009 note: code clean up removed duplicates from adobe_glyphs

pubpub-zz mentioned this issue Jul 25, 2023

ENH: Extract LaTeX characters #2016

Merged

MartinThoma closed this as completed in #2016 Jul 26, 2023

MartinThoma pushed a commit that referenced this issue Jul 26, 2023

ENH: Extract LaTeX characters (#2016)

a327df6

note: code clean up removed duplicates from adobe_glyphs Closes #2009

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve math character extraction #2009

Improve math character extraction #2009

MartinThoma commented Jul 24, 2023 •

edited

Loading

MartinThoma commented Jul 24, 2023

pubpub-zz commented Jul 24, 2023

MartinThoma commented Jul 24, 2023

pubpub-zz commented Jul 25, 2023

MartinThoma commented Jul 25, 2023

MartinThoma commented Jul 25, 2023

MartinThoma commented Jul 25, 2023

MartinThoma commented Jul 25, 2023 •

edited

Loading

pubpub-zz commented Jul 25, 2023

pubpub-zz commented Jul 25, 2023 •

edited

Loading

Improve math character extraction #2009

Improve math character extraction #2009

Comments

MartinThoma commented Jul 24, 2023 • edited Loading

Explanation

Expectations and current state

Proof that it's relevant

MartinThoma commented Jul 24, 2023

pubpub-zz commented Jul 24, 2023

MartinThoma commented Jul 24, 2023

pubpub-zz commented Jul 25, 2023

MartinThoma commented Jul 25, 2023

MartinThoma commented Jul 25, 2023

MartinThoma commented Jul 25, 2023

MartinThoma commented Jul 25, 2023 • edited Loading

pubpub-zz commented Jul 25, 2023

pubpub-zz commented Jul 25, 2023 • edited Loading

MartinThoma commented Jul 24, 2023 •

edited

Loading

MartinThoma commented Jul 25, 2023 •

edited

Loading

pubpub-zz commented Jul 25, 2023 •

edited

Loading