Skip to content

Improve math character extraction #2009

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MartinThoma opened this issue Jul 24, 2023 · 10 comments · Fixed by #2016
Closed

Improve math character extraction #2009

MartinThoma opened this issue Jul 24, 2023 · 10 comments · Fixed by #2016
Assignees
Labels
workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@MartinThoma
Copy link
Member

MartinThoma commented Jul 24, 2023

Explanation

Extracting math content is super hard and at the moment completely out of reach (e.g. fractions, subscripts / superscripts, curly braces for two cases, roots). However, maybe we can improve the extraction a little bit by supporting some heavily used single characters:

Expectations and current state

File Expected pypdf PyMuPDF PDFium Tika Copy-paste from Evince
cdot.pdf · · · · ·
hbar.pdf ħ ~ ~ ~ ~
integral.pdf R R R
partial-derivative.pdf @
phi.pdf φ φ φ φ φ
varphi.pdf φ ' ϕ ϕ ϕ φ

Generated via:

from pypdf import PdfReader
import fitz as PyMuPDF
import pypdfium2 as pdfium
import tika
from tika import parser  # pip install tika

tika.initVM()

def pymupdf_get_text(path) -> str:
    with PyMuPDF.open(path) as doc:
        text = ""
        for page in doc:
            text += page.get_text() + "\n"
    return text

def pdfium_get_text(data: bytes) -> str:
    text = ""
    pdf = pdfium.PdfDocument(data)
    for i in range(len(pdf)):
        page = pdf.get_page(i)
        textpage = page.get_textpage()
        text += textpage.get_text_range() + "\n"
    return text

expected = {
    "integral.pdf": "∫",
    "cdot.pdf": "·",
    "phi.pdf": "φ",
    "varphi.pdf": "φ",
    "partial-derivative.pdf": "∂",
    "hbar.pdf": "ħ",
}

# Print header
file = "File"
expected_str = "Expected"
c_pypdf = "pypdf"
c_pymupdf = "PyMuPDF"
c_pdfium = "PDFium"
c_tika = "Tika"
print(f"{file:<25}|{expected_str:<9}|{c_pypdf:<5}|{c_pymupdf:<7}|{c_pdfium:<6}|{c_tika:<5}")
file = "-" * 25
expected_str = "-------"
c_pypdf = "-----"
c_pymupdf = "-------"
c_pdfium = "------"
c_tika = "----"
print(f"{file:<25}|{expected_str:<9}|{c_pypdf:<5}|{c_pymupdf:<7}|{c_pdfium:<6}|{c_tika:<5}")

# Print data
for file in sorted(expected.keys()):
    expected_str = expected.get(file, 'unknwon')
    c_pypdf = PdfReader(file).pages[0].extract_text().strip()
    c_pymupdf =pymupdf_get_text(file).strip()
    c_pdfium = pdfium_get_text(file).strip()
    c_tika = parser.from_file(file)[ "content" ].strip()
    print(f"{file:<25}|{expected_str:<9}|{c_pypdf:<5}|{c_pymupdf:<7}|{c_pdfium:<6}|{c_tika:<5}")

with those files:

Proof that it's relevant

@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Jul 24, 2023
@MartinThoma MartinThoma self-assigned this Jul 24, 2023
@MartinThoma
Copy link
Member Author

I guess this is about supporting the "math" part of https://fontinfo.opensuse.org/fonts/msbm10Medium.html .

@pubpub-zz
Copy link
Collaborator

@MartinThoma
Can you also add what is the output of the copy/paste with acrobat reader

@MartinThoma
Copy link
Member Author

I don't have Acrobat Reader, but I added a column for Evince. It's the same for the Google chrome PDF viewer.

@pubpub-zz
Copy link
Collaborator

I've reviewed your files and they do not include the ToUnicode cmap that is expected in pypdf to translate.
The other programs seem to use the /CharSet entry within the /FontDescriptor. your examples are very limited as we have only one character defined (not even the space character).
To elaborate a strategy can you produce a new example with all the characters in the same page separated by spaces and some other standard text around

@MartinThoma
Copy link
Member Author

@MartinThoma
Copy link
Member Author

[pdfs generated via pdflatex] do not include the ToUnicode cmap

Interesting. Maybe I can talk with the pdflatex developers to change that 🤔

However, there are tons of documents out there which are generated by pdflatex. That approach would not help for existing documents.

@MartinThoma
Copy link
Member Author

https://tug.org/pipermail/pdftex/2001-February/000385.html : it seems they became aware of the issue in 2001:

one should embed ToUnicode CMap with
the font (PDF 1.3 Reference 5.9 sec.ed.). This CMap relates codes
of font's glyphs to Unicode codes, it can easily be made when such
relations are known.

But there were issues:

This can be done by defining pdf font resources but hooking it into your
font system could be non trivial

\pdffontattr is mentioned (from TeX)

@MartinThoma
Copy link
Member Author

MartinThoma commented Jul 25, 2023

In https://github.com/mozilla/pdf.js/pull/16735/files it was added to a glyph list which is used in the cmap lookup: https://github.com/mozilla/pdf.js/blob/86165a7ba6d843f3520aa697933d14e6607e1394/src/core/font_renderer.js#L562C70-L562C75 :

            cmap = lookupCmap(
              font.cmap,
              String.fromCharCode(font.glyphNameMap[StandardEncoding[bchar]])
            );

Can we just add it to the following?

charset_encoding: Dict[str, List[str]] = {
    "/StandardCoding": _std_encoding,
    "/WinAnsiEncoding": _win_encoding,
    "/MacRomanEncoding": _mac_encoding,
    "/PDFDocEncoding": _pdfdoc_encoding,
    "/Symbol": _symbol_encoding,
    "/ZapfDingbats": _zapfding_encoding,
}

@pubpub-zz
Copy link
Collaborator

thanks,
this gives me some ideas.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Jul 25, 2023

Sure: math-in-text-created-via-latex.pdf

this is my output:

Theαis followed by βandγ. The greek alphabet also has φ. A
variation is ϕ.
In Physics, ℏis used. They also like integrals, which are denoted by∫
as well as partial derivatives denoted by ∂.
3·4 = 12
I also see people write 3 ×4 = 12. Seems to be an American notation.

There is an inversion in the phi which seems to be a bug in latex (mathjax/MathJax#353 (comment))

from acrobat:
image
as text:
The � is followed by and
. The greek alphabet also has �. A
variation is '.
R In Physics, ~ is used. They also like integrals, which are denoted by
as well as partial derivatives denoted by @.
3 � 4 = 12
I also see people write 3�4 = 12. Seems to be an American notation.

we are doing better 😁🎇🎇

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jul 25, 2023
closes py-pdf#2009

note: code clean up removed duplicates from adobe_glyphs
MartinThoma pushed a commit that referenced this issue Jul 26, 2023
note: code clean up removed duplicates from adobe_glyphs

Closes #2009
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants