Support rotated pages with extraction_mode="layout" #3270

hackowitz-af · 2025-04-30T16:52:06Z

Explanation

When extracting text from rotated pages, the current options limit useful extraction in layout mode.

If strip_rotated=True, a warning is issued and there is no output.
If strip_rotated=False, a warning is issued and the output is garbled.

I propose to add an optional orientation: {"infer", 0, 90, 180, 270} = "infer"} to PageObject.extract_text. infer could either use the page['/Rotate'] or use the actual rotation of the text. The names orientation, layout_mode_orientation, rotation, etc. are all the same to me.

I think it's best to add a keyword argument rather than to implicitly use the page['/Rotate'], so one could extract different groups of rotated text from the same page. For example, a page header/footer has 0 rotation, but the page content are rotated 90 degrees. There is value to be able to extract each.

rotated-page.pdf

Code Example

from pypdf import PdfReader
reader = PdfReader("./rotated-page.pdf")

# all to the same effect, for a 90-degree rotated page...
reader.pages[0].extract_text(extraction_mode="layout")
reader.pages[0].extract_text(extraction_mode="layout", orientation="infer")
reader.pages[0].extract_text(extraction_mode="layout", orientation=90)

# to collect different sections of a page, while preserving the layout of each.
header = reader.pages[0].extract_text(extraction_mode="layout", orientation=0)
body = reader.pages[0].extract_text(extraction_mode="layout", orientation=90)

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2025-04-30T19:05:56Z

Thanks for the report. We already have the orientations parameter for the default "plain" mode. Adding another parameter orientation solely for the layout mode with nearly the same name sounds confusing.

I think we should evaluate doing some more or less breaking changes for the text extraction here, maybe together with refactoring the plain mode as well (see #3010). What I have in mind:

Provide a new method for extracting the text in layout mode.
Deprecate the extraction_mode mode in favor of the new method.
Clean up the parameters to have a clean interface without confusing users.
Get rid of the *args-specific code. I do not know if this has ever been useful.

hackowitz-af · 2025-04-30T19:18:49Z

I'm super happy to help with a refactor, both for this and for #3010. What would this change look like with respect to versioning? I wouldn't want to unnecessarily force pypdf 6.

hackowitz-af · 2025-04-30T19:29:20Z

I have found a good-enough working solution to my immediate need via page.transfer_rotation_to_content(). I will make a PR to add a test, and not change any of the extract_text itself

…lity.

stefan6419846 · 2025-05-01T19:14:06Z

I'm super happy to help with a refactor, both for this and for #3010. What would this change look like with respect to versioning? I wouldn't want to unnecessarily force pypdf 6.

We have a deprecation process (see developer docs) and tend to issue a new major release once a year when dropping an old Python version.

I am fine with just tackling the easy way in a PR and moving further refactoring to a dedicated issue - I will take care of this accordingly.

shartzog · 2025-05-20T04:37:57Z

Coming in a bit late, but...

@hackowitz-af, I like where your head's at. A provision to just do the rotation for you when all of the text on a page is rotated would be nice to have, but as you've already gathered, the biggest issue here is semantics. The strip_rotated parameter kinda implies that layout mode has at least some provisions for rotated text handling when in truth, it does not, at least not for a page that's rotated wholesale.

The strip_rotated parameter itself was a late addition to my original implementation, and its primary intent was to provide coverage for pages that contained text in multiple orientations, e.g. when everything is copacetic at 0 rotation except an annoying watermark (a la arXiv.org pdfs) or a clever typesetter's rotated titles on a PDF brochure, etc. Under those scenarios, a 'fixed width' algo becomes more or less impossible, so you're left to choose between ignoring the rotated text or junking up the output of the 'properly rotated' text.

I'd be happy to provide input if you wanted to try and identify a 'dominant rotation' and perform the extraction w.r.t. to that orientation by default. Should come in handy for mixed PDFs that throw a rotated landscape page in the middle of 100 pages of vanilla portrait...

stefan6419846 added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow is-feature A feature request labels Apr 30, 2025

hackowitz-af added a commit to hackowitz-af/pypdf that referenced this issue Apr 30, 2025

Demonstrate that py-pdf#3270 can be adressed using existing functiona…

70d9f52

…lity.

hackowitz-af mentioned this issue Apr 30, 2025

TST: Demonstrate that #3270 can be resolved using existing functionality. #3272

Merged

stefan6419846 closed this as completed in #3272 May 16, 2025

stefan6419846 closed this as completed in d59164b May 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support rotated pages with extraction_mode="layout" #3270

Support rotated pages with extraction_mode="layout" #3270

hackowitz-af commented Apr 30, 2025

stefan6419846 commented Apr 30, 2025

Uh oh!

hackowitz-af commented Apr 30, 2025

Uh oh!

hackowitz-af commented Apr 30, 2025

Uh oh!

stefan6419846 commented May 1, 2025

Uh oh!

shartzog commented May 20, 2025 •

edited

Loading

Uh oh!

Support rotated pages with extraction_mode="layout" #3270

Support rotated pages with extraction_mode="layout" #3270

Comments

hackowitz-af commented Apr 30, 2025

Explanation

Code Example

stefan6419846 commented Apr 30, 2025

Uh oh!

hackowitz-af commented Apr 30, 2025

Uh oh!

hackowitz-af commented Apr 30, 2025

Uh oh!

stefan6419846 commented May 1, 2025

Uh oh!

shartzog commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shartzog commented May 20, 2025 •

edited

Loading