-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Support rotated pages with extraction_mode="layout" #3270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. We already have the I think we should evaluate doing some more or less breaking changes for the text extraction here, maybe together with refactoring the plain mode as well (see #3010). What I have in mind:
|
I'm super happy to help with a refactor, both for this and for #3010. What would this change look like with respect to versioning? I wouldn't want to unnecessarily force pypdf 6. |
I have found a good-enough working solution to my immediate need via |
We have a deprecation process (see developer docs) and tend to issue a new major release once a year when dropping an old Python version. I am fine with just tackling the easy way in a PR and moving further refactoring to a dedicated issue - I will take care of this accordingly. |
Coming in a bit late, but... @hackowitz-af, I like where your head's at. A provision to just do the rotation for you when all of the text on a page is rotated would be nice to have, but as you've already gathered, the biggest issue here is semantics. The The I'd be happy to provide input if you wanted to try and identify a 'dominant rotation' and perform the extraction w.r.t. to that orientation by default. Should come in handy for mixed PDFs that throw a rotated landscape page in the middle of 100 pages of vanilla portrait... |
Explanation
When extracting text from rotated pages, the current options limit useful extraction in layout mode.
strip_rotated=True
, a warning is issued and there is no output.strip_rotated=False
, a warning is issued and the output is garbled.I propose to add an optional
orientation: {"infer", 0, 90, 180, 270} = "infer"}
toPageObject.extract_text
.infer
could either use thepage['/Rotate']
or use the actual rotation of the text. The namesorientation
,layout_mode_orientation
,rotation
, etc. are all the same to me.I think it's best to add a keyword argument rather than to implicitly use the
page['/Rotate']
, so one could extract different groups of rotated text from the same page. For example, a page header/footer has 0 rotation, but the page content are rotated 90 degrees. There is value to be able to extract each.rotated-page.pdf
Code Example
The text was updated successfully, but these errors were encountered: