-
Notifications
You must be signed in to change notification settings - Fork 1.2k
lxml #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you suggest a particular use case? I think it might be easier to think and talk about in terms of a concrete objective the user it trying to achieve. |
Sure. Say that the user wants to find all of the runs with size 14 text in a document and italicize those runs as well. With access to the Docx's |
Ah, good, that helps focus things, thanks :) A couple notions, not completely coherent, and in no particular order:
Anyway, that's probably enough reflection for one sitting. How does all that strike you? |
That strikes me as a lot more words than I was expecting! From the standpoint of a user, my ideal interface would revolve around multipurpose objects that had both higher-level methods and properties like you’ve implemented with the Paragraph/Run/Text classes, as well as lxml-like functionality. For instance, it would be great if I had an instance of Paragraph Implementing this could be tricky though. I don't think it would be a good idea to allow the user to interact with both python-docx wrapper objects as well as the etree objects themselves. There wouldn't really be any clean way to separate the two types of objects, and you'd end up with users trying to call para.add_run() on their etree._Element objects. The best idea I have right now is to implement a base element wrapper class. Each instance of the class maps to a particular etree element and would have methods overriding all of the normal etree methods. Something like:
The getparent() method would return the object wrapping the parent of the etree object. Table, Text etc would all be subclasses of Are the revision markers that you're referring to the RSID attributes that seem to break up perfectly good runs for no good reason? Those definitely gave me trouble when I first started using python-docx and I was wondering why some of my search() and replace() functions weren't working. I ended up writing replacement functions that ignored runs and just searched through the entire paragraph text as a single string and haven't had many issues since. What sort of challenges are you running into there? |
Yeah, apologies for that, there are a lot of related topics here and just needed us to find a focus spot, seems like we're on the trail of one now :) So let's focus on the bit about iterating over the I would be strongly inclined to not try to combine these two in a single object. Rather you can access the back door if you need it, but then you're in an On the question of revision markers, I don't have any direct experience of them being a problem. I actually don't work with .docx an awful lot, I do a lot more with .pptx :). But from a design standpoint, they initially presented as a challenge when trying to provide a |
I definitely see the value of keeping lxml functionality at least somewhat separate from the objects that the user will primarily be working through. My biggest concern is that moving between the two different levels depending on particular needs can get messy quickly. I think the simplest solution, and the one that you seem to be leaning towards, would be to provide access to the etree._Element object through an attribute of an object for exceptional cases, and trust that the user knows enough about what he/she is doing to not mess things up. If that's the case, the biggest issue will be adding functionality in the current API to ensure that the user will only need lxml for special cases. Looking over the italics example that I gave above, it looks like there is almost already a simple implementation available:
Just writing that out, it does seem that adding functionality to the current API without requiring the user to dive down to lxml would be easier than I initially thought. A lot stuff that lxml does is already available by default, and it shouldn't be particularly difficult to add the equivalent for I'll admit that I don't see any special reason to treat modified/deleted text differently any differently. Correct me if I'm wrong, but from an xml standpoint, there are three different implementations (original, markup, final), as "original showing markup" and "final showing markup" are only GUI settings within Word itself. I think that the best default option would be to account for the markup in paragraphs by including deleted paragraphs in |
Oooh, now that's an idea. I like that a lot on first take. So the library could attend to just discovering all the bits that were in there (paragraphs and whatever), characterizing them as to revision status along the way, and then the developer could use that information in whatever way suited their use case. Then if it turned out to be handy, it would be easy to add an additional collection like 'Final after revisions' or something that provided a high-level access point to commonly needed subsets. Maybe the default document.paragraphs could be 'final', including inserts but not deletions, and then have a flag to include deletions. That way you wouldn't have to consider revision marks unless they were likely to be a factor in your particular use-case. I suppose that would mean paragraphs would need to be a method instead of a property, something like I'm liking this line of thinking a lot. On the other bit, like swapping back and forth between This iteration question is big though. Once we have that down there's all kinds of things that can come after and build on that. |
The revision cases are pretty complex it looks like: Eric White article on revisions. But definitely not insurmountable. Especially if taken on an object-by-object basis. For paragraph, for a start, it looks like these are the cases:
However, to get the "effective after revisions" is simpler, just digging one level into:
I could probably update the current .paragraphs to include these fellows in a single sitting. Might make sense to make it a method at the same time, just to hold open the options. |
Okay Evan, so after a day's reflection, here's what I think makes sense for document.paragraphs, in the tone of how it might appear in the documentation:
Additional features for accessing deleted paragraphs (and perhaps moved paragraphs in their original location) could be added later. I don't have a clear idea of a use case for accessing those, so I'm inclined to leave it alone until one surfaces. The advantage of this approach is that the casual user doesn't have to even know about revision marks in order to operate on the document in the form it would most naturally appear in Word. What do you think? I'll add it as a feature request in a separate issue thread if you concur. |
That sounds good. I suspect that manipulating documents with markup will be a very small subset of use cases anyways, so as long as there is a basic implementation in place for most users, it shouldn't be an issue. People who want to do crazier things can, again, dive into the lxml. |
Added heading for table cell
…hading Feature/table cell shading
…port_for_footnote Feature/low level support for footnote
updated dependencies, added py38 to support in setup.py
It looks like you want the user to work entirely through python-docx, as Etree elements are abstracted away through wrapper classes. If that's the case, what are you planning with regards to methods such as iter(), find(), xpath expressions etc.? I know that for simpler documents, statements like document.add_paragraph() are sufficient, but I've found lxml methods like the ones I mentioned above to be invaluable for more involved Docx scripting.
The text was updated successfully, but these errors were encountered: