-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
bpo-34454: Clean up datetime.fromisoformat surrogate handling #8959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-34454: Clean up datetime.fromisoformat surrogate handling #8959
Conversation
cd85c26
to
03a0e03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one small comment.
Lib/test/datetimetester.py
Outdated
# the separator, the error message contains the original string | ||
dtstr = "2018-01-03\ud80001:0113" | ||
|
||
with self.assertRaisesRegex(ValueError, f".*{dtstr}"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO just checking that ValueError
is raised is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex is actually the point of the test. Before these changes, the error message was accidentally including the sanitized string, so 2018-01-03\ud80001:0113
would give you something like Invalid isoformat str: 2018-01-03T01:0113
.
As a result of this test, I also realized that the C and pure Python versions had slightly different error messages, so I corrected that (and this test enforces it).
If we don't care to enforce any particular conditions on the error message, we can drop the whole test, since the part about raising ValueError
has pretty good test coverage elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the needs_decref
flag a must to accomplish the logic? Isn't Py_XDECREF
enough?
There is similar issue in the pure Python version. Don't use I suggested to use While we are here, please make other fromisoformat related code PEP-7 compliant: move top-level Also, the parsing code can be simplified. Since different error codes are not distinguished, it is enough to use just |
I would use the following code instead of PyObject *bytes = NULL;
Py_ssize_t len;
const char * dt_ptr = PyUnicode_AsUTF8AndSize(dtstr, &len);
if (dt_ptr == NULL) {
PyErr_Clear();
bytes = _PyUnicode_AsUTF8String(dtstr, "surrogatepass");
if (bytes == NULL) {
return NULL;
}
dt_ptr = PyBytes_AS_STRING(bytes);
len = PyBytes_GET_SIZE(bytes);
}
...
Py_XDECREF(bytes); |
@zhangyangyu Indeed no, because
@serhiy-storchaka That is similar to the code I was using originally, but this version is much faster in certain cases. I think you can see the reasoning in the original PR, #8862.
Good point.
Fine by me, I was mainly worried about the consistency.
Thanks for the tips. I don't do enough PEP 7 C programming to have these rules internalized - is there a standard code formatter (or even a linter) I can use that you recommend?
The reason for the different error codes in the parsing is that in an alternate version of this code (a full-spec ISO8601 parser), I gave more detailed error messages, e.g. "error parsing time" and "error parsing time zone", etc, which required more insight into where the error happened. I decided that for the moment the error should not be that specific to minimize the API maintenance burden, but because we might want to give a richer error message in the future and the work is already done to support it, I decided to leave it. I don't think it hurts anything to leave the return values as is. |
I used |
Lib/test/datetimetester.py
Outdated
with self.assertRaises(ValueError) as cm: | ||
self.theclass.fromisoformat(dtstr) | ||
|
||
msg = cm.exception.args[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was having trouble using a assertRaisesRegex
here, I think because of the escape characters, but in
works fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use assertRaisesRegex(ValueError, re.escape(f"{dtstr!r}"))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thank you! I thought there must be a way to do this but it seemed like a lot of work to figure it out. Thanks for the pointer. 😄
Sorry I don't get you. I mean something like: PyObject *dtstr_clean = _sanitize_isoformat_str(dtstr);
if (duster_clean == NULL) goto error; // or just `return NULL` and `Py_DECREF` in error
...
Py_DECREF(tz_info);
Py_DECREF(dtstr_clean);
...
error:
Py_XDECREF(dtstr_clean); I think it's easier to understand and maintain. |
@zhangyangyu Consider if I make the following change: diff --git a/Modules/_datetimemodule.c b/Modules/_datetimemodule.c
index 68b1b2f501..78c08bdb19 100644
--- a/Modules/_datetimemodule.c
+++ b/Modules/_datetimemodule.c
@@ -4975,18 +4975,14 @@ datetime_fromisoformat(PyObject *cls, PyObject *dtstr)
second, microsecond, tzinfo, cls);
Py_DECREF(tzinfo);
- if (needs_decref) {
- Py_DECREF(dtstr_clean);
- }
+ Py_DECREF(dtstr_clean);
return dt;
invalid_string_error:
PyErr_Format(PyExc_ValueError, "Invalid isoformat string: %R", dtstr);
error:
- if (needs_decref) {
- Py_DECREF(dtstr_clean);
- }
+ Py_XDECREF(dtstr_clean);
return NULL;
} After running
The segfault is because in the first case |
@pganssle Sorry I forget mention we need to |
@zhangyangyu Ah, yeah, that's an elegant solution, I've pushed a new commit implementing it. (Though weirdly it seems to not be showing up here - hopefully just a delay). |
e3466eb
to
255b823
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
255b823
to
253217a
Compare
@serhiy-storchaka Ping. Everything look good? |
@serhiy-storchaka I think this is still just waiting on your review. Given that this fixes some (albeit hard to reach) NULL dereference bugs, it'd be good to get these merged before 3.7.1 |
253217a
to
4c9c91a
Compare
Instead _sanitize_isoformat_str returns a new reference, even to the original string.
I have rewritten the branch's history to group together related changes in atomic commits. The only formatting changes are now in I can squash those together if you prefer, though they handle three separate issues and could be independently reverted if any one of them causes an issue. The first and last commits also handle separate bugs. If you don't want to do a non-squash merge, we can go ahead and do 6 separate PRs (though I think they'd have to be merged in a specific order). |
Hum. I forgot to explain that CPython policy doesn't allow to merge a PR, only to squash changes into one unique commit. That's why I'm asking for a second PR. |
Your reasoning for why you want two PRs applies equally well to each individual atomic commit, not just to "style changes" and "bug fixes", in fact, it's even more important not to squash multiple commits that have actual effects on the program than it is to avoid squashing style changes together with behavior changes. If you squash style changes together and have to revert, you at least aren't changing unrelated things. If you squash two bugfixes together, you can't revert one without reverting the other. If you want to prioritize process here, I can make 6 PRs, or I can make 4 PRs, or we can do a squash merge on what exists now and not worry about it too much. That said, I don't see the point of doing such a thing. The point of a squash merge policy is because it keeps a relatively clean history while avoiding a lot of nitpicking about re-writing the history to squash fixups and the like. Given that I already have created a clean history, none of the downsides of doing a regular merge exist for this PR, but splitting it up into multiple PRs has all of the downsides of a "regular merge with clean histories" policy. |
Again, the only available choice for me is [Squash and merge], the other choices are disabled anyway.
Honestly, I don't think that your refactoring changes deserve so many commits, it's fine to squash them into a single one. For me, Lib/datetime.py + Lib/test/datetimetester.py change is one PR that should be applied to other branches. All other changes should be into a single other PR. Maybe the first PR should be the bugfix. Maybe wait until this one is merged before creating another them (but keep the cleanup commits in a local branch on your side). |
Feel free to merge or not merge as desired. You're also welcome to cherry pick my commits into whatever number of PRs you want. |
I read individual commits and now I'm confused which parts are related to the bugfix or not. I let you (@pganssle) reorganize commits for that :-)
This change looks a bugfix, so I suggest to include it in the bugfix PR as well. |
You seem to have a pretty good idea of what you want this to look like, so I'm not terribly interested in doing a guess-and-check. IMO the only style-only commit is d809d71. Everything else actually changes the behavior of the program. If you're going to merge all the bug fixes together, might as well merge the style changes with them and if something needs reversion I'll just make a partial reversion PR or something. Seems like less work than rewriting the history and cherry-picking out separate PRs just on the off chance that one of these things needs to be reverted (and a partial re-reversion would need to happen again anyway, since reverting the "all kinds of bugfixes" commit would revert a bunch of stuff that still needs fixing). |
I think we can still "rebase and merge", just not via the GitHub UI. I don't mind doing it myself. No need for additional PRs, IMO. Please stop spending time on this discussion. |
Lib/test/datetimetester.py
Outdated
self.theclass.fromisoformat(dtstr) | ||
|
||
msg = cm.exception.args[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just str(cm.exception)
.
@pganssle: So I tried to extract what I consider as a "bugfix", but I had to read the full history of this PR and full history of https://bugs.python.org/issue34454 and it seems like I misunderstood this PR. This PR is "mostly" cleanup, except that it also fix an inconsistency between the C and the Python implementation. Previously, the input string was formated by str() in Python but repr() in C. With the PR, repr() is used in C and Python, which is the right solution. Sorry, I skipped most of the history when I reviewed your PR and when I followed the link to the bug, I saw that there was a bug about surrogate characters. I understood that the PR fixed the bug... except that the bug is already fixed... [UPDATE] I was mostly confused by the fact that you added a new test, as if you fixed a bug. Well, your PR changes the behaviour, but it's a subtle change about an error message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Note: I agree that this change must be backported to 3.7 (even if it's mostly "cleanup") to ease future bugfixes in fromisoformat(). |
@vstinner I think the new test actually does fix a bug resulting from the PR for which this is a cleanup, but there are a few other bugs fixed in there for which no test is possible. The bug fixed where adding a test was possible was that you could pass an invalid string like this: The first commit in the PR also fixes a null dereference bug IIRC. One of the exception-handling commits also fixes a bug with over-zealous error catching. |
@taleinat: "I think we can still "rebase and merge", just not via the GitHub UI." How would you proceed? |
I merged this PR. I has been approved by 4 developers including 3 core developers, I agree with @taleinat: it has been discussed enough :-) |
Note: I ran " ./python -m test -R 3:3 test_datetime" before merging the PR, and regrtest didn't find any reference leak. |
For future reference: Just use a git client e.g. the cli; rebase, merge into master, push master. (... unless GitHub's master branch is now protected and we can't push directly into it?) |
So apparently I was wrong. Currently, all of our long-lived branches are protected, and the only way to merge a PR as two commits is to create an extra PR. |
…GH-8959) * Use _PyUnicode_Copy in sanitize_isoformat_str * Use repr in fromisoformat error message This reverses commit 67b74a98b2 per Serhiy Storchaka's suggestion: I suggested to use %R in the error message because including the raw string can be confusing in the case of empty string, or string containing trailing whitespaces, invisible or unprintable characters. We agree that it is better to change both the C and pure Python versions to use repr. * Retain non-sanitized dtstr for error printing This does not create an extra string, it just holds on to a reference to the original input string for purposes of creating the error message. * PEP 7 fixes to from_isoformat * Separate handling of Unicode and other errors In the initial implementation, errors other than encoding errors would both raise an error indicating an invalid format, which would not be true for errors like MemoryError. * Drop needs_decref from _sanitize_isoformat_str Instead _sanitize_isoformat_str returns a new reference, even to the original string. (cherry picked from commit 3df8540) Co-authored-by: Paul Ganssle <[email protected]>
GH-10041 is a backport of this pull request to the 3.7 branch. |
* Use _PyUnicode_Copy in sanitize_isoformat_str * Use repr in fromisoformat error message This reverses commit 67b74a98b2 per Serhiy Storchaka's suggestion: I suggested to use %R in the error message because including the raw string can be confusing in the case of empty string, or string containing trailing whitespaces, invisible or unprintable characters. We agree that it is better to change both the C and pure Python versions to use repr. * Retain non-sanitized dtstr for error printing This does not create an extra string, it just holds on to a reference to the original input string for purposes of creating the error message. * PEP 7 fixes to from_isoformat * Separate handling of Unicode and other errors In the initial implementation, errors other than encoding errors would both raise an error indicating an invalid format, which would not be true for errors like MemoryError. * Drop needs_decref from _sanitize_isoformat_str Instead _sanitize_isoformat_str returns a new reference, even to the original string. (cherry picked from commit 3df8540) Co-authored-by: Paul Ganssle <[email protected]>
This is a fixup PR for #8862, per @serhiy-storchaka's comments. I have addressed two PEP-7 violations, switched to using
_PyUnicode_Copy
in_sanitize_fromisoformat_str
, and fixed an issue where the sanitized string (rather than the original string) was displayed as part of the error message.Additionally, I noticed that the pure Python implementation uses the equivalent of
%U
instead of%R
in its error printing, so I switched all thefromisoformat
errors over to using%U
for consistency.bpo-34454
https://bugs.python.org/issue34454