Skip to content

bpo-34454: Clean up datetime.fromisoformat surrogate handling #8959

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 22, 2018

Conversation

pganssle
Copy link
Member

@pganssle pganssle commented Aug 27, 2018

This is a fixup PR for #8862, per @serhiy-storchaka's comments. I have addressed two PEP-7 violations, switched to using _PyUnicode_Copy in _sanitize_fromisoformat_str, and fixed an issue where the sanitized string (rather than the original string) was displayed as part of the error message.

Additionally, I noticed that the pure Python implementation uses the equivalent of %U instead of %R in its error printing, so I switched all the fromisoformat errors over to using %U for consistency.

bpo-34454

https://bugs.python.org/issue34454

@pganssle pganssle force-pushed the cleanup_fromisoformat_surrogate branch 3 times, most recently from cd85c26 to 03a0e03 Compare August 27, 2018 21:26
@pganssle
Copy link
Member Author

pganssle commented Aug 27, 2018

CC @taleinat @izbyshev

Copy link
Contributor

@taleinat taleinat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one small comment.

# the separator, the error message contains the original string
dtstr = "2018-01-03\ud80001:0113"

with self.assertRaisesRegex(ValueError, f".*{dtstr}"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO just checking that ValueError is raised is enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex is actually the point of the test. Before these changes, the error message was accidentally including the sanitized string, so 2018-01-03\ud80001:0113 would give you something like Invalid isoformat str: 2018-01-03T01:0113.

As a result of this test, I also realized that the C and pure Python versions had slightly different error messages, so I corrected that (and this test enforces it).

If we don't care to enforce any particular conditions on the error message, we can drop the whole test, since the part about raising ValueError has pretty good test coverage elsewhere.

Copy link
Member

@zhangyangyu zhangyangyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the needs_decref flag a must to accomplish the logic? Isn't Py_XDECREF enough?

@serhiy-storchaka
Copy link
Member

PyUnicode_AsUTF8AndSize() can fail not only due to a lone surrogate, but because of MemoryError. In this case the exception should not be replaced with a ValueError.

There is similar issue in the pure Python version. Don't use except Exception, it can hide unexpected errors.

I suggested to use %R in the error message because including the raw string can be confusing in the case of empty string, or string containing trailing whitespaces, invisible or unprintable characters. It is better to change the pure Python version.

While we are here, please make other fromisoformat related code PEP-7 compliant: move top-level { on separate line, break a line after } in } else, break a line after static int in static int parse_isoformat_date, add spaces around operators (e.g. rv?-5:1 => rv ? -5 : 1).

Also, the parsing code can be simplified. Since different error codes are not distinguished, it is enough to use just -1 and merge conditions.

@serhiy-storchaka
Copy link
Member

serhiy-storchaka commented Aug 28, 2018

I would use the following code instead of _sanitize_isoformat_str:

    PyObject *bytes = NULL;
    Py_ssize_t len;
    const char * dt_ptr = PyUnicode_AsUTF8AndSize(dtstr, &len);

    if (dt_ptr == NULL) {
        PyErr_Clear();
        bytes = _PyUnicode_AsUTF8String(dtstr, "surrogatepass");
        if (bytes == NULL) {
            return NULL;
        }
        dt_ptr = PyBytes_AS_STRING(bytes);
        len = PyBytes_GET_SIZE(bytes);
    }

    ...

    Py_XDECREF(bytes);

@pganssle
Copy link
Member Author

Is the needs_decref flag a must to accomplish the logic? Isn't Py_XDECREF enough?

@zhangyangyu Indeed no, because _sanitize_isoformat_str returns the original string in most cases, and the reference to dtstr is owned by the calling function.

I would use the following code instead of _sanitize_isoformat_str:

@serhiy-storchaka That is similar to the code I was using originally, but this version is much faster in certain cases. I think you can see the reasoning in the original PR, #8862.

PyUnicode_AsUTF8AndSize() can fail not only due to a lone surrogate, but because of MemoryError. In this case the exception should not be replaced with a ValueError.

Good point.

I suggested to use %R in the error message because including the raw string can be confusing in the case of empty string, or string containing trailing whitespaces, invisible or unprintable characters. It is better to change the pure Python version.

Fine by me, I was mainly worried about the consistency.

While we are here, please make other fromisoformat related code PEP-7 compliant: move top-level { on separate line, break a line after } in } else, break a line after static int in static int parse_isoformat_date, add spaces around operators (e.g. rv?-5:1 => rv ? -5 : 1).

Thanks for the tips. I don't do enough PEP 7 C programming to have these rules internalized - is there a standard code formatter (or even a linter) I can use that you recommend?

Also, the parsing code can be simplified. Since different error codes are not distinguished, it is enough to use just -1 and merge conditions.

The reason for the different error codes in the parsing is that in an alternate version of this code (a full-spec ISO8601 parser), I gave more detailed error messages, e.g. "error parsing time" and "error parsing time zone", etc, which required more insight into where the error happened. I decided that for the moment the error should not be that specific to minimize the API maintenance burden, but because we might want to give a richer error message in the future and the work is already done to support it, I decided to leave it. I don't think it hurts anything to leave the return values as is.

@pganssle
Copy link
Member Author

I used clang-format with this PEP 7-style settings file (coincidentally from an iso 8601 parser, the first google result for a PEP 7 clang-format file). I don't think there's any guarantee that I got all the PEP 7 violations (plus clang-format changed a lot more than my code in that file, so I had to filter it out manually). @serhiy-storchaka @abalkin Please let me know if you spot any more violations.

with self.assertRaises(ValueError) as cm:
self.theclass.fromisoformat(dtstr)

msg = cm.exception.args[0]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was having trouble using a assertRaisesRegex here, I think because of the escape characters, but in works fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use assertRaisesRegex(ValueError, re.escape(f"{dtstr!r}")).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thank you! I thought there must be a way to do this but it seemed like a lot of work to figure it out. Thanks for the pointer. 😄

@zhangyangyu
Copy link
Member

zhangyangyu commented Aug 28, 2018

Indeed no, because _sanitize_isoformat_str returns the original string in most cases, and the reference to dtstr is owned by the calling function.

Sorry I don't get you. I mean something like:

PyObject *dtstr_clean = _sanitize_isoformat_str(dtstr);
if (duster_clean == NULL) goto error;  // or just `return NULL` and `Py_DECREF` in error
...
Py_DECREF(tz_info);
Py_DECREF(dtstr_clean);
...
error:
    Py_XDECREF(dtstr_clean);

I think it's easier to understand and maintain.

@pganssle
Copy link
Member Author

@zhangyangyu _sanitize_fromisoformat_str(dtstr) when dtstr = "2018-01-01" will return dtstr, not NULL. Since we have a borrowed reference to dtstr we cannot decrement the reference count. dtstr_clean is NULL only if an error has occurred in _sanitize_fromisoformat_str. In most cases it returns the input (since most inputs are pre-sanitized), but occasionally it returns a new object, and datetime_fromisoformat must own that object. needs_decref is a flag to indicate whether or not datetime_fromisoformat owns dtstr_clean.

Consider if I make the following change:

diff --git a/Modules/_datetimemodule.c b/Modules/_datetimemodule.c
index 68b1b2f501..78c08bdb19 100644
--- a/Modules/_datetimemodule.c
+++ b/Modules/_datetimemodule.c
@@ -4975,18 +4975,14 @@ datetime_fromisoformat(PyObject *cls, PyObject *dtstr)
                                             second, microsecond, tzinfo, cls);
 
     Py_DECREF(tzinfo);
-    if (needs_decref) {
-        Py_DECREF(dtstr_clean);
-    }
+    Py_DECREF(dtstr_clean);
     return dt;
 
 invalid_string_error:
     PyErr_Format(PyExc_ValueError, "Invalid isoformat string: %R", dtstr);
 
 error:
-    if (needs_decref) {
-        Py_DECREF(dtstr_clean);
-    }
+    Py_XDECREF(dtstr_clean);
 
     return NULL;
 }

After running make, I do:

$ ./python -c "from datetime import datetime; print(datetime.fromisoformat('2018-01-01'))"
Fatal Python error: Objects/listobject.c:341 object at 0x7f4dc6ef5f20 has negative ref count -2604246222170760230

Current thread 0x00007f4dc742b080 (most recent call first):
Aborted (core dumped)
$ ./python -c "from datetime import datetime; print(datetime.fromisoformat('2018-01-01\ud80001:01:01'))"
2018-01-01 01:01:01

The segfault is because in the first case datetime_fromisoformat had a dtstr_clean pointing to dtstr, a borrowed reference. In the second case, dtstr_clean pointed to a temporary object owned by datetime_fromisoformat.

@zhangyangyu
Copy link
Member

zhangyangyu commented Aug 28, 2018

@pganssle Sorry I forget mention we need to Py_INCREF(dtstr) in _sanitize. (I edited my original comment but you have already replied :-()

@pganssle
Copy link
Member Author

pganssle commented Aug 28, 2018

@zhangyangyu Ah, yeah, that's an elegant solution, I've pushed a new commit implementing it. (Though weirdly it seems to not be showing up here - hopefully just a delay).

@pganssle pganssle force-pushed the cleanup_fromisoformat_surrogate branch 2 times, most recently from e3466eb to 255b823 Compare August 28, 2018 18:55
Copy link
Contributor

@izbyshev izbyshev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pganssle pganssle force-pushed the cleanup_fromisoformat_surrogate branch from 255b823 to 253217a Compare September 4, 2018 14:36
@pganssle
Copy link
Member Author

pganssle commented Sep 4, 2018

@serhiy-storchaka Ping. Everything look good?

@pganssle
Copy link
Member Author

@serhiy-storchaka I think this is still just waiting on your review. Given that this fixes some (albeit hard to reach) NULL dereference bugs, it'd be good to get these merged before 3.7.1

@pganssle pganssle force-pushed the cleanup_fromisoformat_surrogate branch from 253217a to 4c9c91a Compare September 19, 2018 13:23
Instead _sanitize_isoformat_str returns a new reference, even to the
original string.
@pganssle
Copy link
Member Author

I have rewritten the branch's history to group together related changes in atomic commits. The only formatting changes are now in d809d712. There are three separate commits for error handling-related changes:

I can squash those together if you prefer, though they handle three separate issues and could be independently reverted if any one of them causes an issue. The first and last commits also handle separate bugs.

If you don't want to do a non-squash merge, we can go ahead and do 6 separate PRs (though I think they'd have to be merged in a specific order).

@vstinner
Copy link
Member

I have rewritten the branch's history to group together related changes in atomic commits.

Hum. I forgot to explain that CPython policy doesn't allow to merge a PR, only to squash changes into one unique commit. That's why I'm asking for a second PR.

@pganssle
Copy link
Member Author

pganssle commented Oct 22, 2018

Hum. I forgot to explain that CPython policy doesn't allow to merge a PR, only to squash changes into one unique commit. That's why I'm asking for a second PR.

Your reasoning for why you want two PRs applies equally well to each individual atomic commit, not just to "style changes" and "bug fixes", in fact, it's even more important not to squash multiple commits that have actual effects on the program than it is to avoid squashing style changes together with behavior changes. If you squash style changes together and have to revert, you at least aren't changing unrelated things. If you squash two bugfixes together, you can't revert one without reverting the other.

If you want to prioritize process here, I can make 6 PRs, or I can make 4 PRs, or we can do a squash merge on what exists now and not worry about it too much. That said, I don't see the point of doing such a thing. The point of a squash merge policy is because it keeps a relatively clean history while avoiding a lot of nitpicking about re-writing the history to squash fixups and the like. Given that I already have created a clean history, none of the downsides of doing a regular merge exist for this PR, but splitting it up into multiple PRs has all of the downsides of a "regular merge with clean histories" policy.

@vstinner
Copy link
Member

it's even more important not to squash multiple commits

Again, the only available choice for me is [Squash and merge], the other choices are disabled anyway.

If you want to prioritize process here, I can make 6 PRs, or I can make 4 PRs, ...

Honestly, I don't think that your refactoring changes deserve so many commits, it's fine to squash them into a single one.

For me, Lib/datetime.py + Lib/test/datetimetester.py change is one PR that should be applied to other branches.

All other changes should be into a single other PR. Maybe the first PR should be the bugfix. Maybe wait until this one is merged before creating another them (but keep the cleanup commits in a local branch on your side).

@pganssle
Copy link
Member Author

Feel free to merge or not merge as desired. You're also welcome to cherry pick my commits into whatever number of PRs you want.

@vstinner
Copy link
Member

I read individual commits and now I'm confused which parts are related to the bugfix or not. I let you (@pganssle) reorganize commits for that :-)

   Py_ssize_t len = PyUnicode_GetLength(dtstr);
+    if (len < 0) {
+        return NULL;
+    }

This change looks a bugfix, so I suggest to include it in the bugfix PR as well.

@pganssle
Copy link
Member Author

pganssle commented Oct 22, 2018

You seem to have a pretty good idea of what you want this to look like, so I'm not terribly interested in doing a guess-and-check. IMO the only style-only commit is d809d71. Everything else actually changes the behavior of the program.

If you're going to merge all the bug fixes together, might as well merge the style changes with them and if something needs reversion I'll just make a partial reversion PR or something. Seems like less work than rewriting the history and cherry-picking out separate PRs just on the off chance that one of these things needs to be reverted (and a partial re-reversion would need to happen again anyway, since reverting the "all kinds of bugfixes" commit would revert a bunch of stuff that still needs fixing).

@taleinat
Copy link
Contributor

I think we can still "rebase and merge", just not via the GitHub UI. I don't mind doing it myself. No need for additional PRs, IMO. Please stop spending time on this discussion.

self.theclass.fromisoformat(dtstr)

msg = cm.exception.args[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just str(cm.exception).

@vstinner
Copy link
Member

vstinner commented Oct 22, 2018

@pganssle: So I tried to extract what I consider as a "bugfix", but I had to read the full history of this PR and full history of https://bugs.python.org/issue34454 and it seems like I misunderstood this PR. This PR is "mostly" cleanup, except that it also fix an inconsistency between the C and the Python implementation. Previously, the input string was formated by str() in Python but repr() in C. With the PR, repr() is used in C and Python, which is the right solution.

Sorry, I skipped most of the history when I reviewed your PR and when I followed the link to the bug, I saw that there was a bug about surrogate characters. I understood that the PR fixed the bug... except that the bug is already fixed...

[UPDATE] I was mostly confused by the fact that you added a new test, as if you fixed a bug. Well, your PR changes the behaviour, but it's a subtle change about an error message.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@vstinner
Copy link
Member

Note: I agree that this change must be backported to 3.7 (even if it's mostly "cleanup") to ease future bugfixes in fromisoformat().

@pganssle
Copy link
Member Author

@vstinner I think the new test actually does fix a bug resulting from the PR for which this is a cleanup, but there are a few other bugs fixed in there for which no test is possible.

The bug fixed where adding a test was possible was that you could pass an invalid string like this: "2018-01-03\ud80001:0113" to datetime.fromisoformat, and the error message would display the string with an error in it as "2018-01-03T01:0113" instead - it was leaking an implementation detail.

The first commit in the PR also fixes a null dereference bug IIRC. One of the exception-handling commits also fixes a bug with over-zealous error catching.

@vstinner vstinner merged commit 3df8540 into python:master Oct 22, 2018
@vstinner
Copy link
Member

@taleinat: "I think we can still "rebase and merge", just not via the GitHub UI."

How would you proceed?

@vstinner
Copy link
Member

I merged this PR. I has been approved by 4 developers including 3 core developers, I agree with @taleinat: it has been discussed enough :-)

@vstinner
Copy link
Member

Note: I ran " ./python -m test -R 3:3 test_datetime" before merging the PR, and regrtest didn't find any reference leak.

@taleinat
Copy link
Contributor

@taleinat: "I think we can still "rebase and merge", just not via the GitHub UI."

How would you proceed?

For future reference: Just use a git client e.g. the cli; rebase, merge into master, push master.

(... unless GitHub's master branch is now protected and we can't push directly into it?)

@miss-islington
Copy link
Contributor

Thanks @pganssle for the PR, and @vstinner for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7.
🐍🍒⛏🤖

@taleinat
Copy link
Contributor

@taleinat: "I think we can still "rebase and merge", just not via the GitHub UI."
How would you proceed?

For future reference: Just use a git client e.g. the cli; rebase, merge into master, push master.

(... unless GitHub's master branch is now protected and we can't push directly into it?)

So apparently I was wrong. Currently, all of our long-lived branches are protected, and the only way to merge a PR as two commits is to create an extra PR.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Oct 22, 2018
…GH-8959)

* Use _PyUnicode_Copy in sanitize_isoformat_str

* Use repr in fromisoformat error message

This reverses commit 67b74a98b2 per Serhiy Storchaka's suggestion:

     I suggested to use %R in the error message because including the raw
     string can be confusing in the case of empty string, or string
     containing trailing whitespaces, invisible or unprintable characters.

We agree that it is better to change both the C and pure Python versions
to use repr.

* Retain non-sanitized dtstr for error printing

This does not create an extra string, it just holds on to a reference to
the original input string for purposes of creating the error message.

* PEP 7 fixes to from_isoformat

* Separate handling of Unicode and other errors

In the initial implementation, errors other than encoding errors would
both raise an error indicating an invalid format, which would not be
true for errors like MemoryError.

* Drop needs_decref from _sanitize_isoformat_str

Instead _sanitize_isoformat_str returns a new reference, even to the
original string.
(cherry picked from commit 3df8540)

Co-authored-by: Paul Ganssle <[email protected]>
@bedevere-bot
Copy link

GH-10041 is a backport of this pull request to the 3.7 branch.

miss-islington added a commit that referenced this pull request Oct 22, 2018
* Use _PyUnicode_Copy in sanitize_isoformat_str

* Use repr in fromisoformat error message

This reverses commit 67b74a98b2 per Serhiy Storchaka's suggestion:

     I suggested to use %R in the error message because including the raw
     string can be confusing in the case of empty string, or string
     containing trailing whitespaces, invisible or unprintable characters.

We agree that it is better to change both the C and pure Python versions
to use repr.

* Retain non-sanitized dtstr for error printing

This does not create an extra string, it just holds on to a reference to
the original input string for purposes of creating the error message.

* PEP 7 fixes to from_isoformat

* Separate handling of Unicode and other errors

In the initial implementation, errors other than encoding errors would
both raise an error indicating an invalid format, which would not be
true for errors like MemoryError.

* Drop needs_decref from _sanitize_isoformat_str

Instead _sanitize_isoformat_str returns a new reference, even to the
original string.
(cherry picked from commit 3df8540)

Co-authored-by: Paul Ganssle <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants