bpo-34454: Clean up datetime.fromisoformat surrogate handling #8959

pganssle · 2018-08-27T21:22:10Z

This is a fixup PR for #8862, per @serhiy-storchaka's comments. I have addressed two PEP-7 violations, switched to using _PyUnicode_Copy in _sanitize_fromisoformat_str, and fixed an issue where the sanitized string (rather than the original string) was displayed as part of the error message.

Additionally, I noticed that the pure Python implementation uses the equivalent of %U instead of %R in its error printing, so I switched all the fromisoformat errors over to using %U for consistency.

bpo-34454

https://bugs.python.org/issue34454

pganssle · 2018-08-27T21:26:26Z

CC @taleinat @izbyshev

taleinat

LGTM, one small comment.

taleinat · 2018-08-27T22:06:17Z

Lib/test/datetimetester.py

+        # the separator, the error message contains the original string
+        dtstr = "2018-01-03\ud80001:0113"
+
+        with self.assertRaisesRegex(ValueError, f".*{dtstr}"):


IMO just checking that ValueError is raised is enough.

The regex is actually the point of the test. Before these changes, the error message was accidentally including the sanitized string, so 2018-01-03\ud80001:0113 would give you something like Invalid isoformat str: 2018-01-03T01:0113.

As a result of this test, I also realized that the C and pure Python versions had slightly different error messages, so I corrected that (and this test enforces it).

If we don't care to enforce any particular conditions on the error message, we can drop the whole test, since the part about raising ValueError has pretty good test coverage elsewhere.

zhangyangyu

Is the needs_decref flag a must to accomplish the logic? Isn't Py_XDECREF enough?

serhiy-storchaka · 2018-08-28T05:17:33Z

PyUnicode_AsUTF8AndSize() can fail not only due to a lone surrogate, but because of MemoryError. In this case the exception should not be replaced with a ValueError.

There is similar issue in the pure Python version. Don't use except Exception, it can hide unexpected errors.

I suggested to use %R in the error message because including the raw string can be confusing in the case of empty string, or string containing trailing whitespaces, invisible or unprintable characters. It is better to change the pure Python version.

While we are here, please make other fromisoformat related code PEP-7 compliant: move top-level { on separate line, break a line after } in } else, break a line after static int in static int parse_isoformat_date, add spaces around operators (e.g. rv?-5:1 => rv ? -5 : 1).

Also, the parsing code can be simplified. Since different error codes are not distinguished, it is enough to use just -1 and merge conditions.

serhiy-storchaka · 2018-08-28T05:31:06Z

I would use the following code instead of _sanitize_isoformat_str:

    PyObject *bytes = NULL;
    Py_ssize_t len;
    const char * dt_ptr = PyUnicode_AsUTF8AndSize(dtstr, &len);

    if (dt_ptr == NULL) {
        PyErr_Clear();
        bytes = _PyUnicode_AsUTF8String(dtstr, "surrogatepass");
        if (bytes == NULL) {
            return NULL;
        }
        dt_ptr = PyBytes_AS_STRING(bytes);
        len = PyBytes_GET_SIZE(bytes);
    }

    ...

    Py_XDECREF(bytes);

pganssle · 2018-08-28T13:06:28Z

Is the needs_decref flag a must to accomplish the logic? Isn't Py_XDECREF enough?

@zhangyangyu Indeed no, because _sanitize_isoformat_str returns the original string in most cases, and the reference to dtstr is owned by the calling function.

I would use the following code instead of _sanitize_isoformat_str:

@serhiy-storchaka That is similar to the code I was using originally, but this version is much faster in certain cases. I think you can see the reasoning in the original PR, #8862.

PyUnicode_AsUTF8AndSize() can fail not only due to a lone surrogate, but because of MemoryError. In this case the exception should not be replaced with a ValueError.

Good point.

I suggested to use %R in the error message because including the raw string can be confusing in the case of empty string, or string containing trailing whitespaces, invisible or unprintable characters. It is better to change the pure Python version.

Fine by me, I was mainly worried about the consistency.

While we are here, please make other fromisoformat related code PEP-7 compliant: move top-level { on separate line, break a line after } in } else, break a line after static int in static int parse_isoformat_date, add spaces around operators (e.g. rv?-5:1 => rv ? -5 : 1).

Thanks for the tips. I don't do enough PEP 7 C programming to have these rules internalized - is there a standard code formatter (or even a linter) I can use that you recommend?

Also, the parsing code can be simplified. Since different error codes are not distinguished, it is enough to use just -1 and merge conditions.

The reason for the different error codes in the parsing is that in an alternate version of this code (a full-spec ISO8601 parser), I gave more detailed error messages, e.g. "error parsing time" and "error parsing time zone", etc, which required more insight into where the error happened. I decided that for the moment the error should not be that specific to minimize the API maintenance burden, but because we might want to give a richer error message in the future and the work is already done to support it, I decided to leave it. I don't think it hurts anything to leave the return values as is.

pganssle · 2018-08-28T14:13:18Z

I used clang-format with this PEP 7-style settings file (coincidentally from an iso 8601 parser, the first google result for a PEP 7 clang-format file). I don't think there's any guarantee that I got all the PEP 7 violations (plus clang-format changed a lot more than my code in that file, so I had to filter it out manually). @serhiy-storchaka @abalkin Please let me know if you spot any more violations.

pganssle · 2018-08-28T14:15:05Z

Lib/test/datetimetester.py

+        with self.assertRaises(ValueError) as cm:
+            self.theclass.fromisoformat(dtstr)
+
+        msg = cm.exception.args[0]


I was having trouble using a assertRaisesRegex here, I think because of the escape characters, but in works fine.

You could use assertRaisesRegex(ValueError, re.escape(f"{dtstr!r}")).

Ah, thank you! I thought there must be a way to do this but it seemed like a lot of work to figure it out. Thanks for the pointer. 😄

zhangyangyu · 2018-08-28T15:34:38Z

Indeed no, because _sanitize_isoformat_str returns the original string in most cases, and the reference to dtstr is owned by the calling function.

Sorry I don't get you. I mean something like:

PyObject *dtstr_clean = _sanitize_isoformat_str(dtstr);
if (duster_clean == NULL) goto error;  // or just `return NULL` and `Py_DECREF` in error
...
Py_DECREF(tz_info);
Py_DECREF(dtstr_clean);
...
error:
    Py_XDECREF(dtstr_clean);

I think it's easier to understand and maintain.

pganssle · 2018-08-28T15:48:54Z

@zhangyangyu _sanitize_fromisoformat_str(dtstr) when dtstr = "2018-01-01" will return dtstr, not NULL. Since we have a borrowed reference to dtstr we cannot decrement the reference count. dtstr_clean is NULL only if an error has occurred in _sanitize_fromisoformat_str. In most cases it returns the input (since most inputs are pre-sanitized), but occasionally it returns a new object, and datetime_fromisoformat must own that object. needs_decref is a flag to indicate whether or not datetime_fromisoformat owns dtstr_clean.

Consider if I make the following change:

diff --git a/Modules/_datetimemodule.c b/Modules/_datetimemodule.c
index 68b1b2f501..78c08bdb19 100644
--- a/Modules/_datetimemodule.c
+++ b/Modules/_datetimemodule.c
@@ -4975,18 +4975,14 @@ datetime_fromisoformat(PyObject *cls, PyObject *dtstr)
                                             second, microsecond, tzinfo, cls);
 
     Py_DECREF(tzinfo);
-    if (needs_decref) {
-        Py_DECREF(dtstr_clean);
-    }
+    Py_DECREF(dtstr_clean);
     return dt;
 
 invalid_string_error:
     PyErr_Format(PyExc_ValueError, "Invalid isoformat string: %R", dtstr);
 
 error:
-    if (needs_decref) {
-        Py_DECREF(dtstr_clean);
-    }
+    Py_XDECREF(dtstr_clean);
 
     return NULL;
 }

After running make, I do:

$ ./python -c "from datetime import datetime; print(datetime.fromisoformat('2018-01-01'))"
Fatal Python error: Objects/listobject.c:341 object at 0x7f4dc6ef5f20 has negative ref count -2604246222170760230

Current thread 0x00007f4dc742b080 (most recent call first):
Aborted (core dumped)
$ ./python -c "from datetime import datetime; print(datetime.fromisoformat('2018-01-01\ud80001:01:01'))"
2018-01-01 01:01:01

The segfault is because in the first case datetime_fromisoformat had a dtstr_clean pointing to dtstr, a borrowed reference. In the second case, dtstr_clean pointed to a temporary object owned by datetime_fromisoformat.

zhangyangyu · 2018-08-28T15:56:20Z

@pganssle Sorry I forget mention we need to Py_INCREF(dtstr) in _sanitize. (I edited my original comment but you have already replied :-()

pganssle · 2018-08-28T16:12:36Z

@zhangyangyu Ah, yeah, that's an elegant solution, I've pushed a new commit implementing it. (Though weirdly it seems to not be showing up here - hopefully just a delay).

izbyshev

LGTM!

Lib/test/datetimetester.py

pganssle · 2018-09-04T14:38:29Z

@serhiy-storchaka Ping. Everything look good?

pganssle · 2018-09-19T13:22:35Z

@serhiy-storchaka I think this is still just waiting on your review. Given that this fixes some (albeit hard to reach) NULL dereference bugs, it'd be good to get these merged before 3.7.1

Instead _sanitize_isoformat_str returns a new reference, even to the original string.

pganssle · 2018-10-21T16:27:17Z

I have rewritten the branch's history to group together related changes in atomic commits. The only formatting changes are now in d809d712. There are three separate commits for error handling-related changes:

I can squash those together if you prefer, though they handle three separate issues and could be independently reverted if any one of them causes an issue. The first and last commits also handle separate bugs.

If you don't want to do a non-squash merge, we can go ahead and do 6 separate PRs (though I think they'd have to be merged in a specific order).

vstinner · 2018-10-22T14:09:47Z

I have rewritten the branch's history to group together related changes in atomic commits.

Hum. I forgot to explain that CPython policy doesn't allow to merge a PR, only to squash changes into one unique commit. That's why I'm asking for a second PR.

pganssle · 2018-10-22T14:33:17Z

Hum. I forgot to explain that CPython policy doesn't allow to merge a PR, only to squash changes into one unique commit. That's why I'm asking for a second PR.

Your reasoning for why you want two PRs applies equally well to each individual atomic commit, not just to "style changes" and "bug fixes", in fact, it's even more important not to squash multiple commits that have actual effects on the program than it is to avoid squashing style changes together with behavior changes. If you squash style changes together and have to revert, you at least aren't changing unrelated things. If you squash two bugfixes together, you can't revert one without reverting the other.

If you want to prioritize process here, I can make 6 PRs, or I can make 4 PRs, or we can do a squash merge on what exists now and not worry about it too much. That said, I don't see the point of doing such a thing. The point of a squash merge policy is because it keeps a relatively clean history while avoiding a lot of nitpicking about re-writing the history to squash fixups and the like. Given that I already have created a clean history, none of the downsides of doing a regular merge exist for this PR, but splitting it up into multiple PRs has all of the downsides of a "regular merge with clean histories" policy.

vstinner · 2018-10-22T15:11:13Z

it's even more important not to squash multiple commits

Again, the only available choice for me is [Squash and merge], the other choices are disabled anyway.

If you want to prioritize process here, I can make 6 PRs, or I can make 4 PRs, ...

Honestly, I don't think that your refactoring changes deserve so many commits, it's fine to squash them into a single one.

For me, Lib/datetime.py + Lib/test/datetimetester.py change is one PR that should be applied to other branches.

All other changes should be into a single other PR. Maybe the first PR should be the bugfix. Maybe wait until this one is merged before creating another them (but keep the cleanup commits in a local branch on your side).

pganssle · 2018-10-22T15:19:39Z

Feel free to merge or not merge as desired. You're also welcome to cherry pick my commits into whatever number of PRs you want.

vstinner · 2018-10-22T15:20:00Z

I read individual commits and now I'm confused which parts are related to the bugfix or not. I let you (@pganssle) reorganize commits for that :-)

   Py_ssize_t len = PyUnicode_GetLength(dtstr);
+    if (len < 0) {
+        return NULL;
+    }

This change looks a bugfix, so I suggest to include it in the bugfix PR as well.

pganssle · 2018-10-22T15:36:06Z

You seem to have a pretty good idea of what you want this to look like, so I'm not terribly interested in doing a guess-and-check. IMO the only style-only commit is d809d71. Everything else actually changes the behavior of the program.

If you're going to merge all the bug fixes together, might as well merge the style changes with them and if something needs reversion I'll just make a partial reversion PR or something. Seems like less work than rewriting the history and cherry-picking out separate PRs just on the off chance that one of these things needs to be reverted (and a partial re-reversion would need to happen again anyway, since reverting the "all kinds of bugfixes" commit would revert a bunch of stuff that still needs fixing).

taleinat · 2018-10-22T15:46:18Z

I think we can still "rebase and merge", just not via the GitHub UI. I don't mind doing it myself. No need for additional PRs, IMO. Please stop spending time on this discussion.

serhiy-storchaka · 2018-08-28T13:58:25Z

Lib/test/datetimetester.py

            self.theclass.fromisoformat(dtstr)

+        msg = cm.exception.args[0]


Or just str(cm.exception).

vstinner · 2018-10-22T16:23:04Z

@pganssle: So I tried to extract what I consider as a "bugfix", but I had to read the full history of this PR and full history of https://bugs.python.org/issue34454 and it seems like I misunderstood this PR. This PR is "mostly" cleanup, except that it also fix an inconsistency between the C and the Python implementation. Previously, the input string was formated by str() in Python but repr() in C. With the PR, repr() is used in C and Python, which is the right solution.

Sorry, I skipped most of the history when I reviewed your PR and when I followed the link to the bug, I saw that there was a bug about surrogate characters. I understood that the PR fixed the bug... except that the bug is already fixed...

[UPDATE] I was mostly confused by the fact that you added a new test, as if you fixed a bug. Well, your PR changes the behaviour, but it's a subtle change about an error message.

vstinner

LGTM.

vstinner · 2018-10-22T16:31:06Z

Note: I agree that this change must be backported to 3.7 (even if it's mostly "cleanup") to ease future bugfixes in fromisoformat().

pganssle · 2018-10-22T16:32:49Z

@vstinner I think the new test actually does fix a bug resulting from the PR for which this is a cleanup, but there are a few other bugs fixed in there for which no test is possible.

The bug fixed where adding a test was possible was that you could pass an invalid string like this: "2018-01-03\ud80001:0113" to datetime.fromisoformat, and the error message would display the string with an error in it as "2018-01-03T01:0113" instead - it was leaking an implementation detail.

The first commit in the PR also fixes a null dereference bug IIRC. One of the exception-handling commits also fixes a bug with over-zealous error catching.

vstinner · 2018-10-22T16:33:31Z

@taleinat: "I think we can still "rebase and merge", just not via the GitHub UI."

How would you proceed?

vstinner · 2018-10-22T16:34:10Z

I merged this PR. I has been approved by 4 developers including 3 core developers, I agree with @taleinat: it has been discussed enough :-)

vstinner · 2018-10-22T16:35:33Z

Note: I ran " ./python -m test -R 3:3 test_datetime" before merging the PR, and regrtest didn't find any reference leak.

taleinat · 2018-10-22T18:44:19Z

@taleinat: "I think we can still "rebase and merge", just not via the GitHub UI."

How would you proceed?

For future reference: Just use a git client e.g. the cli; rebase, merge into master, push master.

(... unless GitHub's master branch is now protected and we can't push directly into it?)

miss-islington · 2018-10-22T19:37:22Z

Thanks @pganssle for the PR, and @vstinner for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7.
🐍🍒⛏🤖

taleinat · 2018-10-22T19:37:27Z

@taleinat: "I think we can still "rebase and merge", just not via the GitHub UI."
How would you proceed?

For future reference: Just use a git client e.g. the cli; rebase, merge into master, push master.

(... unless GitHub's master branch is now protected and we can't push directly into it?)

So apparently I was wrong. Currently, all of our long-lived branches are protected, and the only way to merge a PR as two commits is to create an extra PR.

…GH-8959) * Use _PyUnicode_Copy in sanitize_isoformat_str * Use repr in fromisoformat error message This reverses commit 67b74a98b2 per Serhiy Storchaka's suggestion: I suggested to use %R in the error message because including the raw string can be confusing in the case of empty string, or string containing trailing whitespaces, invisible or unprintable characters. We agree that it is better to change both the C and pure Python versions to use repr. * Retain non-sanitized dtstr for error printing This does not create an extra string, it just holds on to a reference to the original input string for purposes of creating the error message. * PEP 7 fixes to from_isoformat * Separate handling of Unicode and other errors In the initial implementation, errors other than encoding errors would both raise an error indicating an invalid format, which would not be true for errors like MemoryError. * Drop needs_decref from _sanitize_isoformat_str Instead _sanitize_isoformat_str returns a new reference, even to the original string. (cherry picked from commit 3df8540) Co-authored-by: Paul Ganssle <[email protected]>

bedevere-bot · 2018-10-22T20:45:42Z

GH-10041 is a backport of this pull request to the 3.7 branch.

* Use _PyUnicode_Copy in sanitize_isoformat_str * Use repr in fromisoformat error message This reverses commit 67b74a98b2 per Serhiy Storchaka's suggestion: I suggested to use %R in the error message because including the raw string can be confusing in the case of empty string, or string containing trailing whitespaces, invisible or unprintable characters. We agree that it is better to change both the C and pure Python versions to use repr. * Retain non-sanitized dtstr for error printing This does not create an extra string, it just holds on to a reference to the original input string for purposes of creating the error message. * PEP 7 fixes to from_isoformat * Separate handling of Unicode and other errors In the initial implementation, errors other than encoding errors would both raise an error indicating an invalid format, which would not be true for errors like MemoryError. * Drop needs_decref from _sanitize_isoformat_str Instead _sanitize_isoformat_str returns a new reference, even to the original string. (cherry picked from commit 3df8540) Co-authored-by: Paul Ganssle <[email protected]>

the-knights-who-say-ni added the CLA signed label Aug 27, 2018

bedevere-bot added the awaiting review label Aug 27, 2018

pganssle force-pushed the cleanup_fromisoformat_surrogate branch 3 times, most recently from cd85c26 to 03a0e03 Compare August 27, 2018 21:26

taleinat added skip news needs backport to 3.7 labels Aug 27, 2018

taleinat approved these changes Aug 27, 2018

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting review labels Aug 27, 2018

serhiy-storchaka self-requested a review August 27, 2018 22:22

zhangyangyu reviewed Aug 28, 2018

View reviewed changes

pganssle commented Aug 28, 2018

View reviewed changes

pganssle force-pushed the cleanup_fromisoformat_surrogate branch 2 times, most recently from e3466eb to 255b823 Compare August 28, 2018 18:55

izbyshev approved these changes Aug 29, 2018

View reviewed changes

taleinat reviewed Aug 30, 2018

View reviewed changes

Lib/test/datetimetester.py Outdated Show resolved Hide resolved

pganssle force-pushed the cleanup_fromisoformat_surrogate branch from 255b823 to 253217a Compare September 4, 2018 14:36

pganssle force-pushed the cleanup_fromisoformat_surrogate branch from 253217a to 4c9c91a Compare September 19, 2018 13:23

Drop needs_decref from _sanitize_isoformat_str

90a2e71

Instead _sanitize_isoformat_str returns a new reference, even to the original string.

serhiy-storchaka approved these changes Oct 22, 2018

View reviewed changes

vstinner approved these changes Oct 22, 2018

View reviewed changes

vstinner merged commit 3df8540 into python:master Oct 22, 2018

vstinner mentioned this pull request Oct 22, 2018

bpo-34482: Add tests for proper handling of non-UTF-8-encodable strin… #8878

Merged

bedevere-bot added awaiting merge and removed awaiting changes labels Oct 22, 2018

bedevere-bot removed the awaiting merge label Oct 22, 2018

bedevere-bot removed the needs backport to 3.7 label Oct 22, 2018

movermeyer mentioned this pull request Dec 1, 2022

Add support for PyPy movermeyer/backports.datetime_fromisoformat#24

Open

		self.theclass.fromisoformat(dtstr)

		msg = cm.exception.args[0]

Uh oh!

bpo-34454: Clean up datetime.fromisoformat surrogate handling #8959

bpo-34454: Clean up datetime.fromisoformat surrogate handling #8959

Uh oh!

Conversation

pganssle commented Aug 27, 2018 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pganssle commented Aug 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taleinat left a comment

Choose a reason for hiding this comment

Uh oh!

taleinat Aug 27, 2018

Choose a reason for hiding this comment

Uh oh!

pganssle Aug 28, 2018

Choose a reason for hiding this comment

Uh oh!

zhangyangyu left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka commented Aug 28, 2018

Uh oh!

serhiy-storchaka commented Aug 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pganssle commented Aug 28, 2018

Uh oh!

pganssle commented Aug 28, 2018

Uh oh!

pganssle Aug 28, 2018

Choose a reason for hiding this comment

Uh oh!

zhangyangyu Aug 28, 2018

Choose a reason for hiding this comment

Uh oh!

pganssle Aug 28, 2018

Choose a reason for hiding this comment

Uh oh!

zhangyangyu commented Aug 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pganssle commented Aug 28, 2018

Uh oh!

zhangyangyu commented Aug 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pganssle commented Aug 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

izbyshev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pganssle commented Sep 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pganssle commented Sep 19, 2018

Uh oh!

pganssle commented Oct 21, 2018

Uh oh!

vstinner commented Oct 22, 2018

Uh oh!

pganssle commented Oct 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented Oct 22, 2018

Uh oh!

pganssle commented Oct 22, 2018

Uh oh!

vstinner commented Oct 22, 2018

Uh oh!

pganssle commented Oct 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taleinat commented Oct 22, 2018

Uh oh!

serhiy-storchaka Aug 28, 2018

pganssle commented Aug 27, 2018 •

edited by bedevere-bot

Loading

pganssle commented Aug 27, 2018 •

edited

Loading

serhiy-storchaka commented Aug 28, 2018 •

edited

Loading

zhangyangyu commented Aug 28, 2018 •

edited

Loading

zhangyangyu commented Aug 28, 2018 •

edited

Loading

pganssle commented Aug 28, 2018 •

edited

Loading

pganssle commented Sep 4, 2018 •

edited

Loading

pganssle commented Oct 22, 2018 •

edited

Loading

pganssle commented Oct 22, 2018 •

edited

Loading

vstinner commented Oct 22, 2018 •

edited

Loading