bpo-34482: Add tests for proper handling of non-UTF-8-encodable strin… #8878

izbyshev · 2018-08-23T17:49:47Z

…gs in datetime classes

A follow-up of [bpo-34454](https://www.bugs.python.org/issue34454).

https://bugs.python.org/issue34482

…gs in datetime classes A follow-up of bpo-34454.

pganssle · 2018-08-23T18:25:07Z

Lib/test/datetimetester.py

+        # FIXME: The C datetime implementation raises an exception
+        # while the pure-Python one succeeds.
+        try:
+            t.strftime('\ud800')


This seems worth of a separate test, or at least a subtest.

Also, you may as well assert something about the result of this, self.assertEqual(t.strftime('\ud800'), '\ud800')

@pganssle, why assert that? It's not clear to me that this should be something we ensure.

IMO the test is fine. I'd just suggest removing the "FIXME" paragraph entirely (here and above).

Are we saying that .strftime(unicode_with_surrogates) is undefined behavior? Implementation dependent?

I think it's reasonable to either pick something and stick with it or formally make this an error on the pure-python side as well.

Sounds good to me. Let's go with your fix (in the PR to this PR's branch) and the test you suggest above. @izbyshev, could you update accordingly?

@taleinat OK, but I think it'd cleaner to go with two separate PRs for bpo-34481 and bpo-34482. @pganssle suggests the same.

Hm.. This seems like a separate bug on Mac, honestly, rather than implementation-dependent behavior. Usually the "implementation-dependent" stuff in strftime is in how different platforms interpret the different formatting directives, not stuff like this.

I'd be tempted to move the surrogate assertion portion added into a separate test that you can wrap in @unittest.skipIf or the like.

As for what to do about it, technically it's probably a bug in the platform's strftime implementation, of course, but it might be worth adding a workaround like I did for the C implementation into the Python implementation. It definitely seems wrong that dt.strftime('%Y-%m-%dT%H:%M:%S\ud800') would return ''. It should either raise an error or return the formatted date with the surrogate character in place.

It turned out that behavior of strftime with surrogates is even worse. If there is no wcsftime, the string is encoded with surrogateescape error handler, which means that my tests should fail with UnicodeEncodeError even on pure-Python datetime because the surrogate I use (\uD800) is not in the range of surrogates used for escaping (\uDC80-\uDCFF).

This is indeed what happens on my Windows 8.1 with Python 3.7 in a manual test:

Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32 >>> import time >>> time.strftime('\ud800') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'locale' codec can't encode character '\ud800' in position 0: encoding error

Moreover:

>>> time.strftime('\udc80') '\x80'

This I can't explain currently. Decoding in timemodule.c uses the locale encoding (with surrogateescape), and my locale's encoding is not Latin-1. If I do what I think should be the same, I get a different result:

>>> ord('\udc80'.encode('mbcs', 'surrogateescape').decode('mbcs', 'surrogateescape')) 1026 # This is the unicode code point corresponding to b'0x80' in cp1251 encoding

I'll try to understand this tomorrow.

Here is the summary of what I've learned.

strftime implementation uses either strftime or wcsftime from the C standard library, depending on the platform (usually wcsftime is available, e.g. on Linux and macOS, but strftime is used on Windows because wcsftime is broken).

Those functions are locale-dependent (in the sense of setlocale()).

Those functions don't report errors via errno according to POSIX and man pages. All they can do is return 0, except that on Windows errno is set to EINVAL if buffer is too short.

If wcsftime is used, Python converts the format string into wchar_t string, and there are no issues with surrogates at this point. However, wcsftime may perform additional validation of wchars for the current locale. glibc apparently doesn't do that, since I could roundtrip any code point through wcsftime, including surrogates. But wcsftime on macOS does perform the validation: in C locale, it rejects code points > 255 (by returning zero; errno was also set to EILSEQ in my tests, but it's not documented). Even in a locale with UTF-8 charset, it rejects surrogates. This why tests on macOS failed.

If strftime is used, Python encodes the input into a locale-dependent charset with surrogateescape error handler, and decodes the result back in the same way. The implications:

Surrogates outside of \uDC80-\uDCFF range cause UnicodeEncodeError.

Other surrogates can't be reliably round-tripped because they turn into lone bytes on encoding which are then reinterpreted with the locale encoding on the way back. This explains why I got 0x80 from time.strftime('\udc80') on Windows: on encoding, the surrogate was unwrapped into b'0x80', on decoding, since there was no setlocale() calls and the C locale was active, each byte was simply converted to the code point with the same numerical value. I was initially confused because I thought that "locale" means the user-default code page (which would give a different interpretation for b'0x80').

Given the above, I don't think that we should check the return value of strftime in surrogate-related tests, at least in this PR. If such tests are desired, I think that they belong to the time module and require some experiments/doc research on platforms not covered by build bots.

@izbyshev Awesome work. I'm thinking the best way forward at this point is to drop the assertion portion and to copy all this information into a separate bug report about the inconsistency in strftime. It might be a wontfix or back-burnered, but I think we can probably get a lot more consistency than we already have.

@pganssle After thinking about a separate bug report, I realized that the whole situation is not so different from the usual stuff with file-system-related functions. Some platforms have wide-character APIs, others have only byte-oriented APIs, and Python tries to appease both. open('\ud800') fails on Linux with UnicodeEncodeError, but is OK on Windows. And the twist with wcsftime on macOS rejecting surrogates is somewhat similar Windows refusing to open('*'). Seems unlikely that we can reach consistency here, given that existing users may rely on specific behavior. So I've submitted bpo-34512 to at least try to update the docs.

pganssle · 2018-08-23T18:29:37Z

II think rather than doing try/except, it's best use expectedFailure. Unfortunately, it doesn't seem like there is a way to do conditional expected failures like there is with pytest.xfail.

Alternatively, maybe use skipIf?

pganssle · 2018-08-23T18:40:25Z

I think this also needs tests for date.strftime as well, no?

pganssle · 2018-08-23T20:11:47Z

I made a PR against @izbyshev's branch that fixes this bug as part of adding the test suite.

I can move that to a separate PR against master if preferred.

taleinat · 2018-08-23T20:32:33Z

IMO a NEWS entry isn't needed here, since this only affects tests.

pganssle · 2018-08-23T20:49:38Z

By the way, in light of the fact that I have a PR to fix this, I withdraw my comments about expectedFailure and such. The whole point of expectedFailure is to make it clear that you are testing known-pathological behavior so it's not misinterpreted (and so that you can check in an automated way whether your expectations are false).

Since it will be fixed almost immediately it's a pointless bit of formalism to explicitly mark them as failing just to have them switched over to succeeding immediately.

izbyshev · 2018-08-23T21:12:56Z

@pganssle

I think this also needs tests for date.strftime as well, no?

My tests are in TestDate and TestTime test cases. TestDateTime is a subclass of TestDate, so my tests run for all three classes.

pganssle · 2018-08-23T21:17:39Z

@taleinat Per discussion on izbyshev#1, I think it might be good to just do these as two separate PRs, if this one is ready to merge. It would be inconvenient for me to update my PR in response to review comments if it has to go through PRs to Alexey's fork first, and it's probably better to centralize the review comments on the main fork.

izbyshev · 2018-08-23T21:37:57Z

Thanks for the review, @pganssle and @taleinat! I've updated the PR. I haven't removed try/except since @pganssle wants to merge his PR on top of mine.

izbyshev · 2018-08-23T21:39:48Z

@pganssle And I've eventually decided to add a separate test for datetime.strftime since it won't hurt :)

Its behavior across platforms is inconsistent and hard to test.

izbyshev · 2018-08-24T20:18:55Z

I've removed assertEqual from strftime tests per discussion with @pganssle.

izbyshev · 2018-08-29T23:07:00Z

@taleinat Would you give this PR another look?

taleinat · 2018-08-30T09:50:33Z

Lib/test/datetimetester.py

@@ -2352,6 +2373,12 @@ def test_more_strftime(self):
            t = t.replace(tzinfo=tz)
            self.assertEqual(t.strftime("%z"), "-0200" + z)

+        # bpo-34482: Check that surrogates don't cause a crash.


I don't think this belongs in the test_more_strftime() test method.

Why? TestDateTime.test_more_strftime is a datetime-specific extension to TestDate.strftime (which in inherited by TestDateTime), and the latter contains my test for surrogates. I suggest to either move all surrogate-related strftime tests to separate methods or keep things as is.

I personally think breaking the tests up a bit more is a good idea, though I don't know the overall test philosophy of CPython.

If the idea is to put all the strftime tests in one test, I'd probably use subTest to distinguish the various failure modes, at the very least.

taleinat · 2018-08-30T09:55:10Z

Lib/test/datetimetester.py

-        self.assertEqual(expected, got)
-        self.assertIs(type(expected), self.theclass)
-        self.assertIs(type(got), self.theclass)
+        inputs = [


Let's leave the first "happy path" test as it is.

I suggest separating the surrogate examples, keeping the loop just for them, and removing the type assertions for them, i.e. just checking that the result equals what is expected.

Sounds reasonable, fixed.

vstinner · 2018-10-22T16:34:56Z

I just merged the PR #8959: you may have to update/rebase this PR (that I didn't review it).

izbyshev · 2018-10-22T18:02:06Z

Thanks, @vstinner, but this PR doesn't need rebasing right now.

BTW, I don't know why Travis is unhappy. Could anybody tell it to try again?

taleinat · 2018-10-22T18:13:55Z

BTW, I don't know why Travis is unhappy. Could anybody tell it to try again?

Restarted.

Two runs, one of the a "docs" run, ran out of memory. I hope it's just transient.

taleinat · 2018-10-22T18:33:13Z

BTW, I don't know why Travis is unhappy. Could anybody tell it to try again?

Restarted.

Two runs, one of the a "docs" run, ran out of memory. I hope it's just transient.

The test run succeeded. The "docs" run fails due to recent changes having left it a broken state in master for a short while; this just needs a merge/rebase. I'm on it.

taleinat

LGTM

vstinner

LGTM. I love to see more tests on edge cases!

izbyshev · 2018-10-22T21:38:41Z

Thanks to @taleinat for helping with Travis and reviewing, and to @vstinner for reviewing!

taleinat · 2018-10-23T06:33:55Z

I'm backporting this to 3.7 similarly to the related PRs GH-8862 and GH-8959, for consistency and easier future backporting.

miss-islington · 2018-10-23T06:36:10Z

Thanks @izbyshev for the PR, and @taleinat for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7.
🐍🍒⛏🤖

miss-islington · 2018-10-23T06:36:11Z

Thanks @izbyshev for the PR, and @taleinat for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7.
🐍🍒⛏🤖

bedevere-bot · 2018-10-23T06:36:20Z

GH-10049 is a backport of this pull request to the 3.7 branch.

…ings (pythonGH-8878) (cherry picked from commit 3b0047d) Co-authored-by: Alexey Izbyshev <[email protected]>

…ings (GH-8878) (cherry picked from commit 3b0047d) Co-authored-by: Alexey Izbyshev <[email protected]>

bpo-34482: Add tests for proper handling of non-UTF-8-encodable strin…

97037bd

…gs in datetime classes A follow-up of bpo-34454.

the-knights-who-say-ni added the CLA signed label Aug 23, 2018

bedevere-bot added the awaiting review label Aug 23, 2018

izbyshev mentioned this pull request Aug 23, 2018

bpo-34454: datetime: Fix crash on PyUnicode_AsUTF8AndSize() failure #8850

Closed

pganssle reviewed Aug 23, 2018

View reviewed changes

taleinat added the skip news label Aug 23, 2018

taleinat added the tests Tests in the Lib/test dir label Aug 23, 2018

Alexey Izbyshev added 3 commits August 24, 2018 00:27

Remove FIXMEs and use assertEqual in strftime tests

a7e435d

Remove the NEWS entry

4603f7e

Add a separate test for datetime.strftime()

beca2bd

Remove check for strftime() return value

51bd826

Its behavior across platforms is inconsistent and hard to test.

pganssle mentioned this pull request Aug 28, 2018

bpo-34481: Fix surrogate-handling in strftime #8983

Closed

taleinat reviewed Aug 30, 2018

View reviewed changes

Don't mix regular and surrogate-related tests in test_strptime()

f24c13c

Merge branch 'master' into bpo-34482

faadbb9

taleinat approved these changes Oct 22, 2018

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting review labels Oct 22, 2018

vstinner approved these changes Oct 22, 2018

View reviewed changes

taleinat added the needs backport to 3.7 label Oct 23, 2018

taleinat merged commit 3b0047d into python:master Oct 23, 2018

bedevere-bot removed the awaiting merge label Oct 23, 2018

bedevere-bot removed the needs backport to 3.7 label Oct 23, 2018

miss-islington added a commit that referenced this pull request Oct 23, 2018

bpo-34482: test datetime classes' handling of non-UTF-8-encodable str…

313e501

…ings (GH-8878) (cherry picked from commit 3b0047d) Co-authored-by: Alexey Izbyshev <[email protected]>

Uh oh!

bpo-34482: Add tests for proper handling of non-UTF-8-encodable strin… #8878

bpo-34482: Add tests for proper handling of non-UTF-8-encodable strin… #8878

Uh oh!

Conversation

izbyshev commented Aug 23, 2018 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pganssle commented Aug 23, 2018

Uh oh!

pganssle commented Aug 23, 2018

Uh oh!

pganssle commented Aug 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taleinat commented Aug 23, 2018

Uh oh!

pganssle commented Aug 23, 2018

Uh oh!

izbyshev commented Aug 23, 2018

Uh oh!

pganssle commented Aug 23, 2018

Uh oh!

izbyshev commented Aug 23, 2018

Uh oh!

izbyshev commented Aug 23, 2018

Uh oh!

izbyshev commented Aug 24, 2018

Uh oh!

izbyshev commented Aug 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vstinner commented Oct 22, 2018

Uh oh!

izbyshev commented Oct 22, 2018

Uh oh!

taleinat commented Oct 22, 2018

Uh oh!

taleinat commented Oct 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taleinat left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

izbyshev commented Oct 22, 2018

Uh oh!

taleinat commented Oct 23, 2018

Uh oh!

izbyshev commented Aug 23, 2018 •

edited by bedevere-bot

Loading

pganssle commented Aug 23, 2018 •

edited

Loading

taleinat commented Oct 22, 2018 •

edited

Loading