-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
bpo-34482: Add tests for proper handling of non-UTF-8-encodable strin… #8878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…gs in datetime classes A follow-up of bpo-34454.
Lib/test/datetimetester.py
Outdated
# FIXME: The C datetime implementation raises an exception | ||
# while the pure-Python one succeeds. | ||
try: | ||
t.strftime('\ud800') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems worth of a separate test, or at least a subtest.
Also, you may as well assert something about the result of this, self.assertEqual(t.strftime('\ud800'), '\ud800')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pganssle, why assert that? It's not clear to me that this should be something we ensure.
IMO the test is fine. I'd just suggest removing the "FIXME" paragraph entirely (here and above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we saying that .strftime(unicode_with_surrogates)
is undefined behavior? Implementation dependent?
I think it's reasonable to either pick something and stick with it or formally make this an error on the pure-python side as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. Let's go with your fix (in the PR to this PR's branch) and the test you suggest above. @izbyshev, could you update accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taleinat OK, but I think it'd cleaner to go with two separate PRs for bpo-34481 and bpo-34482. @pganssle suggests the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. This seems like a separate bug on Mac, honestly, rather than implementation-dependent behavior. Usually the "implementation-dependent" stuff in strftime is in how different platforms interpret the different formatting directives, not stuff like this.
I'd be tempted to move the surrogate assertion portion added into a separate test that you can wrap in @unittest.skipIf
or the like.
As for what to do about it, technically it's probably a bug in the platform's strftime
implementation, of course, but it might be worth adding a workaround like I did for the C implementation into the Python implementation. It definitely seems wrong that dt.strftime('%Y-%m-%dT%H:%M:%S\ud800')
would return ''
. It should either raise an error or return the formatted date with the surrogate character in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turned out that behavior of strftime
with surrogates is even worse. If there is no wcsftime
, the string is encoded with surrogateescape
error handler, which means that my tests should fail with UnicodeEncodeError
even on pure-Python datetime
because the surrogate I use (\uD800
) is not in the range of surrogates used for escaping (\uDC80-\uDCFF
).
This is indeed what happens on my Windows 8.1 with Python 3.7 in a manual test:
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)] on win32
>>> import time
>>> time.strftime('\ud800')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'locale' codec can't encode character '\ud800' in position 0: encoding error
Moreover:
>>> time.strftime('\udc80')
'\x80'
This I can't explain currently. Decoding in timemodule.c uses the locale encoding (with surrogateescape
), and my locale's encoding is not Latin-1. If I do what I think should be the same, I get a different result:
>>> ord('\udc80'.encode('mbcs', 'surrogateescape').decode('mbcs', 'surrogateescape'))
1026 # This is the unicode code point corresponding to b'0x80' in cp1251 encoding
I'll try to understand this tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the summary of what I've learned.
strftime
implementation uses eitherstrftime
orwcsftime
from the C standard library, depending on the platform (usuallywcsftime
is available, e.g. on Linux and macOS, butstrftime
is used on Windows becausewcsftime
is broken).- Those functions are locale-dependent (in the sense of
setlocale()
). - Those functions don't report errors via
errno
according to POSIX and man pages. All they can do is return 0, except that on Windowserrno
is set toEINVAL
if buffer is too short. - If
wcsftime
is used, Python converts the format string intowchar_t
string, and there are no issues with surrogates at this point. However,wcsftime
may perform additional validation of wchars for the current locale.glibc
apparently doesn't do that, since I could roundtrip any code point throughwcsftime
, including surrogates. Butwcsftime
on macOS does perform the validation: in C locale, it rejects code points > 255 (by returning zero;errno
was also set toEILSEQ
in my tests, but it's not documented). Even in a locale with UTF-8 charset, it rejects surrogates. This why tests on macOS failed. - If
strftime
is used, Python encodes the input into a locale-dependent charset withsurrogateescape
error handler, and decodes the result back in the same way. The implications:- Surrogates outside of
\uDC80-\uDCFF
range causeUnicodeEncodeError
. - Other surrogates can't be reliably round-tripped because they turn into lone bytes on encoding which are then reinterpreted with the locale encoding on the way back. This explains why I got
0x80
fromtime.strftime('\udc80')
on Windows: on encoding, the surrogate was unwrapped intob'0x80'
, on decoding, since there was nosetlocale()
calls and the C locale was active, each byte was simply converted to the code point with the same numerical value. I was initially confused because I thought that "locale" means the user-default code page (which would give a different interpretation forb'0x80'
).
- Surrogates outside of
Given the above, I don't think that we should check the return value of strftime
in surrogate-related tests, at least in this PR. If such tests are desired, I think that they belong to the time
module and require some experiments/doc research on platforms not covered by build bots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@izbyshev Awesome work. I'm thinking the best way forward at this point is to drop the assertion portion and to copy all this information into a separate bug report about the inconsistency in strftime. It might be a wontfix
or back-burnered, but I think we can probably get a lot more consistency than we already have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pganssle After thinking about a separate bug report, I realized that the whole situation is not so different from the usual stuff with file-system-related functions. Some platforms have wide-character APIs, others have only byte-oriented APIs, and Python tries to appease both. open('\ud800')
fails on Linux with UnicodeEncodeError
, but is OK on Windows. And the twist with wcsftime
on macOS rejecting surrogates is somewhat similar Windows refusing to open('*')
. Seems unlikely that we can reach consistency here, given that existing users may rely on specific behavior. So I've submitted bpo-34512 to at least try to update the docs.
II think rather than doing Alternatively, maybe use |
I think this also needs tests for |
I made a PR against @izbyshev's branch that fixes this bug as part of adding the test suite. I can move that to a separate PR against master if preferred. |
IMO a NEWS entry isn't needed here, since this only affects tests. |
By the way, in light of the fact that I have a PR to fix this, I withdraw my comments about expectedFailure and such. The whole point of expectedFailure is to make it clear that you are testing known-pathological behavior so it's not misinterpreted (and so that you can check in an automated way whether your expectations are false). Since it will be fixed almost immediately it's a pointless bit of formalism to explicitly mark them as failing just to have them switched over to succeeding immediately. |
My tests are in |
@taleinat Per discussion on izbyshev#1, I think it might be good to just do these as two separate PRs, if this one is ready to merge. It would be inconvenient for me to update my PR in response to review comments if it has to go through PRs to Alexey's fork first, and it's probably better to centralize the review comments on the main fork. |
@pganssle And I've eventually decided to add a separate test for |
Its behavior across platforms is inconsistent and hard to test.
I've removed |
@taleinat Would you give this PR another look? |
@@ -2352,6 +2373,12 @@ def test_more_strftime(self): | |||
t = t.replace(tzinfo=tz) | |||
self.assertEqual(t.strftime("%z"), "-0200" + z) | |||
|
|||
# bpo-34482: Check that surrogates don't cause a crash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this belongs in the test_more_strftime()
test method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? TestDateTime.test_more_strftime
is a datetime
-specific extension to TestDate.strftime
(which in inherited by TestDateTime
), and the latter contains my test for surrogates. I suggest to either move all surrogate-related strftime
tests to separate methods or keep things as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think breaking the tests up a bit more is a good idea, though I don't know the overall test philosophy of CPython.
If the idea is to put all the strftime tests in one test, I'd probably use subTest to distinguish the various failure modes, at the very least.
self.assertEqual(expected, got) | ||
self.assertIs(type(expected), self.theclass) | ||
self.assertIs(type(got), self.theclass) | ||
inputs = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave the first "happy path" test as it is.
I suggest separating the surrogate examples, keeping the loop just for them, and removing the type assertions for them, i.e. just checking that the result equals what is expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable, fixed.
I just merged the PR #8959: you may have to update/rebase this PR (that I didn't review it). |
Thanks, @vstinner, but this PR doesn't need rebasing right now. BTW, I don't know why Travis is unhappy. Could anybody tell it to try again? |
Restarted. Two runs, one of the a "docs" run, ran out of memory. I hope it's just transient. |
The test run succeeded. The "docs" run fails due to recent changes having left it a broken state in master for a short while; this just needs a merge/rebase. I'm on it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I love to see more tests on edge cases!
GH-10049 is a backport of this pull request to the 3.7 branch. |
…ings (pythonGH-8878) (cherry picked from commit 3b0047d) Co-authored-by: Alexey Izbyshev <[email protected]>
…ings (GH-8878) (cherry picked from commit 3b0047d) Co-authored-by: Alexey Izbyshev <[email protected]>
…gs in datetime classes
A follow-up of [bpo-34454](https://www.bugs.python.org/issue34454).
https://bugs.python.org/issue34482