-
-
Notifications
You must be signed in to change notification settings - Fork 32.1k
group names of bytes regexes are strings #85152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I noticed that match.groupdict() returns string keys, even for a bytes regex:
This seems somewhat strange, because string and bytes matching in re are kind of two separate parts, cf. doc:
|
This also affects functions/methods expecting a group name as parameter (e.g. match.group), the group name has to be passed as string. |
Group name is Other names in Python are also |
Agreed to some extent, but there is the difference that group names are embedded in the pattern, which has to be bytes if the target is bytes. My use case is in an all-bytes, no-string project where I construct a large regular expression at startup, with semi-dynamical group names. So it seems natural to have everything in bytes to concatenate the regular expression, incl. the group names. But then group names that I receive back are strings, so I cannot look them up directly into the set of group names that I used to create the expression in the first place. Of course I can live with it by storing them as strings in the first place and encode()'ing them during concatenation, but it does not feel "natural". Furthermore, even if it is "just a name", a non-ascii group name will raise an error in bytes, even if encoded...:
So no, it's not really "just a name", considering that in Python "é" should is a valid name. |
should *be a valid name |
Looks like this is a language limitation: >>> b'é'
File "<stdin>", line 1
SyntaxError: bytes can only contain ASCII literal characters. No problem if you use escaped character: >>> re.match(b'(?P<\xe9>)', b'').groupdict()
{'é': b''} There may be some inconveniences in your program, but IMO there is nothing wrong, maybe this issue can be closed. |
Of course an inconvenience in my program is not per se the reason to change the language. I just wanted to motivate that the current situation gives unexpected results. "\xe9" doesn't look like proper utf-8 to me:
Let's try another one: how would you go for Δ ("\u0394") as a group name?
|
So b'\xe9' is mapped to \u00e9, it is Of course, characters with Unicode code point greater than 0xff are impossible to appear in |
Yes but \xe9 is not strictly valid utf-8, or say not the canonical representation of "é". So there is no way to get \xe9 starting from é without leaving utf-8. So starting with é as group name, I cannot programmatically encode it into a bytes pattern.
But \xce and \x94 are both lower than \xff, yet using \xce\x94 ("Δ".encode()) in a group name fails. According to the doc, the sole constraint on group names is that they have to be valid and unique Python identifiers. So this should work:
|
In this case, you can only use 'latin1', which directly map one character (\u0000-\u00FF) to/from one byte. If use 'utf-8', it may map one character to multiple bytes, such as 'Δ' -> b'\xce\x94' '\x94' is an invalid identifier, it will raise an error: >>> '\xce'.isidentifier() # '\xce' is 'Î'
True
>>> '\x94'.isidentifier()
False You may close this issue (I can't close it), we can continue the discussion. |
But Δ has no latin-1 representation. So Δ currently cannot be used as a group name in bytes regex, although it is a valid Python identifier. So that's a bug. I mean, if you insist of having group names as strings even for bytes regexes, then it is not reasonable to prevent them from going _in_. b"(??<\xce\x94>)" is a valid utf-8-encoded bytestring, why wouldn't you accept it as a valid re pattern? IMHO, either
|
Sorry, b"(?P<\xce\x94>)" |
The issue with the second variant is that utf-8 is an arbitrary (although default) choice. But: re is doing that same arbitrary choice already in decoding the group names into a string, which is my original complaint! |
It seems you don't know some knowledge of encoding yet. Naturally, "utf-8", "utf-16" and "utf-32" are all encoding codecs, "utf-8" should not have a special status in this scene. |
I don't have to be ashamed of my knowledge of encoding. Yet you are right that I was missing a subtlety, which is that latin-1 is a strict subset of Unicode rather than a completely arbitrary encoding. Thank you for that. So what you are saying is that group names in bytes regexes can only be specified directly (without -explicit- encoding), so de facto they are limited to the latin-1 subset. Very well. But then, once again:
|
I prove my point that the decoding to string is arbitrary:
For any dynamically-constructed bytes regex pattern, a string group name as output is unusable. Only after latin-1-reencoding can it be safely compared. This latin-1 choice is arbitrary. |
Not all latin-1 characters are valid identifier, for example: >>> '\x94'.encode('latin1')
b'\x94'
>>> '\x94'.isidentifier()
False There is a workaround, you can convert |
Please look at these: >>> orig_name = "Ř"
>>> orig_ch = orig_name.encode("cp1250") # Because why not?
>>> orig_ch
b'\xd8'
>>> name = list(re.match(b"(?P<" + orig_ch + b">)", b"").groupdict().keys())[0]
>>> name
'Ø' # '\xd8'
>>> name == orig_name
False
>>> name.encode("latin-1")
b'\xd8'
>>> name.encode("latin-1") == orig_ch
True "Ř" (\u0158) --cp1250--> b'\xd8' |
True but that's not the point. Δ is a valid Python identifier but not a valid group name in bytes regexes, because it is not in the latin-1 plane. The documentation does not mention this.
I am not searching a workaround for my current code. And the simplest workaround is to latin-1-convert back to bytes, because re should not latin-1-convert to string in the first place. Are you saying that the proper way to use bytes regexes is to use string regexes instead?
That's no surprize, I carefully crafted this example. :-) Rather, that is exactly my point: several different strings (which can all be valid Python identifiers) can have the same single-byte representation, simply by the mean of different encodings (duh). So why convert group names to strings when outputting them from matches, when you don't know where the bytes come from, or even whether they ever were strings? That should be left to the programmer. |
And there's no need for a cryptic encoding like cp1250 for this problem to arise. Here is a simple example with Python's default encoding utf-8:
For reference, here is the very source of the issue: https://github.com/python/cpython/blob/master/Lib/sre_parse.py#L228 |
The problem can also be played in reverse, maybe it is more telling:
|
You questioned my knowledge of encodings. Let's quote from one of the most famous introductory articles on the subject (https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/):
So I have that bytestring that comes from somewhere, maybe it was originally utf-8 or cp1250 or ... encoded, but I won't tell or don't know, the only thing I swear is that it originally was a valid Python identifier. So: latin-1 is an arbitrary choice that is no better than any other, and the fact that it "naturally" converts bytes to unicode code points is an implementation detail. |
I just had an "aha moment": What re claims is that, rather than doing as I suggested:
the actual way to know what group name is represented would be to look at the (unicode) string with the same "graphical representation":
This way of going from bytes to strings _naively_ (which happens to be called latin-1) makes IMHO as much sense as saying that 0x10, 0b10 and 0o10 should be the same value, just because they "look the same" in the source code. This is like throwing away everything we ever learned about Unicode and how a code point is fundamentally different from what is stored in memory. |
Why you always want to use "utf-8" encoded identifier as group name in The direction is: a group name written in |
Because utf-8 is Python's default encoding, e.g. in source files, decode() and encode(). Literally everywhere. If you ask around "I have a bytestring, I need a string, what do I do?", using latin-1 will not be the first answer (and moreover, the correct answer should be "it depends on the encoding", which re happily ignores by just asserting one). Saying "just strip that b prefix, it's fine" cannot be taken seriously. Yes latin-1 will never give an error on converting a bytestring, because it has full coverage of the 256 byte values, but saying that this is the reason why it should be used instead of another is forgetting why we have Unicode in the first place. **It is just pretending that Unicode never was a thing**. It is not because it can decode any bytestring that it will not return garbage _when the bytestring is not latin-1-encoded in the first place_. Take a look at the documentation: https://docs.python.org/3/howto/unicode.html latin-1 used to be prominent in the 2.x world, it should slowly be time to recognize that this is over, and we cannot ignore anymore that encoding is a thing. |
If I don't have to think about the str -> bytes direction, re should first stop going in the other direction. When I have bytes regexes I actually don't care about strings and would happily receive group names as bytes. But no, re decides that latin-1 is the way to go, and this way it 1) reduces my freedom in the choice of the group names, 2) makes me need to go read the internals to understand the the encoding it arbitrarily chose is latin-1, so that I can undo it properly and get back what I always wanted - a bytes group name. |
bytes are _not_ Unicode code points, not even in the 256 range. End of the story. |
Latin1 Is an implementation detail of the RE parser. There is no deep meaning in this. Group names are purposed to be human-readable. This is why they are limited to be identifiers. Non-ASCII characters in bytes pattern are not human-readable. I think we should only allow ASCII-only identifiers as group names in bytes patterns. The question is whether it needs a deprecation period? |
Close this issue as "not a bug". See #91760 for more strict rules which can eliminate confusion. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: