group names of bytes regexes are strings #85152

qwenger · 2020-06-14T21:03:35Z

BPO	40980
Nosy	@animalize, @qwenger

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2020-06-14.21:03:34.705>
labels = ['expert-regex', 'type-bug', '3.8']
title = 'group names of bytes regexes are strings'
updated_at = <Date 2020-06-17.08:20:57.392>
user = 'https://github.com/qwenger'

bugs.python.org fields:

activity = <Date 2020-06-17.08:20:57.392>
actor = 'matpi'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Regular Expressions']
creation = <Date 2020-06-14.21:03:34.705>
creator = 'matpi'
dependencies = []
files = []
hgrepos = []
issue_num = 40980
keywords = []
message_count = 27.0
messages = ['371516', '371607', '371614', '371629', '371631', '371633', '371634', '371637', '371638', '371639', '371643', '371644', '371646', '371652', '371657', '371660', '371672', '371676', '371681', '371692', '371696', '371697', '371705', '371709', '371718', '371719', '371720']
nosy_count = 2.0
nosy_names = ['malin', 'matpi']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue40980'
versions = ['Python 3.8']

qwenger · 2020-06-14T21:03:34Z

I noticed that match.groupdict() returns string keys, even for a bytes regex:

>>> import re
>>> re.match(b"(?P<a>)", b"").groupdict()
{'a': b''}

This seems somewhat strange, because string and bytes matching in re are kind of two separate parts, cf. doc:

Both patterns and strings to be searched can be Unicode strings (str) as well as 8-bit strings (bytes). However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match a Unicode string with a byte pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.

qwenger · 2020-06-15T23:58:52Z

This also affects functions/methods expecting a group name as parameter (e.g. match.group), the group name has to be passed as string.

animalize · 2020-06-16T05:30:20Z

Group name is str is very reasonable. Essentially it is just a name, it has nothing to do with bytes.

Other names in Python are also str type, such as codec names, hashlib names.

qwenger · 2020-06-16T10:13:21Z

Agreed to some extent, but there is the difference that group names are embedded in the pattern, which has to be bytes if the target is bytes.

My use case is in an all-bytes, no-string project where I construct a large regular expression at startup, with semi-dynamical group names.

So it seems natural to have everything in bytes to concatenate the regular expression, incl. the group names.

But then group names that I receive back are strings, so I cannot look them up directly into the set of group names that I used to create the expression in the first place.

Of course I can live with it by storing them as strings in the first place and encode()'ing them during concatenation, but it does not feel "natural".

Furthermore, even if it is "just a name", a non-ascii group name will raise an error in bytes, even if encoded...:

>>> re.compile("(?P<" + "é" + ">)")
re.compile('(?P<é>)')
>>> re.compile(b"(?P<" + "é".encode() + b">)")
Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    re.compile(b"(?P<" + "é".encode() + b">)")
  File "/usr/lib/python3.8/re.py", line 252, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
    raise source.error(msg, len(name) + 1)
re.error: bad character in group name 'Ã©' at position 4

So no, it's not really "just a name", considering that in Python "é" should is a valid name.

qwenger · 2020-06-16T10:14:50Z

should *be a valid name

animalize · 2020-06-16T10:30:48Z

a non-ascii group name will raise an error in bytes, even if encoded

Looks like this is a language limitation:

    >>> b'é'
      File "<stdin>", line 1
    SyntaxError: bytes can only contain ASCII literal characters.

No problem if you use escaped character:

    >>> re.match(b'(?P<\xe9>)', b'').groupdict()
    {'é': b''}

There may be some inconveniences in your program, but IMO there is nothing wrong, maybe this issue can be closed.

qwenger · 2020-06-16T11:19:48Z

Of course an inconvenience in my program is not per se the reason to change the language. I just wanted to motivate that the current situation gives unexpected results.

"\xe9" doesn't look like proper utf-8 to me:

>>> "é".encode("latin-1")
b'\xe9'
>>> "é".encode()
b'\xc3\xa9'

Let's try another one: how would you go for Δ ("\u0394") as a group name?

>>> "Δ".encode()
b'\xce\x94'
>>> "Δ".encode("latin-1")
Traceback (most recent call last):
  File "<pyshell#21>", line 1, in <module>
    "Δ".encode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0394' in position 0: ordinal not in range(256)
>>> re.match(b'(?P<\xce\x94>)', b'').groupdict()
Traceback (most recent call last):
  File "<pyshell#16>", line 1, in <module>
    re.match(b'(?P<\xce\x94>)', b'').groupdict()
  File "/usr/lib/python3.8/re.py", line 191, in match
    return _compile(pattern, flags).match(string)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
    raise source.error(msg, len(name) + 1)
re.error: bad character in group name 'Î\x94' at position 4
>>> re.match(b'(?P<\u0394>)', b'').groupdict()
Traceback (most recent call last):
  File "<pyshell#12>", line 1, in <module>
    re.match(b'(?P<\u0394>)', b'').groupdict()
  File "/usr/lib/python3.8/re.py", line 191, in match
    return _compile(pattern, flags).match(string)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
    raise source.error(msg, len(name) + 1)
re.error: bad character in group name '\\u0394' at position 4

animalize · 2020-06-16T11:41:03Z

latin1 is the character set that Unicode code point from \u0000 to \u00ff, and the characters are directly mapped from/to bytes.

So b'\xe9' is mapped to \u00e9, it is é.

Of course, characters with Unicode code point greater than 0xff are impossible to appear in bytes.

qwenger · 2020-06-16T11:51:43Z

So b'\xe9' is mapped to \u00e9, it is é.

Yes but \xe9 is not strictly valid utf-8, or say not the canonical representation of "é". So there is no way to get \xe9 starting from é without leaving utf-8. So starting with é as group name, I cannot programmatically encode it into a bytes pattern.

Of course, characters with Unicode code point greater than 0xff are impossible to appear in bytes.

But \xce and \x94 are both lower than \xff, yet using \xce\x94 ("Δ".encode()) in a group name fails.

According to the doc, the sole constraint on group names is that they have to be valid and unique Python identifiers. So this should work:

# Δ is a valid identifier
>>> "Δ".isidentifier()
True
>>> Δ = 1
>>> Δ
1
>>> import re
>>> name = "Δ"
>>> re.match(b"(?P<" + name.encode() + b">)", b"")
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    re.match(b"(?P<" + name.encode() + b">)", b"")
  File "/usr/lib/python3.8/re.py", line 191, in match
    return _compile(pattern, flags).match(string)
  File "/usr/lib/python3.8/re.py", line 304, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python3.8/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)
  File "/usr/lib/python3.8/sre_parse.py", line 948, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "/usr/lib/python3.8/sre_parse.py", line 443, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "/usr/lib/python3.8/sre_parse.py", line 703, in _parse
    raise source.error(msg, len(name) + 1)
re.error: bad character in group name 'Î\x94' at position 4
re.match(b'(?P<\xce\x94>)', b'').groupdict()

animalize · 2020-06-16T12:11:44Z

In this case, you can only use 'latin1', which directly map one character (\u0000-\u00FF) to/from one byte.

If use 'utf-8', it may map one character to multiple bytes, such as 'Δ' -> b'\xce\x94'

'\x94' is an invalid identifier, it will raise an error:

    >>> '\xce'.isidentifier()   # '\xce' is 'Î'
    True
    >>> '\x94'.isidentifier()
    False

You may close this issue (I can't close it), we can continue the discussion.

qwenger · 2020-06-16T12:37:24Z

But Δ has no latin-1 representation. So Δ currently cannot be used as a group name in bytes regex, although it is a valid Python identifier. So that's a bug.

I mean, if you insist of having group names as strings even for bytes regexes, then it is not reasonable to prevent them from going _in_.

b"(??<\xce\x94>)" is a valid utf-8-encoded bytestring, why wouldn't you accept it as a valid re pattern?

IMHO, either

group names from byte regexes should be returned as bytes
or any utf-8-encoded representation of a valid Python identifier should be accepted as a group name of a bytes regex pattern.

qwenger · 2020-06-16T12:38:23Z

Sorry, b"(?P<\xce\x94>)"

qwenger · 2020-06-16T12:49:12Z

The issue with the second variant is that utf-8 is an arbitrary (although default) choice.

But: re is doing that same arbitrary choice already in decoding the group names into a string, which is my original complaint!

animalize · 2020-06-16T13:03:43Z

It seems you don't know some knowledge of encoding yet.

Naturally, bytes cannot contain character which Unicode code point is greater than \u00ff. So you can only use "latin1" encoding, which map from character to byte (or reverse) directly.

"utf-8", "utf-16" and "utf-32" are all encoding codecs, "utf-8" should not have a special status in this scene.

qwenger · 2020-06-16T13:51:43Z

It seems you don't know some knowledge of encoding yet.

I don't have to be ashamed of my knowledge of encoding. Yet you are right that I was missing a subtlety, which is that latin-1 is a strict subset of Unicode rather than a completely arbitrary encoding. Thank you for that.

So what you are saying is that group names in bytes regexes can only be specified directly (without -explicit- encoding), so de facto they are limited to the latin-1 subset.

Very well.

But then, once again:

why convert them to string when spitting them out? bytes they were when going in, bytes they should remain... **By converting them you are choosing an arbitrary encoding, even if it is the "natural" one.**
this limitation to the latin-1 subset is not compatible with the documentation, which says that valid Python identifiers are valid group names. If this was really the case, then I would expect to be able to use any string for which .isidentifier() is true as a group name, programmatically.

qwenger · 2020-06-16T14:34:13Z

I prove my point that the decoding to string is arbitrary:

>>> import re
>>> orig_name = "Ř"
>>> orig_ch = orig_name.encode("cp1250") # Because why not?
>>> name = list(re.match(b"(?P<" + orig_ch + b">)", b"").groupdict().keys())[0]
>>> name == orig_name
False
>>> name
'Ø'
>>> name.encode("latin-1") == orig_ch
True

For any dynamically-constructed bytes regex pattern, a string group name as output is unusable. Only after latin-1-reencoding can it be safely compared. This latin-1 choice is arbitrary.

animalize · 2020-06-16T15:33:39Z

this limitation to the latin-1 subset is not compatible with the documentation, which says that valid Python identifiers are valid group names.

Not all latin-1 characters are valid identifier, for example:

    >>> '\x94'.encode('latin1')
    b'\x94'
    >>> '\x94'.isidentifier()
    False

There is a workaround, you can convert bytes to str with "latin-1" decoder before processing, IIRC there will be no extra overhead (memory/speed) during processing, then the name and content are the same type. :)

animalize · 2020-06-16T15:51:25Z

Please look at these:

    >>> orig_name = "Ř"
    >>> orig_ch = orig_name.encode("cp1250") # Because why not?
    >>> orig_ch
    b'\xd8'
    >>> name = list(re.match(b"(?P<" + orig_ch + b">)", b"").groupdict().keys())[0]
    >>> name
    'Ø'  # '\xd8'
    >>> name == orig_name
    False
    >>> name.encode("latin-1")
    b'\xd8'
    >>> name.encode("latin-1") == orig_ch
    True

"Ř" (\u0158) --cp1250--> b'\xd8'
"Ø" (\u00d8) --latin-1--> b'\xd8'

qwenger · 2020-06-16T16:17:49Z

> this limitation to the latin-1 subset is not compatible with the documentation, which says that valid Python identifiers are valid group names.

Not all latin-1 characters are valid identifier, for example:
\>\>\> '\\x94'.encode('latin1')
b'\\x94'
\>\>\> '\\x94'.isidentifier()
False

True but that's not the point. Δ is a valid Python identifier but not a valid group name in bytes regexes, because it is not in the latin-1 plane. The documentation does not mention this.

There is a workaround, you can convert bytes to str with "latin-1" decoder before processing, IIRC there will be no extra overhead (memory/speed) during processing, then the name and content are the same type. :)

I am not searching a workaround for my current code.

And the simplest workaround is to latin-1-convert back to bytes, because re should not latin-1-convert to string in the first place.

Are you saying that the proper way to use bytes regexes is to use string regexes instead?

Please look at these:

\>\>\> orig_name = "Ř"
\>\>\> orig_ch = orig_name.encode("cp1250") # Because why not?
\>\>\> orig_ch
b'\\xd8'
\>\>\> name = list(re.match(b"(?P\<" + orig_ch + b"\>)", b"").groupdict().keys())[0]
\>\>\> name
'Ø'  # '\\xd8'
\>\>\> name == orig_name
False
\>\>\> name.encode("latin-1")
b'\\xd8'
\>\>\> name.encode("latin-1") == orig_ch
True

"Ř" (\u0158) --cp1250--> b'\xd8'
"Ø" (\u00d8) --latin-1--> b'\xd8'

That's no surprize, I carefully crafted this example. :-)

Rather, that is exactly my point: several different strings (which can all be valid Python identifiers) can have the same single-byte representation, simply by the mean of different encodings (duh).

So why convert group names to strings when outputting them from matches, when you don't know where the bytes come from, or even whether they ever were strings? That should be left to the programmer.

qwenger · 2020-06-16T19:41:48Z

And there's no need for a cryptic encoding like cp1250 for this problem to arise. Here is a simple example with Python's default encoding utf-8:

>>> a = "ú"
>>> b = list(re.match(b"(?P<" + a.encode() + b">)", b"").groupdict())[0]
>>> a.isidentifier()
True
>>> b.isidentifier()
True
>>> b
'Ãº'
>>> a.encode() == b.encode("latin1")
True

For reference, here is the very source of the issue: https://github.com/python/cpython/blob/master/Lib/sre_parse.py#L228

qwenger · 2020-06-16T20:17:51Z

The problem can also be played in reverse, maybe it is more telling:

# consider the following bytestring pattern
>>> p = b"(?P<\xc3\xba>)"

# what character does the group name correspond to?
# maybe we can try to infer it by decoding the bytestring?
# let's try to do it with the default encoding... that natural, right?
>>> p.decode()
'(?P<ú>)'

# so we can reasonably expect the group name to be ú, right?
>>> list(re.compile(p).groupindex.keys()).pop()
'Ãº'

# Fail.

qwenger · 2020-06-16T20:37:53Z

You questioned my knowledge of encodings. Let's quote from one of the most famous introductory articles on the subject (https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/):

It does not make sense to have a string without knowing what encoding it uses

So I have that bytestring that comes from somewhere, maybe it was originally utf-8 or cp1250 or ... encoded, but I won't tell or don't know, the only thing I swear is that it originally was a valid Python identifier.
Now I pass it as a group name in re.match (it was a valid Python identifier, so that has to be alright per the docs) and I get back a (unicode) string.
re.match, how dare you giving me back a string when _you have no clue what my bytestring originally represented, resp. what it originally was encoded with_?
Maybe re.match will even crash, because it wrongly and assumes the bytestring to have been latin-1 encoded!

So: latin-1 is an arbitrary choice that is no better than any other, and the fact that it "naturally" converts bytes to unicode code points is an implementation detail.
If you want to keep it so, it ought (cf. the quote above) to be made clear in the docs that group names come out as latin-1-encoded strings, with all the restrictions that follow from that choice.
But the more logical way would be to renounce this arbitrary encoding altogether.

qwenger · 2020-06-17T00:26:30Z

I just had an "aha moment": What re claims is that, rather than doing as I suggested:

# consider the following bytestring pattern
>>> p = b"(?P<\xc3\xba>)"

# what character does the group name correspond to?
# maybe we can try to infer it by decoding the bytestring?
# let's try to do it with the default encoding... that's natural, right?
>>> p.decode()
'(?P<ú>)'

the actual way to know what group name is represented would be to look at the (unicode) string with the same "graphical representation":

# consider the following bytestring pattern
>>> p = b"(?P<\xc3\xba>)"

# what character does the group name correspond to?
# to discover it, we instead consider the string that "looks the same":
>>> "(?P<\xc3\xba>)"
'(?P<Ãº>)'

# ok so the group name will be "Ãº"

This way of going from bytes to strings _naively_ (which happens to be called latin-1) makes IMHO as much sense as saying that 0x10, 0b10 and 0o10 should be the same value, just because they "look the same" in the source code.

This is like throwing away everything we ever learned about Unicode and how a code point is fundamentally different from what is stored in memory.

animalize · 2020-06-17T03:30:17Z

Why you always want to use "utf-8" encoded identifier as group name in bytes pattern.

The direction is: a group name written in bytes pattern, and will convert to str. Not this direction: strgroup name -(utf8)->bytespattern ->str` group name

qwenger · 2020-06-17T08:10:17Z

Because utf-8 is Python's default encoding, e.g. in source files, decode() and encode(). Literally everywhere.

If you ask around "I have a bytestring, I need a string, what do I do?", using latin-1 will not be the first answer (and moreover, the correct answer should be "it depends on the encoding", which re happily ignores by just asserting one).

Saying "just strip that b prefix, it's fine" cannot be taken seriously.

Yes latin-1 will never give an error on converting a bytestring, because it has full coverage of the 256 byte values, but saying that this is the reason why it should be used instead of another is forgetting why we have Unicode in the first place. **It is just pretending that Unicode never was a thing**. It is not because it can decode any bytestring that it will not return garbage _when the bytestring is not latin-1-encoded in the first place_.

Take a look at the documentation: https://docs.python.org/3/howto/unicode.html
7 references to latin-1, none saying that latin-1 is the way to go because it is so much better than anything else.

latin-1 used to be prominent in the 2.x world, it should slowly be time to recognize that this is over, and we cannot ignore anymore that encoding is a thing.

qwenger · 2020-06-17T08:17:09Z

If I don't have to think about the str -> bytes direction, re should first stop going in the other direction.

When I have bytes regexes I actually don't care about strings and would happily receive group names as bytes. But no, re decides that latin-1 is the way to go, and this way it 1) reduces my freedom in the choice of the group names, 2) makes me need to go read the internals to understand the the encoding it arbitrarily chose is latin-1, so that I can undo it properly and get back what I always wanted - a bytes group name.

qwenger · 2020-06-17T08:20:57Z

bytes are _not_ Unicode code points, not even in the 256 range. End of the story.

serhiy-storchaka · 2022-04-18T21:04:08Z

Latin1 Is an implementation detail of the RE parser. There is no deep meaning in this.

Group names are purposed to be human-readable. This is why they are limited to be identifiers. Non-ASCII characters in bytes pattern are not human-readable. I think we should only allow ASCII-only identifiers as group names in bytes patterns. The question is whether it needs a deprecation period?

serhiy-storchaka · 2022-04-21T17:12:01Z

Close this issue as "not a bug". See #91760 for more strict rules which can eliminate confusion.

qwenger mannequin added 3.8 (EOL) end of life topic-regex type-bug An unexpected behavior, bug, or error labels Jun 14, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

serhiy-storchaka closed this as completed Apr 21, 2022

Uh oh!

group names of bytes regexes are strings #85152

group names of bytes regexes are strings #85152

Comments

qwenger mannequin commented Jun 14, 2020

qwenger mannequin commented Jun 14, 2020

Uh oh!

qwenger mannequin commented Jun 15, 2020

Uh oh!

animalize mannequin commented Jun 16, 2020

Uh oh!

qwenger mannequin commented Jun 16, 2020

Uh oh!

qwenger mannequin commented Jun 16, 2020

Uh oh!

animalize mannequin commented Jun 16, 2020

Uh oh!

qwenger mannequin commented Jun 16, 2020

Uh oh!

animalize mannequin commented Jun 16, 2020

Uh oh!

qwenger mannequin commented Jun 16, 2020

Uh oh!

animalize mannequin commented Jun 16, 2020

Uh oh!

qwenger mannequin commented Jun 16, 2020

Uh oh!

qwenger mannequin commented Jun 16, 2020

Uh oh!

qwenger mannequin commented Jun 16, 2020

Uh oh!

animalize mannequin commented Jun 16, 2020

Uh oh!

qwenger mannequin commented Jun 16, 2020

Uh oh!

qwenger mannequin commented Jun 16, 2020

Uh oh!

animalize mannequin commented Jun 16, 2020

Uh oh!

animalize mannequin commented Jun 16, 2020

Uh oh!

qwenger mannequin commented Jun 16, 2020

Uh oh!

qwenger mannequin commented Jun 16, 2020

Uh oh!

qwenger mannequin commented Jun 16, 2020

Uh oh!

qwenger mannequin commented Jun 16, 2020

Uh oh!

qwenger mannequin commented Jun 17, 2020

Uh oh!

animalize mannequin commented Jun 17, 2020

Uh oh!

qwenger mannequin commented Jun 17, 2020

Uh oh!

qwenger mannequin commented Jun 17, 2020

Uh oh!

qwenger mannequin commented Jun 17, 2020

Uh oh!

serhiy-storchaka commented Apr 18, 2022

Uh oh!

serhiy-storchaka commented Apr 21, 2022

Uh oh!