bpo-46091: Correctly calculate indentation levels for whitespace lines with continuation characters #30130

pablogsal · 2021-12-16T00:25:03Z

https://bugs.python.org/issue46091

…s with continuation characters

Parser/tokenizer.c

Lib/test/test_syntax.py

…ce lines with continuation characters

gvanrossum

Hopefully @lysnikolaou can review the fix to the C code? It's deleting an awful lot of code.

gvanrossum · 2021-12-16T01:40:46Z

Lib/test/test_syntax.py

+def fib(n):
+    \
+'''Print a Fibonacci series up to n.'''
+    \
+a, b = 0, 1
+    \
+while a < n:
+        \
+print(a, end=' ')
+        \
+a, b = b, a+b
+    \
+print()


This example is both too long and too short.

It doesn't really test anything more than a shorter example, say just the first three lines.

Moreover, it doesn't test that the code is parsed the same as the equivalent version without backslashes.

And it doesn't test the stated property quoted in bpo-46091 from the docs: "Indentation cannot be split over multiple physical lines using backslashes; the whitespace up to the first backslash determines the indentation."

Moreover, it doesn't test that the code is parsed the same as the equivalent version without backslashes.

We could check the AST, but likely you want to check that is tokenized the same as the code without backslashes, but that's is a bit more complicated to test here as we don't have direct access to the C tokenizer.

And it doesn't test the stated property quoted in bpo-46091 from the docs: "Indentation cannot be split over multiple physical lines using backslashes; the whitespace up to the first backslash determines the indentation."

Hummmm is testing the second part at least as it's checking that the whitespace up the the \ determines the identation of the next line as opposed to something like:

pass \ pass

where the \ and the \n will just produce an empty line

Since we don't have access to the tokenizer, checking that the ASTs are the same is reasonable. In fact, you can probably just use ast.unparse() and compare the output with the equivalent without backslashes.

But this reminds me. Does tokenize.py have the same behavior?

This is the tokenization of:

def fib(n): \ """Print a Fibonacci series up to n.""" \ a, b = 0, 1

with \:

0,0-0,0: ENCODING 'utf-8' 1,0-1,3: NAME 'def' 1,4-1,7: NAME 'fib' 1,7-1,8: OP '(' 1,8-1,9: NAME 'n' 1,9-1,10: OP ')' 1,10-1,11: OP ':' 1,11-1,12: NEWLINE '\n' 2,0-2,4: INDENT ' ' 3,0-3,39: STRING '"""Print a Fibonacci series up to n."""' 3,39-3,40: NEWLINE '\n' 5,0-5,1: NAME 'a' 5,1-5,2: OP ',' 5,3-5,4: NAME 'b' 5,5-5,6: OP '=' 5,7-5,8: NUMBER '0' 5,8-5,9: OP ',' 5,10-5,11: NUMBER '1' 5,11-5,12: NEWLINE '\n' 6,0-6,0: DEDENT '' 6,0-6,0: ENDMARKER ''

and without:

0,0-0,0: ENCODING 'utf-8' 1,0-1,3: NAME 'def' 1,4-1,7: NAME 'fib' 1,7-1,8: OP '(' 1,8-1,9: NAME 'n' 1,9-1,10: OP ')' 1,10-1,11: OP ':' 1,11-1,12: NEWLINE '\n' 2,0-2,4: INDENT ' ' 2,4-2,43: STRING '"""Print a Fibonacci series up to n."""' 2,43-2,44: NEWLINE '\n' 3,4-3,5: NAME 'a' 3,5-3,6: OP ',' 3,7-3,8: NAME 'b' 3,9-3,10: OP '=' 3,11-3,12: NUMBER '0' 3,12-3,13: OP ',' 3,14-3,15: NUMBER '1' 3,15-3,16: NEWLINE '\n' 4,0-4,0: DEDENT '' 4,0-4,0: ENDMARKER ''

But unfortunately the tokenization for:

pass \ pass

is wrong:

0,0-0,0: ENCODING 'utf-8' 1,0-1,4: NAME 'pass' 1,4-1,5: NEWLINE '\n' 2,0-2,4: INDENT ' ' 3,0-3,1: NEWLINE '\n' 4,0-4,0: DEDENT '' 4,0-4,4: NAME 'pass' 4,4-4,5: NEWLINE '\n' 5,0-5,0: ENDMARKER ''

but that should be fixed separately to this PR

pablogsal · 2021-12-16T01:54:04Z

It's deleting an awful lot of code.

Is really moving it into a function and calling that function in a new place (in addition to the old place). The important part is this:

            else if (c == '\\') {
                c = tok_continuation_line(tok);
                if (c == -1) {
                    return ERRORTOKEN;
                }
            }

which is basically skipping over pairs of \ followed by \n if the previous characters in the line are whitespace characters. And handling errors correctly (as when \ is followed by something else than \n).

…hitespace lines with continuation characters

pablogsal · 2021-12-16T11:46:00Z

I have pushed a bunch of tests testing the C tokenizer (I forgot I actually exposed an API for testing long ago 😅 ) and some extra corrections. Turns out this is very tricky to get right because "the backslash and the newline just need to eliminate themselves" and "the indentation is marked by the first backlash that we see" is complicated to keep in sync.

Parser/tokenizer.c

Lib/test/test_syntax.py

Lib/test/test_tokenize.py

…s for whitespace lines with continuation characters

github-actions · 2022-01-16T00:07:31Z

This PR is stale because it has been open for 30 days with no activity.

miss-islington · 2022-01-25T22:12:17Z

Thanks @pablogsal for the PR 🌮🎉.. I'm working now to backport this PR to: 3.9, 3.10.
🐍🍒⛏🤖

miss-islington · 2022-01-25T22:12:22Z

Sorry, @pablogsal, I could not cleanly backport this to 3.10 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker a0efc0c1960e2c49e0092694d98395555270914c 3.10

miss-islington · 2022-01-25T22:12:25Z

Sorry @pablogsal, I had trouble checking out the 3.9 backport branch.
Please backport using cherry_picker on command line.
cherry_picker a0efc0c1960e2c49e0092694d98395555270914c 3.9

…ce lines with continuation characters (pythonGH-30130). (cherry picked from commit a0efc0c) Co-authored-by: Pablo Galindo Salgado <[email protected]>

bedevere-bot · 2022-01-25T22:14:01Z

GH-30898 is a backport of this pull request to the 3.10 branch.

…ce lines with continuation characters (GH-30130). (GH-30898) (cherry picked from commit a0efc0c) Co-authored-by: Pablo Galindo Salgado <[email protected]>

…e lines with continuation characters (pythonGH-30130). (pythonGH-30898) (cherry picked from commit a0efc0c) Co-authored-by: Pablo Galindo Salgado <[email protected]>. (cherry picked from commit 3fc8b74) Co-authored-by: Pablo Galindo Salgado <[email protected]>

pablogsal requested a review from lysnikolaou as a code owner December 16, 2021 00:25

pablogsal added 🤖 automerge needs backport to 3.10 only security fixes needs backport to 3.9 only security fixes labels Dec 16, 2021

the-knights-who-say-ni added the CLA signed label Dec 16, 2021

bedevere-bot added the awaiting core review label Dec 16, 2021

pablogsal requested review from gvanrossum and isidentical December 16, 2021 00:25

bpo-46091: Correctly calculate indentation levels for whitespace line…

de9998a

…s with continuation characters

pablogsal force-pushed the whitespace_cont branch from 2591366 to de9998a Compare December 16, 2021 00:32

ucodery reviewed Dec 16, 2021

View reviewed changes

Parser/tokenizer.c Outdated Show resolved Hide resolved

Lib/test/test_syntax.py Show resolved Hide resolved

fixup! bpo-46091: Correctly calculate indentation levels for whitespa…

118b66c

…ce lines with continuation characters

gvanrossum reviewed Dec 16, 2021

View reviewed changes

fixup! fixup! bpo-46091: Correctly calculate indentation levels for w…

c623e38

…hitespace lines with continuation characters

pablogsal force-pushed the whitespace_cont branch from 03d63f1 to c623e38 Compare December 16, 2021 11:44

pablogsal commented Dec 16, 2021

View reviewed changes

Parser/tokenizer.c Outdated Show resolved Hide resolved

Fix comment

95b10a6

ucodery reviewed Dec 16, 2021

View reviewed changes

Lib/test/test_syntax.py Show resolved Hide resolved

Lib/test/test_tokenize.py Show resolved Hide resolved

fixup! fixup! fixup! bpo-46091: Correctly calculate indentation level…

61c95fe

…s for whitespace lines with continuation characters

github-actions bot added the stale Stale PR or inactive for long period of time. label Jan 16, 2022

pablogsal merged commit a0efc0c into python:main Jan 25, 2022

pablogsal deleted the whitespace_cont branch January 25, 2022 22:12

bedevere-bot removed the awaiting core review label Jan 25, 2022

miss-islington assigned pablogsal Jan 25, 2022

bedevere-bot removed the needs backport to 3.10 only security fixes label Jan 25, 2022

ZeroIntensity removed the needs backport to 3.9 only security fixes label Feb 17, 2025

mattip mentioned this pull request Feb 20, 2025

Parsing files with form feeds and backslash-space fails (regression) pypy/pypy#5221

Closed

Uh oh!

bpo-46091: Correctly calculate indentation levels for whitespace lines with continuation characters #30130

bpo-46091: Correctly calculate indentation levels for whitespace lines with continuation characters #30130

Uh oh!

Conversation

pablogsal commented Dec 16, 2021 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gvanrossum left a comment

Choose a reason for hiding this comment

Uh oh!

gvanrossum Dec 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pablogsal Dec 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gvanrossum Dec 16, 2021

Choose a reason for hiding this comment

Uh oh!

pablogsal Dec 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pablogsal Dec 16, 2021

Choose a reason for hiding this comment

Uh oh!

pablogsal commented Dec 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pablogsal commented Dec 16, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jan 16, 2022

Uh oh!

miss-islington commented Jan 25, 2022

Uh oh!

miss-islington commented Jan 25, 2022

Uh oh!

miss-islington commented Jan 25, 2022

Uh oh!

bedevere-bot commented Jan 25, 2022

Uh oh!

Uh oh!

pablogsal commented Dec 16, 2021 •

edited by bedevere-bot

Loading

gvanrossum Dec 16, 2021 •

edited

Loading

pablogsal Dec 16, 2021 •

edited

Loading

pablogsal Dec 16, 2021 •

edited

Loading

pablogsal commented Dec 16, 2021 •

edited

Loading