Skip to content

bpo-46091: Correctly calculate indentation levels for whitespace lines with continuation characters #30130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 25, 2022

Conversation

pablogsal
Copy link
Member

@pablogsal pablogsal commented Dec 16, 2021

Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully @lysnikolaou can review the fix to the C code? It's deleting an awful lot of code.

Comment on lines 1518 to 1530
def fib(n):
\
'''Print a Fibonacci series up to n.'''
\
a, b = 0, 1
\
while a < n:
\
print(a, end=' ')
\
a, b = b, a+b
\
print()
Copy link
Member

@gvanrossum gvanrossum Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is both too long and too short.

It doesn't really test anything more than a shorter example, say just the first three lines.

Moreover, it doesn't test that the code is parsed the same as the equivalent version without backslashes.

And it doesn't test the stated property quoted in bpo-46091 from the docs: "Indentation cannot be split over multiple physical lines using backslashes; the whitespace up to the first backslash determines the indentation."

Copy link
Member Author

@pablogsal pablogsal Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, it doesn't test that the code is parsed the same as the equivalent version without backslashes.

We could check the AST, but likely you want to check that is tokenized the same as the code without backslashes, but that's is a bit more complicated to test here as we don't have direct access to the C tokenizer.

And it doesn't test the stated property quoted in bpo-46091 from the docs: "Indentation cannot be split over multiple physical lines using backslashes; the whitespace up to the first backslash determines the indentation."

Hummmm is testing the second part at least as it's checking that the whitespace up the the \ determines the identation of the next line as opposed to something like:

pass
      \
      
pass

where the \ and the \n will just produce an empty line

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't have access to the tokenizer, checking that the ASTs are the same is reasonable. In fact, you can probably just use ast.unparse() and compare the output with the equivalent without backslashes.

But this reminds me. Does tokenize.py have the same behavior?

Copy link
Member Author

@pablogsal pablogsal Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the tokenization of:

def fib(n):
    \
"""Print a Fibonacci series up to n."""
    \
a, b = 0, 1

with \:

0,0-0,0:            ENCODING       'utf-8'
1,0-1,3:            NAME           'def'
1,4-1,7:            NAME           'fib'
1,7-1,8:            OP             '('
1,8-1,9:            NAME           'n'
1,9-1,10:           OP             ')'
1,10-1,11:          OP             ':'
1,11-1,12:          NEWLINE        '\n'
2,0-2,4:            INDENT         '    '
3,0-3,39:           STRING         '"""Print a Fibonacci series up to n."""'
3,39-3,40:          NEWLINE        '\n'
5,0-5,1:            NAME           'a'
5,1-5,2:            OP             ','
5,3-5,4:            NAME           'b'
5,5-5,6:            OP             '='
5,7-5,8:            NUMBER         '0'
5,8-5,9:            OP             ','
5,10-5,11:          NUMBER         '1'
5,11-5,12:          NEWLINE        '\n'
6,0-6,0:            DEDENT         ''
6,0-6,0:            ENDMARKER      ''

and without:

0,0-0,0:            ENCODING       'utf-8'
1,0-1,3:            NAME           'def'
1,4-1,7:            NAME           'fib'
1,7-1,8:            OP             '('
1,8-1,9:            NAME           'n'
1,9-1,10:           OP             ')'
1,10-1,11:          OP             ':'
1,11-1,12:          NEWLINE        '\n'
2,0-2,4:            INDENT         '    '
2,4-2,43:           STRING         '"""Print a Fibonacci series up to n."""'
2,43-2,44:          NEWLINE        '\n'
3,4-3,5:            NAME           'a'
3,5-3,6:            OP             ','
3,7-3,8:            NAME           'b'
3,9-3,10:           OP             '='
3,11-3,12:          NUMBER         '0'
3,12-3,13:          OP             ','
3,14-3,15:          NUMBER         '1'
3,15-3,16:          NEWLINE        '\n'
4,0-4,0:            DEDENT         ''
4,0-4,0:            ENDMARKER      ''

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But unfortunately the tokenization for:

pass
        \
pass

is wrong:

0,0-0,0:            ENCODING       'utf-8'
1,0-1,4:            NAME           'pass'
1,4-1,5:            NEWLINE        '\n'
2,0-2,4:            INDENT         '    '
3,0-3,1:            NEWLINE        '\n'
4,0-4,0:            DEDENT         ''
4,0-4,4:            NAME           'pass'
4,4-4,5:            NEWLINE        '\n'
5,0-5,0:            ENDMARKER      ''

but that should be fixed separately to this PR

@pablogsal
Copy link
Member Author

pablogsal commented Dec 16, 2021

It's deleting an awful lot of code.

Is really moving it into a function and calling that function in a new place (in addition to the old place). The important part is this:

            else if (c == '\\') {
                c = tok_continuation_line(tok);
                if (c == -1) {
                    return ERRORTOKEN;
                }
            }

which is basically skipping over pairs of \ followed by \n if the previous characters in the line are whitespace characters. And handling errors correctly (as when \ is followed by something else than \n).

…hitespace lines with continuation characters
@pablogsal
Copy link
Member Author

I have pushed a bunch of tests testing the C tokenizer (I forgot I actually exposed an API for testing long ago 😅 ) and some extra corrections. Turns out this is very tricky to get right because "the backslash and the newline just need to eliminate themselves" and "the indentation is marked by the first backlash that we see" is complicated to keep in sync.

…s for whitespace lines with continuation characters
@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Jan 16, 2022
@pablogsal pablogsal merged commit a0efc0c into python:main Jan 25, 2022
@pablogsal pablogsal deleted the whitespace_cont branch January 25, 2022 22:12
@miss-islington
Copy link
Contributor

Thanks @pablogsal for the PR 🌮🎉.. I'm working now to backport this PR to: 3.9, 3.10.
🐍🍒⛏🤖

@miss-islington
Copy link
Contributor

Sorry, @pablogsal, I could not cleanly backport this to 3.10 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker a0efc0c1960e2c49e0092694d98395555270914c 3.10

@miss-islington
Copy link
Contributor

Sorry @pablogsal, I had trouble checking out the 3.9 backport branch.
Please backport using cherry_picker on command line.
cherry_picker a0efc0c1960e2c49e0092694d98395555270914c 3.9

pablogsal added a commit to pablogsal/cpython that referenced this pull request Jan 25, 2022
…ce lines with continuation characters (pythonGH-30130).

(cherry picked from commit a0efc0c)

Co-authored-by: Pablo Galindo Salgado <[email protected]>
@bedevere-bot bedevere-bot removed the needs backport to 3.10 only security fixes label Jan 25, 2022
@bedevere-bot
Copy link

GH-30898 is a backport of this pull request to the 3.10 branch.

pablogsal added a commit that referenced this pull request Jan 25, 2022
…ce lines with continuation characters (GH-30130). (GH-30898)

(cherry picked from commit a0efc0c)

Co-authored-by: Pablo Galindo Salgado <[email protected]>
pablogsal added a commit to pablogsal/cpython that referenced this pull request Jan 25, 2022
…e lines with continuation characters (pythonGH-30130). (pythonGH-30898)

(cherry picked from commit a0efc0c)

Co-authored-by: Pablo Galindo Salgado <[email protected]>.
(cherry picked from commit 3fc8b74)

Co-authored-by: Pablo Galindo Salgado <[email protected]>
@ZeroIntensity ZeroIntensity removed the needs backport to 3.9 only security fixes label Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Stale PR or inactive for long period of time.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants