gh-127833: Reword and expand the Notation section (GH-134443)

encukou · blaisep · AA-Turner · web-flow · commit 28d91d06f13c · 2025-06-09T15:50:11.000+02:00
Prepare the docs for using the notation used in the `python.gram`
file. If we want to sync the two, the meta-syntax should be the same.

Link the Full Grammar docs here; keep only a few extras.

Also, remove the distinction between lexical and syntactic rules,
except for whitespace handling.
With f- and t-strings, the line between the two is blurry.

Co-authored-by: Blaise Pabon &lt;blaise@gmail.com&gt;
Co-authored-by: Adam Turner &lt;9087854+AA-Turner@users.noreply.github.com&gt;
Co-authored-by: Lysandros Nikolaou &lt;lisandrosnik@gmail.com&gt;
Co-authored-by: Colin Marquardt &lt;cmarqu42@gmail.com&gt;
diff --git a/Doc/reference/grammar.rst b/Doc/reference/grammar.rst
@@ -8,15 +8,15 @@ used to generate the CPython parser (see :source:`Grammar/python.gram`).
 The version here omits details related to code generation and
 error recovery.
 
-The notation is a mixture of `EBNF
-<https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form>`_
-and `PEG <https://en.wikipedia.org/wiki/Parsing_expression_grammar>`_.
-In particular, ``&`` followed by a symbol, token or parenthesized
-group indicates a positive lookahead (i.e., is required to match but
-not consumed), while ``!`` indicates a negative lookahead (i.e., is
-required *not* to match).  We use the ``|`` separator to mean PEG's
-"ordered choice" (written as ``/`` in traditional PEG grammars). See
-:pep:`617` for more details on the grammar's syntax.
+The notation used here is the same as in the preceding docs,
+and is described in the :ref:`notation <notation>` section,
+except for a few extra complications:
+
+* ``&e``: a positive lookahead (that is, ``e`` is required to match but
+  not consumed)
+* ``!e``: a negative lookahead (that is, ``e`` is required *not* to match)
+* ``~`` ("cut"): commit to the current alternative and fail the rule
+  even if this fails to parse
 
 .. literalinclude:: ../../Grammar/python.gram
   :language: peg
diff --git a/Doc/reference/introduction.rst b/Doc/reference/introduction.rst
@@ -90,44 +90,122 @@ Notation
 
 .. index:: BNF, grammar, syntax, notation
 
-The descriptions of lexical analysis and syntax use a modified
-`Backus–Naur form (BNF) <https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form>`_ grammar
-notation.  This uses the following style of definition:
-
-.. productionlist:: notation
-   name: `lc_letter` (`lc_letter` | "_")*
-   lc_letter: "a"..."z"
-
-The first line says that a ``name`` is an ``lc_letter`` followed by a sequence
-of zero or more ``lc_letter``\ s and underscores.  An ``lc_letter`` in turn is
-any of the single characters ``'a'`` through ``'z'``.  (This rule is actually
-adhered to for the names defined in lexical and grammar rules in this document.)
-
-Each rule begins with a name (which is the name defined by the rule) and
-``::=``.  A vertical bar (``|``) is used to separate alternatives; it is the
-least binding operator in this notation.  A star (``*``) means zero or more
-repetitions of the preceding item; likewise, a plus (``+``) means one or more
-repetitions, and a phrase enclosed in square brackets (``[ ]``) means zero or
-one occurrences (in other words, the enclosed phrase is optional).  The ``*``
-and ``+`` operators bind as tightly as possible; parentheses are used for
-grouping.  Literal strings are enclosed in quotes.  White space is only
-meaningful to separate tokens. Rules are normally contained on a single line;
-rules with many alternatives may be formatted alternatively with each line after
-the first beginning with a vertical bar.
-
-.. index:: lexical definitions, ASCII
-
-In lexical definitions (as the example above), two more conventions are used:
-Two literal characters separated by three dots mean a choice of any single
-character in the given (inclusive) range of ASCII characters.  A phrase between
-angular brackets (``<...>``) gives an informal description of the symbol
-defined; e.g., this could be used to describe the notion of 'control character'
-if needed.
-
-Even though the notation used is almost the same, there is a big difference
-between the meaning of lexical and syntactic definitions: a lexical definition
-operates on the individual characters of the input source, while a syntax
-definition operates on the stream of tokens generated by the lexical analysis.
-All uses of BNF in the next chapter ("Lexical Analysis") are lexical
-definitions; uses in subsequent chapters are syntactic definitions.
-
+The descriptions of lexical analysis and syntax use a grammar notation that
+is a mixture of
+`EBNF <https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form>`_
+and `PEG <https://en.wikipedia.org/wiki/Parsing_expression_grammar>`_.
+For example:
+
+.. grammar-snippet::
+   :group: notation
+
+   name:   `letter` (`letter` | `digit` | "_")*
+   letter: "a"..."z" | "A"..."Z"
+   digit:  "0"..."9"
+
+In this example, the first line says that a ``name`` is a ``letter`` followed
+by a sequence of zero or more ``letter``\ s, ``digit``\ s, and underscores.
+A ``letter`` in turn is any of the single characters ``'a'`` through
+``'z'`` and ``A`` through ``Z``; a ``digit`` is a single character from ``0``
+to ``9``.
+
+Each rule begins with a name (which identifies the rule that's being defined)
+followed by a colon, ``:``.
+The definition to the right of the colon uses the following syntax elements:
+
+* ``name``: A name refers to another rule.
+  Where possible, it is a link to the rule's definition.
+
+  * ``TOKEN``: An uppercase name refers to a :term:`token`.
+    For the purposes of grammar definitions, tokens are the same as rules.
+
+* ``"text"``, ``'text'``: Text in single or double quotes must match literally
+  (without the quotes). The type of quote is chosen according to the meaning
+  of ``text``:
+
+  * ``'if'``: A name in single quotes denotes a :ref:`keyword <keywords>`.
+  * ``"case"``: A name in double quotes denotes a
+    :ref:`soft-keyword <soft-keywords>`.
+  * ``'@'``: A non-letter symbol in single quotes denotes an
+    :py:data:`~token.OP` token, that is, a :ref:`delimiter <delimiters>` or
+    :ref:`operator <operators>`.
+
+* ``e1 e2``: Items separated only by whitespace denote a sequence.
+  Here, ``e1`` must be followed by ``e2``.
+* ``e1 | e2``: A vertical bar is used to separate alternatives.
+  It denotes PEG's "ordered choice": if ``e1`` matches, ``e2`` is
+  not considered.
+  In traditional PEG grammars, this is written as a slash, ``/``, rather than
+  a vertical bar.
+  See :pep:`617` for more background and details.
+* ``e*``: A star means zero or more repetitions of the preceding item.
+* ``e+``: Likewise, a plus means one or more repetitions.
+* ``[e]``: A phrase enclosed in square brackets means zero or
+  one occurrences. In other words, the enclosed phrase is optional.
+* ``e?``: A question mark has exactly the same meaning as square brackets:
+  the preceding item is optional.
+* ``(e)``: Parentheses are used for grouping.
+* ``"a"..."z"``: Two literal characters separated by three dots mean a choice
+  of any single character in the given (inclusive) range of ASCII characters.
+  This notation is only used in
+  :ref:`lexical definitions <notation-lexical-vs-syntactic>`.
+* ``<...>``: A phrase between angular brackets gives an informal description
+  of the matched symbol (for example, ``<any ASCII character except "\">``),
+  or an abbreviation that is defined in nearby text (for example, ``<Lu>``).
+  This notation is only used in
+  :ref:`lexical definitions <notation-lexical-vs-syntactic>`.
+
+The unary operators (``*``, ``+``, ``?``) bind as tightly as possible;
+the vertical bar (``|``) binds most loosely.
+
+White space is only meaningful to separate tokens.
+
+Rules are normally contained on a single line, but rules that are too long
+may be wrapped:
+
+.. grammar-snippet::
+   :group: notation
+
+   literal: stringliteral | bytesliteral
+            | integer | floatnumber | imagnumber
+
+Alternatively, rules may be formatted with the first line ending at the colon,
+and each alternative beginning with a vertical bar on a new line.
+For example:
+
+
+.. grammar-snippet::
+   :group: notation-alt
+
+   literal:
+      | stringliteral
+      | bytesliteral
+      | integer
+      | floatnumber
+      | imagnumber
+
+This does *not* mean that there is an empty first alternative.
+
+.. index:: lexical definitions
+
+.. _notation-lexical-vs-syntactic:
+
+Lexical and Syntactic definitions
+---------------------------------
+
+There is some difference between *lexical* and *syntactic* analysis:
+the :term:`lexical analyzer` operates on the individual characters of the
+input source, while the *parser* (syntactic analyzer) operates on the stream
+of :term:`tokens <token>` generated by the lexical analysis.
+However, in some cases the exact boundary between the two phases is a
+CPython implementation detail.
+
+The practical difference between the two is that in *lexical* definitions,
+all whitespace is significant.
+The lexical analyzer :ref:`discards <whitespace>` all whitespace that is not
+converted to tokens like :data:`token.INDENT` or :data:`~token.NEWLINE`.
+*Syntactic* definitions then use these tokens, rather than source characters.
+
+This documentation uses the same BNF grammar for both styles of definitions.
+All uses of BNF in the next chapter (:ref:`lexical`) are lexical definitions;
+uses in subsequent chapters are syntactic definitions.