Skip to content

Commit 28d91d0

Browse files
encukoublaisepAA-Turnerlysnikolaoucmarqu
authored
gh-127833: Reword and expand the Notation section (GH-134443)
Prepare the docs for using the notation used in the `python.gram` file. If we want to sync the two, the meta-syntax should be the same. Link the Full Grammar docs here; keep only a few extras. Also, remove the distinction between lexical and syntactic rules, except for whitespace handling. With f- and t-strings, the line between the two is blurry. Co-authored-by: Blaise Pabon <[email protected]> Co-authored-by: Adam Turner <[email protected]> Co-authored-by: Lysandros Nikolaou <[email protected]> Co-authored-by: Colin Marquardt <[email protected]>
1 parent f90483e commit 28d91d0

File tree

2 files changed

+128
-50
lines changed

2 files changed

+128
-50
lines changed

Doc/reference/grammar.rst

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,15 @@ used to generate the CPython parser (see :source:`Grammar/python.gram`).
88
The version here omits details related to code generation and
99
error recovery.
1010

11-
The notation is a mixture of `EBNF
12-
<https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form>`_
13-
and `PEG <https://en.wikipedia.org/wiki/Parsing_expression_grammar>`_.
14-
In particular, ``&`` followed by a symbol, token or parenthesized
15-
group indicates a positive lookahead (i.e., is required to match but
16-
not consumed), while ``!`` indicates a negative lookahead (i.e., is
17-
required *not* to match). We use the ``|`` separator to mean PEG's
18-
"ordered choice" (written as ``/`` in traditional PEG grammars). See
19-
:pep:`617` for more details on the grammar's syntax.
11+
The notation used here is the same as in the preceding docs,
12+
and is described in the :ref:`notation <notation>` section,
13+
except for a few extra complications:
14+
15+
* ``&e``: a positive lookahead (that is, ``e`` is required to match but
16+
not consumed)
17+
* ``!e``: a negative lookahead (that is, ``e`` is required *not* to match)
18+
* ``~`` ("cut"): commit to the current alternative and fail the rule
19+
even if this fails to parse
2020

2121
.. literalinclude:: ../../Grammar/python.gram
2222
:language: peg

Doc/reference/introduction.rst

Lines changed: 119 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -90,44 +90,122 @@ Notation
9090

9191
.. index:: BNF, grammar, syntax, notation
9292

93-
The descriptions of lexical analysis and syntax use a modified
94-
`Backus–Naur form (BNF) <https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form>`_ grammar
95-
notation. This uses the following style of definition:
96-
97-
.. productionlist:: notation
98-
name: `lc_letter` (`lc_letter` | "_")*
99-
lc_letter: "a"..."z"
100-
101-
The first line says that a ``name`` is an ``lc_letter`` followed by a sequence
102-
of zero or more ``lc_letter``\ s and underscores. An ``lc_letter`` in turn is
103-
any of the single characters ``'a'`` through ``'z'``. (This rule is actually
104-
adhered to for the names defined in lexical and grammar rules in this document.)
105-
106-
Each rule begins with a name (which is the name defined by the rule) and
107-
``::=``. A vertical bar (``|``) is used to separate alternatives; it is the
108-
least binding operator in this notation. A star (``*``) means zero or more
109-
repetitions of the preceding item; likewise, a plus (``+``) means one or more
110-
repetitions, and a phrase enclosed in square brackets (``[ ]``) means zero or
111-
one occurrences (in other words, the enclosed phrase is optional). The ``*``
112-
and ``+`` operators bind as tightly as possible; parentheses are used for
113-
grouping. Literal strings are enclosed in quotes. White space is only
114-
meaningful to separate tokens. Rules are normally contained on a single line;
115-
rules with many alternatives may be formatted alternatively with each line after
116-
the first beginning with a vertical bar.
117-
118-
.. index:: lexical definitions, ASCII
119-
120-
In lexical definitions (as the example above), two more conventions are used:
121-
Two literal characters separated by three dots mean a choice of any single
122-
character in the given (inclusive) range of ASCII characters. A phrase between
123-
angular brackets (``<...>``) gives an informal description of the symbol
124-
defined; e.g., this could be used to describe the notion of 'control character'
125-
if needed.
126-
127-
Even though the notation used is almost the same, there is a big difference
128-
between the meaning of lexical and syntactic definitions: a lexical definition
129-
operates on the individual characters of the input source, while a syntax
130-
definition operates on the stream of tokens generated by the lexical analysis.
131-
All uses of BNF in the next chapter ("Lexical Analysis") are lexical
132-
definitions; uses in subsequent chapters are syntactic definitions.
133-
93+
The descriptions of lexical analysis and syntax use a grammar notation that
94+
is a mixture of
95+
`EBNF <https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form>`_
96+
and `PEG <https://en.wikipedia.org/wiki/Parsing_expression_grammar>`_.
97+
For example:
98+
99+
.. grammar-snippet::
100+
:group: notation
101+
102+
name: `letter` (`letter` | `digit` | "_")*
103+
letter: "a"..."z" | "A"..."Z"
104+
digit: "0"..."9"
105+
106+
In this example, the first line says that a ``name`` is a ``letter`` followed
107+
by a sequence of zero or more ``letter``\ s, ``digit``\ s, and underscores.
108+
A ``letter`` in turn is any of the single characters ``'a'`` through
109+
``'z'`` and ``A`` through ``Z``; a ``digit`` is a single character from ``0``
110+
to ``9``.
111+
112+
Each rule begins with a name (which identifies the rule that's being defined)
113+
followed by a colon, ``:``.
114+
The definition to the right of the colon uses the following syntax elements:
115+
116+
* ``name``: A name refers to another rule.
117+
Where possible, it is a link to the rule's definition.
118+
119+
* ``TOKEN``: An uppercase name refers to a :term:`token`.
120+
For the purposes of grammar definitions, tokens are the same as rules.
121+
122+
* ``"text"``, ``'text'``: Text in single or double quotes must match literally
123+
(without the quotes). The type of quote is chosen according to the meaning
124+
of ``text``:
125+
126+
* ``'if'``: A name in single quotes denotes a :ref:`keyword <keywords>`.
127+
* ``"case"``: A name in double quotes denotes a
128+
:ref:`soft-keyword <soft-keywords>`.
129+
* ``'@'``: A non-letter symbol in single quotes denotes an
130+
:py:data:`~token.OP` token, that is, a :ref:`delimiter <delimiters>` or
131+
:ref:`operator <operators>`.
132+
133+
* ``e1 e2``: Items separated only by whitespace denote a sequence.
134+
Here, ``e1`` must be followed by ``e2``.
135+
* ``e1 | e2``: A vertical bar is used to separate alternatives.
136+
It denotes PEG's "ordered choice": if ``e1`` matches, ``e2`` is
137+
not considered.
138+
In traditional PEG grammars, this is written as a slash, ``/``, rather than
139+
a vertical bar.
140+
See :pep:`617` for more background and details.
141+
* ``e*``: A star means zero or more repetitions of the preceding item.
142+
* ``e+``: Likewise, a plus means one or more repetitions.
143+
* ``[e]``: A phrase enclosed in square brackets means zero or
144+
one occurrences. In other words, the enclosed phrase is optional.
145+
* ``e?``: A question mark has exactly the same meaning as square brackets:
146+
the preceding item is optional.
147+
* ``(e)``: Parentheses are used for grouping.
148+
* ``"a"..."z"``: Two literal characters separated by three dots mean a choice
149+
of any single character in the given (inclusive) range of ASCII characters.
150+
This notation is only used in
151+
:ref:`lexical definitions <notation-lexical-vs-syntactic>`.
152+
* ``<...>``: A phrase between angular brackets gives an informal description
153+
of the matched symbol (for example, ``<any ASCII character except "\">``),
154+
or an abbreviation that is defined in nearby text (for example, ``<Lu>``).
155+
This notation is only used in
156+
:ref:`lexical definitions <notation-lexical-vs-syntactic>`.
157+
158+
The unary operators (``*``, ``+``, ``?``) bind as tightly as possible;
159+
the vertical bar (``|``) binds most loosely.
160+
161+
White space is only meaningful to separate tokens.
162+
163+
Rules are normally contained on a single line, but rules that are too long
164+
may be wrapped:
165+
166+
.. grammar-snippet::
167+
:group: notation
168+
169+
literal: stringliteral | bytesliteral
170+
| integer | floatnumber | imagnumber
171+
172+
Alternatively, rules may be formatted with the first line ending at the colon,
173+
and each alternative beginning with a vertical bar on a new line.
174+
For example:
175+
176+
177+
.. grammar-snippet::
178+
:group: notation-alt
179+
180+
literal:
181+
| stringliteral
182+
| bytesliteral
183+
| integer
184+
| floatnumber
185+
| imagnumber
186+
187+
This does *not* mean that there is an empty first alternative.
188+
189+
.. index:: lexical definitions
190+
191+
.. _notation-lexical-vs-syntactic:
192+
193+
Lexical and Syntactic definitions
194+
---------------------------------
195+
196+
There is some difference between *lexical* and *syntactic* analysis:
197+
the :term:`lexical analyzer` operates on the individual characters of the
198+
input source, while the *parser* (syntactic analyzer) operates on the stream
199+
of :term:`tokens <token>` generated by the lexical analysis.
200+
However, in some cases the exact boundary between the two phases is a
201+
CPython implementation detail.
202+
203+
The practical difference between the two is that in *lexical* definitions,
204+
all whitespace is significant.
205+
The lexical analyzer :ref:`discards <whitespace>` all whitespace that is not
206+
converted to tokens like :data:`token.INDENT` or :data:`~token.NEWLINE`.
207+
*Syntactic* definitions then use these tokens, rather than source characters.
208+
209+
This documentation uses the same BNF grammar for both styles of definitions.
210+
All uses of BNF in the next chapter (:ref:`lexical`) are lexical definitions;
211+
uses in subsequent chapters are syntactic definitions.

0 commit comments

Comments
 (0)