Skip to content

Commit b9fdb7a

Browse files
committed
Issue 19548: update codecs module documentation
- clarified the distinction between text encodings and other codecs - clarified relationship with builtin open and the io module - consolidated documentation of error handlers into one section - clarified type constraints of some behaviours - added tests for some of the new statements in the docs
1 parent fcfed19 commit b9fdb7a

File tree

9 files changed

+417
-363
lines changed

9 files changed

+417
-363
lines changed

Doc/glossary.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -820,10 +820,13 @@ Glossary
820820
:meth:`~collections.somenamedtuple._asdict`. Examples of struct sequences
821821
include :data:`sys.float_info` and the return value of :func:`os.stat`.
822822

823+
text encoding
824+
A codec which encodes Unicode strings to bytes.
825+
823826
text file
824827
A :term:`file object` able to read and write :class:`str` objects.
825828
Often, a text file actually accesses a byte-oriented datastream
826-
and handles the text encoding automatically.
829+
and handles the :term:`text encoding` automatically.
827830

828831
.. seealso::
829832
A :term:`binary file` reads and write :class:`bytes` objects.

Doc/library/codecs.rst

Lines changed: 327 additions & 303 deletions
Large diffs are not rendered by default.

Doc/library/functions.rst

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -939,15 +939,17 @@ are always available. They are listed here in alphabetical order.
939939
*encoding* is the name of the encoding used to decode or encode the file.
940940
This should only be used in text mode. The default encoding is platform
941941
dependent (whatever :func:`locale.getpreferredencoding` returns), but any
942-
encoding supported by Python can be used. See the :mod:`codecs` module for
942+
:term:`text encoding` supported by Python
943+
can be used. See the :mod:`codecs` module for
943944
the list of supported encodings.
944945

945946
*errors* is an optional string that specifies how encoding and decoding
946947
errors are to be handled--this cannot be used in binary mode.
947-
A variety of standard error handlers are available, though any
948+
A variety of standard error handlers are available
949+
(listed under :ref:`error-handlers`), though any
948950
error handling name that has been registered with
949951
:func:`codecs.register_error` is also valid. The standard names
950-
are:
952+
include:
951953

952954
* ``'strict'`` to raise a :exc:`ValueError` exception if there is
953955
an encoding error. The default value of ``None`` has the same

Doc/library/stdtypes.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1512,7 +1512,7 @@ expression support in the :mod:`re` module).
15121512
a :exc:`UnicodeError`. Other possible
15131513
values are ``'ignore'``, ``'replace'``, ``'xmlcharrefreplace'``,
15141514
``'backslashreplace'`` and any other name registered via
1515-
:func:`codecs.register_error`, see section :ref:`codec-base-classes`. For a
1515+
:func:`codecs.register_error`, see section :ref:`error-handlers`. For a
15161516
list of possible encodings, see section :ref:`standard-encodings`.
15171517

15181518
.. versionchanged:: 3.1
@@ -2384,7 +2384,7 @@ arbitrary binary data.
23842384
error handling scheme. The default for *errors* is ``'strict'``, meaning
23852385
that encoding errors raise a :exc:`UnicodeError`. Other possible values are
23862386
``'ignore'``, ``'replace'`` and any other name registered via
2387-
:func:`codecs.register_error`, see section :ref:`codec-base-classes`. For a
2387+
:func:`codecs.register_error`, see section :ref:`error-handlers`. For a
23882388
list of possible encodings, see section :ref:`standard-encodings`.
23892389

23902390
.. note::

Doc/library/tarfile.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -794,7 +794,7 @@ metadata must be either decoded or encoded. If *encoding* is not set
794794
appropriately, this conversion may fail.
795795

796796
The *errors* argument defines how characters are treated that cannot be
797-
converted. Possible values are listed in section :ref:`codec-base-classes`.
797+
converted. Possible values are listed in section :ref:`error-handlers`.
798798
The default scheme is ``'surrogateescape'`` which Python also uses for its
799799
file system calls, see :ref:`os-filenames`.
800800

Lib/codecs.py

Lines changed: 30 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -346,8 +346,7 @@ def __init__(self, stream, errors='strict'):
346346

347347
""" Creates a StreamWriter instance.
348348
349-
stream must be a file-like object open for writing
350-
(binary) data.
349+
stream must be a file-like object open for writing.
351350
352351
The StreamWriter may use different error handling
353352
schemes by providing the errors keyword argument. These
@@ -421,8 +420,7 @@ def __init__(self, stream, errors='strict'):
421420

422421
""" Creates a StreamReader instance.
423422
424-
stream must be a file-like object open for reading
425-
(binary) data.
423+
stream must be a file-like object open for reading.
426424
427425
The StreamReader may use different error handling
428426
schemes by providing the errors keyword argument. These
@@ -450,13 +448,12 @@ def read(self, size=-1, chars=-1, firstline=False):
450448
""" Decodes data from the stream self.stream and returns the
451449
resulting object.
452450
453-
chars indicates the number of characters to read from the
454-
stream. read() will never return more than chars
455-
characters, but it might return less, if there are not enough
456-
characters available.
451+
chars indicates the number of decoded code points or bytes to
452+
return. read() will never return more data than requested,
453+
but it might return less, if there is not enough available.
457454
458-
size indicates the approximate maximum number of bytes to
459-
read from the stream for decoding purposes. The decoder
455+
size indicates the approximate maximum number of decoded
456+
bytes or code points to read for decoding. The decoder
460457
can modify this setting as appropriate. The default value
461458
-1 indicates to read and decode as much as possible. size
462459
is intended to prevent having to decode huge files in one
@@ -467,7 +464,7 @@ def read(self, size=-1, chars=-1, firstline=False):
467464
will be returned, the rest of the input will be kept until the
468465
next call to read().
469466
470-
The method should use a greedy read strategy meaning that
467+
The method should use a greedy read strategy, meaning that
471468
it should read as much data as is allowed within the
472469
definition of the encoding and the given size, e.g. if
473470
optional encoding endings or state markers are available
@@ -602,7 +599,7 @@ def readline(self, size=None, keepends=True):
602599
def readlines(self, sizehint=None, keepends=True):
603600

604601
""" Read all lines available on the input stream
605-
and return them as list of lines.
602+
and return them as a list.
606603
607604
Line breaks are implemented using the codec's decoder
608605
method and are included in the list entries.
@@ -750,19 +747,18 @@ def __exit__(self, type, value, tb):
750747

751748
class StreamRecoder:
752749

753-
""" StreamRecoder instances provide a frontend - backend
754-
view of encoding data.
750+
""" StreamRecoder instances translate data from one encoding to another.
755751
756752
They use the complete set of APIs returned by the
757753
codecs.lookup() function to implement their task.
758754
759-
Data written to the stream is first decoded into an
760-
intermediate format (which is dependent on the given codec
761-
combination) and then written to the stream using an instance
762-
of the provided Writer class.
755+
Data written to the StreamRecoder is first decoded into an
756+
intermediate format (depending on the "decode" codec) and then
757+
written to the underlying stream using an instance of the provided
758+
Writer class.
763759
764-
In the other direction, data is read from the stream using a
765-
Reader instance and then return encoded data to the caller.
760+
In the other direction, data is read from the underlying stream using
761+
a Reader instance and then encoded and returned to the caller.
766762
767763
"""
768764
# Optional attributes set by the file wrappers below
@@ -774,22 +770,17 @@ def __init__(self, stream, encode, decode, Reader, Writer,
774770

775771
""" Creates a StreamRecoder instance which implements a two-way
776772
conversion: encode and decode work on the frontend (the
777-
input to .read() and output of .write()) while
778-
Reader and Writer work on the backend (reading and
779-
writing to the stream).
773+
data visible to .read() and .write()) while Reader and Writer
774+
work on the backend (the data in stream).
780775
781-
You can use these objects to do transparent direct
782-
recodings from e.g. latin-1 to utf-8 and back.
776+
You can use these objects to do transparent
777+
transcodings from e.g. latin-1 to utf-8 and back.
783778
784779
stream must be a file-like object.
785780
786-
encode, decode must adhere to the Codec interface, Reader,
781+
encode and decode must adhere to the Codec interface; Reader and
787782
Writer must be factory functions or classes providing the
788-
StreamReader, StreamWriter interface resp.
789-
790-
encode and decode are needed for the frontend translation,
791-
Reader and Writer for the backend translation. Unicode is
792-
used as intermediate encoding.
783+
StreamReader and StreamWriter interfaces resp.
793784
794785
Error handling is done in the same way as defined for the
795786
StreamWriter/Readers.
@@ -864,7 +855,7 @@ def __exit__(self, type, value, tb):
864855

865856
### Shortcuts
866857

867-
def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):
858+
def open(filename, mode='r', encoding=None, errors='strict', buffering=1):
868859

869860
""" Open an encoded file using the given mode and return
870861
a wrapped version providing transparent encoding/decoding.
@@ -874,10 +865,8 @@ def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):
874865
codecs. Output is also codec dependent and will usually be
875866
Unicode as well.
876867
877-
Files are always opened in binary mode, even if no binary mode
878-
was specified. This is done to avoid data loss due to encodings
879-
using 8-bit values. The default file mode is 'rb' meaning to
880-
open the file in binary read mode.
868+
Underlying encoded files are always opened in binary mode.
869+
The default file mode is 'r', meaning to open the file in read mode.
881870
882871
encoding specifies the encoding which is to be used for the
883872
file.
@@ -913,13 +902,13 @@ def EncodedFile(file, data_encoding, file_encoding=None, errors='strict'):
913902
""" Return a wrapped version of file which provides transparent
914903
encoding translation.
915904
916-
Strings written to the wrapped file are interpreted according
917-
to the given data_encoding and then written to the original
918-
file as string using file_encoding. The intermediate encoding
905+
Data written to the wrapped file is decoded according
906+
to the given data_encoding and then encoded to the underlying
907+
file using file_encoding. The intermediate data type
919908
will usually be Unicode but depends on the specified codecs.
920909
921-
Strings are read from the file using file_encoding and then
922-
passed back to the caller as string using data_encoding.
910+
Bytes read from the file are decoded using file_encoding and then
911+
passed back to the caller encoded using data_encoding.
923912
924913
If file_encoding is not given, it defaults to data_encoding.
925914

Lib/test/test_codecs.py

Lines changed: 37 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1139,6 +1139,8 @@ def test_recoding(self):
11391139
# Python used to crash on this at exit because of a refcount
11401140
# bug in _codecsmodule.c
11411141

1142+
self.assertTrue(f.closed)
1143+
11421144
# From RFC 3492
11431145
punycode_testcases = [
11441146
# A Arabic (Egyptian):
@@ -1591,6 +1593,16 @@ def test_incremental_encode(self):
15911593
self.assertEqual(encoder.encode("ample.org."), b"xn--xample-9ta.org.")
15921594
self.assertEqual(encoder.encode("", True), b"")
15931595

1596+
def test_errors(self):
1597+
"""Only supports "strict" error handler"""
1598+
"python.org".encode("idna", "strict")
1599+
b"python.org".decode("idna", "strict")
1600+
for errors in ("ignore", "replace", "backslashreplace",
1601+
"surrogateescape"):
1602+
self.assertRaises(Exception, "python.org".encode, "idna", errors)
1603+
self.assertRaises(Exception,
1604+
b"python.org".decode, "idna", errors)
1605+
15941606
class CodecsModuleTest(unittest.TestCase):
15951607

15961608
def test_decode(self):
@@ -1668,6 +1680,24 @@ def test_all(self):
16681680
for api in codecs.__all__:
16691681
getattr(codecs, api)
16701682

1683+
def test_open(self):
1684+
self.addCleanup(support.unlink, support.TESTFN)
1685+
for mode in ('w', 'r', 'r+', 'w+', 'a', 'a+'):
1686+
with self.subTest(mode), \
1687+
codecs.open(support.TESTFN, mode, 'ascii') as file:
1688+
self.assertIsInstance(file, codecs.StreamReaderWriter)
1689+
1690+
def test_undefined(self):
1691+
self.assertRaises(UnicodeError, codecs.encode, 'abc', 'undefined')
1692+
self.assertRaises(UnicodeError, codecs.decode, b'abc', 'undefined')
1693+
self.assertRaises(UnicodeError, codecs.encode, '', 'undefined')
1694+
self.assertRaises(UnicodeError, codecs.decode, b'', 'undefined')
1695+
for errors in ('strict', 'ignore', 'replace', 'backslashreplace'):
1696+
self.assertRaises(UnicodeError,
1697+
codecs.encode, 'abc', 'undefined', errors)
1698+
self.assertRaises(UnicodeError,
1699+
codecs.decode, b'abc', 'undefined', errors)
1700+
16711701
class StreamReaderTest(unittest.TestCase):
16721702

16731703
def setUp(self):
@@ -1801,13 +1831,10 @@ def test_basic(self):
18011831
# "undefined"
18021832

18031833
# The following encodings don't work in stateful mode
1804-
broken_unicode_with_streams = [
1834+
broken_unicode_with_stateful = [
18051835
"punycode",
18061836
"unicode_internal"
18071837
]
1808-
broken_incremental_coders = broken_unicode_with_streams + [
1809-
"idna",
1810-
]
18111838

18121839
class BasicUnicodeTest(unittest.TestCase, MixInCheckStateHandling):
18131840
def test_basics(self):
@@ -1827,7 +1854,7 @@ def test_basics(self):
18271854
(chars, size) = codecs.getdecoder(encoding)(b)
18281855
self.assertEqual(chars, s, "encoding=%r" % encoding)
18291856

1830-
if encoding not in broken_unicode_with_streams:
1857+
if encoding not in broken_unicode_with_stateful:
18311858
# check stream reader/writer
18321859
q = Queue(b"")
18331860
writer = codecs.getwriter(encoding)(q)
@@ -1845,7 +1872,7 @@ def test_basics(self):
18451872
decodedresult += reader.read()
18461873
self.assertEqual(decodedresult, s, "encoding=%r" % encoding)
18471874

1848-
if encoding not in broken_incremental_coders:
1875+
if encoding not in broken_unicode_with_stateful:
18491876
# check incremental decoder/encoder and iterencode()/iterdecode()
18501877
try:
18511878
encoder = codecs.getincrementalencoder(encoding)()
@@ -1894,7 +1921,7 @@ def test_basics_capi(self):
18941921
from _testcapi import codec_incrementalencoder, codec_incrementaldecoder
18951922
s = "abc123" # all codecs should be able to encode these
18961923
for encoding in all_unicode_encodings:
1897-
if encoding not in broken_incremental_coders:
1924+
if encoding not in broken_unicode_with_stateful:
18981925
# check incremental decoder/encoder (fetched via the C API)
18991926
try:
19001927
cencoder = codec_incrementalencoder(encoding)
@@ -1934,7 +1961,7 @@ def test_seek(self):
19341961
for encoding in all_unicode_encodings:
19351962
if encoding == "idna": # FIXME: See SF bug #1163178
19361963
continue
1937-
if encoding in broken_unicode_with_streams:
1964+
if encoding in broken_unicode_with_stateful:
19381965
continue
19391966
reader = codecs.getreader(encoding)(io.BytesIO(s.encode(encoding)))
19401967
for t in range(5):
@@ -1967,7 +1994,7 @@ def test_decoder_state(self):
19671994
# Check that getstate() and setstate() handle the state properly
19681995
u = "abc123"
19691996
for encoding in all_unicode_encodings:
1970-
if encoding not in broken_incremental_coders:
1997+
if encoding not in broken_unicode_with_stateful:
19711998
self.check_state_handling_decode(encoding, u, u.encode(encoding))
19721999
self.check_state_handling_encode(encoding, u, u.encode(encoding))
19732000

@@ -2171,6 +2198,7 @@ def test_encodedfile(self):
21712198
f = io.BytesIO(b"\xc3\xbc")
21722199
with codecs.EncodedFile(f, "latin-1", "utf-8") as ef:
21732200
self.assertEqual(ef.read(), b"\xfc")
2201+
self.assertTrue(f.closed)
21742202

21752203
def test_streamreaderwriter(self):
21762204
f = io.BytesIO(b"\xc3\xbc")

Misc/NEWS

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -265,6 +265,10 @@ IDLE
265265
Tests
266266
-----
267267

268+
- Issue #19548: Added some additional checks to test_codecs to ensure that
269+
statements in the updated documentation remain accurate. Patch by Martin
270+
Panter.
271+
268272
- Issue #22838: All test_re tests now work with unittest test discovery.
269273

270274
- Issue #22173: Update lib2to3 tests to use unittest test discovery.
@@ -297,6 +301,10 @@ Build
297301
Documentation
298302
-------------
299303

304+
- Issue #19548: Update the codecs module documentation to better cover the
305+
distinction between text encodings and other codecs, together with other
306+
clarifications. Patch by Martin Panter.
307+
300308
- Issue #22914: Update the Python 2/3 porting HOWTO to describe a more automated
301309
approach.
302310

Modules/_codecsmodule.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -54,9 +54,9 @@ PyDoc_STRVAR(register__doc__,
5454
"register(search_function)\n\
5555
\n\
5656
Register a codec search function. Search functions are expected to take\n\
57-
one argument, the encoding name in all lower case letters, and return\n\
58-
a tuple of functions (encoder, decoder, stream_reader, stream_writer)\n\
59-
(or a CodecInfo object).");
57+
one argument, the encoding name in all lower case letters, and either\n\
58+
return None, or a tuple of functions (encoder, decoder, stream_reader,\n\
59+
stream_writer) (or a CodecInfo object).");
6060

6161
static
6262
PyObject *codec_register(PyObject *self, PyObject *search_function)

0 commit comments

Comments
 (0)