Skip to content

Faster ASCII and possibly UTF-8 decoding of text files with TextIOWrapper #101289

Closed as not planned
@rhpvorderman

Description

@rhpvorderman

Feature or enhancement

Do an ASCII check on the entire buffer for TextIOWrapper and safe this as a variable. Such as self->buffer_is_ascii. Use this informed knowledge to create strings more quickly.

Pitch

PyUnicode_Decode* functions perform a check what the maximum character is for the data. For instance PyUnicode_DecodeLatin1 still scans and if the string is actually ASCII an ASCII string is made. A similar process happens when using TextIOWrapper to decode a text file.

However in the ASCII case, all characters are ASCII. In the UTF8 case, possibly all characters are ASCII. In that case a PyUnicode_New call to initialize an ASCII string and a simple memcpy of the data is much faster than the alternative. This is utilized in the dnaio parser for FASTQ files.

The following code runs at 20GB/s https://github.com/rhpvorderman/ascii-check/blob/main/ascii_check.h#L41 and is therefore almost cost-free when running on io.DEFAULT_BUFFER_SIZE chunks (8kb IIRC). Also a SSE2 implementation is provided in the same repository.

After this step is performed a lot of the translation and decoding can in fact be skipped if the data turns out to be ASCII. Since UTF-8 files are quite common, this can turn out to be a real-world performance benefit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions