Description
Feature or enhancement
Do an ASCII check on the entire buffer for TextIOWrapper and safe this as a variable. Such as self->buffer_is_ascii. Use this informed knowledge to create strings more quickly.
Pitch
PyUnicode_Decode* functions perform a check what the maximum character is for the data. For instance PyUnicode_DecodeLatin1 still scans and if the string is actually ASCII an ASCII string is made. A similar process happens when using TextIOWrapper to decode a text file.
However in the ASCII case, all characters are ASCII. In the UTF8 case, possibly all characters are ASCII. In that case a PyUnicode_New call to initialize an ASCII string and a simple memcpy of the data is much faster than the alternative. This is utilized in the dnaio parser for FASTQ files.
The following code runs at 20GB/s https://github.com/rhpvorderman/ascii-check/blob/main/ascii_check.h#L41 and is therefore almost cost-free when running on io.DEFAULT_BUFFER_SIZE chunks (8kb IIRC). Also a SSE2 implementation is provided in the same repository.
After this step is performed a lot of the translation and decoding can in fact be skipped if the data turns out to be ASCII. Since UTF-8 files are quite common, this can turn out to be a real-world performance benefit.
Metadata
Metadata
Assignees
Projects
Status