UTF-8 encoding fixes + tests for xcp.pci, .cmd and .net.biosdevname #22

bernhardkaindl · 2023-04-21T18:06:55Z

The changes are unit-tested using the included testsuite improvements to test UTF-8 handling.
The changes to xcp.pci.PCIIds and xcp.net.biosdevname are tested by @alexhimmel using XenRT with his Python3 port of the ACK xapi-plugin.

The commits of 1-21 by ydirson are from #17, this is just to allow for a separate review of my additional commits.

Foundational information:

For reading UTF-8 characters from files using Python3, there AFAIK are 3 possibilities:

On Unix, the Python3 interpreter's Locale must be set to an UTF-8 locale
Files are opened using open(file, encoding="utf-8"), subprocess pipes are opened using Popen(..., encoding="utf-8")
Binary mode/Bytes are used for I/O (open(file, "b") and then all data read and written is encoded and decoded using explicit encode()/decode() alls using the utf-8 codec.
is needed.

My observations on these are:

While XenServer has en_US.UTF-8 in /etc/locale which shell logins get passed as LANG=, it may not always be set. For example, when daemons are started it is good practice to clear the environment. The ACK xapi-plugin appears be an example for this situation. The locale can be enforced interpreter-wide by setting LC_CYTPE/LC_ALL using the environment or locale.setlocale(), but it affects the entire process including all threads, calls to setlocale are not thread-safe and it would not be good practice for a library like python-libs/xcp to change the locale used by the process.
encoding='utf-8' works nice, is easy to apply to all open and Popen calls, but is neither supported nor needed for Python2. Passing to all open and Popen can be done thru a **kwargs keyword parameter which passes the needed arguments as a dict.
Adding encode and decode calls to all locations where data would need to be encoded from string to bytes and from bytes to string, and ensuring that it is not over-done can go wrong. For example, a encode(args) call on bytes internally results in decode().encode(args), which can go wrong and can even result in raised UnicodeDecodeError because the implicit decode() is done without encoding= and errors= arguments.

Option 2, passing encoding="utf-8", errors="replace" (replaces decoding errors with "?", without raising an exception) using a **kwargs keyword parameter dict on python3 and empty on python2 is the easiest and safest option that I can imagine.

Use of `unicode` needed to be immediately handled, but a few checks relying on `str` could become insufficient in python2 with the larger usage of unicode strings. Signed-off-by: Yann Dirson <[email protected]>

…conversion Signed-off-by: Yann Dirson <[email protected]>

…s to open() as ths is considered best practice. (cherry picked from cpython commit 6cef076ba5edbfa42239924951d8acbb087b3b19) Signed-off-by: Yann Dirson <[email protected]>

…fication Signed-off-by: Yann Dirson <[email protected]>

…ated Signed-off-by: Yann Dirson <[email protected]>

Running tests on python3 did reveal some of them. Signed-off-by: Yann Dirson <[email protected]>

Signed-off-by: Yann Dirson <[email protected]>

There is no guaranty about ordering of dict elements, and tests compare results derived from enumerating a dict element. We could have used an OrderedDict to store the formulae and get a predictible output order, but just considering the output as a set seems better. Only applying this to rules expected to hold more than one element. Signed-off-by: Yann Dirson <[email protected]>

Caught by extended test. Signed-off-by: Yann Dirson <[email protected]>

This goes away in python3. Signed-off-by: Yann Dirson <[email protected]>

FIXME: I'm quite unsure why xcp.xmlunwrap would want to use bytes and not unicode strings, but the encode/decode calls make it quite clear it wants to work with bytes. That makes the API painful to use in python3.

hashlib came with python 2.5, and old md5 module disappears in 3.0 Signed-off-by: Yann Dirson <[email protected]>

This is supposed to be just a module renaming to conform to PEP8, see https://docs.python.org/3/whatsnew/3.0.html#library-changes The SafeConfigParser class has been renamed to ConfigParser in Python 3.2, and backported as addon package. The `readfp` method now triggers a deprecation warning to replace it with `read_file`. FIXME: With python3 some Accessor implementations (e.g. FileAccessor) provide a text stream for repository config (and with python2 all implementations), while others (e.g. HTTPAccessor) provide a binary stream. But on python3 ConfigParser will bomb out if given a binary stream, so use a TextIOWrapper to access the config. This is a hack, which cannot be used when it is binary data which has to be read (see later commits), so I don't consider this commit to be correct in that respect.

Testing several accessor classes causes code duplication, which can be avoided with help from the `parametrized` package (unfortunately, `pytest` support cannot be used together with `unittest`). Not a big deal right now, but starts becoming painful when adding new tests or testing other Accessor classes. Signed-off-by: Yann Dirson <[email protected]>

This test uses the same kind of I/O (file copy) that prepare_host_upgrade.py does. FIXME: the copy cannot proceed this way in python3

This works properly for the http case, but FileAccessor provides us with a text fileobj handle, and `read()` gets a UTF-8 decoding error. FIXME: Accessor ctor requires a `mode` argument

Signed-off-by: Yann Dirson <[email protected]>

Reported under python3 for members created on-the-fly in `setUp()` Signed-off-by: Yann Dirson <[email protected]>

With python3, pylint complains about `else: raise()` constructs. This rework avoids them and reduces cyclomatic complexity by using the error-out-first idiom. Signed-off-by: Yann Dirson <[email protected]>

diff-cover defaults to origin/main in new version, it seems. Signed-off-by: Yann Dirson <[email protected]>

Also use xcp.xcp_popen_text_kwargs for all affected unit tests because they need to handle the encoding decode/encode likewise.

Now we use encoding="utf-8" to open /usr/share/hwdata/pci.ids, enhance the test case to ensure that xcp.pci does not crash when the existing UTF-8 characters in /usr/share/hwdata/pci.ids are included in the unit test and returns the expected output.

We might be called with the locale not set, in which case python2's default charset is ASCII. This happens when code is running as an xapi-plugin. For example, this happens with the ACK plugin which uses xcp.pci.PCIIds() This means we have to test that the code uses encoding="utf-8" correctly in all cases (python2 and python3) and this test adds testing this while processing UTF-8 data in xcp.cmd and xcp.pci.PCIIds().read()

Signed-off-by: Bernhard Kaindl <[email protected]>

psafont · 2023-04-24T11:49:59Z

The DCO check is failing, this is because the signoffs are missing from some commits. To easily sign them of, run git rebase --signoff 09ba68aca491f26e8552e935fe36a6af634b55a5

bernhardkaindl · 2023-05-24T15:02:01Z

All commits of this PR have now been refactored and merged/obsoelted by PRs which are now merged.

ydirson and others added 28 commits January 20, 2023 17:45

python3: use six.string_types not version-dependant types

05a4e60

Use of `unicode` needed to be immediately handled, but a few checks relying on `str` could become insufficient in python2 with the larger usage of unicode strings. Signed-off-by: Yann Dirson <[email protected]>

python3: use "six.ensure_binary" and "six.ensure_text" for str/bytes …

84d172a

…conversion Signed-off-by: Yann Dirson <[email protected]>

Remove direct call's to file's constructor and replace them with call…

7ac16be

…s to open() as ths is considered best practice. (cherry picked from cpython commit 6cef076ba5edbfa42239924951d8acbb087b3b19) Signed-off-by: Yann Dirson <[email protected]>

python3: xcp.net.mac: use six.python_2_unicode_compatible for stringi…

520d419

…fication Signed-off-by: Yann Dirson <[email protected]>

xcp.net.ifrename.logic: use "logger.warning", "logger.warn" is deprec…

a59c3ba

…ated Signed-off-by: Yann Dirson <[email protected]>

python3: use raw strings for regexps, fixes insufficient quoting

0832486

Running tests on python3 did reveal some of them. Signed-off-by: Yann Dirson <[email protected]>

test_dom0: mock "open()" in a python3-compatible way

7e14499

Signed-off-by: Yann Dirson <[email protected]>

test_cpio: ensure paths are handled as text

ae79078

Caught by extended test. Signed-off-by: Yann Dirson <[email protected]>

cpiofile: migrate last "list.sort()" call still using a "cmp" argument

346ebc0

This goes away in python3. Signed-off-by: Yann Dirson <[email protected]>

WIP python3: fix xmlunwrap and its test to align with the use of bytes

5fd6cdf

FIXME: I'm quite unsure why xcp.xmlunwrap would want to use bytes and not unicode strings, but the encode/decode calls make it quite clear it wants to work with bytes. That makes the API painful to use in python3.

xcp.repository: switch from md5 to hashlib.md5

67225f7

hashlib came with python 2.5, and old md5 module disappears in 3.0 Signed-off-by: Yann Dirson <[email protected]>

WIP test_accessor: check for I/O on binary files

cdd5ee8

This test uses the same kind of I/O (file copy) that prepare_host_upgrade.py does. FIXME: the copy cannot proceed this way in python3

WIP test_accessor: write into copy file as binary

71183e4

This works properly for the http case, but FileAccessor provides us with a text fileobj handle, and `read()` gets a UTF-8 decoding error. FIXME: Accessor ctor requires a `mode` argument

Pylint complements: honor len-as-condition convention

d6566dd

Signed-off-by: Yann Dirson <[email protected]>

Pylint complements: whitespace in expressions

ac1b1b8

Signed-off-by: Yann Dirson <[email protected]>

Pylint complements: test_ifrename_logic: disable "no-member" warning

d58e988

Reported under python3 for members created on-the-fly in `setUp()` Signed-off-by: Yann Dirson <[email protected]>

Pylint complements: avoid no-else-raise "refactor" issues

efeecfa

With python3, pylint complains about `else: raise()` constructs. This rework avoids them and reduces cyclomatic complexity by using the error-out-first idiom. Signed-off-by: Yann Dirson <[email protected]>

CI: also run tests with python3

09ba68a

diff-cover defaults to origin/main in new version, it seems. Signed-off-by: Yann Dirson <[email protected]>

Introduce xcp.xcp_popen_text_kwargs and use it for Popen and open

74daac5

Also use xcp.xcp_popen_text_kwargs for all affected unit tests because they need to handle the encoding decode/encode likewise.

tests/test_cmd.py: Also test UTF-8 in stderr of the cmd

0f0c19f

tests/test_cmd.py: Full test for stdin/stdout (UTF-8) using "cat"

4463538

tests/test_pci.py: Verify the returned Dict to have expected content.

75b2cf1

xcp.net.ip.ip_link_set_name(): Support Python3 and add testcase for it

c636d9b

Signed-off-by: Bernhard Kaindl <[email protected]>

bernhardkaindl changed the title ~~Testsuite driven py3, with UTF-8 encoding fixes + tests for xcp.pci, .cmd and .net.biosdevname~~ UTF-8 encoding fixes + tests for xcp.pci, .cmd and .net.biosdevname May 15, 2023

bernhardkaindl added Will be fixed by other PRs Leave until as issues in it are fixed by other PRs bug labels May 15, 2023

bernhardkaindl closed this May 24, 2023

bernhardkaindl mentioned this pull request Mar 12, 2024

CP-47555 Porting usb_scan.py to python3 xapi-project/xen-api#5424

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UTF-8 encoding fixes + tests for xcp.pci, .cmd and .net.biosdevname #22

UTF-8 encoding fixes + tests for xcp.pci, .cmd and .net.biosdevname #22

Uh oh!

bernhardkaindl commented Apr 21, 2023 •

edited

Loading

Uh oh!

psafont commented Apr 24, 2023

Uh oh!

bernhardkaindl commented May 24, 2023

Uh oh!

Uh oh!

UTF-8 encoding fixes + tests for xcp.pci, .cmd and .net.biosdevname #22

UTF-8 encoding fixes + tests for xcp.pci, .cmd and .net.biosdevname #22

Uh oh!

Conversation

bernhardkaindl commented Apr 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Foundational information:

Uh oh!

psafont commented Apr 24, 2023

Uh oh!

bernhardkaindl commented May 24, 2023

Uh oh!

Uh oh!

bernhardkaindl commented Apr 21, 2023 •

edited

Loading