Skip to content

UTF-8 encoding fixes + tests for xcp.pci, .cmd and .net.biosdevname #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 28 commits into from

Conversation

bernhardkaindl
Copy link
Collaborator

@bernhardkaindl bernhardkaindl commented Apr 21, 2023

  • The changes are unit-tested using the included testsuite improvements to test UTF-8 handling.
  • The changes to xcp.pci.PCIIds and xcp.net.biosdevname are tested by @alexhimmel using XenRT with his Python3 port of the ACK xapi-plugin.

The commits of 1-21 by ydirson are from #17, this is just to allow for a separate review of my additional commits.

Foundational information:

For reading UTF-8 characters from files using Python3, there AFAIK are 3 possibilities:

  1. On Unix, the Python3 interpreter's Locale must be set to an UTF-8 locale
  2. Files are opened using open(file, encoding="utf-8"), subprocess pipes are opened using Popen(..., encoding="utf-8")
  3. Binary mode/Bytes are used for I/O (open(file, "b") and then all data read and written is encoded and decoded using explicit encode()/decode() alls using the utf-8 codec.
    is needed.

My observations on these are:

  1. While XenServer has en_US.UTF-8 in /etc/locale which shell logins get passed as LANG=, it may not always be set. For example, when daemons are started it is good practice to clear the environment. The ACK xapi-plugin appears be an example for this situation. The locale can be enforced interpreter-wide by setting LC_CYTPE/LC_ALL using the environment or locale.setlocale(), but it affects the entire process including all threads, calls to setlocale are not thread-safe and it would not be good practice for a library like python-libs/xcp to change the locale used by the process.
  2. encoding='utf-8' works nice, is easy to apply to all open and Popen calls, but is neither supported nor needed for Python2. Passing to all open and Popen can be done thru a **kwargs keyword parameter which passes the needed arguments as a dict.
  3. Adding encode and decode calls to all locations where data would need to be encoded from string to bytes and from bytes to string, and ensuring that it is not over-done can go wrong. For example, a encode(args) call on bytes internally results in decode().encode(args), which can go wrong and can even result in raised UnicodeDecodeError because the implicit decode() is done without encoding= and errors= arguments.

Option 2, passing encoding="utf-8", errors="replace" (replaces decoding errors with "?", without raising an exception) using a **kwargs keyword parameter dict on python3 and empty on python2 is the easiest and safest option that I can imagine.

ydirson and others added 28 commits January 20, 2023 17:45
Use of `unicode` needed to be immediately handled, but a few checks
relying on `str` could become insufficient in python2 with the larger
usage of unicode strings.

Signed-off-by: Yann Dirson <[email protected]>
…s to

open() as ths is considered best practice.

(cherry picked from cpython commit 6cef076ba5edbfa42239924951d8acbb087b3b19)

Signed-off-by: Yann Dirson <[email protected]>
Running tests on python3 did reveal some of them.

Signed-off-by: Yann Dirson <[email protected]>
There is no guaranty about ordering of dict elements, and tests compare
results derived from enumerating a dict element.  We could have used an
OrderedDict to store the formulae and get a predictible output order, but
just considering the output as a set seems better.

Only applying this to rules expected to hold more than one element.

Signed-off-by: Yann Dirson <[email protected]>
Caught by extended test.

Signed-off-by: Yann Dirson <[email protected]>
FIXME: I'm quite unsure why xcp.xmlunwrap would want to use bytes and not
unicode strings, but the encode/decode calls make it quite clear it wants
to work with bytes.  That makes the API painful to use in python3.
hashlib came with python 2.5, and old md5 module disappears in 3.0

Signed-off-by: Yann Dirson <[email protected]>
This is supposed to be just a module renaming to conform to PEP8, see
https://docs.python.org/3/whatsnew/3.0.html#library-changes

The SafeConfigParser class has been renamed to ConfigParser in Python
3.2, and backported as addon package.  The `readfp` method now
triggers a deprecation warning to replace it with `read_file`.

FIXME: With python3 some Accessor implementations (e.g. FileAccessor)
provide a text stream for repository config (and with python2 all
implementations), while others (e.g. HTTPAccessor) provide a binary
stream.  But on python3 ConfigParser will bomb out if given a binary
stream, so use a TextIOWrapper to access the config.  This is a hack,
which cannot be used when it is binary data which has to be read (see
later commits), so I don't consider this commit to be correct in that
respect.
Testing several accessor classes causes code duplication, which can be
avoided with help from the `parametrized` package (unfortunately, `pytest`
support cannot be used together with `unittest`).

Not a big deal right now, but starts becoming painful when adding new tests
or testing other Accessor classes.

Signed-off-by: Yann Dirson <[email protected]>
This test uses the same kind of I/O (file copy) that prepare_host_upgrade.py
does.

FIXME: the copy cannot proceed this way in python3
This works properly for the http case, but FileAccessor provides us with
a text fileobj handle, and `read()` gets a UTF-8 decoding error.

FIXME: Accessor ctor requires a `mode` argument
Reported under python3 for members created on-the-fly in `setUp()`

Signed-off-by: Yann Dirson <[email protected]>
With python3, pylint complains about `else: raise()` constructs.
This rework avoids them and reduces cyclomatic complexity by using
the error-out-first idiom.

Signed-off-by: Yann Dirson <[email protected]>
diff-cover defaults to origin/main in new version, it seems.

Signed-off-by: Yann Dirson <[email protected]>
Also use xcp.xcp_popen_text_kwargs for all affected unit tests
because they need to handle the encoding decode/encode likewise.
Now we use encoding="utf-8" to open /usr/share/hwdata/pci.ids,
enhance the test case to ensure that xcp.pci does not crash when
the existing UTF-8 characters in /usr/share/hwdata/pci.ids are
included in the unit test and returns the expected output.
We might be called with the locale not set, in which case python2's default
charset is ASCII. This happens when code is running as an xapi-plugin.

For example, this happens with the ACK plugin which uses xcp.pci.PCIIds()

This means we have to test that the code uses encoding="utf-8" correctly
in all cases (python2 and python3) and this test adds testing this while
processing UTF-8 data in xcp.cmd and xcp.pci.PCIIds().read()
@psafont
Copy link
Member

psafont commented Apr 24, 2023

The DCO check is failing, this is because the signoffs are missing from some commits. To easily sign them of, run git rebase --signoff 09ba68aca491f26e8552e935fe36a6af634b55a5

@bernhardkaindl bernhardkaindl changed the title Testsuite driven py3, with UTF-8 encoding fixes + tests for xcp.pci, .cmd and .net.biosdevname UTF-8 encoding fixes + tests for xcp.pci, .cmd and .net.biosdevname May 15, 2023
@bernhardkaindl bernhardkaindl added Will be fixed by other PRs Leave until as issues in it are fixed by other PRs bug labels May 15, 2023
@bernhardkaindl
Copy link
Collaborator Author

All commits of this PR have now been refactored and merged/obsoelted by PRs which are now merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Will be fixed by other PRs Leave until as issues in it are fixed by other PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants