Description
Still exploring this, but was uncovered in issue #643...
If the config is char-encoding: latin1
, and an invalid character, say 0x96, decimal 150, is found in a html comment, when tidy encounters that 0x96
, it uses DecodeWin1252
to get the unicode 0x2013, 8211, to store in the lexer, and outputs a warning, INVALID_SGML_CHARS
, line ?? column ?? - Warning: replacing invalid character code 150
...
But when that comment text, with that substitute character 0x2013, 8211, is output to the html file, it will be raw encoded as 0x13
in the document...
Or maybe as 0x13 0x20
, or maybe 0x20 0x13
, but the point is the 0x13
, decimal 19, makes it into the document output stream... This is a non-printable DC3
character, and has no place in any text document...
A sample
html document, with this 0x96 character, can be found in our regression test repo, case-445557.html. The first -
after the word had
in the 2nd line comment...
Of course make sure you use char-encoding: latin1
in testing, although it may exist even when other character encodings are used...
As stated, still to fully explore and understand this... any help appreciated... thanks...