Skip to content

Potential bug in html comment output #651

Open
@geoffmcl

Description

@geoffmcl

Still exploring this, but was uncovered in issue #643...

If the config is char-encoding: latin1, and an invalid character, say 0x96, decimal 150, is found in a html comment, when tidy encounters that 0x96, it uses DecodeWin1252 to get the unicode 0x2013, 8211, to store in the lexer, and outputs a warning, INVALID_SGML_CHARS, line ?? column ?? - Warning: replacing invalid character code 150...

But when that comment text, with that substitute character 0x2013, 8211, is output to the html file, it will be raw encoded as 0x13 in the document...

Or maybe as 0x13 0x20, or maybe 0x20 0x13, but the point is the 0x13, decimal 19, makes it into the document output stream... This is a non-printable DC3 character, and has no place in any text document...

A sample html document, with this 0x96 character, can be found in our regression test repo, case-445557.html. The first - after the word had in the 2nd line comment...

Of course make sure you use char-encoding: latin1 in testing, although it may exist even when other character encodings are used...

As stated, still to fully explore and understand this... any help appreciated... thanks...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions