Potential bug in html comment output

Still exploring this, but was uncovered in issue #643...

If the config is `char-encoding: latin1`, and an invalid character, say 0x96, decimal 150, is found in a html comment, when tidy encounters that `0x96`, it uses `DecodeWin1252` to get the unicode 0x2013, 8211, to store in the lexer, and outputs a warning, `INVALID_SGML_CHARS`, `line ?? column ?? - Warning: replacing invalid character code 150`...

But when that comment text, with that substitute character 0x2013, 8211, is **output** to the html file, it will be **raw** encoded as `0x13` in the document...

Or maybe as `0x13 0x20`, or maybe `0x20 0x13`, but the point is the `0x13`, decimal 19, makes it into the document output stream... This is a non-printable `DC3` character, and has no place in **any** text document...

A `sample` html document, with this 0x96 character, can be found in our **regression** test repo, [case-445557.html](https://github.com/htacg/tidy-html5-tests/blob/next/cases/testbase/case-445557.html). The first `-` after the word `had` in the 2nd line comment... 

Of course make sure you use `char-encoding: latin1` in testing, although it may exist even when other character encodings are used...

As stated, still to fully explore and understand this... any help appreciated... thanks...


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential bug in html comment output #651

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Potential bug in html comment output #651

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions