Description
Consider this page, specifically this line from its html:
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
From my testing, it looks like tidy doesn't respect that encoding, instead in src/clean.c:2316 it looks like it forcibly replaces that "windows-1252" value with "utf-8", if I read the code correctly.
The problem is that when processing the above html page, the output is not valid utf-8 - there is an accented character near the string "des Mondes", if you grep for it you should see it, that gets destroyed for example.
In my own code, this line
tidyOptSetInt(tidyDoc, TidyInCharEncoding, TidyEncWin1252);
fixes the issue and I get valid utf-8 out with correct accents and all, but I can't hardcode that because now my code is incorrect for every other html page out there. I also can't get that information from the cleaned html anymore because tidy overwrites it.
I think one of these two things should happen:
- tidy leaves the encoding as-is, I can try and find the relevant meta tag myself through xpath and manually convert to utf8 myself, using iconv
- tidy reads and respects that encoding when converting to utf8
I don't know if I'm doing something wrong in the way I call tidy, but after trying several options I can't get it to give me a correctly converted utf8 html string.