Problem with tests 427664 and 427672 on certain OSes! #3
Description
There is a difference in output message text using ARM / Raspberry Pi 2, RPI, first reported as Issue 258, Issue 266, Issue 269, by @vielmetti, back in Sep 13, 2015. Thank you for that report. Maybe others...
One or both may also be a problem on the MAC OS X, reported by @balthisar. To be verified.
First to try to examine the exact reason for these two test...
Test 427664 - now https://sourceforge.net/p/tidy/bugs/4/
#4 Missing attr values cause NULL segfault
Created: 2001-05-27 Creator: Terry Teague
Test 427672 - now https://sourceforge.net/p/tidy/bugs/10/
#10 Non-std attrs w/multibyte names segfault
Created: 2001-05-27 Creator: Terry Teague
Both these test inputs existed in SF CVS source, without a special config file. Both reported a segfault at that time! And both input files seems exactly the same as in this github tests repo. And in a binary compare, namely in_427664.html == case-427664.html and in_427672.html == case-427672.html! So no change has been made in the inputs. Of course, the SF CVS has no `testbase-expects' output to compare with...
However, re-running tidy04aug00
, even adding the suggested -utf8
option, on each file, does NOT produce a segfault, as far as I can see...
But running tidy04aug00
, for which I do not have the source, on both inputs, using DrMemory, does show it has -
Error #1: UNADDRESSABLE ACCESS beyond top of stack: reading 4 byte(s)
But this is not exacly a segfault due to a NULL pointer! And repeating the tests using tidy2000
, for which we do have the source, does not show any problems...
And while, for some reason I can not yet run DrMemory using the current tidy 5.1.45++, it also appears to not have a segfault! And need to also try in linux using valgrind, ASAN, testing...
But, for sure, that segfault seems to have been solved, the reason for the two tests.
So there remains this mystery of the character encoding differences in the message output in certain OS environments, which still need to be solved.
What is in the testbase
input, and testbase-expects
?
Essentially both input file have <body name="xx">
. A comment in the files says the name
is supposed to be 2 bytes hex c3 87, but it is not! Now maybe this is a corruption from a long way back, but even in SF CVS source the name
is a 4 byte sequence of C3 31 2F 32
.
Thus, in their present state, both inputs do not verify as valid utf-8 text. They would if changed back to the c3 87
given in the comment, and yet to test if that changes the situation.
In parsing this document, tidy finds this 4 byte sequence is not a valid attribute name
, and outputs a warning. Now it is the value output for that name
in the warning message differs in RPI OS. And maybe in OS X, still to be verified.
Tidy in Windows, and Ubuntu linux consistently outputs a 9 byte sequence EF BF BF EF BF BF 31 2F 32
, and this is what is in testbase-expects
, so the compare is exact. No problem.
While Tidy in RPI outputs, in testbase-results
a 7 byte sequence c3 83 c2 83 31 2f 32
, so the diff fails. A problem.
What can we do?
- Reduce the attribute name to just
1/2
, which is still invalid, so keeps the tests meaning. - Change the file back to valid utf-8
c3 87
, change the expected accordingly - to be tested. - Maybe a fix in Tidy code could force RPI to use
EF BF BF
output. - If also a problem in OS X, maybe exclude the 2 tests.
- If only in RPI, be ready to explain that this difference exists in these 2 tests.
- Other choices?
I seek ideas and comments on what would be best?
As previously expressed, I think it is important that we have a consistent set of tests across all OSes, and to not have to try and explain a difference every time someone stumbles across it.
Help Needed