Skip to content

Wrong code with optimization on i386 FreeBSD #40569

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
llvmbot opened this issue Mar 25, 2019 · 5 comments
Open

Wrong code with optimization on i386 FreeBSD #40569

llvmbot opened this issue Mar 25, 2019 · 5 comments
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@llvmbot
Copy link
Member

llvmbot commented Mar 25, 2019

Bugzilla Link 41224
Version 7.0
OS FreeBSD
Attachments Program that demonstrates the issue, Assembly when a.c is compiled without -DKLUDGE, Assembly when a.c is compiled with -DKLUDGE
Reporter LLVM Bugzilla Contributor
CC @andykaylor,@topperc,@DimitryAndric,@emaste

Extended Description

The attached testcase, a.c, demonstrates a code generation issue on FreeBSD running on an i686 class hardware (i.e., 32-bit i386/387). FreeBSD sets the i387 FPU to a 53-bit precision when the FPU is first accessed. clang or llvm seems to have no knowledge of this setting and unconditionally assumes a 64-bit precision. This leads to wrong for floating point codes that use the 32-bit float type when optimization is used. Consider,

gcc8 (FreeBSD Ports Collection) 8.3.0

gcc8 -fno-builtin -O0 -o z a.c -lm && ./z
gcc8 -fno-builtin -O1 -o z a.c -lm && ./z
gcc8 -fno-builtin -O2 -o z a.c -lm && ./z
gcc8 -fno-builtin -O3 -o z a.c -lm && ./z

The above command lines yield

Maximum ULP: 2.297073

of ULP > 21: 0

This is the expected result.

gcc8 -fno-builtin -O0 -DKLUDGE -o z a.c -lm && ./z
gcc8 -fno-builtin -O1 -DKLUDGE -o z a.c -lm && ./z
gcc8 -fno-builtin -O2 -DKLUDGE -o z a.c -lm && ./z
gcc8 -fno-builtin -O3 -DKLUDGE -o z a.c -lm && ./z

The above command lines yield

Maximum ULP: 2.297073

of ULP > 21: 0

This is the expected result.

Now, consider

FreeBSD clang version 7.0.1 (tags/RELEASE_701/final 349250)
(based on LLVM 7.0.1)
Target: i386-unknown-freebsd13.0
Thread model: posix

/usr/bin/clang -fno-builtin -O0 -o z a.c -lm && ./z

The above command line yields

Maximum ULP: 2.297073

of ULP > 21: 0

This is the expected result.

/usr/bin/clang -fno-builtin -O1 -o z a.c -lm && ./z
/usr/bin/clang -fno-builtin -O2 -o z a.c -lm && ./z
/usr/bin/clang -fno-builtin -O3 -o z a.c -lm && ./z

The above command lines yield

Maximum ULP: 23.061242

of ULP > 21: 39

This is not the expected result. In fact, in my numerical testsuite I have observed 6 digit Max ULP estimates (i.e., only a single digit is correct).

/usr/bin/clang -fno-builtin -O0 -DKLUDGE -o z a.c -lm && ./z
/usr/bin/clang -fno-builtin -O1 -DKLUDGE -o z a.c -lm && ./z
/usr/bin/clang -fno-builtin -O2 -DKLUDGE -o z a.c -lm && ./z
/usr/bin/clang -fno-builtin -O3 -DKLUDGE -o z a.c -lm && ./z

The above command lines yield

Maximum ULP: 2.297073

of ULP > 21: 0

which is again the expected results. The -DKLUDGE option causes
the source to use 'volatile float x, y' instead of just 'float x, y'.
AFAICT, from the generated asm (see attachments), the use of volatile
forces clang to spill/reload x, y (thus, using the correct precision
for the type).

@llvmbot
Copy link
Member Author

llvmbot commented Mar 25, 2019

Changed the "Importance" field to "normal".

@llvmbot
Copy link
Member Author

llvmbot commented Mar 25, 2019

A discussion of the bug can be found in the FreeBSD toolchain mailing list archive at

https://lists.freebsd.org/pipermail/freebsd-toolchain/2019-March/004458.html

@topperc
Copy link
Collaborator

topperc commented Mar 26, 2019

Myself and Andy Kaylor here at Intel spent a great deal of time playing around with this today using both clang and gcc. I don't believe the precision being set to 53 is necessary to hit the issue. I was able to get the test to report errors on the normal linux configuration.

As I think we mentioned in the freebsd mailing list, we are passing x and y to dp_csinh without rounding to float after the additions. We spill them around the call to csinhf using a 64 bit stack slot. This is due to the X87 codegen treating an fp_extend as nothing more than a copy from float register class to double register class. And the register coalescer rewriting some of the machine IR to use the double register class. This causes spill slot size to be calculated as 64-bits.

Using volatile circumvents this and forces a store as float instead of the spill. This causes the value to be rounded to float before being extended to double.

gcc's codegen seems to be affected by -std=c11 vs -std=gnu11. I believe the difference is really -fexcess-precision=standard vs -fexcess-precision=fast. When -fexcess-precision=fast is in effect, gcc's maximum ULP gets worse if dp_csinh is called before csinhf. Though it does not exceed the limit of 21.

It looks as though -fexcess-precision=standard causes gcc to insert intermediate casts to long double in their intermediate representation. At least on linux. That's what I observed from dumping the output of the 004t.original in godbolt. They might be casts to double on freebsd when the precision is 53 bits. I assume this has some effect on how instructions are generated later.

Another interesting note, I noticed clang does set FLT_EVAL_METHOD to 1 instead of 2 on some versions of NetBSD without SSE. But not for FreeBSD and we don't seem to do anything with the information other than set the define.

@llvmbot
Copy link
Member Author

llvmbot commented Mar 26, 2019

Myself and Andy Kaylor here at Intel spent a great deal of time playing
around with this today using both clang and gcc. I don't believe the
precision being set to 53 is necessary to hit the issue. I was able to get
the test to report errors on the normal linux configuration.

Craig,

Thanks for taking a look at this issue. FreeBSD on i386/387
has been using 53-bit precision for likely more than 2 decades.
To save you some code spelunking, one finds (sorry about long url)

https://svnweb.freebsd.org/base/head/sys/x86/include/fpu.h?revision=314436view=markup

lines 186-208 are

/*
* The hardware default control word for i387's and later coprocessors is
* 0x37F, giving:
*
*      round to nearest
*      64-bit precision
*      all exceptions masked.
*
* FreeBSD/i386 uses 53 bit precision for things like fadd/fsub/fsqrt etc
* because of the difference between memory and fpu register stack arguments.
* If its using an intermediate fpu register, it has 80/64 bits to work
* with.  If it uses memory, it has 64/53 bits to work with.  However,
* gcc is aware of this and goes to a fair bit of trouble to make the
* best use of it.
*
* This is mostly academic for AMD64, because the ABI prefers the use
* SSE2 based math.  For FreeBSD/amd64, we go with the default settings.
*/
#define __INITIAL_FPUCW__       0x037F
#define __INITIAL_FPUCW_I386__  0x127F
#define __INITIAL_NPXCW__       __INITIAL_FPUCW_I386__
#define __INITIAL_MXCSR__       0x1F80
#define __INITIAL_MXCSR_MASK__  0xFFBF

The comment above that refers to GCC knowing about this setting can
be traced to

https://gcc.gnu.org/viewcvs/gcc/trunk/gcc/config/i386/freebsd.h?revision=267494&view=markup

lines 120-123 are

/* FreeBSD sets the rounding precision of the FPU to 53 bits.  Let the
   compiler get the contents of <float.h> and std::numeric_limits correct. */
#undef TARGET_96_ROUND_53_LONG_DOUBLE
#define TARGET_96_ROUND_53_LONG_DOUBLE (!TARGET_64BIT)

I've looked through GCC sources, and know the above effects the
processing of gcc/config/i386/i386-modes.def, but that's as far
as I'm able to go at the moment.

--
steve

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
@llvmbot
Copy link
Member Author

llvmbot commented Nov 5, 2022

@llvm/issue-subscribers-backend-x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

3 participants