Memory issues when using libFLAME dpotrf_() with OpenBLAS #2562

gmargari · 2020-04-15T12:52:05Z

Hello all,
I'm running this minimal test file with DPOTRF linking it with libFLAME and OpenBLAS. I tried with various matrix dimensions from 2 to 32768 (8GB matrix), and I run it with number of threads = 1, 2, 4, 8.
I have this weird behaviour:

when matrix dim is 2, 4, ..., 16384 everything is ok
when matrix dim is 32768 and num threads is 1, 4 or 8 it's ok
when matrix dim is 32768 and num threads is 2 I get all kinds of different memory errors (have tried both released and debug builds), from bus errors and segmentation faults within OpenBLAS, to errors in my free() at the end of the program ("invalid size", "double free or corruption", "munmap_chunk() failed" etc).

I get this both on Intel i7-6800K and AMD Ryzen Threadripper 1950X (both 64GB RAM).
Maybe I'm messing something but don't know what. Note sure if this has anything to do with #1137 since I'm getting bus errors, but currently I can't test a more recent version on our internal network.
I'm using OpenBLAS 0.3.9.

It also maybe related to #352.

Background:
I'm running some benchmarks to compare MKL LAPACK vs libFLAME performance, using MKL BLAS or OpenBlas as underlying BLAS libraries.

I first benchmarked MKL BLAS vs OpenBlas on DDOT, DGEMV, DGEMM.
I tested matrix sizes up to 32768 x 32768 = 8GB, with number of threads 1, 2, 4, 8.
Everything was ok (btw, OpenBLAS was very close to MKL on Intel CPUs, and was 40%-60% faster on AMD).

I repeated the same settings but now testing MKL LAPACK vs libFLAME on DPOTR, DGETR, DSYTR.
When using MKL BLAS as underlying BLAS library for both, everything is ok.
When I use libFLAME with OpenBLAS, then everything is ok until I run DPOTRF with size 32768 and num threads = 2.

Unfortunately I'm working on internal corporate network so can't copy much info such as lscpu, dumps or the results of benchmarks (I copied by hand the test file).

The text was updated successfully, but these errors were encountered:

martin-frbg · 2020-04-15T13:28:50Z

(1) this should not happen(TM)
~~(2) DPOTRF is actually one of a few LAPACK functions that OpenBLAS replaces with its own optimized version so you will not get to test libFLAME performance in this case even if things do not blow up.~~ (does not apply if you built OpenBLAS without its LAPACK though, my mind was on too many other things)
(3) for MKL on AMD you will need to set the environment variable MKL_DEBUG_CPU_TYPE=5
to make it accept that the "weird foreign" cpu actually is capable of AVX/AVX2
(4) I have not been able to reproduce this yet but maybe valgrind will tell me something (later)

gmargari · 2020-04-15T13:33:05Z

The best I can do regarding how I compile and run the program (second recompilation is not needed though, just changed the number of threads from 4 to 2 and reran it):
https://imgur.com/PVCUy0J
https://imgur.com/N3UXnpo

gmargari · 2020-04-15T13:36:25Z

I know about the MKL_DEBUG_CPU_TYPE=5 hack to "fix" MKL performance, but since this flag may be removed at any time without notice we want to benchmark MKL vs (OpenBLAS + libFLAME) "as is", and maybe switch to the latter on AMD machines.

martin-frbg · 2020-04-15T14:19:05Z

Nothing found so far, only change I made to your test case was adding the stdlib include for malloc/free. And I am currently building without libFLAME (which I think should not be an issue unless the memory corruption happens there)

gmargari · 2020-04-15T14:26:09Z

During bisecting to find the min size that causes the issue, which seems to be the weird magic number 22659, I also got this (but maybe it's just a random error along rest memory errors): https://imgur.com/a/QovF9MN

gmargari · 2020-04-15T14:37:17Z

Linked without libFLAME, issue still exists.
Different trace I just got: https://imgur.com/8SSiglN (sorry for low quality image, my dumbphone is doing its best)

martin-frbg · 2020-04-15T15:17:55Z

That is certainly suspicious (though I would hope any bugs in the Haswell/SkylakeX dgemm_ncopy kernel would have surfaced by now). Will try to reproduce this on a bigger machine later today, sometimes the laptop is not good enough at uncovering bugs.

martin-frbg · 2020-04-15T22:34:12Z

Now reproduced on Skylake-X.

martin-frbg · 2020-04-17T08:24:12Z

Unfortunately valgrind on SkylakeX hits an unimplemented instruction, and valgrind on Haswell just hangs. Suspect the calculation of the memory buffer addresses in lapack/dpotrf/dportf_parallel.c is going haywire

martin-frbg · 2020-04-21T08:16:17Z

Heisenbug - apparently it just does not want to happen again here, and valgrind on Haswell finds nothing wrong at more manageable sizes (a bit over 16k) while taking days at n=32768.

gmargari · 2020-04-22T08:02:02Z

Same here. For small sizes nothing showed up in valgrind. Tried larger sizes, but due to the -I suppose- O(N^3) asymptotic complexity, I tried to bisect to find the minimum size that causes the issue. But since you reproduced it I didn't tested it.

Btw, did you test if size 22659 also causes the problem in your setting? It should be much better than 32768.

martin-frbg · 2020-04-22T08:08:39Z

Good point about the 22659 - did not test this yet (but my Haswell currently refuses to have any problem with 32768 as well - for now I'll just leave the valgrind job running in the hope that it uncovers anything eventually)

gmargari · 2020-04-22T09:44:48Z

Btw the 22569 triggers the issue on my i7-6800. I will also check it on Threadripper 1950X.

32768 triggers it on both machines. I have no idea if this is reproduced on Haswell.

martin-frbg · 2020-04-22T10:06:30Z

While i7-6800 is using Broadwell kernels, the Threadripper basically shares Haswell kernels (and the underlying issue is probably in the memory buffer setup/ workload splitting code that is shared between all cpu types anyway).

martin-frbg · 2020-04-25T12:30:59Z

Haswell valgrind run finally completed after three days without showing any issues at n=32768. And ignore my earlier comment about Broadwell kernels - I got the lineage confused past Sandybridge, OpenBLAS on Broadwell uses the exact same Haswell kernels and GEMM parameters.

martin-frbg · 2020-04-28T08:09:13Z

I still fail to reproduce this on anything other than SkylakeX, possibly this is just another case of the default BUFFER_SIZE being too small for large problems. At least increasing the value in common_x86_64.h to what is used on Haswell makes the problem go away, though it could certainly be that this only adds a safe drop zone for things that get written to a miscalculated address.

gmargari · 2020-04-28T09:35:51Z

Did you try bisecting to see if the same size causes the issue to you or if this 22659 is just a random number? (at least on me it was deterministic: all smaller numbers were ok, all larger caused issues)

If you decrease BUFFER_SIZE can you reproduce the issue on smaller, thus more manageable, matrix sizes?

Also, did you try with different number of threads? How can the number of threads be involved into this? E.g. in my cases on omp_num_threads = 2 caused the issue. Other number of threads were ok.

martin-frbg · 2020-04-28T11:16:00Z

The problem is that so far I have seen this only on SkylakeX (where it fails for values down to about n=15700 with the 0.3.9 BUFFER_SIZE) and glibc's builtin MALLOC_CHECK_ is of no use in tracking down the actual source of the (heap?) corruption. (And valgrind refuses to work as it does not support AVX512 as I mentioned above - seems Intel provided an experimental patchset for this in 2018 that never got approved, and the Intel guys went back to working on something else)

martin-frbg · 2020-04-28T16:10:06Z

Ok. by increasing n (or decreasing BUFFER_SIZE) I can indeed make dgemm_oncopy write beyond the "buffer" on Haswell too (and the fault must be higher up in either trsm_L.c, the level3_gemm/syrk_threaded or ultimately blas_server_omp.c as the generic gemm_ncopy8 fails in the same way as its optimized counterparts). No idea yet why nthreads=2 would be special (and potentially be assigned non-adjacent regions of the buffer ?)

martin-frbg · 2020-04-30T22:13:31Z

Actually nthreads=2 does not seem to be that special. Valgrind tells me that at least in my somewhat contrived example with the decreased BUFFER_SIZE the gemm_ncopy writes outside the allocated memory whenever the number of threads is less than the number of actual cores in the system. One possible link is the buffer pre-allocation introduced with d744c95 , but I see nothing immediately wrong with that code.

martin-frbg · 2020-10-17T21:01:10Z

Need to retest this with 0.3.11, may have been connected to the huge BLAS3 stack requirements that were recently improved in #2879

martin-frbg · 2022-07-30T22:37:23Z

Closing as not reproducible with any later version

martin-frbg mentioned this issue May 3, 2020

cblas_sgemm crash in Windows x64 #2596

Closed

martin-frbg closed this as completed Jul 30, 2022

Memory issues when using libFLAME dpotrf_() with OpenBLAS #2562

Memory issues when using libFLAME dpotrf_() with OpenBLAS #2562

Comments

gmargari commented Apr 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

martin-frbg commented Apr 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmargari commented Apr 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmargari commented Apr 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Apr 15, 2020

Uh oh!

gmargari commented Apr 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmargari commented Apr 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Apr 15, 2020

Uh oh!

martin-frbg commented Apr 15, 2020

Uh oh!

martin-frbg commented Apr 17, 2020

Uh oh!

martin-frbg commented Apr 21, 2020

Uh oh!

gmargari commented Apr 22, 2020

Uh oh!

martin-frbg commented Apr 22, 2020

Uh oh!

gmargari commented Apr 22, 2020

Uh oh!

martin-frbg commented Apr 22, 2020

Uh oh!

martin-frbg commented Apr 25, 2020

Uh oh!

martin-frbg commented Apr 28, 2020

Uh oh!

gmargari commented Apr 28, 2020

Uh oh!

martin-frbg commented Apr 28, 2020

Uh oh!

martin-frbg commented Apr 28, 2020

Uh oh!

martin-frbg commented Apr 30, 2020

Uh oh!

martin-frbg commented Oct 17, 2020

Uh oh!

martin-frbg commented Jul 30, 2022

Uh oh!

gmargari commented Apr 15, 2020 •

edited

Loading

martin-frbg commented Apr 15, 2020 •

edited

Loading

gmargari commented Apr 15, 2020 •

edited

Loading

gmargari commented Apr 15, 2020 •

edited

Loading

gmargari commented Apr 15, 2020 •

edited

Loading

gmargari commented Apr 15, 2020 •

edited

Loading