Skip to content

Memory issues when using libFLAME dpotrf_() with OpenBLAS #2562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gmargari opened this issue Apr 15, 2020 · 22 comments
Closed

Memory issues when using libFLAME dpotrf_() with OpenBLAS #2562

gmargari opened this issue Apr 15, 2020 · 22 comments

Comments

@gmargari
Copy link

gmargari commented Apr 15, 2020

Hello all,
I'm running this minimal test file with DPOTRF linking it with libFLAME and OpenBLAS. I tried with various matrix dimensions from 2 to 32768 (8GB matrix), and I run it with number of threads = 1, 2, 4, 8.
I have this weird behaviour:

  • when matrix dim is 2, 4, ..., 16384 everything is ok
  • when matrix dim is 32768 and num threads is 1, 4 or 8 it's ok
  • when matrix dim is 32768 and num threads is 2 I get all kinds of different memory errors (have tried both released and debug builds), from bus errors and segmentation faults within OpenBLAS, to errors in my free() at the end of the program ("invalid size", "double free or corruption", "munmap_chunk() failed" etc).

I get this both on Intel i7-6800K and AMD Ryzen Threadripper 1950X (both 64GB RAM).
Maybe I'm messing something but don't know what. Note sure if this has anything to do with #1137 since I'm getting bus errors, but currently I can't test a more recent version on our internal network.
I'm using OpenBLAS 0.3.9.

It also maybe related to #352.


Background:
I'm running some benchmarks to compare MKL LAPACK vs libFLAME performance, using MKL BLAS or OpenBlas as underlying BLAS libraries.

I first benchmarked MKL BLAS vs OpenBlas on DDOT, DGEMV, DGEMM.
I tested matrix sizes up to 32768 x 32768 = 8GB, with number of threads 1, 2, 4, 8.
Everything was ok (btw, OpenBLAS was very close to MKL on Intel CPUs, and was 40%-60% faster on AMD).

I repeated the same settings but now testing MKL LAPACK vs libFLAME on DPOTR, DGETR, DSYTR.
When using MKL BLAS as underlying BLAS library for both, everything is ok.
When I use libFLAME with OpenBLAS, then everything is ok until I run DPOTRF with size 32768 and num threads = 2.

Unfortunately I'm working on internal corporate network so can't copy much info such as lscpu, dumps or the results of benchmarks (I copied by hand the test file).

@martin-frbg
Copy link
Collaborator

martin-frbg commented Apr 15, 2020

(1) this should not happen(TM)
(2) DPOTRF is actually one of a few LAPACK functions that OpenBLAS replaces with its own optimized version so you will not get to test libFLAME performance in this case even if things do not blow up. (does not apply if you built OpenBLAS without its LAPACK though, my mind was on too many other things)
(3) for MKL on AMD you will need to set the environment variable MKL_DEBUG_CPU_TYPE=5
to make it accept that the "weird foreign" cpu actually is capable of AVX/AVX2
(4) I have not been able to reproduce this yet but maybe valgrind will tell me something (later)

@gmargari
Copy link
Author

gmargari commented Apr 15, 2020

The best I can do regarding how I compile and run the program (second recompilation is not needed though, just changed the number of threads from 4 to 2 and reran it):
https://imgur.com/PVCUy0J
https://imgur.com/N3UXnpo

@gmargari
Copy link
Author

gmargari commented Apr 15, 2020

I know about the MKL_DEBUG_CPU_TYPE=5 hack to "fix" MKL performance, but since this flag may be removed at any time without notice we want to benchmark MKL vs (OpenBLAS + libFLAME) "as is", and maybe switch to the latter on AMD machines.

@martin-frbg
Copy link
Collaborator

Nothing found so far, only change I made to your test case was adding the stdlib include for malloc/free. And I am currently building without libFLAME (which I think should not be an issue unless the memory corruption happens there)

@gmargari
Copy link
Author

gmargari commented Apr 15, 2020

During bisecting to find the min size that causes the issue, which seems to be the weird magic number 22659, I also got this (but maybe it's just a random error along rest memory errors): https://imgur.com/a/QovF9MN

@gmargari
Copy link
Author

gmargari commented Apr 15, 2020

Linked without libFLAME, issue still exists.
Different trace I just got: https://imgur.com/8SSiglN (sorry for low quality image, my dumbphone is doing its best)

@martin-frbg
Copy link
Collaborator

That is certainly suspicious (though I would hope any bugs in the Haswell/SkylakeX dgemm_ncopy kernel would have surfaced by now). Will try to reproduce this on a bigger machine later today, sometimes the laptop is not good enough at uncovering bugs.

@martin-frbg
Copy link
Collaborator

Now reproduced on Skylake-X.

@martin-frbg
Copy link
Collaborator

Unfortunately valgrind on SkylakeX hits an unimplemented instruction, and valgrind on Haswell just hangs. Suspect the calculation of the memory buffer addresses in lapack/dpotrf/dportf_parallel.c is going haywire

@martin-frbg
Copy link
Collaborator

Heisenbug - apparently it just does not want to happen again here, and valgrind on Haswell finds nothing wrong at more manageable sizes (a bit over 16k) while taking days at n=32768.

@gmargari
Copy link
Author

Same here. For small sizes nothing showed up in valgrind. Tried larger sizes, but due to the -I suppose- O(N^3) asymptotic complexity, I tried to bisect to find the minimum size that causes the issue. But since you reproduced it I didn't tested it.

Btw, did you test if size 22659 also causes the problem in your setting? It should be much better than 32768.

@martin-frbg
Copy link
Collaborator

Good point about the 22659 - did not test this yet (but my Haswell currently refuses to have any problem with 32768 as well - for now I'll just leave the valgrind job running in the hope that it uncovers anything eventually)

@gmargari
Copy link
Author

Btw the 22569 triggers the issue on my i7-6800. I will also check it on Threadripper 1950X.

32768 triggers it on both machines. I have no idea if this is reproduced on Haswell.

@martin-frbg
Copy link
Collaborator

While i7-6800 is using Broadwell kernels, the Threadripper basically shares Haswell kernels (and the underlying issue is probably in the memory buffer setup/ workload splitting code that is shared between all cpu types anyway).

@martin-frbg
Copy link
Collaborator

Haswell valgrind run finally completed after three days without showing any issues at n=32768. And ignore my earlier comment about Broadwell kernels - I got the lineage confused past Sandybridge, OpenBLAS on Broadwell uses the exact same Haswell kernels and GEMM parameters.

@martin-frbg
Copy link
Collaborator

I still fail to reproduce this on anything other than SkylakeX, possibly this is just another case of the default BUFFER_SIZE being too small for large problems. At least increasing the value in common_x86_64.h to what is used on Haswell makes the problem go away, though it could certainly be that this only adds a safe drop zone for things that get written to a miscalculated address.

@gmargari
Copy link
Author

Did you try bisecting to see if the same size causes the issue to you or if this 22659 is just a random number? (at least on me it was deterministic: all smaller numbers were ok, all larger caused issues)

If you decrease BUFFER_SIZE can you reproduce the issue on smaller, thus more manageable, matrix sizes?

Also, did you try with different number of threads? How can the number of threads be involved into this? E.g. in my cases on omp_num_threads = 2 caused the issue. Other number of threads were ok.

@martin-frbg
Copy link
Collaborator

The problem is that so far I have seen this only on SkylakeX (where it fails for values down to about n=15700 with the 0.3.9 BUFFER_SIZE) and glibc's builtin MALLOC_CHECK_ is of no use in tracking down the actual source of the (heap?) corruption. (And valgrind refuses to work as it does not support AVX512 as I mentioned above - seems Intel provided an experimental patchset for this in 2018 that never got approved, and the Intel guys went back to working on something else)

@martin-frbg
Copy link
Collaborator

Ok. by increasing n (or decreasing BUFFER_SIZE) I can indeed make dgemm_oncopy write beyond the "buffer" on Haswell too (and the fault must be higher up in either trsm_L.c, the level3_gemm/syrk_threaded or ultimately blas_server_omp.c as the generic gemm_ncopy8 fails in the same way as its optimized counterparts). No idea yet why nthreads=2 would be special (and potentially be assigned non-adjacent regions of the buffer ?)

@martin-frbg
Copy link
Collaborator

Actually nthreads=2 does not seem to be that special. Valgrind tells me that at least in my somewhat contrived example with the decreased BUFFER_SIZE the gemm_ncopy writes outside the allocated memory whenever the number of threads is less than the number of actual cores in the system. One possible link is the buffer pre-allocation introduced with d744c95 , but I see nothing immediately wrong with that code.

@martin-frbg
Copy link
Collaborator

Need to retest this with 0.3.11, may have been connected to the huge BLAS3 stack requirements that were recently improved in #2879

@martin-frbg
Copy link
Collaborator

Closing as not reproducible with any later version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants