-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Memory issues when using libFLAME dpotrf_() with OpenBLAS #2562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
(1) this should not happen(TM) |
The best I can do regarding how I compile and run the program (second recompilation is not needed though, just changed the number of threads from 4 to 2 and reran it): |
I know about the |
Nothing found so far, only change I made to your test case was adding the stdlib include for malloc/free. And I am currently building without libFLAME (which I think should not be an issue unless the memory corruption happens there) |
During bisecting to find the min size that causes the issue, which seems to be the weird magic number 22659, I also got this (but maybe it's just a random error along rest memory errors): https://imgur.com/a/QovF9MN |
Linked without libFLAME, issue still exists. |
That is certainly suspicious (though I would hope any bugs in the Haswell/SkylakeX dgemm_ncopy kernel would have surfaced by now). Will try to reproduce this on a bigger machine later today, sometimes the laptop is not good enough at uncovering bugs. |
Now reproduced on Skylake-X. |
Unfortunately valgrind on SkylakeX hits an unimplemented instruction, and valgrind on Haswell just hangs. Suspect the calculation of the memory buffer addresses in lapack/dpotrf/dportf_parallel.c is going haywire |
Heisenbug - apparently it just does not want to happen again here, and valgrind on Haswell finds nothing wrong at more manageable sizes (a bit over 16k) while taking days at n=32768. |
Same here. For small sizes nothing showed up in valgrind. Tried larger sizes, but due to the -I suppose- O(N^3) asymptotic complexity, I tried to bisect to find the minimum size that causes the issue. But since you reproduced it I didn't tested it. Btw, did you test if size 22659 also causes the problem in your setting? It should be much better than 32768. |
Good point about the 22659 - did not test this yet (but my Haswell currently refuses to have any problem with 32768 as well - for now I'll just leave the valgrind job running in the hope that it uncovers anything eventually) |
Btw the 22569 triggers the issue on my i7-6800. I will also check it on Threadripper 1950X. 32768 triggers it on both machines. I have no idea if this is reproduced on Haswell. |
While i7-6800 is using Broadwell kernels, the Threadripper basically shares Haswell kernels (and the underlying issue is probably in the memory buffer setup/ workload splitting code that is shared between all cpu types anyway). |
Haswell valgrind run finally completed after three days without showing any issues at n=32768. And ignore my earlier comment about Broadwell kernels - I got the lineage confused past Sandybridge, OpenBLAS on Broadwell uses the exact same Haswell kernels and GEMM parameters. |
I still fail to reproduce this on anything other than SkylakeX, possibly this is just another case of the default BUFFER_SIZE being too small for large problems. At least increasing the value in common_x86_64.h to what is used on Haswell makes the problem go away, though it could certainly be that this only adds a safe drop zone for things that get written to a miscalculated address. |
Did you try bisecting to see if the same size causes the issue to you or if this 22659 is just a random number? (at least on me it was deterministic: all smaller numbers were ok, all larger caused issues) If you decrease BUFFER_SIZE can you reproduce the issue on smaller, thus more manageable, matrix sizes? Also, did you try with different number of threads? How can the number of threads be involved into this? E.g. in my cases on omp_num_threads = 2 caused the issue. Other number of threads were ok. |
The problem is that so far I have seen this only on SkylakeX (where it fails for values down to about n=15700 with the 0.3.9 BUFFER_SIZE) and glibc's builtin MALLOC_CHECK_ is of no use in tracking down the actual source of the (heap?) corruption. (And valgrind refuses to work as it does not support AVX512 as I mentioned above - seems Intel provided an experimental patchset for this in 2018 that never got approved, and the Intel guys went back to working on something else) |
Ok. by increasing n (or decreasing BUFFER_SIZE) I can indeed make dgemm_oncopy write beyond the "buffer" on Haswell too (and the fault must be higher up in either trsm_L.c, the level3_gemm/syrk_threaded or ultimately blas_server_omp.c as the generic gemm_ncopy8 fails in the same way as its optimized counterparts). No idea yet why nthreads=2 would be special (and potentially be assigned non-adjacent regions of the buffer ?) |
Actually nthreads=2 does not seem to be that special. Valgrind tells me that at least in my somewhat contrived example with the decreased BUFFER_SIZE the gemm_ncopy writes outside the allocated memory whenever the number of threads is less than the number of actual cores in the system. One possible link is the buffer pre-allocation introduced with d744c95 , but I see nothing immediately wrong with that code. |
Need to retest this with 0.3.11, may have been connected to the huge BLAS3 stack requirements that were recently improved in #2879 |
Closing as not reproducible with any later version |
Uh oh!
There was an error while loading. Please reload this page.
Hello all,
I'm running this minimal test file with
DPOTRF
linking it with libFLAME and OpenBLAS. I tried with various matrix dimensions from 2 to 32768 (8GB matrix), and I run it with number of threads = 1, 2, 4, 8.I have this weird behaviour:
free()
at the end of the program ("invalid size", "double free or corruption", "munmap_chunk() failed" etc).I get this both on Intel i7-6800K and AMD Ryzen Threadripper 1950X (both 64GB RAM).
Maybe I'm messing something but don't know what. Note sure if this has anything to do with #1137 since I'm getting bus errors, but currently I can't test a more recent version on our internal network.
I'm using OpenBLAS 0.3.9.
It also maybe related to #352.
Background:
I'm running some benchmarks to compare MKL LAPACK vs libFLAME performance, using MKL BLAS or OpenBlas as underlying BLAS libraries.
I first benchmarked MKL BLAS vs OpenBlas on
DDOT
,DGEMV
,DGEMM
.I tested matrix sizes up to 32768 x 32768 = 8GB, with number of threads 1, 2, 4, 8.
Everything was ok (btw, OpenBLAS was very close to MKL on Intel CPUs, and was 40%-60% faster on AMD).
I repeated the same settings but now testing MKL LAPACK vs libFLAME on
DPOTR
,DGETR
,DSYTR
.When using MKL BLAS as underlying BLAS library for both, everything is ok.
When I use libFLAME with OpenBLAS, then everything is ok until I run
DPOTRF
with size32768
and num threads =2
.Unfortunately I'm working on internal corporate network so can't copy much info such as
lscpu
, dumps or the results of benchmarks (I copied by hand the test file).The text was updated successfully, but these errors were encountered: