Skip to content

BUG: np.dot is not thread-safe with OpenBLAS #11046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
artemru opened this issue May 4, 2018 · 21 comments
Closed

BUG: np.dot is not thread-safe with OpenBLAS #11046

artemru opened this issue May 4, 2018 · 21 comments

Comments

@artemru
Copy link

artemru commented May 4, 2018

I'm using numpy (1.14.1) linked against OpenBLAS 0.2.18 and it looks like np.dot
(that uses dgemm routine from openblas) is not thread-safe :

import numpy as np
from multiprocessing.pool import ThreadPool

dim = 4   # for larger value of dim, there's no issue
a = np.arange(10**5 / dim) / 10.**5
b = np.arange(10**5).reshape(-1, dim) / 10.**5

pp = ThreadPool(4)
threaded_result = pp.map(a.dot, [b] * 4) 
pp.close()
pp.terminate()

result = a.dot(b)
print [np.max(np.abs(x - result)) for x in threaded_result]

# print
# [1822.7068840452998, 1540.2636287421, 96.10628199050007, 0.0]
# or other rather random results whereas it should return zeros

I don't know if this kind of behavior is expected, is it numpy or rather openblas bug ?

Note :

  • numpy with MKL blas does not have this issue at all
  • everything runs fine if openblas threading is turned off (export OPENBLAS_NUM_THREADS=1)
  • I don't know how to test openblas==0.2.20 version that maybe solves this

Some extra info if needed :

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Stepping:              4
CPU MHz:               2500.060
BogoMIPS:              5000.12
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm retpoline kaiser fsgsbase smep erms xsaveopt
np.show_config()
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blis_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
@mattip
Copy link
Member

mattip commented May 4, 2018

re-occurance of #4813, there the issue was solved by upgrading to OpenBLAS 0.2.9

@artemru
Copy link
Author

artemru commented May 4, 2018

with multiprocessing everything is fine, it's about multi-threading. I think it's a different issue.

@artemru
Copy link
Author

artemru commented May 16, 2018

could anyone reproduce this issue ?

@pv
Copy link
Member

pv commented May 16, 2018

Yes, reproducible in fedora 28 + openblas 0.2.20, didn't seem to occur with ATLAS.
Numpy iirc just assumes the BLAS/LAPACK libs are threadsafe, there are no extra locks.
I'm not sure if there's something that needs to be done on the Numpy side, looks like openblas issue.

@charris
Copy link
Member

charris commented May 16, 2018

There is some experimenting going on with the OpenBLAS versions. It would be good to have a test for this, probably in numpy/linalg/tests/. Maybe in test_regressions.py or its own test_threading.py module.

@ogrisel
Copy link
Contributor

ogrisel commented May 22, 2018

@artemru did you report this bug to the OpenBLAS developers? If so what is the URL of the report?

@artemru artemru closed this as completed May 22, 2018
@artemru artemru reopened this May 22, 2018
@artemru
Copy link
Author

artemru commented May 23, 2018

@ogrisel, indeed it looks like a pure openblas issue. I did not report it to openblas developers (lack of time and I'm not fluent in c++). Yet, it's not clear for me whether openblas guaranties the thread-safety (I've just looked at https://github.com/xianyi/OpenBLAS/wiki/faq
If your application is already multi-threaded, it will conflict with OpenBLAS multi-threading. Thus, you must set OpenBLAS to use single thread as following.).

@seberg
Copy link
Member

seberg commented Oct 30, 2018

This issue is troubling me, but I am not quite sure how it can be solved. Possibly we can push openblas to solve the big bugs? Even than it would be annoying if other BLAS implementations are also not thread safe (by default).

I don't like hacks, but for this thing, I don't mind if the solution is seriously ugly or not, I would just prefer if there is one at all...

@ogrisel
Copy link
Contributor

ogrisel commented Oct 31, 2018

I think we should work with upstream OpenBLAS to make it thread-safe.

@ogrisel
Copy link
Contributor

ogrisel commented Oct 31, 2018

It would also be interesting to try to build OpenBLAS with OpenMP instead of its internal libpthread backend and check it the race condition reported by @artemru still happens in this case. OpenMP runtimes are thread-safe by design (I believe), so it's likely that it would fix the issue reported by @artemru.

In the past the @matthew-brett decided to build the OpenBLAS included in numpy & scipy wheels with the libpthread backend instead of OpenMP so as to avoid the fork-safety issues of the GCC implementation of the OpenMP runtime named GOMP. @njsmith submitted a patch to the GOMP developers to make it fork-safe but the review stalled: https://gcc.gnu.org/ml/gcc-patches/2014-02/msg00813.html and C-libraries that use OpenMP are still subject to deadlock or crash Python programs that use multiprocessing with the fork startmethod.

Nowadays I suspect that OpenBLAS could be built with OpenMP using clang so as to avoid running into the GOMP fork-safety limitations. clang / llvm use the implementation of the openmp runtime opensourced by Intel and as far as I know it is fork-safe.

Edit: the thread-safety issue in OpenBLAS is apparently unrelated to its threading backend (pthread vs openmp) as it also occurs when OpenBLAS is compiled with the single thread-mode flag (OpenMathLib/OpenBLAS#1844).

@charris
Copy link
Member

charris commented Oct 31, 2018

Note that current NumPy wheels are linked with OpenBLAS 0.3.0

@artemru
Copy link
Author

artemru commented Oct 31, 2018

it's also reproducible with OpenBLAS 0.3.0.

@seberg
Copy link
Member

seberg commented Oct 31, 2018

Just to note, I have opened an issue at OpenBLAS (OpenMathLib/OpenBLAS#1844), to hopefully discuss things there. Since I do not know the technical details here, any continuation of discussion there would be very welcome. For all I know right now, this seems like a high priority issue to me (also happens as default on Linux Systems when OpenBLAS is used), and if we can provide some help to OpenBLAS it might be good.

For all I see, downstream users have no reason to suspect such issues and it seems like it could randomly, once in a while create incorrect results (frankly, I mean I might suspect such issues, but half the people who work in a similar environment as I do are probably not even aware that OpenBLAS is threading).

@mattip mattip changed the title np.dot is not thread-safe with OpenBLAS==0.2.18 np.dot is not thread-safe with OpenBLAS Nov 1, 2018
@mattip mattip changed the title np.dot is not thread-safe with OpenBLAS BUG: np.dot is not thread-safe with OpenBLAS Nov 1, 2018
@mattip
Copy link
Member

mattip commented Nov 1, 2018

There might be a need to hold the GIL for some lapack/blas implementations if they cannot promise thread safety. Unfortunately we do not have a way to query, at runtime, which implementation we are using, see issue #11826

@ogrisel
Copy link
Contributor

ogrisel commented Nov 1, 2018

I think it's better to work with upstream to ensure that they are all thread-safe. MKL is thread safe, OpenBLAS can probably be fixed. I don't know for Blis but I would believe so.

@seberg
Copy link
Member

seberg commented Nov 1, 2018

Well, maybe we can add code such as the one you linked to to change the number of threads. If numpy knows the BLAS implementation it should release the GIL. But it could refuse to release the GIL if it sees one it does not recognize. For OpenBLAS and the typical ones it should definitely be rather fixed of course.

@bbbbbbbbba
Copy link

My impression is that, for reasonable behavior with multithreading, the thread server (blas_server.c) in OpenBLAS may need to be rewritten completely. Currently, if the calling program spawns multiple threads, then each of those threads becomes a main thread, and they share the same n-1 worker threads, which is not that bad (since the amount of parallelism is upper bounded by n anyway). However, blas_server.c doesn't expect there to be more than one main thread, so it makes a lot of questionable design choices, e.g:

  • When dispatching tasks, busy-waiting on the worker threads until one become idle;
  • When waiting for results, waiting for each worker thread as long as it is busy, even if it is busy from a task some other main thread gave to it.

Despite not affecting correctness, those problems lead to worse performance than one can reasonably expect --- than, say, if each main thread spawned its own n-1 worker threads, or if they shared the same n-1 worker threads in a reasonable way.

And there there are some one-off things that becomes outright bugs in a multithread setting, like this global buffer, and some other bug I don't yet understand that happens with this code snippet. This last one has been frustrating me for quite a while.

@matthew-brett
Copy link
Contributor

Would you consider opening an OpenBLAS issue on Github, to give a home for discussion?

@bbbbbbbbba
Copy link

There is already a OpenBLAS issue (OpenMathLib/OpenBLAS#1844), and I have been trying to discuss there for a while. I decided to escape here for my mental health.

@mattip
Copy link
Member

mattip commented Nov 12, 2018

OpenBLAS fixed OpenMathLib/OpenBLAS#1844.

@seberg
Copy link
Member

seberg commented Jan 5, 2019

I guess we can close this, since OpenBLAS is fixed, and we are making sure to link a newer version (even point it out in the release notes).

@seberg seberg closed this as completed Jan 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants