-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Openblas 2.19 and above is not working on Ubuntu 16.04 for Power 8 #1037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Most of the POWER8-specific changes since 0.2.18 appear to have happened in a narrow timespan between April 19 and May 22 - while I cannot possibly comment on the power8 assembly, I think I understand that wernsaar also adjusted some thresholds for thread creation in the GEMM functions that may simply have made thread contention more likely, and 8310d4d also dropped the ALLOC_SHM define from Makefile.power without explanation (though it may have been spurious all along). Unless somebody else comes up with a better idea, I wonder if you could try a snapshot from somewhere in the middle of what appears to have been the crucial period ( say 0551e57 from April 26 ), and/or try and see if limiting the number of threads created on each node via OMP_NUM_THREADS has any influence ? Also do the Ubuntu and RHEL machines you mention use the exact same binary, or was OpenBLAS |
Did you use any flags when compiling OpenBLAS? Attach to frozen process with debugger and dump all backtraces: And attach typescript file here. |
I have compiled openblas using just the make command to use default options on both Ubuntu and RHEL and used export LD_LIBRARY_PATH= and ran HPL using the following command : mpirun -np 2 -bind-to-core --allow-run-as-root --mca btl sm,self,tcp xhpl I have used GCC and Gfortran version 5.3.1 on both Ubuntu and RHEL. Also tried to collect the traces but I see the following error - [ No Source Available ] Am I missing out something? Does it require to connect any debuggers? since it is a guest machine on KVM. Any other ways of collecting the traces? |
You may need to build a debuggable version of OpenBLAS and HPL first to get any meaningful backtraces with gdb. |
Just a quick peek into the problem - can you compare CPUID on compile machine and run machine? |
Does it work out single threaded and/or without MPI binding options? |
@rakshithprakash Can you provide me your make.inc from the HPL benchmark such that I can recompile it easily ? |
@brada4 Attaching the traces : |
@brada4 Both the compiler and the run machine are the same in my case. |
@grisuthedragon Please find the attached Makefile |
I tested it with the current development version of OpenBLAS on an IBM Power8+ with CentOS 7.3 running on and everything works fine with the HPL benchmark. I compiled OpenBLAS using
because for the HPL benchmark it is quite common to use only the parallelization coming from the MPI processes. |
Probably related to #660, running with only one thread is bound to avoid any deadlocks from multithreading. |
Even with enabled multithreading in OpenBLAS the hpl code works fine without running into a deadlock on my machine. |
I was looking for Current backtrace looks like OpenMP-enabled system OpenBLAS (symlinked to /usr/lib/libblas.so.3 via update-alternatives)- are you sure HPL is linked against freshly built OpenBLAS? (check with ldd) |
@brada4 Please find the attached backtrace for all threads : I also looked at the ldd and could see that libopenblas is linked to 2.19 version : libopenblas.so.0 => /home/hpl-2.2/openblas-2.19/OpenBLAS-0.2.19/libopenblas.so.0 (0x00003fffb4790000) |
I have doubts about linked imports of xhpl (program that backtrace comes from)
|
I did a export LD_PRELOAD=/home/hpl-2.2/openblas-2.19/OpenBLAS-0.2.19 and i'm seeing the following error : ERROR: ld.so: object '/home/hpl-2.2/openblas-2.19/OpenBLAS-0.2.19' from LD_PRELOAD cannot be preloaded (cannot read file data): ignored. And this is the ldd from export LD_LIBRARY_PATH=/home/hpl-2.2/openblas-2.19/OpenBLAS-0.2.19 : ldd ./xhpl |
Unlike LD_LIBRARY_PATH you need to include the name of the library in the LD_PRELOAD, so: |
If I look at Makefile_ppc.txt attached earlier, it uses both -lblas and -lopenblas |
Users of Ubuntu 16.04 on POWER8 may also want to take note of Ubuntu Bug #1641241 here: |
I did an export LD_PRELOAD for both versions of openblas - 2.18 and 2.19. Below is the ldd for 2.19 : ldd ./xhpl Both the versions 2.18 & 2.19 seem to work now after using LD_PRELOAD but I see that 2.19 is taking approximately 2.4x the time to complete when compared to 2.18. Please find the results below : 2.18 : WR11R2C4 2000 140 1 2 2.44 2.190e+00 2.19 : WR11R2C4 2000 140 1 2 5.98 8.922e-01 I collected the perf data and annotations for them, please find them below : Samples: 293K of event 'cycles:ppp', Event count (approx.): 293830000000
|
Not sure how to read the perf data (do you have the 2.18 values for comparison ?), are these results reproducible (and same workload etc on the machine during both runs) ? If the values are stable it could be that the changed thresholds mentioned above are not favorable for the matrix sizes in this particular benchmark. Perhaps @grisuthedragon has benchmark results from his machine easily available ? |
@martin-frbg Here are the perf data for 2.18 for comparison : Samples: 301K of event 'cycles:ppp', Event count (approx.): 301774000000
Yes it's reproducible. |
can you fix conflicting libblas and libopenblas dependencies? |
Is it conceivable that you built 0.2.18 with different options, something more like the NUM_THREADS=1 USE_OPENMP=0 that grisuthedragon recommended above for hpl ? (No libgomp and no reference to threading in its perf results would explain less overhead...) |
Most likely /usr/lib/libblas.so.3 is ubuntu-supplied openblas 0.2.18 built with OPENMP, and without CBLAS or LAPACK... |
@martin-frbg Here is the result of my benchmark( gcc 4.8.5, glibc 2,17, CentOS 7.3, Kernel 4.8[from Fedora 25], current OpenBLAS 0.2.20dev , OpenMPI 1.8.) I optimized the HPL.dat for my machine now having the following:
and for the largest experiment (N = 40960) I get:
which is quite good comparing it to the pure DGEMM performance (490 GFlop/s) obtained by the DGEMM benchmark of OpenBLAS. |
again check with LDD if you use fedora-supplied /usr/lib64/libopenblas(p/o).so or one you built or both. |
@brada4 Please find the results below for update-alternatives --list command for both LD_PRELOAD & LD_LIBRARY_PATH update-alternatives --list libblas.so.3/usr/lib/libblas/libblas.so.3 update-alternatives --list libopenblas.so.0update-alternatives: error: no alternatives for libopenblas.so.0 |
@martin-frbg I have used just the make command to use default options for both 2.18 & 2.19. |
@grisuthedragon Hi, can you please give it a try on Ubuntu 16.04 once? |
Main idea is to avoid linking to system blas (remove -lblas option) , and use -L/where/openblas/is/built -lopenblas, and check with ldd that you actually test openblas build that you intended to test. |
@rakshithprakash I do not have an Ubuntu 16.04 running on this machine. Furthermore, IBM suggest RHEL/CentOS on this type of machines. |
@grisuthedragon On IBM site I find contrary statement.... This problem will not be fixed by Red Hat switch, since system default BLAS will be linked in addition to openblas. |
@brada4 Removing the -lblas option doesn't seem to work for me. Please find the error below : mpicc -L/home/hpl-2.2/openblas-2.19/OpenBLAS-0.2.19/ -lopenblas -L/opt/ibm/lib/ -lm -R/opt/ibm/lib -o /home/hpl-2.2/hpl-2.2/bin/ppc64/xhpl HPL_pddriver.o HPL_pdinfo.o HPL_pdtest.o /home/hpl-2.2/hpl-2.2/lib/ppc64/libhpl.a And my make file is : ----------------------------------------------------------------------- shell ------------------------------------------------------------------------------------------------------------------------------------SHELL = /bin/sh CD = cd ----------------------------------------------------------------------- Platform identifier ----------------------------------------------------------------------------------------------------------------------ARCH = ppc64 ----------------------------------------------------------------------- HPL Directory Structure / HPL library ----------------------------------------------------------------------------------------------------TOPdir = /home/hpl-2.2/hpl-2.2 HPLlib = $(LIBdir)/libhpl.a ----------------------------------------------------------------------- Message Passing library (MPI) ------------------------------------------------------------------------------------------------------------MPinc tells the C compiler where to find the Message Passing libraryheader files, MPlib is defined to be the name of the library to beused. The variable MPdir is only used for defining MPinc and MPlib.MPdir = ----------------------------------------------------------------------- Linear Algebra library (BLAS or VSIPL) ---------------------------------------------------------------------------------------------------LAinc tells the C compiler where to find the Linear Algebra libraryheader files, LAlib is defined to be the name of the library to beused. The variable LAdir is only used for defining LAinc and LAlib.LAdir = ----------------------------------------------------------------------- F77 / C interface ------------------------------------------------------------------------------------------------------------------------You can skip this section if and only if you are not planning to usea BLAS library featuring a Fortran 77 interface. Otherwise, it isnecessary to fill out the F2CDEFS variable with the appropriateoptions. One and only one option should be chosen in each ofthe 3 following categories:1) name space (How C calls a Fortran 77 routine)-DAdd_ : all lower case and a suffixed underscore (Suns,Intel, ...), [default]-DNoChange : all lower case (IBM RS6000),-DUpCase : all upper case (Cray),-DAdd__ : the FORTRAN compiler in use is f2c.2) C and Fortran 77 integer mapping-DF77_INTEGER=int : Fortran 77 INTEGER is a C int, [default]-DF77_INTEGER=long : Fortran 77 INTEGER is a C long,-DF77_INTEGER=short : Fortran 77 INTEGER is a C short.3) Fortran 77 string handling-DStringSunStyle : The string address is passed at the string loca-tion on the stack, and the string length is thenpassed as an F77_INTEGER after all explicitstack arguments, [default]-DStringStructPtr : The address of a structure is passed by aFortran 77 string, and the structure is of theform: struct {char *cp; F77_INTEGER len;},-DStringStructVal : A structure is passed by value for each Fortran77 string, and the structure is of the form:struct {char *cp; F77_INTEGER len;},-DStringCrayStyle : Special option for Cray machines, which usesCray fcd (fortran character descriptor) forinteroperation.F2CDEFS = -DAdd_ -DF77_INTEGER=int -DStringSunStyle ----------------------------------------------------------------------- HPL includes / libraries / specifics -----------------------------------------------------------------------------------------------------HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) - Compile time options ------------------------------------------------DHPL_COPY_L force the copy of the panel L before bcast;-DHPL_CALL_CBLAS call the cblas interface;-DHPL_CALL_VSIPL call the vsip library;-DHPL_DETAILED_TIMING enable detailed timers;By default HPL will:*) not copy L before broadcast,*) call the BLAS Fortran 77 interface,*) not display detailed timing information.HPL_OPTS = ----------------------------------------------------------------------HPL_DEFS = ----------------------------------------------------------------------- Compilers / linkers - Optimization flags -------------------------------------------------------------------------------------------------export OMPI_CFLAGS:= CCNOOPT = $(HPL_DEFS) -m64 CCFLAGS = $(HPL_DEFS) -m64 -O3 -mcpu=power8 -mtune=power8 LINKER = mpicc LINKFLAGS = -L/home/hpl-2.2/openblas-2.19/OpenBLAS-0.2.19/ -lopenblas -L/opt/ibm/lib/ -lm -R/opt/ibm/lib ARCHIVER = ar I tried adding another L/home/hpl-2.2/openblas-2.19/OpenBLAS-0.2.19/ -lblas and it didn't work. I can get the same make file compiled by using the -lblas option in LAlib |
Please put the "-lopenblas" in the LAlib list where the -lblas was - the libhpl.a depends on it and the sequence within the library list matters. |
@rakshithprakash @brada4 If you're using the IBM XL compilers + ESSL + CUDA than the IBM support told me the other around. But no more here. ;-) |
It got compiled now after adding the entire path in LAlib variable. But I do not see the path in the ldd. ldd ./xhpl But using export LD_LIBRARY_PATH I can see the output for both 2.18 & 2.19. 2.18: ================================================================================
|
Probably they add ESSL as alternative to libblas.so.3 and all works well by default. You could try that way with OpenBLAS too - 'make install' will install /opt/OpenBLAS/lib/libopenblas.so |
So if I read your most recent results correctly 2.19 is now performing better than 2.18 (Gflops went from 2.948 to 15.12 for that test) ? |
@brada4 I do not think that they use the ESSL as alternative for libblas.so.3 because the ESSL is designed to work with the XL compiler and therefore the Fortran symbols does not have the underscore at the end. So installing ESSL as alternative will break all applications. |
Just build HPL against -lblas and update alternatives. It is the easiest way |
Hi,
I was running HPL-2.2 with openblas 2.19,2.20 and could see that the benchmark never exits even after running it overnight for a very small problem size such as 200. I cross verified it once on 2.18 and could see that it completes in less than a second. Please find the command that I'm using :
mpirun -np 2 -bind-to-core --allow-run-as-root --mca btl sm,self,tcp xhpl
My guest configuration is as below :
Number of cores : 2 cores
SMT mode : 8
Memory : 16GB
OS : Ubuntu 16.04 LTS
Kernel version : 4.4.0-21-generic
I verified the same thing on x86 and could see that it is working fine.
After looking at the perf data on 2.19 I could observe that 90% of the time is spent in inner_thread
Attaching the perf data of 2.19 :
and whereas the annotations of inner_thread looks like this :
annotations.txt
I could the observe the same behavior on another Ubuntu machine as well but it works fine in RHEL.
The text was updated successfully, but these errors were encountered: