# [C++] OMP vs. MKL parallelization

+1 vote

In the documentation, the parallelization is controlled by two environment variables, MKL_NUM_THREADS and OMP_NUM_THREADS. However in my benchmark, changing MKL_NUM_THREADS has no effect, and the number of CPU is fully controlled by OMP_NUM_THREADS. With this observation I have the following questions.

Is it correct that, MKL_NUM_THREADS controls the number of CPU used in a matrix-matrix multiplication, and OMP_NUM_THREADS controls the parallelization among different quantum number blocks? If this is true, does the current version of ITensor only support the parallelization among quantum number blocks?

+1 vote
selected by

Hi Chia-Min,
Your understanding about the role of these two different environment variables is correct, assuming that the BLAS you are using is actually MKL. Of course for a user using a different BLAS, such as OpenBLAS, they would need to set OPENBLASNUMTHREADS instead of MKLNUMTHREADS.

To answer your last question, within ITensor the only explicit parallelization is over quantum number blocks, with the number of threads used for that controlled by OMPNUMTHREADS. However, that does not mean ITensor "only" supports that kind of multithreading. Since we use BLAS to do the tensor contractions, then if you turn on multithreading for your BLAS then calls to the BLAS by ITensor will also be multithreaded. In fact, it is something ITensor does not control itself and you just control by setting MKLNUMTHREADS or similar.

If you did not see any effect of setting MKLNUMTHREADS, this could be for a number of different reasons. A less likely one is that it's not properly set by your code or terminal. But more likely, it's one of two other things (or both):
1. there is a competition of resources happening between the multithreading over the blocks and the multithreading over the matrix data within BLAS
2. many of the blocks or tensors are just too small for the BLAS multithreading to have much of an effect

We have in general seen that BLAS multithreading does not often scale very well, and will just give something like a factor of 2 speedup even if more than two threads are used for it.

To be more precise about all these things, please see the benchmarks in the latest version of the ITensor paper, Section 12: https://arxiv.org/abs/2007.14822

Here is a link to the actual code that was used to obtain these benchmarks - I link here to the line that sets the MKLNUMTHREADS and OMPNUMTHREADS variables so you can see that is indeed how it is done:
https://github.com/ITensor/ITensorBenchmarks.jl/blob/12e3a1f0ff3e587fd026d22d79a00bf36668cb34/src/runbenchmarks.jl#L234

Of course if you have any followup questions please ask. Also I might ask Matt Fishman to weigh in since he did those benchmarks and wrote the block-sparse multithreading code.

Best regards,
Miles

commented by (760 points)
Thank you for guiding me to the paper. The benchmark is very useful.