# ITensor3 is slower than same code on ITensor2

Hello,

I recently switched from ITensor 2 to version 3 and I noticed that the same code runs noticeably slower on the newer version.
A calculation that was taking ~7 hours in ITensor 2 now takes almost twice as much time.
I am not sure whether it is ITensor that is responsible to this or something in the cluster configuration where I am running the code, but I would appreciate any help or ideas about what might be causing this.

In particular, on the cluster I am using, I only have access to certain gcc compilers, and in order to use c++17 I had to switch from gcc 4.9.3 to gcc9.1.0 (the only one that supports c++17).
In both cases, I am using LAPACK 3.6.1.

Thanks,

commented by (47.7k points)
Hi, could you please say more about which type of calculation you are running? Is it a DMRG calculation? If so, what is a typical bond dimension you are reaching and are you conserving any quantum numbers? Also is the MKL library available on your system?
commented by (47.7k points)
Oh, and are you using the very newest version of ITensor 3 (version 3.1.0)?
commented by (330 points)
Yes, I am using the latest ITensor 3.1.0.
The calculation is DMRG followed by time evolution using applyMPO (2nd order Zalatel). The DMRG step is negligible. Bond dimension is bounded by 10.  I am not using any quantum number conservation.
MKL is available as part of the Intel Compiler Suite, but in the option.mk file you say "Intel C++ Compiler not recommended" ?
commented by (8.5k points)
ITensor v3.1.1 has an important bug fix compared to v3.1.0, it is recommended to use that version (but that shouldn't affect the issue you are seeing).

That note only refers to the C++ compiler (the Intel compiler is generally farther behind in terms of features compared to gcc or clang). You can still use Intel MKL with gcc/clang, which you set farther down in options.mk (i.e. use PLATFORM=mkl and set the library and include flags for the location of MKL on your platform).

However, if you are comparing the same BLAS/LAPACK implementation, the efficiency should be the same between ITensor v2 and v3 for dense operations (i.e. with no quantum numbers). It may be good to confirm you see the same issue when comparing ITensor v2 and v3, both using Intel MKL (I have seen a speedup of a factor of 2 of Intel MKL vs. other BLAS implementations).
commented by (8.5k points)
edited
Also, could you please confirm that all of the bond dimensions involved in your calculation (the maximum bond dimension of the MPS and the MPO) are the same between ITensor v2 and v3? That is the only thing I could think could account for such a large difference in performance.
commented by (330 points)
The bond dimension is the same (within 0.01 margin). I recompiled ITensor and it seems to be a lot better, although I am not sure what changed. I added some GCC_LIBS flags to the compilation commands.
ITensor3 is still slower by about ~0.5-1 hours, but I guess I can live with that, although it would be good to know if it expected or not.
I will try the suggestion to use MKL.
commented by (47.7k points)
Matt and I were wondering another thing also: if the bond dimension is bounded by 10, how can the calculation be taking many hours to complete? Is the number of sites or time steps very large say? Or perhaps the MPO is very large? Ideally it would be great if you could provide us with a minimal working code that we could run ourselves to reproduce the slowdown between versions. You can email it to us if you do not want to post it here. Best regards, Miles
commented by (8.5k points)
If all of the bond dimensions are the same, it is unexpected to hear that ITensor v3 would be slower than ITensor v2 (if anything, dense contractions in the latest v3 should be a bit faster than v2 since we made an optimization to the contraction code to decrease the allocation time of the output ITensor storage). All timing comparisons we have done for contractions and DMRG have confirmed this as well. As Miles said, a minimal working code that shows a slowdown would be useful (perhaps after you compare ITensor v2 and v3 both using Intel MKL).

Cheers,
Matt
commented by (330 points)
Hello Miles and Matt,

I switched to intel MKL and I have to say it is amazing! what used to take ~7 hours now takes ~2.5 hours!
Unfortunately, this is with v2. v3 is lagging behind at about an hour longer ~3.5 hours.
To be precise, I am using the latest v2 and v3 from github, both compiled with c++17 with gcc/9.1.0 and MKL2019.5.281.

The reason it takes so long with a bond dimension of 10 is because I have a system of 400 bosonic sites , with up to 4 bosons per sites (technically, one site is special with twice the internal states) . I am doing time evolution with time step 0.01, until 300 (so 300*100 time steps). I making measurements and saving results (results themselves as well as the full MPS) to file every 100 time steps. I think this take a nonnegligible amount of time.

I would gladly send you my code. However, right now it is far from being "minimal" as it uses multiple additional headers and source files. For example, while in v3 I am using the provided bosonsite class, in v2 I had to write my own.

Do you have any examples for which you have done the timing comparisons that I can repeat on my end? If not, I might try to do some testing with a simple spin chain and evolve under some random Hamiltonian.
commented by (8.5k points)
A quick comparison you can do is just run some of the dmrg sample codes that are included in the itensor installation (in the directory where you have installed ITensor, go into the 'sample/' directory which should be roughly the same for ITensor v2 and v3). You can play around with some of those sample codes and compare between the two versions.

For the 'dmrgj1j2.cc' sample, I modified it to use SpinOne instead of SpinHalf (to make it a bit harder), turned off QNs (since you are using dense tensors), and ran ITensor v2 and v3 with J2 = 1. The timings for the last sweep using ITensor v2 are:

vN Entropy at center bond b=50 = 1.731717097752
Eigs at center bond b=50: 0.3080 0.2068 0.2068 0.2068 0.0103 0.0103 0.0103 0.0033 0.0033 0.0033
Largest m during sweep 5/5 was 200
Largest truncation error: 7.20283e-07
Energy after sweep 5/5 is -151.154829221677
Sweep 5/5 CPU time = 9m, 27.9s (Wall time = 50.28s)

while for the same run using ITensor v3 I get:

vN Entropy at center bond b=50 = 1.731717627565
Eigs at center bond b=50: 0.3080 0.2068 0.2068 0.2068 0.0103 0.0103 0.0103 0.0033 0.0033 0.0033
Largest link dim during sweep 5/5 was 200
Largest truncation error: 2.21877e-07
Energy after sweep 5/5 is -151.154829255446
Sweep 5/5 CPU time = 5m, 52.6s (Wall time = 31.84s)

You can see the bond dimensions are the same, and I am seeing that ITensor v3 is quite a bit faster. This is due to an optimization that we did that should speed up (to different extents) every contraction in ITensor. For every test case I did I have seen either that ITensor v2 and v3 are about the same speed or that ITensor v3 is faster. Let me know if you see something different.

Maybe the reading and writing (or something else besides standard tensor operations like contraction and svd) is taking up a significant amount of time in your calculations? Have you tried using timers or profilers (like gperftools)?

Cheers,
Matt
commented by (47.7k points)
I agree having a gperftools profile of your code would be very clarifying, Ron. One tip that Matt taught me is that you need to make sure to set OMP_NUM_THREADS=1 (shell environment variable) when using it to get meaningful results. (Otherwise it reports a ton of threading functions that are misleading.) But apart from that it's really great at revealing performance bottlenecks.
commented by (330 points)
I ran the dmrgj1j2 code with the same modifications as Matt.
Both v2 and v3 seems approximately the same. I dont see the speedup that Matt sees.

I will look into gperftools.
commented by (330 points)
V2 Code:

#include "itensor/all.h"

using namespace itensor;

int main(int argc, char* argv[])
{
int N = 100;

println("Input J2 value:");
Real J2 = 0;
std::cin >> J2;

//
// Initialize the site degrees of freedom.
//
auto sites = SpinOne(N);

//
// Create the Hamiltonian matrix product operator.
// Here we use the IQMPO class which is an MPO of
// IQTensors, tensors whose indices are sorted
// with respect to quantum numbers
//
auto ampo = AutoMPO(sites);
for(int j = 1; j < N; ++j)
{
ampo += 0.5,"S+",j,"S-",j+1;
ampo += 0.5,"S-",j,"S+",j+1;
ampo +=     "Sz",j,"Sz",j+1;
}
for(int j = 1; j < N-1; ++j)
{
ampo += 0.5*J2,"S+",j,"S-",j+2;
ampo += 0.5*J2,"S-",j,"S+",j+2;
ampo +=     J2,"Sz",j,"Sz",j+2;
}
auto H = MPO(ampo);

// Set the initial wavefunction matrix product state
// to be a Neel state.
//
auto state = InitState(sites);
for(int i = 1; i <= N; ++i)
state.set(i,(i%2==1 ? "Up" : "Dn"));

auto psi = MPS(state);

//
// overlap calculates matrix elements of MPO's with respect to MPS's
// overlap(psi,H,psi) = <psi|H|psi>
//
printfln("Initial energy = %.5f",overlap(psi,H,psi));

//
// Set the parameters controlling the accuracy of the DMRG
// calculation for each DMRG sweep. Here less than 10 maxm
// values are provided, so all remaining sweeps will use the
// last maxm (= 200).
//
auto sweeps = Sweeps(5);
sweeps.maxm() = 50,50,100,100,200;
sweeps.cutoff() = 1E-8;
println(sweeps);

//
// Begin the DMRG calculation
//
auto energy = dmrg(psi,H,sweeps,"Quiet");

//
// Print the final energy reported by DMRG
//
printfln("\nGround State Energy = %.10f",energy);
printfln("\nUsing overlap = %.10f\n", overlap(psi,H,psi) );

//println("\nTotal QN of Ground State = ",totalQN(psi));

//
// Measure S.S on every bond
//
for(int b = 1; b < N; ++b)
{
psi.position(b);
auto ketzz = psi.A(b)*psi.A(b+1)*sites.op("Sz",b)*sites.op("Sz",b+1);
auto ketpm = psi.A(b)*psi.A(b+1)*sites.op("Sp",b)*sites.op("Sm",b+1)*0.5;
auto ketmp = psi.A(b)*psi.A(b+1)*sites.op("Sm",b)*sites.op("Sp",b+1)*0.5;
auto bra = dag(psi.A(b)*psi.A(b+1));
bra.prime(Site);
auto SdS = (bra*ketzz).real() + (bra*ketpm).real() + (bra*ketmp).real();
printfln("S.S b %d = %.10f",b,SdS);
}

return 0;
}

Output:

vN Entropy at center bond b=50 = 1.281288148412
Eigs at center bond b=50: 0.4629 0.3626 0.0686 0.0426 0.0402 0.0115 0.0094 0.0017
Largest m during sweep 1/5 was 9
Largest truncation error: 5.70055e-16
Energy after sweep 1/5 is -133.047725652099
Sweep 1/5 CPU time = 0.0630s (Wall time = 0.0677s)

vN Entropy at center bond b=50 = 1.538710479300
Eigs at center bond b=50: 0.4453 0.2765 0.1089 0.0847 0.0356 0.0177 0.0088 0.0059 0.0036 0.0027
Largest m during sweep 2/5 was 50
Largest truncation error: 2.64072e-05
Energy after sweep 2/5 is -149.576391371753
Sweep 2/5 CPU time = 1.439s (Wall time = 1.861s)

vN Entropy at center bond b=50 = 1.679144043470
Eigs at center bond b=50: 0.3261 0.2293 0.2017 0.1810 0.0098 0.0086 0.0081 0.0036 0.0033 0.0030
Largest m during sweep 3/5 was 100
Largest truncation error: 5.82099e-06
Energy after sweep 3/5 is -151.121715673937
Sweep 3/5 CPU time = 17.58s (Wall time = 21.05s)

vN Entropy at center bond b=50 = 1.720886079247
Eigs at center bond b=50: 0.3090 0.2085 0.2072 0.2060 0.0101 0.0100 0.0100 0.0033 0.0033 0.0033
Largest m during sweep 4/5 was 100
Largest truncation error: 1.34707e-05
Energy after sweep 4/5 is -151.147226028195
Sweep 4/5 CPU time = 21.91s (Wall time = 26.23s)

vN Entropy at center bond b=50 = 1.731658618466
Eigs at center bond b=50: 0.3080 0.2069 0.2068 0.2068 0.0103 0.0103 0.0103 0.0033 0.0033 0.0033
Largest m during sweep 5/5 was 200
Largest truncation error: 7.18139e-07
Energy after sweep 5/5 is -151.154828974666
Sweep 5/5 CPU time = 1m, 43.2s (Wall time = 1m, 52.6s)
commented by (330 points)
V3 Code:

#include "itensor/all.h"

using namespace itensor;

int main(int argc, char* argv[])
{
int N = 100;

println("Input J2 value:");
Real J2 = 0;
std::cin >> J2;

//
// Initialize the site degrees of freedom.
//
auto sites = SpinOne(N,{"ConserveQNs=",false});

//
// Create the Hamiltonian matrix product operator.
// Here we use the MPO class which is an MPO of
// IQTensors, tensors whose indices are sorted
// with respect to quantum numbers
//
auto ampo = AutoMPO(sites);
for(int j = 1; j < N; ++j)
{
ampo += 0.5,"S+",j,"S-",j+1;
ampo += 0.5,"S-",j,"S+",j+1;
ampo +=     "Sz",j,"Sz",j+1;
}
for(int j = 1; j < N-1; ++j)
{
ampo += 0.5*J2,"S+",j,"S-",j+2;
ampo += 0.5*J2,"S-",j,"S+",j+2;
ampo +=     J2,"Sz",j,"Sz",j+2;
}
auto H = toMPO(ampo);

// Set the initial wavefunction matrix product state
// to be a Neel state.
//
auto state = InitState(sites);
for(int i = 1; i <= N; ++i)
state.set(i,(i%2==1 ? "Up" : "Dn"));

auto psi0 = MPS(state);

//
// inner calculates matrix elements of MPO's with respect to MPS's
// inner(psi0,H,psi0) = <psi0|H|psi0>
//
printfln("Initial energy = %.5f",inner(psi0,H,psi0));

//
// Set the parameters controlling the accuracy of the DMRG
// calculation for each DMRG sweep. Here less than 10 maxdim
// values are provided, so all remaining sweeps will use the
// last maxdim (= 200).
//
auto sweeps = Sweeps(5);
sweeps.maxdim() = 50,50,100,100,200;
sweeps.cutoff() = 1E-8;
println(sweeps);

//
// Begin the DMRG calculation
//
auto [energy,psi] = dmrg(H,psi0,sweeps,"Quiet");

//
// Print the final energy reported by DMRG
//
printfln("\nGround State Energy = %.10f",energy);
printfln("\nUsing inner = %.10f\n", inner(psi,H,psi) );

//println("\nTotal QN of Ground State = ",totalQN(psi));

//
// Measure S.S on every bond
//
for(int b = 1; b < N; ++b)
{
psi.position(b);
auto ketzz = psi(b)*psi(b+1)*op(sites,"Sz",b)*op(sites,"Sz",b+1);
auto ketpm = psi(b)*psi(b+1)*op(sites,"Sp",b)*op(sites,"Sm",b+1)*0.5;
auto ketmp = psi(b)*psi(b+1)*op(sites,"Sm",b)*op(sites,"Sp",b+1)*0.5;
auto bra = dag(psi(b)*psi(b+1));
bra.prime("Site");
auto SdS = elt(bra*ketzz) + elt(bra*ketpm) + elt(bra*ketmp);
printfln("S.S b %d = %.10f",b,SdS);
}

return 0;
}

Output:

vN Entropy at center bond b=50 = 1.281288148412
Eigs at center bond b=50: 0.4629 0.3626 0.0686 0.0426 0.0402 0.0115 0.0094 0.0017
Largest link dim during sweep 1/5 was 9
Largest truncation error: 3.99609e-16
Energy after sweep 1/5 is -133.047725652099
Sweep 1/5 CPU time = 0.109s (Wall time = 0.112s)

vN Entropy at center bond b=50 = 1.538717778431
Eigs at center bond b=50: 0.4453 0.2765 0.1089 0.0847 0.0356 0.0177 0.0088 0.0059 0.0036 0.0027
Largest link dim during sweep 2/5 was 50
Largest truncation error: 1.19503e-05
Energy after sweep 2/5 is -149.576253943409
Sweep 2/5 CPU time = 1.534s (Wall time = 1.952s)

vN Entropy at center bond b=50 = 1.679145692220
Eigs at center bond b=50: 0.3261 0.2293 0.2017 0.1810 0.0098 0.0086 0.0081 0.0036 0.0033 0.0030
Largest link dim during sweep 3/5 was 100
Largest truncation error: 1.86552e-06
Energy after sweep 3/5 is -151.121709478775
Sweep 3/5 CPU time = 18.79s (Wall time = 22.47s)

vN Entropy at center bond b=50 = 1.720891954614
Eigs at center bond b=50: 0.3090 0.2085 0.2072 0.2060 0.0101 0.0100 0.0100 0.0033 0.0033 0.0033
Largest link dim during sweep 4/5 was 100
Largest truncation error: 4.20862e-06
Energy after sweep 4/5 is -151.147225986774
Sweep 4/5 CPU time = 22.59s (Wall time = 27.49s)

vN Entropy at center bond b=50 = 1.731658867223
Eigs at center bond b=50: 0.3080 0.2069 0.2068 0.2068 0.0103 0.0103 0.0103 0.0033 0.0033 0.0033
Largest link dim during sweep 5/5 was 200
Largest truncation error: 2.21218e-07
Energy after sweep 5/5 is -151.154828916378
Sweep 5/5 CPU time = 1m, 45.5s (Wall time = 1m, 54.5s)
commented by (8.5k points)
It looks like you may be running MKL with a single thread (since your CPU time and Wall time are about equal). When I run with a single thread I also get a similar time for ITensor v2 and v3. For example:

vN Entropy at center bond b=50 = 1.731717097752
Eigs at center bond b=50: 0.3080 0.2068 0.2068 0.2068 0.0103 0.0103 0.0103 0.0033 0.0033 0.0033
Largest m during sweep 5/5 was 200
Largest truncation error: 7.20283e-07
Energy after sweep 5/5 is -151.154829221677
Sweep 5/5 CPU time = 42.71s (Wall time = 1m, 0.0s)

ITensor v3 with a single thread:

vN Entropy at center bond b=50 = 1.731717627565
Eigs at center bond b=50: 0.3080 0.2068 0.2068 0.2068 0.0103 0.0103 0.0103 0.0033 0.0033 0.0033
Largest link dim during sweep 5/5 was 200
Largest truncation error: 2.21877e-07
Energy after sweep 5/5 is -151.154829255446
Sweep 5/5 CPU time = 39.90s (Wall time = 54.32s)

In this case, it seems as though using only a single BLAS thread slows down the matrix multiplication and therefore the optimization from ITensor v2 to v3 is not as pronounced (since the optimization was in speeding up the allocation of ITensor data, so it is a subleading cost compared to the matrix multiplication performed in the contraction). You should make sure you are running MKL with multithreading turned on for optimal performance (and note the MKL performance may depend on the machine you are running on).

Likely the slowdown in your code is not due to standard tensor operations like svd and contraction (since you are seeing that dmrg is comparable between ITensor v2 and v3, and dmrg is dominated by tensor contractions and svd operations). I think the next step would be to do some timing and profiling to try to pinpoint what operations are dominating your calculation.

Cheers,
Matt