ITensor3 is slower than same code on ITensor2

Question

ITensor3 is slower than same code on ITensor2

asked Dec 11, 2019 by mpspik (440 points)

Hello,

I recently switched from ITensor 2 to version 3 and I noticed that the same code runs noticeably slower on the newer version.
A calculation that was taking ~7 hours in ITensor 2 now takes almost twice as much time.
I am not sure whether it is ITensor that is responsible to this or something in the cluster configuration where I am running the code, but I would appreciate any help or ideas about what might be causing this.

In particular, on the cluster I am using, I only have access to certain gcc compilers, and in order to use c++17 I had to switch from gcc 4.9.3 to gcc9.1.0 (the only one that supports c++17).
In both cases, I am using LAPACK 3.6.1.

Thanks,

commented Dec 11, 2019 by miles (70.2k points)

Hi, could you please say more about which type of calculation you are running? Is it a DMRG calculation? If so, what is a typical bond dimension you are reaching and are you conserving any quantum numbers? Also is the MKL library available on your system?

commented Dec 11, 2019 by miles (70.2k points)

Oh, and are you using the very newest version of ITensor 3 (version 3.1.0)?

commented Dec 11, 2019 by mpspik (440 points)

Yes, I am using the latest ITensor 3.1.0.
The calculation is DMRG followed by time evolution using applyMPO (2nd order Zalatel). The DMRG step is negligible. Bond dimension is bounded by 10. I am not using any quantum number conservation.
MKL is available as part of the Intel Compiler Suite, but in the option.mk file you say "Intel C++ Compiler not recommended" ?

commented Dec 12, 2019 by MattFishman (14.1k points)

ITensor v3.1.1 has an important bug fix compared to v3.1.0, it is recommended to use that version (but that shouldn't affect the issue you are seeing).

That note only refers to the C++ compiler (the Intel compiler is generally farther behind in terms of features compared to gcc or clang). You can still use Intel MKL with gcc/clang, which you set farther down in options.mk (i.e. use `PLATFORM=mkl` and set the library and include flags for the location of MKL on your platform).

However, if you are comparing the same BLAS/LAPACK implementation, the efficiency should be the same between ITensor v2 and v3 for dense operations (i.e. with no quantum numbers). It may be good to confirm you see the same issue when comparing ITensor v2 and v3, both using Intel MKL (I have seen a speedup of a factor of 2 of Intel MKL vs. other BLAS implementations).

commented Dec 12, 2019 by MattFishman (14.1k points)
edited Dec 12, 2019 by MattFishman

Also, could you please confirm that all of the bond dimensions involved in your calculation (the maximum bond dimension of the MPS and the MPO) are the same between ITensor v2 and v3? That is the only thing I could think could account for such a large difference in performance.

commented Dec 14, 2019 by mpspik (440 points)

The bond dimension is the same (within 0.01 margin). I recompiled ITensor and it seems to be a lot better, although I am not sure what changed. I added some GCC_LIBS flags to the compilation commands.
ITensor3 is still slower by about ~0.5-1 hours, but I guess I can live with that, although it would be good to know if it expected or not.
I will try the suggestion to use MKL.

commented Dec 14, 2019 by miles (70.2k points)

Matt and I were wondering another thing also: if the bond dimension is bounded by 10, how can the calculation be taking many hours to complete? Is the number of sites or time steps very large say? Or perhaps the MPO is very large? Ideally it would be great if you could provide us with a minimal working code that we could run ourselves to reproduce the slowdown between versions. You can email it to us if you do not want to post it here. Best regards, Miles

commented Dec 16, 2019 by MattFishman (14.1k points)

If all of the bond dimensions are the same, it is unexpected to hear that ITensor v3 would be slower than ITensor v2 (if anything, dense contractions in the latest v3 should be a bit faster than v2 since we made an optimization to the contraction code to decrease the allocation time of the output ITensor storage). All timing comparisons we have done for contractions and DMRG have confirmed this as well. As Miles said, a minimal working code that shows a slowdown would be useful (perhaps after you compare ITensor v2 and v3 both using Intel MKL).

Cheers,
Matt

commented Dec 20, 2019 by mpspik (440 points)

Hello Miles and Matt,

I switched to intel MKL and I have to say it is amazing! what used to take ~7 hours now takes ~2.5 hours!
Unfortunately, this is with v2. v3 is lagging behind at about an hour longer ~3.5 hours.
To be precise, I am using the latest v2 and v3 from github, both compiled with c++17 with gcc/9.1.0 and MKL2019.5.281.

The reason it takes so long with a bond dimension of 10 is because I have a system of 400 bosonic sites , with up to 4 bosons per sites (technically, one site is special with twice the internal states) . I am doing time evolution with time step 0.01, until 300 (so 300*100 time steps). I making measurements and saving results (results themselves as well as the full MPS) to file every 100 time steps. I think this take a nonnegligible amount of time.

I would gladly send you my code. However, right now it is far from being "minimal" as it uses multiple additional headers and source files. For example, while in v3 I am using the provided bosonsite class, in v2 I had to write my own.

Do you have any examples for which you have done the timing comparisons that I can repeat on my end? If not, I might try to do some testing with a simple spin chain and evolve under some random Hamiltonian.

commented Dec 20, 2019 by MattFishman (14.1k points)

A quick comparison you can do is just run some of the dmrg sample codes that are included in the itensor installation (in the directory where you have installed ITensor, go into the 'sample/' directory which should be roughly the same for ITensor v2 and v3). You can play around with some of those sample codes and compare between the two versions.

For the 'dmrgj1j2.cc' sample, I modified it to use SpinOne instead of SpinHalf (to make it a bit harder), turned off QNs (since you are using dense tensors), and ran ITensor v2 and v3 with J2 = 1. The timings for the last sweep using ITensor v2 are:

    vN Entropy at center bond b=50 = 1.731717097752
    Eigs at center bond b=50: 0.3080 0.2068 0.2068 0.2068 0.0103 0.0103 0.0103 0.0033 0.0033 0.0033
    Largest m during sweep 5/5 was 200
    Largest truncation error: 7.20283e-07
    Energy after sweep 5/5 is -151.154829221677
    Sweep 5/5 CPU time = 9m, 27.9s (Wall time = 50.28s)

while for the same run using ITensor v3 I get:

    vN Entropy at center bond b=50 = 1.731717627565
    Eigs at center bond b=50: 0.3080 0.2068 0.2068 0.2068 0.0103 0.0103 0.0103 0.0033 0.0033 0.0033
    Largest link dim during sweep 5/5 was 200
    Largest truncation error: 2.21877e-07
    Energy after sweep 5/5 is -151.154829255446
    Sweep 5/5 CPU time = 5m, 52.6s (Wall time = 31.84s)

You can see the bond dimensions are the same, and I am seeing that ITensor v3 is quite a bit faster. This is due to an optimization that we did that should speed up (to different extents) every contraction in ITensor. For every test case I did I have seen either that ITensor v2 and v3 are about the same speed or that ITensor v3 is faster. Let me know if you see something different.

Maybe the reading and writing (or something else besides standard tensor operations like contraction and svd) is taking up a significant amount of time in your calculations? Have you tried using timers or profilers (like gperftools)?

Cheers,
Matt

commented Dec 20, 2019 by miles (70.2k points)

I agree having a gperftools profile of your code would be very clarifying, Ron. One tip that Matt taught me is that you need to make sure to set OMP_NUM_THREADS=1 (shell environment variable) when using it to get meaningful results. (Otherwise it reports a ton of threading functions that are misleading.) But apart from that it's really great at revealing performance bottlenecks.

commented Dec 20, 2019 by mpspik (440 points)

I ran the dmrgj1j2 code with the same modifications as Matt.
Both v2 and v3 seems approximately the same. I dont see the speedup that Matt sees.

I will look into gperftools.

commented Dec 20, 2019 by mpspik (440 points)

V2 Code:

#include "itensor/all.h"

using namespace itensor;

int main(int argc, char* argv[])
    {
    int N = 100;

    println("Input J2 value:");
    Real J2 = 0;
    std::cin >> J2;

    //
    // Initialize the site degrees of freedom.
    //
    auto sites = SpinOne(N);

    //
    // Create the Hamiltonian matrix product operator.
    // Here we use the IQMPO class which is an MPO of
    // IQTensors, tensors whose indices are sorted
    // with respect to quantum numbers
    //
    auto ampo = AutoMPO(sites);
    for(int j = 1; j < N; ++j)
        {
        ampo += 0.5,"S+",j,"S-",j+1;
        ampo += 0.5,"S-",j,"S+",j+1;
        ampo +=     "Sz",j,"Sz",j+1;
        }
    for(int j = 1; j < N-1; ++j)
        {
        ampo += 0.5*J2,"S+",j,"S-",j+2;
        ampo += 0.5*J2,"S-",j,"S+",j+2;
        ampo +=     J2,"Sz",j,"Sz",j+2;
        }
    auto H = MPO(ampo);

    // Set the initial wavefunction matrix product state
    // to be a Neel state.
    //
    auto state = InitState(sites);
    for(int i = 1; i <= N; ++i)
        state.set(i,(i%2==1 ? "Up" : "Dn"));

    auto psi = MPS(state);

    //
    // overlap calculates matrix elements of MPO's with respect to MPS's
    // overlap(psi,H,psi) = <psi|H|psi>
    //
    printfln("Initial energy = %.5f",overlap(psi,H,psi));

    //
    // Set the parameters controlling the accuracy of the DMRG
    // calculation for each DMRG sweep. Here less than 10 maxm
    // values are provided, so all remaining sweeps will use the
    // last maxm (= 200).
    //
    auto sweeps = Sweeps(5);
    sweeps.maxm() = 50,50,100,100,200;
    sweeps.cutoff() = 1E-8;
    println(sweeps);

    //
    // Begin the DMRG calculation
    //
    auto energy = dmrg(psi,H,sweeps,"Quiet");

    //
    // Print the final energy reported by DMRG
    //
    printfln("\nGround State Energy = %.10f",energy);
    printfln("\nUsing overlap = %.10f\n", overlap(psi,H,psi) );

    //println("\nTotal QN of Ground State = ",totalQN(psi));

    //
    // Measure S.S on every bond
    //
    for(int b = 1; b < N; ++b)
        {
        psi.position(b);
        auto ketzz = psi.A(b)*psi.A(b+1)*sites.op("Sz",b)*sites.op("Sz",b+1);
        auto ketpm = psi.A(b)*psi.A(b+1)*sites.op("Sp",b)*sites.op("Sm",b+1)*0.5;
        auto ketmp = psi.A(b)*psi.A(b+1)*sites.op("Sm",b)*sites.op("Sp",b+1)*0.5;
        auto bra = dag(psi.A(b)*psi.A(b+1));
        bra.prime(Site);
        auto SdS = (bra*ketzz).real() + (bra*ketpm).real() + (bra*ketmp).real();
        printfln("S.S b %d = %.10f",b,SdS);
        }

    return 0;
    }

Output:

    vN Entropy at center bond b=50 = 1.281288148412
    Eigs at center bond b=50: 0.4629 0.3626 0.0686 0.0426 0.0402 0.0115 0.0094 0.0017
    Largest m during sweep 1/5 was 9
    Largest truncation error: 5.70055e-16
    Energy after sweep 1/5 is -133.047725652099
    Sweep 1/5 CPU time = 0.0630s (Wall time = 0.0677s)

    vN Entropy at center bond b=50 = 1.538710479300
    Eigs at center bond b=50: 0.4453 0.2765 0.1089 0.0847 0.0356 0.0177 0.0088 0.0059 0.0036 0.0027
    Largest m during sweep 2/5 was 50
    Largest truncation error: 2.64072e-05
    Energy after sweep 2/5 is -149.576391371753
    Sweep 2/5 CPU time = 1.439s (Wall time = 1.861s)

    vN Entropy at center bond b=50 = 1.679144043470
    Eigs at center bond b=50: 0.3261 0.2293 0.2017 0.1810 0.0098 0.0086 0.0081 0.0036 0.0033 0.0030
    Largest m during sweep 3/5 was 100
    Largest truncation error: 5.82099e-06
    Energy after sweep 3/5 is -151.121715673937
    Sweep 3/5 CPU time = 17.58s (Wall time = 21.05s)

    vN Entropy at center bond b=50 = 1.720886079247
    Eigs at center bond b=50: 0.3090 0.2085 0.2072 0.2060 0.0101 0.0100 0.0100 0.0033 0.0033 0.0033
    Largest m during sweep 4/5 was 100
    Largest truncation error: 1.34707e-05
    Energy after sweep 4/5 is -151.147226028195
    Sweep 4/5 CPU time = 21.91s (Wall time = 26.23s)

    vN Entropy at center bond b=50 = 1.731658618466
    Eigs at center bond b=50: 0.3080 0.2069 0.2068 0.2068 0.0103 0.0103 0.0103 0.0033 0.0033 0.0033
    Largest m during sweep 5/5 was 200
    Largest truncation error: 7.18139e-07
    Energy after sweep 5/5 is -151.154828974666
    Sweep 5/5 CPU time = 1m, 43.2s (Wall time = 1m, 52.6s)

commented Dec 20, 2019 by mpspik (440 points)

V3 Code:

#include "itensor/all.h"

using namespace itensor;

int main(int argc, char* argv[])
    {
    int N = 100;

    println("Input J2 value:");
    Real J2 = 0;
    std::cin >> J2;

    //
    // Initialize the site degrees of freedom.
    //
    auto sites = SpinOne(N,{"ConserveQNs=",false});

    //
    // Create the Hamiltonian matrix product operator.
    // Here we use the MPO class which is an MPO of
    // IQTensors, tensors whose indices are sorted
    // with respect to quantum numbers
    //
    auto ampo = AutoMPO(sites);
    for(int j = 1; j < N; ++j)
        {
        ampo += 0.5,"S+",j,"S-",j+1;
        ampo += 0.5,"S-",j,"S+",j+1;
        ampo +=     "Sz",j,"Sz",j+1;
        }
    for(int j = 1; j < N-1; ++j)
        {
        ampo += 0.5*J2,"S+",j,"S-",j+2;
        ampo += 0.5*J2,"S-",j,"S+",j+2;
        ampo +=     J2,"Sz",j,"Sz",j+2;
        }
    auto H = toMPO(ampo);

    // Set the initial wavefunction matrix product state
    // to be a Neel state.
    //
    auto state = InitState(sites);
    for(int i = 1; i <= N; ++i)
        state.set(i,(i%2==1 ? "Up" : "Dn"));

    auto psi0 = MPS(state);

    //
    // inner calculates matrix elements of MPO's with respect to MPS's
    // inner(psi0,H,psi0) = <psi0|H|psi0>
    //
    printfln("Initial energy = %.5f",inner(psi0,H,psi0));

    //
    // Set the parameters controlling the accuracy of the DMRG
    // calculation for each DMRG sweep. Here less than 10 maxdim
    // values are provided, so all remaining sweeps will use the
    // last maxdim (= 200).
    //
    auto sweeps = Sweeps(5);
    sweeps.maxdim() = 50,50,100,100,200;
    sweeps.cutoff() = 1E-8;
    println(sweeps);

    //
    // Begin the DMRG calculation
    //
    auto [energy,psi] = dmrg(H,psi0,sweeps,"Quiet");

    //
    // Print the final energy reported by DMRG
    //
    printfln("\nGround State Energy = %.10f",energy);
    printfln("\nUsing inner = %.10f\n", inner(psi,H,psi) );

    //println("\nTotal QN of Ground State = ",totalQN(psi));

    //
    // Measure S.S on every bond
    //
    for(int b = 1; b < N; ++b)
        {
        psi.position(b);
        auto ketzz = psi(b)*psi(b+1)*op(sites,"Sz",b)*op(sites,"Sz",b+1);
        auto ketpm = psi(b)*psi(b+1)*op(sites,"Sp",b)*op(sites,"Sm",b+1)*0.5;
        auto ketmp = psi(b)*psi(b+1)*op(sites,"Sm",b)*op(sites,"Sp",b+1)*0.5;
        auto bra = dag(psi(b)*psi(b+1));
        bra.prime("Site");
        auto SdS = elt(bra*ketzz) + elt(bra*ketpm) + elt(bra*ketmp);
        printfln("S.S b %d = %.10f",b,SdS);
        }

    return 0;
    }

Output:

    vN Entropy at center bond b=50 = 1.281288148412
    Eigs at center bond b=50: 0.4629 0.3626 0.0686 0.0426 0.0402 0.0115 0.0094 0.0017
    Largest link dim during sweep 1/5 was 9
    Largest truncation error: 3.99609e-16
    Energy after sweep 1/5 is -133.047725652099
    Sweep 1/5 CPU time = 0.109s (Wall time = 0.112s)

    vN Entropy at center bond b=50 = 1.538717778431
    Eigs at center bond b=50: 0.4453 0.2765 0.1089 0.0847 0.0356 0.0177 0.0088 0.0059 0.0036 0.0027
    Largest link dim during sweep 2/5 was 50
    Largest truncation error: 1.19503e-05
    Energy after sweep 2/5 is -149.576253943409
    Sweep 2/5 CPU time = 1.534s (Wall time = 1.952s)

    vN Entropy at center bond b=50 = 1.679145692220
    Eigs at center bond b=50: 0.3261 0.2293 0.2017 0.1810 0.0098 0.0086 0.0081 0.0036 0.0033 0.0030
    Largest link dim during sweep 3/5 was 100
    Largest truncation error: 1.86552e-06
    Energy after sweep 3/5 is -151.121709478775
    Sweep 3/5 CPU time = 18.79s (Wall time = 22.47s)

    vN Entropy at center bond b=50 = 1.720891954614
    Eigs at center bond b=50: 0.3090 0.2085 0.2072 0.2060 0.0101 0.0100 0.0100 0.0033 0.0033 0.0033
    Largest link dim during sweep 4/5 was 100
    Largest truncation error: 4.20862e-06
    Energy after sweep 4/5 is -151.147225986774
    Sweep 4/5 CPU time = 22.59s (Wall time = 27.49s)

    vN Entropy at center bond b=50 = 1.731658867223
    Eigs at center bond b=50: 0.3080 0.2069 0.2068 0.2068 0.0103 0.0103 0.0103 0.0033 0.0033 0.0033
    Largest link dim during sweep 5/5 was 200
    Largest truncation error: 2.21218e-07
    Energy after sweep 5/5 is -151.154828916378
    Sweep 5/5 CPU time = 1m, 45.5s (Wall time = 1m, 54.5s)

commented Dec 20, 2019 by MattFishman (14.1k points)

It looks like you may be running MKL with a single thread (since your CPU time and Wall time are about equal). When I run with a single thread I also get a similar time for ITensor v2 and v3. For example:

ITensor v2 with a single thread (export OMP_NUM_THREADS=1):

    vN Entropy at center bond b=50 = 1.731717097752
    Eigs at center bond b=50: 0.3080 0.2068 0.2068 0.2068 0.0103 0.0103 0.0103 0.0033 0.0033 0.0033
    Largest m during sweep 5/5 was 200
    Largest truncation error: 7.20283e-07
    Energy after sweep 5/5 is -151.154829221677
    Sweep 5/5 CPU time = 42.71s (Wall time = 1m, 0.0s)

ITensor v3 with a single thread:

    vN Entropy at center bond b=50 = 1.731717627565
    Eigs at center bond b=50: 0.3080 0.2068 0.2068 0.2068 0.0103 0.0103 0.0103 0.0033 0.0033 0.0033
    Largest link dim during sweep 5/5 was 200
    Largest truncation error: 2.21877e-07
    Energy after sweep 5/5 is -151.154829255446
    Sweep 5/5 CPU time = 39.90s (Wall time = 54.32s)

In this case, it seems as though using only a single BLAS thread slows down the matrix multiplication and therefore the optimization from ITensor v2 to v3 is not as pronounced (since the optimization was in speeding up the allocation of ITensor data, so it is a subleading cost compared to the matrix multiplication performed in the contraction). You should make sure you are running MKL with multithreading turned on for optimal performance (and note the MKL performance may depend on the machine you are running on).

Likely the slowdown in your code is not due to standard tensor operations like svd and contraction (since you are seeing that dmrg is comparable between ITensor v2 and v3, and dmrg is dominated by tensor contractions and svd operations). I think the next step would be to do some timing and profiling to try to pinpoint what operations are dominating your calculation.

Cheers,
Matt

ITensor3 is slower than same code on ITensor2

Please log in or register to add a comment.

Please log in or register to answer this question.

Categories