0 votes
asked by (700 points)

Hi Miles,

I recently came across a strange error when trying to save MPS to a .h5 file when running the code on multiple cores (because single core doesn't have enough memory for large bond dimension).

For small enough bond dimension, when I use a single core, everything works fine. However, when I set the number of cores to be more than 1, then I got an error, copy and pasted at the end. For creating and writing into the file, I used the standard
f = h5open("mps-$N-free.h5","w")
write(f,"mps-$N-free",psi)
close(f)

Is it because this way of creating and writing the file doesn't support multicore processes? I'm not sure if you have seen this kind of error before.
Thanks a lot for your time
-Mason

HDF5-DIAG: Error detected in HDF5 (1.12.0) thread 0:
#000: H5F.c line 705 in H5Fcreate(): unable to create file
major: File accessibility
minor: Unable to open file
#001: H5VLcallback.c line 3393 in H5VLfilecreate(): file create failed
major: Virtual Object Layer
minor: Unable to create file
#002: H5VLcallback.c line 3358 in H5VLfilecreate(): file create failed
major: Virtual Object Layer
minor: Unable to create file
#003: H5VLnative
file.c line 65 in H5VL
nativefilecreate(): unable to create file
major: File accessibility
minor: Unable to open file
#004: H5Fint.c line 1707 in H5F_open(): unable to read superblock
major: File accessibility
minor: Read failed
#005: H5Fsuper.c line 412 in H5F__super_read(): file signature not found
major: File accessibility
minor: Not an HDF5 file
ERROR: LoadError: Error creating file mps-10-free.h5
Stacktrace:
[1] error(::String, ::String)
@ Base ./error.jl:42
[2] h5f_create
@ ~/.julia/packages/HDF5/0iEnL/src/api.jl:504 [inlined]
[3] h5open(filename::String, mode::String; swmr::Bool, pv::Base.Iterators.Pairs{Union{}, Unn
ion{}, Tuple{}, NamedTuple{(), Tuple{}}})
@ HDF5 ~/.julia/packages/HDF5/0iEnL/src/HDF5.jl:428
[4] h5open(filename::String, mode::String)
@ HDF5 ~/.julia/packages/HDF5/0iEnL/src/HDF5.jl:412
[5] top-level scope
@ ~/new/D10/N-30-free/CFermion.jl:382
in expression starting at /public1/home/sc61251/new/D10/N-30-free/CFermion.jl:192

1 Answer

0 votes
answered by (700 points)

For people that are interested, I accidentally found a way around this.

The original shell script I tried for job submission contains the following:
#SBATCH -n 6
srun -n 4
where the -n 6 means saving 6 cores for the calculation, and the -n 4 means using 4 of them. The purpose of the extra 2 cores is to provide some more memory space in case the 4 cores aren't enough, so that the job wouldn't be killed.

It's exactly the -n 4 cores used that seems to be giving me the problem in creating the .h5 file and save data, which I still haven't figured out why. In my new shell script, I simply changed the -n 4 to be -n 1, which surprisingly worked very well. It's a single core, so no file creation issue, and at the same time we have the 6 cores backup, so no memory issue.

My initial misconception was that the srun -n has to take care of the most part of the memory consumption, so that in order to resolve the memory issue, it has to be more than 1 core. This turned out not to be the case. Again, I still don't know why due to my very limited knowledge in computer science.

Hopefully this could be useful to people in the future, and I would also love to hear about why this is the case from the experts.

Welcome to ITensor Support Q&A, where you can ask questions and receive answers from other members of the community.

Formatting Tips:
  • To format code, indent by four spaces
  • To format inline LaTeX, surround it by @@ on both sides
  • To format LaTeX on its own line, surround it by $$ above and below
  • For LaTeX, it may be necessary to backslash-escape underscore characters to obtain proper formatting. So for example writing \sum\_i to represent a sum over i.
If you cannot register due to firewall issues (e.g. you cannot see the capcha box) please email Miles Stoudenmire to ask for an account.

To report ITensor bugs, please use the issue tracker.

Categories

...