Problem with reading a HDF5 mesh on multiple compute nodes


540
views
0
8 months ago by
SQ  
Dear Community,

When I try to read a HDF5 mesh in Fenics on multiple compute nodes via MPI I run into some problems. To read the mesh I use the following code:

# Read the mesh file
mesh = Mesh()
hdf = HDF5File(mpi_comm_world(), mesh_file, "r")
hdf.read(mesh, "/mesh", False)
info_green('Reading mesh... DONE')
info_green('Reading markers...')
subdomains = CellFunction("size_t", mesh)
hdf.read(subdomains, "/subdomains")
boundaries = FacetFunction("size_t", mesh)
hdf.read(boundaries, "/boundaries")
info_green('Reading markers... DONE')
bmesh = BoundaryMesh(mesh, "exterior")​


Typically the error reads something like:
17]PETSC ERROR: ------------------------------------------------------------------------
[17]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[17]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[17]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[17]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors


Strange thing is, the problems only occur after the moment that the mesh itself has been read (So after "Reading mesh... DONE" and "READING markers..." have been displayed and at the moment that the boundaries and subdomains are being read). The problems are completely absent when I read the complete mesh with subdomains and boundaries on only one compute node. Furthermore, the problems do not always occur when i comment out
# hdf.read(subdomains, "/subdomains")​
Because commenting this line only solves the problem 50% of the time this is not really a solution, but when I get the code to work I see significant speedup of my code, so all other functionality of the code functions properly on multiple nodes. Does anybody know how I can resolve this problem?

Thanks,
Sjeng
Community: FEniCS Project
Hi, I am unable to reproduce the error.
Could you be more specific on how do you save the mesh and mesh function (how many nodes, etc.). Please set set_log_level(DEBUG) and provide full error message.
Best would be, if you could provide us with some small .h5 files that fails. Thanks!
written 8 months ago by Michal Habera  
Thanks for your reply!

The process of exporting the mesh is quite inefficient, but in short: I create the mesh in SALOME-PLATFORM, which I then exported to a GMF *.mesh file which is not directly recognized by fenics-convert, but which is the only exportable format that is recognized by GMSH. I load this mesh into GMSH and export to *.msh format. I converted this file to fenics xml, which I subsequently convert to a HDF file using the following code:

print("Converting: " + mesh_file + ".xml")
mesh = Mesh(mesh_file + ".xml")
subdomains = MeshFunction("size_t", mesh, mesh_file +
                          "_physical_region.xml")
boundaries = MeshFunction("size_t", mesh, mesh_file +
                          "_facet_region.xml")
hdf = HDF5File(mesh.mpi_comm(), mesh_file + ".h5", "w")
hdf.write(mesh, "/mesh")
hdf.write(subdomains, "/subdomains")
hdf.write(boundaries, "/boundaries")

if args.clean:
    os.remove(mesh_file + ".xml")
    os.remove(mesh_file + "_physical_region.xml")
    os.remove(mesh_file + "_facet_region.xml")
​
The number of elements is approximately 3*10^6 - 5*10^6, with around 0.5*10^6 nodes (~32-50 GB of ram to solve the problems, Hence the need for more than one compute node). Note that the meshes are correctly read on only one compute node. I cannot provide a small mesh today, but I will try to send you one tomorrow.

Sadly, I do not get more error messages when I use set_log_level(DEBUG). I think this is because the errors occur in some external PETSc module. I will try to obtain a more elaborate error message as soon as possible.

Sjeng
written 8 months ago by SQ  
I tried reproducing the same error with a smaller mesh (~2*10^5 elements, just a straight tube) that I made in the exact same way, but loading that mesh seems to be no problem at all. Might the way I export the meshes become a problem for larger meshes? I notice the mesh files themselves get very large (about 5 times as large as the corresponding GMSH files, or ~500 MB in the case of 3*10^6 elements) when I export them to hdf5. Might this be the problem? If you like, you can download the smaller mesh file from: https://surfdrive.surf.nl/files/index.php/s/Y6V33zoHjcBQykU, which is structured in the exact same way as the meshes that produce the errors.

Sjeng
written 8 months ago by SQ  
So this converting script is run on only one node (mpi_comm_self)? For converting the mesh give a try also to https://github.com/nschloe/meshio.

It really is a huge mesh and you have many converting steps. I hope there is no "trim" behavior in any of them.
written 8 months ago by Michal Habera  
The converting script I use now is only run on one CPU. I succeeded to use meshio on my mesh and I agree that this works much nicer than the methods I used before. However I'm not really sure how to correctly use the XDMF meshes in Fenics (or how to correctly export surface ID's from either GMSH or SALOME meshes). The layout of the XDMF file seems to be significantly different compared to the regular fenics XDMF meshes and fenics doesn't seem to recognize the element type. Is there a way to do this, as far as you know?
written 8 months ago by SQ  
I made a mistake in setting the log_level. The error log is provided in: https://surfdrive.surf.nl/files/index.php/s/oLKuUmRk6ymtR1t
written 8 months ago by SQ  
Dear All,

I contacted the support desk of the computing facility with my problem and they could create a stack trace of the error. Apparently the error comes from some SCOTCH subroutine, but I'm unable to resolve this issue. Does somebody know what the problem might be?

[squicken@int2 surfsara]$ pretty_print_strace_out.py --tree strace.out 

=== exit_group ===
     1    _exit()+57 (/usr/lib64/libc-2.17.so)
     1      __run_exit_handlers()+155 (/usr/lib64/libc-2.17.so)
     1        <???> (/usr/lib64/libc-2.17.so)
     1          MPIU_Exit()+1 (../../src/util/dbg/exit.c:22)
     1            MPID_Abort()+99 (../../src/mpid/ch3/src/mpid_abort.c:106)
     1              PMPI_Abort()+507 (../../src/mpi/init/abort.c:137)
     1                PetscSignalHandlerDefault()+471 (/nfs/admin/hpc/sw/RedHatEnterpriseServer7/PETSc/3.7.6-intel-2016b-Python-2.7.12/lib/libpetsc.so.3.7.6)
     1                  PetscSignalHandler_Private()+136 (/nfs/admin/hpc/sw/RedHatEnterpriseServer7/PETSc/3.7.6-intel-2016b-Python-2.7.12/lib/libpetsc.so.3.7.6)
     1                    __restore_rt()+0 (/usr/lib64/libc-2.17.so)
     1                      _SCOTCHbgraphBipartFm()+6609 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                        _SCOTCHbgraphBipartSt()+451 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                          _SCOTCHbdgraphBipartSq()+199 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                            _SCOTCHbdgraphBipartSt()+451 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                              _SCOTCHbdgraphBipartSt()+565 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                                _SCOTCHbdgraphBipartSt()+545 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                                  _SCOTCHbdgraphBipartBd()+1197 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                                    _SCOTCHbdgraphBipartSt()+451 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                                      bdgraphBipartMl2()+657 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                                        bdgraphBipartMl2()+600 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                                          bdgraphBipartMl2()+600 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                                            bdgraphBipartMl2()+600 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                                              bdgraphBipartMl2()+600 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                                                _SCOTCHbdgraphBipartMl()+615 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                                                  _SCOTCHbdgraphBipartSt()+451 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                                                    kdgraphMapRbPart2()+422 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                                                      _SCOTCHkdgraphMapRbPart()+806 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                                                        _SCOTCHkdgraphMapSt()+118 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                                                          SCOTCH_dgraphMapCompute()+130 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                                                            SCOTCH_dgraphMap()+39 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                                                              SCOTCH_dgraphPart()+59 (/nfs/home6/squicken/fenics_alt/lib/libdolfin.so.2017.1.0)
     1                                                                dolfin::SCOTCH::partition<int>(int, dolfin::CSRGraph<int>&, std::vector<unsigned long, std::allocator<unsigned long> > const&, std::set<long, std::less<long>, std::allocator<long> > const&, unsigned long, std::vector<int, std::allocator<int> >&, std::map<long, std::vector<int, std::allocator<int> >, std::less<long>, std::allocator<std::pair<long const, std::vector<int, std::allocator<int> > > > >&)+986 (/home/squicken/software/fenics/dolfin/dolfin/graph/SCOTCH.cpp:328)
     1                                                                  dolfin::SCOTCH::compute_partition(int, std::vector<int, std::allocator<int> >&, std::map<long, std::vector<int, std::allocator<int> >, std::less<long>, std::allocator<std::pair<long const, std::vector<int, std::allocator<int> > > > >&, boost::multi_array<long, 2ul, std::allocator<long> > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&, long, long, dolfin::CellType const&)+1011 (/home/squicken/software/fenics/dolfin/dolfin/graph/SCOTCH.cpp:77)
     1                                                                    dolfin::MeshPartitioning::partition_cells(int const&, dolfin::LocalMeshData const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<int, std::allocator<int> >&, std::map<long, std::vector<int, std::allocator<int> >, std::less<long>, std::allocator<std::pair<long const, std::vector<int, std::allocator<int> > > > >&)+218 (/home/squicken/software/fenics/dolfin/dolfin/mesh/MeshPartitioning.cpp:193)
     1                                                                      dolfin::MeshPartitioning::build_distributed_mesh(dolfin::Mesh&, dolfin::LocalMeshData const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+897 (/home/squicken/software/fenics/dolfin/dolfin/mesh/MeshPartitioning.cpp:134)
     1                                                                        dolfin::XDMFFile::read(dolfin::Mesh&) const()+961 (/home/squicken/software/fenics/dolfin/dolfin/io/XDMFFile.cpp:1089)
     1                                                                          _wrap_XDMFFile_read(PyObject*, PyObject*)+418 (/home/squicken/software/fenics/dolfin/build/dolfin/swig/modules/io/modulePYTHON_wrap.cxx:21084)
     1                                                                            call_function()+677 (Python/ceval.c:4564)
     1                                                                              PyEval_EvalFrameEx()+20145 (Python/ceval.c:2987)
     1                                                                                PyEval_EvalCodeEx.A()+1329 (Python/ceval.c:3582)
     1                                                                                  PyEval_EvalCode()+1 (Python/ceval.c:669)
     1                                                                                    PyRun_FileExFlags()+126 (Python/pythonrun.c:1376)
     1                                                                                      PyRun_SimpleFileExFlags()+414 (Python/pythonrun.c:948)
     1                                                                                        Py_Main()+2522 (Modules/main.c:640)
     1                                                                                          __libc_start_main()+245 (/usr/lib64/libc-2.17.so)
    [1]       ​
written 8 months ago by SQ  
I was having a similar problem running on more than 1 node on a cray.  I got the same PETSc SEGV, apparently from reading the mesh.  But I investigated further, setting print statements and flushing with sys.stdout.flush().  For me it only appeared to be crashing while reading the mesh from the HDF5 file. The actual crash happened when defining the FunctionSpace to go with the mesh, but further output hadn't been flushed out of the stdout buffer.

Ultimately I found that the crash stopped happening when I recompiled with cray-petsc-64 instead of cray-petsc (version 3.7.6.2 from craype/2.5.13).  We had less than 40 million cells, so we shouldn't have needed 64 bit PETSc.  But it's what we needed to get around a similar kind of segfault to the one you reported.
written 6 months ago by Drew Parsons  
Please login to add an answer/comment or follow this question.

Similar posts:
Search »