Unable to run FEniCS (OASIS) on multiple compute nodes with MPI


263
views
0
8 months ago by
SQ  
Dear Fenics community,

I have a problem with running fenics on multiple compute nodes in a cluster with MPI. These errors do not occur when I run the program using MPI on only one compute node.

OVERVIEW OF SOFTWARE & SYSTEM
The machines I run the code on are bullx machines that use a slurm workload manager with each 24 processors. I've compiled fenics 2017.1.0 from source. Tech support of the cluster told me the problems likely occur somewhere in my fenics scripts so they could not help me. I use the OASIS CFD solvers for my problems. Because of the workload manager I have to submit compute jobs via a batchscripts, which are pretty standard: claim a number of compute nodes/ processors. However, to run the programs I am advised to use the command:
srun python SOME_FENICS_PROGAM.py​
Where srun is a wrapper for mpirun using all the required options/flags to run the program on the requested number of compute nodes / processors. All input data (hdf5 meshes) and output is located in or written to a shared temporary directory.

ERROR LOG
The error I get is:
Submitted batch job 3651270
/scratch-shared/tmp.Oh9K4HOXVP
Reading input from: /scratch-shared/tmp.Oh9K4HOXVP/pspec_mesh2.h5
Saving output to:   /scratch-shared/tmp.Oh9K4HOXVP/graft_pspec
Reading mesh...
[40] ***ASSERTION failed on line 207 of file /nfs/home6/squicken/petsc/petsc-3.7.6/arch-linux2-c-opt/externalpackages/git.parmetis/libparmetis/comm.c:sendind[i] >= firstvtx && sendind[i] < lastvtx
[40] 0 2372037 2429071
python: /nfs/home6/squicken/petsc/petsc-3.7.6/arch-linux2-c-opt/externalpackages/git.parmetis/libparmetis/comm.c:207: libparmetis__CommSetup: Assertion `sendind[i] >= firstvtx && sendind[i] < lastvtx' failed.
srun: error: tcn1115: task 40: Aborted
srun: Terminating job step 3651269.0
slurmstepd: *** STEP 3651269.0 ON tcn1114 CANCELLED AT 2017-10-09T11:48:50 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: tcn1114: tasks 0-23: Killed
srun: error: tcn1115: tasks 24-39,41-47: Killed​

Which points to some ParMETIS routine in my local PETSc. This version of PETSc was compiled with parmetis support (more specifically, the petsc configure script was run with the -dowload-parmetis=1 flag) so it's strange that this would not work.

The code crashes on one of these lines:

# Read the mesh file
mesh = Mesh()
hdf = HDF5File(mesh.mpi_comm(), mesh_file, "r")
hdf.read(mesh, "/mesh", False)
# Read the subdomains
subdomains = CellFunction("size_t", mesh)
hdf.read(subdomains, "/subdomains")
# Read the boundaries
boundaries = FacetFunction("size_t", mesh)
hdf.read(boundaries, "/boundaries")
# Compute the boundary mesh
bmesh = BoundaryMesh(mesh, "exterior")

# Define surface normal at the inlet
inlet_normal = [-0.14058317, 0.034723956, -0.98945975]

restart_folder = None  # Specify folder for reading previous simulations
​

QUESTION
In conclusion: the question I would like to ask:
Can somebody help me resolve this issue or help me to locate the problem in this case?

• Is the problem caused by the way I run the program: e.g. should I specify some extra options when using mpirun (or srun for that matter) on multiple machines, or should I do something special with, for instance, the meshes?
• Do I need to specify some extra parts of code in my python program to make it run in parallel on multiple machines?
• Are there problems with my PETSc, and if so, why isn't this a problem when I run my problems on one machine?

Thank you in advance for taking the time to look at my problem. If there is some extra information needed to solve my problem, I will gladly provide it.

Kind regards,
Sjeng
Community: FEniCS Project
Please login to add an answer/comment or follow this question.

Similar posts:
Search »