Unable to run FEniCS (OASIS) on multiple compute nodes with MPI
10 months ago by
I have a problem with running fenics on multiple compute nodes in a cluster with MPI. These errors do not occur when I run the program using MPI on only one compute node.
OVERVIEW OF SOFTWARE & SYSTEM
The machines I run the code on are bullx machines that use a slurm workload manager with each 24 processors. I've compiled fenics 2017.1.0 from source. Tech support of the cluster told me the problems likely occur somewhere in my fenics scripts so they could not help me. I use the OASIS CFD solvers for my problems. Because of the workload manager I have to submit compute jobs via a batchscripts, which are pretty standard: claim a number of compute nodes/ processors. However, to run the programs I am advised to use the command:
Where srun is a wrapper for mpirun using all the required options/flags to run the program on the requested number of compute nodes / processors. All input data (hdf5 meshes) and output is located in or written to a shared temporary directory.
srun python SOME_FENICS_PROGAM.py
The error I get is:
Submitted batch job 3651270 /scratch-shared/tmp.Oh9K4HOXVP Reading input from: /scratch-shared/tmp.Oh9K4HOXVP/pspec_mesh2.h5 Saving output to: /scratch-shared/tmp.Oh9K4HOXVP/graft_pspec Reading mesh...  ***ASSERTION failed on line 207 of file /nfs/home6/squicken/petsc/petsc-3.7.6/arch-linux2-c-opt/externalpackages/git.parmetis/libparmetis/comm.c:sendind[i] >= firstvtx && sendind[i] < lastvtx  0 2372037 2429071 python: /nfs/home6/squicken/petsc/petsc-3.7.6/arch-linux2-c-opt/externalpackages/git.parmetis/libparmetis/comm.c:207: libparmetis__CommSetup: Assertion `sendind[i] >= firstvtx && sendind[i] < lastvtx' failed. srun: error: tcn1115: task 40: Aborted srun: Terminating job step 3651269.0 slurmstepd: *** STEP 3651269.0 ON tcn1114 CANCELLED AT 2017-10-09T11:48:50 *** srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: tcn1114: tasks 0-23: Killed srun: error: tcn1115: tasks 24-39,41-47: Killed
Which points to some ParMETIS routine in my local PETSc. This version of PETSc was compiled with parmetis support (more specifically, the petsc configure script was run with the -dowload-parmetis=1 flag) so it's strange that this would not work.
The code crashes on one of these lines:
# Read the mesh file mesh = Mesh() hdf = HDF5File(mesh.mpi_comm(), mesh_file, "r") hdf.read(mesh, "/mesh", False) # Read the subdomains subdomains = CellFunction("size_t", mesh) hdf.read(subdomains, "/subdomains") # Read the boundaries boundaries = FacetFunction("size_t", mesh) hdf.read(boundaries, "/boundaries") # Compute the boundary mesh bmesh = BoundaryMesh(mesh, "exterior") # Define surface normal at the inlet inlet_normal = [-0.14058317, 0.034723956, -0.98945975] restart_folder = None # Specify folder for reading previous simulations
In conclusion: the question I would like to ask:
Can somebody help me resolve this issue or help me to locate the problem in this case?
• Is the problem caused by the way I run the program: e.g. should I specify some extra options when using mpirun (or srun for that matter) on multiple machines, or should I do something special with, for instance, the meshes?
• Do I need to specify some extra parts of code in my python program to make it run in parallel on multiple machines?
• Are there problems with my PETSc, and if so, why isn't this a problem when I run my problems on one machine?
Thank you in advance for taking the time to look at my problem. If there is some extra information needed to solve my problem, I will gladly provide it.
Community: FEniCS Project
Please login to add an answer/comment or follow this question.