### Unable to run FEniCS (OASIS) on multiple compute nodes with MPI

340
views
0
10 months ago by
Dear Fenics community,

I have a problem with running fenics on multiple compute nodes in a cluster with MPI. These errors do not occur when I run the program using MPI on only one compute node.

OVERVIEW OF SOFTWARE & SYSTEM
The machines I run the code on are bullx machines that use a slurm workload manager with each 24 processors. I've compiled fenics 2017.1.0 from source. Tech support of the cluster told me the problems likely occur somewhere in my fenics scripts so they could not help me. I use the OASIS CFD solvers for my problems. Because of the workload manager I have to submit compute jobs via a batchscripts, which are pretty standard: claim a number of compute nodes/ processors. However, to run the programs I am advised to use the command:
srun python SOME_FENICS_PROGAM.py​
Where srun is a wrapper for mpirun using all the required options/flags to run the program on the requested number of compute nodes / processors. All input data (hdf5 meshes) and output is located in or written to a shared temporary directory.

ERROR LOG
The error I get is:
Submitted batch job 3651270
/scratch-shared/tmp.Oh9K4HOXVP
Reading input from: /scratch-shared/tmp.Oh9K4HOXVP/pspec_mesh2.h5
Saving output to:   /scratch-shared/tmp.Oh9K4HOXVP/graft_pspec
[40] ***ASSERTION failed on line 207 of file /nfs/home6/squicken/petsc/petsc-3.7.6/arch-linux2-c-opt/externalpackages/git.parmetis/libparmetis/comm.c:sendind[i] >= firstvtx && sendind[i] < lastvtx
[40] 0 2372037 2429071
python: /nfs/home6/squicken/petsc/petsc-3.7.6/arch-linux2-c-opt/externalpackages/git.parmetis/libparmetis/comm.c:207: libparmetis__CommSetup: Assertion sendind[i] >= firstvtx && sendind[i] < lastvtx' failed.
srun: error: tcn1115: task 40: Aborted
srun: Terminating job step 3651269.0
slurmstepd: *** STEP 3651269.0 ON tcn1114 CANCELLED AT 2017-10-09T11:48:50 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: tcn1114: tasks 0-23: Killed
srun: error: tcn1115: tasks 24-39,41-47: Killed​

Which points to some ParMETIS routine in my local PETSc. This version of PETSc was compiled with parmetis support (more specifically, the petsc configure script was run with the -dowload-parmetis=1 flag) so it's strange that this would not work.

The code crashes on one of these lines:

# Read the mesh file
mesh = Mesh()
hdf = HDF5File(mesh.mpi_comm(), mesh_file, "r")
# Read the subdomains
subdomains = CellFunction("size_t", mesh)
# Read the boundaries
boundaries = FacetFunction("size_t", mesh)
# Compute the boundary mesh
bmesh = BoundaryMesh(mesh, "exterior")

# Define surface normal at the inlet
inlet_normal = [-0.14058317, 0.034723956, -0.98945975]

restart_folder = None  # Specify folder for reading previous simulations
​`

QUESTION
In conclusion: the question I would like to ask:
Can somebody help me resolve this issue or help me to locate the problem in this case?

• Is the problem caused by the way I run the program: e.g. should I specify some extra options when using mpirun (or srun for that matter) on multiple machines, or should I do something special with, for instance, the meshes?
• Do I need to specify some extra parts of code in my python program to make it run in parallel on multiple machines?
• Are there problems with my PETSc, and if so, why isn't this a problem when I run my problems on one machine?

Thank you in advance for taking the time to look at my problem. If there is some extra information needed to solve my problem, I will gladly provide it.

Kind regards,
Sjeng
Community: FEniCS Project
I too got the same issue with 2017.1.0 version of FEniCS last year and though it might be strange issues with SWIG python wrapper. But this year I tried 2018.1.0 version which is pybind based thinking I might get some success, but unfortunately the issue remains same.

So I concluded that the issue is not with wrappers rather it's something to do with compute nodes environment variable settings. I am still not sure what is going wrong here. I used all specific env variable like LD_LIBRARY, PYTHONPATH etc. with SLURM script to run the job hoping that the compute node can get those from here, but still no success. I might also due to the fact that I use python3 from anaconda3 but compute node might be having conflict with root user\bin version of python, but not sure and can't check this(how to compute node issues).

Anyone still face this kind of problem? Please share if you have any solution !!
written 2 days ago by sandeep shrivastava