Demo Parallel segmental fault


211
views
0
4 months ago by
Ting  
Hi everyone,

I have some wired problems. I'm trying to run the C++ demo Poisson and Hyperelasticity in parallel. They are working great when I'm running them in serial, and running in parallel on my own work station with "mpirun -np 4 ./demo_poisson" and "mpirun -np 4 ./demo_hyperelasticity". And when I running them on HPC, they works correctly for serial. But they keep giving me the segmental fault when I trying to running them in parallel with sbatch files. Specifically, the Poisson demo works fine with 64*64 mesh. But when I trying to run 128*128 mesh size in parallel, it give me the segmental fault. I've tried the PETSc separately, it works correctly.

Here is the error message:
srun: error: b059: tasks 0-1: Segmentation fault

Is anyone know what is this problem and how to fix it? Thanks ahead for your help!

Best,
Ting
Community: FEniCS Project
@Ting:
Could you post your sbatch job file (and also any modifications to the demo files):  are you properly allocating 4 cpus using #SBATCH for the job, for example?  Your question requires more information to answer.
written 4 months ago by jwinkle  
Hi Jwinkle,

Here is my sbatch file:

#!/bin/bash
#SBATCH -J Poisson_test
#SBATCH -o test.out
#SBATCH -e test.err
#SBATCH -n 2
#SBATCH --ntasks-per-node=2
#SBATCH -p standard-mem-s
#SBATCH -t 10
#SBATCH --exclusive

srun ./demo_poisson


And the modifications to the demo files is:

// Create mesh and function space
long unsigned int N = 256;
auto mesh = std::make_shared<Mesh>(
UnitSquareMesh::create({{N, N}}, CellType::Type::triangle));
auto V = std::make_shared<Poisson::FunctionSpace>(mesh);

at the beginning of the Poisson demo main.cpp. I just changed the 32*32 mesh into 256*256 mesh. And the left thing kept the same. I've tried using this sbatch file to run the 32*32 mesh, it works fine and give me the correct solution. But if I changed the mesh to 256*256, it crashed.

Thank you so much for help.
written 4 months ago by Ting  
It will give me the segmental fault even for 2 mpi tasks for 256*256 mesh. And same for 128*128 mesh. It only works for 32*32 or 64*64.
written 4 months ago by Ting  
Where are you calling mpirun? (I don't see it in the sbatch file). 

I don't know if it is related to the node size, but you need to change the ghost_mode parameter.  See:
https://www.allanswered.com/post/xnqop/mpi-segmentation-fault-with-c-demo-periodic/

That example crashes for a different reason, but you may need to set
parameters["ghost_mode"] = "shared_facet";
written 4 months ago by jwinkle  
The simple CG Poisson demo shouldn't required ghosted facets.

I also can't see where the mpirun is issued. Perhaps srun -N 1 -n 1 mpirun -np $NP ./demo_poisson ?
written 4 months ago by Nate  
Hi, in our HPC system, we set "-n" for number of MPI tasks. I used "#SBATCH -n 2" here means I have 2 MPI tasks, which should equivalent to "-np 2"
written 4 months ago by Ting  
Hi, in our HPC system, we set "-n" for number of MPI tasks. I used "#SBATCH -n 2" here means I have 2 MPI tasks, which should equivalent to "-np 2.

Here is the output for the same sbatch file to run the Poisson Demo with 32*32 mesh.

Process 0: Solving linear variational problem.
Process 1: Solving linear variational problem.
written 4 months ago by Ting  
Another thought:  Does the cluster have multiple MPI implementations (e.g., MVAPICH, OpenMPI, Intel MPI, etc.) installed?  It could be that the FEniCS installation is only compatible with one of them.
written 4 months ago by David Kamensky  
The installation should be correct since the Poisson demo can running in parallel for 32*32 mesh.

Here is the output for the same sbatch file to run the Poisson Demo with 32*32 mesh.

Process 0: Solving linear variational problem.
Process 1: Solving linear variational problem.

And 64*64 mesh also works fine. But it give me segmental fault starts with 128*128 mesh.
written 4 months ago by Ting  
@Ting:
I tried this on our server and it works fine for me with128x128 (using an sbatch job submission).  The only difference I have is that we only have Fenics version 2017.1.0.  This version does not recognize the new mesh constructor in the demo.  Instead (and I suggest you try this for a sanity check), I used:
auto mesh = std::make_shared<UnitSquareMesh>(128, 128);

Can you confirm your Fenics version?  (e.g., type: ffc -V)
written 4 months ago by jwinkle  
Hi Jwinkle,

My dolfin version is 2017.2.0. And FFC is also 2017.2.0 with Python 3.6.3
written 4 months ago by Ting  
Can you try to change the mesh constructor to:
auto mesh = std::make_shared<UnitSquareMesh>(128, 128);

just to see if that has any effect (don't know that it will, you may have MPI install issues, as below).
written 4 months ago by jwinkle  

1 Answer


0
4 months ago by
Emek  
Hi Ting,

could you try to use 3, 5, or 7 cpus instead of 4 and tell the outcome? I have sometimes uncommon problems when I use even numbers, I could not find the reason. Please post the sbatch here as well (as jwinkle already mentioned).

Best, Emek
Hi Emek,

Here is the error message given if I run 5 or 7:

srun: Warning: can't honor --ntasks-per-node set to 4 which doesn't match the requested tasks 5 with the number of requested nodes 2. Ignoring --ntasks-per-node.
Fatal error in PMPI_Scatterv: Unknown error class, error stack:
PMPI_Scatterv(499)........................: MPI_Scatterv(sbuf=(nil), scnts=(nil), displs=(nil), MPI_LONG, rbuf=0x2aaaaab70010, rcount=26212, MPI_LONG, root=0, comm=0x84000004) failed
MPIR_Scatterv_impl(268)...................:
MPIR_Scatterv(152)........................:
MPIC_Recv(430)............................:
MPIC_Wait(314)............................:
MPIDI_CH3i_Progress_wait(249).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(461):
MPIDU_Socki_handle_read(649)..............: connection failure (set=0,sock=3,errno=104:Connection reset by peer)
Fatal error in PMPI_Scatterv: Unknown error class, error stack:
PMPI_Scatterv(499)........................: MPI_Scatterv(sbuf=(nil), scnts=(nil), displs=(nil), MPI_LONG, rbuf=0x2aaaaab13010, rcount=26216, MPI_LONG, root=0, comm=0x84000002) failed
MPIR_Scatterv_impl(268)...................:
MPIR_Scatterv(152)........................:
MPIC_Recv(430)............................:
MPIC_Wait(314)............................:
MPIDI_CH3i_Progress_wait(249).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(461):
MPIDU_Socki_handle_read(649)..............: connection failure (set=0,sock=2,errno=104:Connection reset by peer)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 2101159.0 ON b062 CANCELLED AT 2018-04-01T14:22:58 ***
Fatal error in PMPI_Scatterv: Unknown error class, error stack:
PMPI_Scatterv(499)........................: MPI_Scatterv(sbuf=(nil), scnts=(nil), displs=(nil), MPI_LONG, rbuf=0x2aaaaab13010, rcount=26212, MPI_LONG, root=0, comm=0x84000002) failed
MPIR_Scatterv_impl(268)...................:
MPIR_Scatterv(152)........................:
MPIC_Recv(430)............................:
MPIC_Wait(314)............................:
MPIDI_CH3i_Progress_wait(249).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(461):
MPIDU_Socki_handle_read(649)..............: connection failure (set=0,sock=3,errno=104:Connection reset by peer)
srun: error: b062: task 0: Segmentation fault
srun: error: b062: task 1: Broken pipe
srun: error: b062: task 2: Killed
srun: error: b063: tasks 3-4: Broken pipe
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 2101158 ON b082 CANCELLED AT 2018-04-01T14:30:43 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 2101158.0 ON b082 CANCELLED AT 2018-04-01T14:30:43 DUE TO TIME LIMIT ***

Basically the same error with 4, 8 nodes.


Or the message is:

srun: error: b149: task 0: Segmentation fault (core dumped)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 2101169 ON b149 CANCELLED AT 2018-04-01T14:40:43 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 2101169.0 ON b149 CANCELLED AT 2018-04-01T14:40:43 DUE TO TIME LIMIT ***

These are basically two error message I received.

Thank you so much Emek.
written 4 months ago by Ting  
1
But you need to change
--ntasks-per-node=2
in your file as well. Have a look at the error code, if you set 5 processes and for each node 2 processes, it won't work.

Why do you use only 2 tasks per node?

What is the content of the srun file?
written 4 months ago by Emek  
Here is my sbatch file:

#!/bin/bash
#SBATCH -J Poisson_test
#SBATCH -o test.out
#SBATCH -e test.err
#SBATCH -n 7
#SBATCH -p standard-mem-s
#SBATCH -t 10
#SBATCH --exclusive

srun ./demo_poisson

for running 7 tasks.

If I just running 64*64 mesh problem, it give me the solution:

Process 0: Solving linear variational problem.
Process 1: Solving linear variational problem.
Process 2: Solving linear variational problem.
Process 3: Solving linear variational problem.
Process 4: Solving linear variational problem.
Process 5: Solving linear variational problem.
Process 6: Solving linear variational problem.

But it will give me the segmental fault if I tried to run 128*128 mesh or 256*256 mesh problem with the exact same sbatch file.
written 4 months ago by Ting  
Sorry about asking once more, what is srun?
written 4 months ago by Emek  
Oh I'm sorry, I did not notice that.

The "srun" is the SLURM command which required to launch parallel jobs - both batch and interactive.

Here is the link for "srun" detail. https://computing.llnl.gov/tutorials/linux_clusters/index.html#Starting
written 4 months ago by Ting  
Please login to add an answer/comment or follow this question.

Similar posts:
Search »