Problem when using FEnics in infiniband cluster


404
views
0
7 months ago by
Dear friends in FEniCS community,

I have some problem when I use FEniCS 2017.1.0 in infiniband cluster in University of Minnesota. I have questions to ask.

The first question: FEniCS work well when I use less than 200 cores in a node-sharing cluster. But when I increase the cores number to 480 cores, the program would be pretty slow in the compilation process. I wonder whether there is any problem with my compilation? I include the file I use to compile in cluster.

The second question: I find there is a waring in formal website of instant and dijitso(OFED-fork safe system call method might be required to avoid crashes on OFED-based (InfiniBand) clusters! If using python 2, installing subprocess32 is recommended.). I have tried both method, but still the program run slow in cluster with 480 cores.

The third question: Is there any big difference between mpich, openmpi and mvapich. I find that mpich3.2 also support infiniband. I wonder whether this would influence the speed of program.

Thank you for your help all the time.

Yours sincerely,
Qiming Zhu

File attached: build-pre.sh (4.87 KB)

File attached: build-dolfin.sh (6.21 KB)

Community: FEniCS Project
In your first question, are you referring to the JIT compilation?  If you do the JIT compilation in parallel, the default behavior is for all of the MPI tasks to write to your ~/.cache directory, which can be inefficient on distributed filesystems.  On Stampede, I've even seen the process fail with an error, which I assume is due to some sort of synchronization/file-locking problem.  My work-around was to just run the problem with a tiny mesh on one core first, which compiles all of the forms.  Then, when running a big problem in parallel, all of the tasks just read from the cache. 

Another source of major slow-downs for me when running big problems, which may at first appear to be related to compilation, was the DOF re-ordering when FunctionSpaces are created.  If you have enough DsOF, the FunctionSpace constructor can take hours to run.  I found the following options to help:

parameters['reorder_dofs_serial'] = False
parameters['dof_ordering_library'] = 'random'​

written 7 months ago by David Kamensky  
Thank you for your reply. For first question, the JIT compilation is pretty slow in my case. I would try to run the problem with small mesh first. For the DOF re-ordering, I use fieldsplit with PETSC in my program, but I do not know whether this would have some side effects. Thank you for your help all the time.
written 7 months ago by zqm1992  
Please login to add an answer/comment or follow this question.

Similar posts:
Search »