ffc fails on MPI cluster (multiple nodes): Source file should not exist at this point!


180
views
0
12 weeks ago by
FEniCS 2017.2.0 is failing to run for me on an Ubuntu-based cloud cluster. Trying to run with 32 processes over 2 nodes (16 proc per node), I get errors like:

Moving new file over differing existing file:
src: /tmp/tmp926dy6f7/ffc_form_bbc590299c63ad57e1103617fd94585c79ed63d7.cpp.gz
dst: /home/username/.cache/dijitso/src/ffc_form_bbc590299c63ad57e1103617fd94585c79ed63d7.cpp.gz
backup: /home/username/.cache/dijitso/src/ffc_form_bbc590299c63ad57e1103617fd94585c79ed63d7.cpp.gz.old
Backup file exists, overwriting.
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/dolfin/compilemodules/jit.py", line 142, in jit
    result = ffc.jit(ufl_object, parameters=p)
  File "/usr/lib/python3/dist-packages/ffc/jitcompiler.py", line 218, in jit
    module = jit_build(ufl_object, module_name, parameters)
  File "/usr/lib/python3/dist-packages/ffc/jitcompiler.py", line 134, in jit_build
    generate=jit_generate)
  File "/usr/lib/python3/dist-packages/dijitso/jit.py", line 180, in jit
    params)
  File "/usr/lib/python3/dist-packages/dijitso/build.py", line 183, in build_shared_library
    lockfree_move_file(temp_src_filename, src_filename)
  File "/usr/lib/python3/dist-packages/dijitso/system.py", line 272, in lockfree_move_file
    return _lockfree_move_file(src, dst, False)
  File "/usr/lib/python3/dist-packages/dijitso/system.py", line 299, in _lockfree_move_file
    _lockfree_move_file(dst, backup, True)
  File "/usr/lib/python3/dist-packages/dijitso/system.py", line 338, in _lockfree_move_file
    raise RuntimeError("Source file should not exist at this point!")
RuntimeError: Source file should not exist at this point!
​

The same code does succeed on 16 processes (1 node).

I suspect it is related to using an nfs4 filesystem for the directory placed at /data, done to give both nodes access to the same script. But on the other hand, the complaint is about /home/username/.cache/dijitso, which is specific to each node, not shared. 

I have flufl.lock 2.4.1 installed, which is supposed to help with nfs though doesn't seem to have helped.  What else should I do to get the code running?
Community: FEniCS Project
Are you using Infiniband?

I have seen multiple reports with Dijitso failing on clusters (presumably NFS). Note that file locking is not employed in dijitso. Package flufl.lock is relevant for Instant. Dijitso employs lock-free approach using atomic IO operations. I suspect dijitso needs some fixes for NFS, which basically does not guarantee that the file system looks the same from every node, as far as I understand it.
written 11 weeks ago by Jan Blechta  
That would explain why flufl.lock didn't help, if it's instant that uses it and not dijitso.

I'm not certain whether this system is running on infiniband, I'll need to check with the admin staff. At launch mpirun gives the warning:
--------------------------------------------------------------------------
[[25330,1],3]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
Host: cloudhost-3

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
written 11 weeks ago by Drew Parsons  
Please login to add an answer/comment or follow this question.

Similar posts:
Search »