### Deterministic mesh partitioning th MPI

156
views
1
4 months ago by
Hi,

I want to restart my simulations from a checkpoint where I saved all my solution vectors (I use the OASIS solver, in which this is the default implemented method. If possible I would like to keep using this method). When I use only one processor this works, as well as when I use very simple 2D meshes. However, when I use larger 3D meshes I run into problems, which are caused by the fact that the degrees of freedom of the first simulation do not match those of the second simulation, even if I use the same amount of processors (e.g. if I map the imported solution vectors to the mesh and export this, the results are very nonsensical). I think this has to do with the mesh partitioning. Is there a way to do mesh partitioning in such a way that it is the same every time?

Thanks!
Sjeng
Community: FEniCS Project
2
Deeper control over mesh partitioning is something we want to add into dolfinX, the next generation FEniCS solver. At the moment, I don't think what you want is possible.
written 4 months ago by Jack Hale
How do you save the solution vector? If you use HDF5File or XDMFFile interface, they both should store also info about mesh partitioning.How do you save and read back the mesh? Lot of details, please provide a MWE.
written 4 months ago by Michal Habera
Thanks for the quick answer. I have written a MWE that demonstrates the problem (at least on my PC). Due to the setup of the example (two programs + mesh) I put the MWE on bitbucket: git@bitbucket.org:squicken/mwe_nondeterministic.git. You can run the example with:

mpirun -n 8 python MWE_nondeterministic_write.py && mpirun -n 8 python MWE_nondeterministic_read.py​

Which will each output an XDMF file to be read in paraview. When I use a XDMF version 2.0 mesh that I had lying around (u_tube.xdmf in the repository) everything seems to be going OK, whereas it goes wrong with a current XDMF 3.0 mesh (I export via fenics, so not much to be done about the XDMF version). Note that the associated HDF5 files have some different hierarchy.
written 4 months ago by SQ
Does it demonstrate the claimed non-determinism of mesh partitioning?
written 4 months ago by Jan Blechta
On my PC it demonstrates the problem. If you run
mpirun -n 8 python MWE_nondeterministic_write.py​
an XDMF file is exported that, if read in paraview, shows the partitioning by coloring the mesh by the MPI rank. Furthermore this coloring is exported as a vector to HDF5.

If you run
mpirun -n 8 python MWE_nondeterministic_read.py​​
the HDF5 vector and the mesh are read and the vector is mapped to the mesh. This result is again exported to an XDMF file which can be opened in paraview. If you use the XDMF 3.0 mesh you will see that the results are not the same, whereas with the XDMF 2.0 mesh this does not seem to happen.

This is illustrated in an image that I added to the repository (top =XDMF 3.0 bottom = XDMF 2.0, left = result from MWE_write, right = result from MWE read)
written 4 months ago by SQ
I have had a similar experience with mesh partitioning and MPI use.  In my case (and this may be easier to test/show) I created two, 2D rectangle meshes in my program sequentially, which I was expecting to be exactly the same mesh with respect to mesh partitioning on the nodes.  With small mesh sizes they were, but larger ones failed to repeat the same partitioning (repeatedly).  I nearly sent a question on this to the forum, but then I solved my problem by creating a singleton mesh (sufficient for my purposes).

Thus, it appears if one creates two "identical" meshes in the same program, the partitioning can be (and certainly was in my case for larger meshes) different, all things being equal.  I detected the difference in partitioning since I was assigning cell-reaction objects at each vertex and had a verification step that the vertices assigned to each processor unit were the same for each of the two meshes.  I could create a MWE for this (C++) if anyone thinks that would be helpful here or in the dolfin issue tracker per @Jan's response below.
written 4 months ago by jwinkle