std::length_error when using 3d unit cube problem of size greater than 410x410x410


318
views
1
9 months ago by
I am trying to run a unit cube 3d elasticity problem of size greater 410x410x410 in 32 node each having 128gb of ram. I found that it stuck in creating BoxMesh which works perfectly fine with the problem size smaller than 400x400x400.

It throws error of std::length_error what() vector::_M_default_append. I tried debugging the issue and traced it back to the file MPI.h. It stuck in resizing sendbuf inside broadcasting cell vertices due to the resizing variable n was redefined to int rather than size_t inside std:: accumulate function. Somehow I got rid of it my defining 3rd variable of accumulate function to size_t explicitly.

But now I got another problem. Actually a new vector offsets in the same function call is also defined as int which is crossed it's limit. I tried changing it's type to size_t, but now the function mpi_scatterv taking this as argument doesn't support size_t type. It only accepts int as type for offsets vector.

I tried searching for mpi scatterv arguments explicitly and found that the mpi version less than 3 doesn't support argument type other than int for counts and displacements.

Does anybody faced the same issue or I am missing something.I think ideally it should work as the performance test problem is seems to be run with 12 billion dofs.

Just for information I am using dolfin development version last update Aug 2017.


Community: FEniCS Project
1
As Chris has answered, it's better (scalable) to use refinement to get to very large problems. The test code at https://bitbucket.org/fenics-project/performance-tests uses parallel refinement.

Nonetheless, it would be nice to fix the integer overflow.  Is the issue with std::accumulate just in dolfin/common/MPI.h?

In which function precisely are you see the problem with mpi_scatterv?
written 9 months ago by Garth Wells  
1
Seems like only in dolfin/common/MPI.h. As the update in this file solved the problem of std:: length_error. To be precise line number 530 for std::accumulate and line number 542 for mpi_scatterv in MPI.h of latest repo file.
written 9 months ago by sandeep shrivastava  
2
Trying to figure out where it's hitting the 32-bit limit. It's probably in packing the cell topology on process 0 for sending, i.e. (410^3)*12*4 \approx 3.3 billion > 2^32. The 12 is the number of tetrahedra per 'unit cube' of vertices, and 4 for 4 vertex indices per cell.

Indeed, MPI_Alltoallv is limited to 32-bit int for offsets into the send buffer. Since we're constrained by the MPI specification there isn't much we can do without splitting up the communication, which isn't worth the effort since refinement or reading from disk is preferable.
written 9 months ago by Garth Wells  
1
Adding type size_t in the 3rd argument of std:: accumulate along with number zero does the trick for variable n. To be more precise limit seems to be 415x415x415 based on((2^31)/(6*5))^1/3 as per my observation. Cell topology stores 5 data per cell with 6 cells per unit tetrahedron.

But as you mentioned changing offsets data type to size_t doesn't help due to mpi constraint.

I read somewhere that mpi above version 3 support type other than int. But not sure if it's true. You may explore  if it's prioritized later.
written 9 months ago by sandeep shrivastava  
I see, you need more digits for the memory offset; nevermind the ninja suit, RAM will not help you.
written 9 months ago by pf4d  
You got it right :) Thanks anyways
written 9 months ago by sandeep shrivastava  
What happens when you ask for more processors?
written 9 months ago by pf4d  
It's independent of number of processor used.
written 9 months ago by sandeep shrivastava  
Just curious.  It must be nice to be able to burn so much energy on toy problems.
written 9 months ago by pf4d  

1 Answer


4
9 months ago by
The problem is that BoxMesh is created on one process and then distributed. If you want a bigger box, it is better to make a smaller one and then use refine(), as this will be done in parallel. 
1
ohh I see. Thanks..
Then is this the only way to create large size meshes by dolfin in built functions or the thing need to be improved in BoxMesh and others to make it parallel for any size??

And what about mesh quality by refinement. I think it will not retain it's aspect ratio in case of refining tetrahedron.
Am I right? If so, how to make sure we create structured mesh without altering the mesh quality.
written 9 months ago by sandeep shrivastava  
1
Refinement does not seriously degrade the tet quality. The algorithm is designed with this in mind. Making BoxMesh work in parallel at any size could be done, but is not a priority.
written 9 months ago by Chris Richardson  
1
That's fine Chris. I mentioned about mesh quality because once I did tet mesh refinement through edge division which creates 8 tet. Corner tets have the same quality of parent element, but the 4 inside elements become little flat. One refinement was fine, but if I do more consecutive refinement inside elements are getting more and more flat and distorted. But one refinement was ok for me, so was not a big problem.

Anyways it was just for the information. Thanks for the clarification..

I have one more question if you can clarify. Are the XML, xdmf and other mesh files with large data sets have the similar limitation as BoxMesh??
written 9 months ago by sandeep shrivastava  
1
The XML format is not scalable and will be deprecated at some point in the future.

XDMF with HDF5 storage is read in parallel, with each process reading only part of the data file, which makes it scalable in memory. It scales well in time on systems with a good parallel filesystem. If the filesystem is poor the read time won't be great.

XDMF with ASCII storage is read on process zero, so is not scalable.
written 9 months ago by Garth Wells  
Thanks Garth. That's a very useful information..
written 9 months ago by sandeep shrivastava  
If you had a node with more ram, you could increase your box that way... You could just buy some at Best buy, sneak into the ssc, and slip it in one of the nodes, real quiet. Bring your ninja suit.
written 9 months ago by pf4d  
Aah ha!  Interesting.
written 9 months ago by pf4d  
Aah ha!  Interesting.
written 9 months ago by pf4d  
Please login to add an answer/comment or follow this question.

Similar posts:
Search »