View Single Post
Old 2019-03-22, 11:48   #3
Tribal Bullet
jasonp's Avatar
Oct 2004

3×1,181 Posts

The existing CUDA code is definitely not MPI aware; MPI processes can each use a GPU for a smaller matrix multiply but data transfers to/from GPU would be required for every such operation. I've never even tried using it so the odds are 100% that it is broken.

A better implementation would host the data buffers on GPU at all times and do direct copies from one GPU to another. Latter-day CUDA makes this possible but it has to be explicitly set up.
jasonp is offline   Reply With Quote