![]() |
This is actually something I'm in the middle of too.
Porting GNFS polynomial selection to use a GPU was several months of spare-time work, even though the main primitive involved (radix sorting) had a very fast library implementation ready-made, which was used as a black box. Most of the effort involved retooling the host code to issue sorting calls in parallel, and also retooling the support code that generated sort data so that it ran on the GPU and didn't take tons of time. It already ran on the GPU thanks to jrk's work, so doing that from scratch would have been more effort. In the case of poly selection it paid off handsomely, the speedup is on the order of 70x. I've also been porting the sparse linear algebra in Msieve to use a GPU. This is going to be a lot harder, because there are no ready-made GPU library routines that do sparse matrix multiplies in a Galois field. One can use a segmented scan (available in library form) to implement a sparse matrix multiply, but that still would require support code to assemble scan problems and would triple the memory use so that one would have to use a GPU cluster to assemble enough GPU memory for even medium-size problems. My initial proof-of-concept code was a complete rewrite of the original matrix multiply source in order to use an algorithm specifically made for vector processors, and the performance on my low-end card is only a little better than running the same thing on my low-end CPU. Meanwhile, Greg tried tuning the current CPU code a little bit to run on the Phi; it worked out of the box, and was 8x slower than running on a Haswell. After using MPI plus threads he got it down to 2.5x slower. I suspect improving on that will also take major work. |
[QUOTE=ldesnogu;356509]Ernst, Oliver, is it really that harder to get a working program on a GPU? Of course there are some changes to do, but they seem rather small if all you want is something working (that is something similar to the initial porting effort to Phi). Am I completely wrong?[/QUOTE]
I'll let you know once I have successfully done so. :) My main point here is this: Many/most HPC codes already have been parallelized using one of the small number of widespread threading APIs. It is really annoying for GPU vendors to not permit such to "work out of the box", i.e. to have done the work needed to map that extant parallelism to their particular architectures during the long process of developing their own compilers and APIs. As a developer, I fully expect to have to work to *tune* the code to the particularities of the given architecture, but I should not have to completely-rewrite the parallelization interface. This part of the game Intel has right. |
| All times are UTC. The time now is 11:24. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.