Craig wrote:
We should probably start a separate GPU thread.
MJ: would be good, could anybody move the relevant posts to a separate post? Or should I open a new post altogether?

The next bit should be discussed in the new post.

You can have 3584 simultaneous gap searches. In my 65-bit optimized code roughly 1/2 the candidates are prime after sieving. If I processed 32 candidates for a single gap I would be giving up about a factor of 16 in performance. I think the term Nvidia uses for a warp is single instruction multiple thread (SIMT) which is similar to SIMD.

A single GPU core is going to be slower than a single CPU core. GPU operations are 32-bit. Integer multiplication is slower than floating point multiplication (I think it is ~7x cycles for a 32 bit integer multiply 32-bit float on mine). Integer division and mod are very slow on a GPU and should be avoided as much as possible. GPUs are designed for fast 32-bit float multiply and fast memory access. Everything else is slower.

For an idea of the processing power the specs are listed here

The 1080 Ti has about 10000 GFlops. That counts a multiply and accumulate as separate operations so 5000 GMACs. Divide by 7 gives me about 700E9/sec 32-bit integer multiply and accumulate.

On the memory side, global memory bandwidth is about 480 GB/s and shared memory is about a factor of 10 higher.

In my experience, my internet browser is slow while the GPU is running. Everything else I do is unaffected, but I don't use that computer for much.

I've only used linux and the GTX 10 series so not sure about cross platform / driver updates. Different GPU architectures (9 series vs 10 series, etc) have different instruction sets and features. Even within a single family the optimal parameters will probably change due to different capabilities.
