View Single Post
Old 2021-06-24, 13:00   #7
MJansen
 
Jan 2018

5·17 Posts
Default

Craig wrote:
We should probably start a separate GPU thread.
MJ: would be good, could anybody move the relevant posts to a separate post? Or should I open a new post altogether?

The next bit should be discussed in the new post.

You can have 3584 simultaneous gap searches. In my 65-bit optimized code roughly 1/2 the candidates are prime after sieving. If I processed 32 candidates for a single gap I would be giving up about a factor of 16 in performance. I think the term Nvidia uses for a warp is single instruction multiple thread (SIMT) which is similar to SIMD.

A single GPU core is going to be slower than a single CPU core. GPU operations are 32-bit. Integer multiplication is slower than floating point multiplication (I think it is ~7x cycles for a 32 bit integer multiply 32-bit float on mine). Integer division and mod are very slow on a GPU and should be avoided as much as possible. GPUs are designed for fast 32-bit float multiply and fast memory access. Everything else is slower.

For an idea of the processing power the specs are listed here
https://en.wikipedia.org/wiki/GeForce_10_series

The 1080 Ti has about 10000 GFlops. That counts a multiply and accumulate as separate operations so 5000 GMACs. Divide by 7 gives me about 700E9/sec 32-bit integer multiply and accumulate.

On the memory side, global memory bandwidth is about 480 GB/s and shared memory is about a factor of 10 higher.

In my experience, my internet browser is slow while the GPU is running. Everything else I do is unaffected, but I don't use that computer for much.

I've only used linux and the GTX 10 series so not sure about cross platform / driver updates. Different GPU architectures (9 series vs 10 series, etc) have different instruction sets and features. Even within a single family the optimal parameters will probably change due to different capabilities.
MJansen is offline   Reply With Quote