mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2016-05-10, 10:21   #1
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

22·32·179 Posts
Default What to expect from GTX1080 for ECM?

A GTX970 wants to run 416 curves at a time, and at B1=6e8 each block takes 24 hours.

A GTX1080 has more execution units, so will run 640 curves at a time, and is clocked at 1607MHz vs 1050MHz for the 970 - does that mean it's reasonable to anticipate that it will do a 640-curve block in 16 hours?
fivemack is offline   Reply With Quote
Old 2016-05-10, 10:41   #2
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

11·71 Posts
Default

Quote:
Originally Posted by fivemack View Post
A GTX970 wants to run 416 curves at a time, and at B1=6e8 each block takes 24 hours.

A GTX1080 has more execution units, so will run 640 curves at a time, and is clocked at 1607MHz vs 1050MHz for the 970 - does that mean it's reasonable to anticipate that it will do a 640-curve block in 16 hours?
If ECM is using FP64 resources i wouldn't count on GTX1080 being faster at all.
diep is offline   Reply With Quote
Old 2016-05-10, 15:07   #3
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

1100100000002 Posts
Default

GTX 970 also has poor FP64 performance so I doubt it uses FP64.

GTX 950, 960, 970, 980, 980Ti and Titan X all have FP64 = 1/32th of FP32.

Last fiddled with by ATH on 2016-05-10 at 15:11
ATH is offline   Reply With Quote
Old 2016-05-10, 15:28   #4
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3×1,193 Posts
Default

Quote:
Originally Posted by ecm-gpu.h

#ifndef ECM_GPU_NB_DIGITS
#define ECM_GPU_NB_DIGITS 32 //by default
#endif

#ifndef ECM_GPU_DIGITS
#define ECM_GPU_DIGITS 0
#endif

#if (ECM_GPU_DIGITS==0)
#define ECM_GPU_SIZE_DIGIT 32
typedef unsigned int digit_t;
typedef int carry_t;
#endif
It appears to use 32-bit unsigned ints as its bignum digits, as expected.

I see the memory clock and bandwidth also scale up, but not by quite as much (7 Gb/s --> 10 Gb/s, 224 GB/sec --> 320 GB/sec). So whether GPU-ECM is compute or memory bound I think it's reasonable to expect the performance to improve by a factor of 1.4-1.5.
bsquared is offline   Reply With Quote
Old 2016-05-10, 15:31   #5
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

11×71 Posts
Default

if i look at graphs at number of gflops (double precision) that LLR-cuda type software gets out of gpu's then this is very bad indeed. Seems to scale with bandwidth. Not having too many fp64 resources is no big deal then to them. They need a different transform to implement... ...right now they burn big watts for nothing seemingly.

...

Yet objectively if you look to fp64 resources and the gtx1080 i doubt it has many. they can clock it way higher when not using them. What would stop 'em to do that?

Last fiddled with by diep on 2016-05-10 at 15:33
diep is offline   Reply With Quote
Old 2016-05-10, 15:40   #6
danaj
 
"Dana Jacobsen"
Feb 2011
Bangkok, TH

38D16 Posts
Default

Some sources are claiming much better double precision, albeit these are more like "educated rumors" right now. E.g. this says "For starters the company is bringing back double precision floating point compute performance to the forefront. Where only one double precision CUDA core existed for every 32 standard CUDA cores in Maxwell, Pascal is built with one double precision CUDA core for every two standard CUDA cores."

The Wikipedia article also cites two sources for it's claim of "Half precision FP16 operations executed at twice the rate of FP32, while FP64 operations run at 1/2 the speed respectively."

It wouldn't surprise me if the press completely misinterpreted this and it was still crippled on the consumer models. The sources seem to be discussions of the Pascal architecture, which is targeted to both Tesla and consumers.
danaj is offline   Reply With Quote
Old 2016-05-10, 15:40   #7
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

14158 Posts
Default

If you no good in GPU programming then you scale with bandwidth.
GTX980 has versions with 224GB/s going up to versions with 336 GB/s bandwidth.
Initial released GTX1080 has bandwidth 320GB/s.

So if it's software that's not so very well programmed and bandwidth oriented rather than CPU oriented then it'll be a lot slower.

edit: GTX980ti = 336 GB/s

Last fiddled with by diep on 2016-05-10 at 15:54
diep is offline   Reply With Quote
Old 2016-05-10, 15:45   #8
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

30D16 Posts
Default

Danaj that 2 units per fp64 is only true of course for the Tesla P100.

Nvidia is bunch of very clever guys. they aren't gonna sell for 600 euro a card that they wanna sell to you for 120k dollar (with 8 in 1 node).

Tesla P100 is a very impressive card. It'll dominate HPC machines that get built right now i bet.

So they aren't gonna offer you reduction of factor 100 in price for a gamers card that's nearly same performance like 1/3 of a tesla as they won't sell a single tesla then.
diep is offline   Reply With Quote
Old 2016-05-10, 16:05   #9
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

11·71 Posts
Default

If card has fast 32x32 == 64 bits multiplication then we can consider taking look at producing FFT that uses only integers.

Maybe a NTT using CRT to put more bits an unit and be able to scale to larger transforms. Those gpu's so fast now and you got so many resources. That would be worth investigation.

Overhead of NTT is pretty big compared to what Woltman is doing at CPU. Calculating modulo is not fast at GPU. Lots of instructions for each unit compared to ideal double precision transform.
diep is offline   Reply With Quote
Old 2016-05-10, 18:41   #10
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3×1,193 Posts
Default

Quote:
Originally Posted by fivemack View Post
A GTX970 wants to run 416 curves at a time, and at B1=6e8 each block takes 24 hours.

A GTX1080 has more execution units, so will run 640 curves at a time, and is clocked at 1607MHz vs 1050MHz for the 970 - does that mean it's reasonable to anticipate that it will do a 640-curve block in 16 hours?
Out of curiosity, are these timings for half size or full size (1018-bit) inputs?
bsquared is offline   Reply With Quote
Old 2016-05-11, 09:07   #11
xilman
Bamboozled!
 
xilman's Avatar
 
"π’‰Ίπ’ŒŒπ’‡·π’†·π’€­"
May 2003
Down not across

11,027 Posts
Default

Quote:
Originally Posted by fivemack View Post
A GTX970 wants to run 416 curves at a time, and at B1=6e8 each block takes 24 hours.

A GTX1080 has more execution units, so will run 640 curves at a time, and is clocked at 1607MHz vs 1050MHz for the 970 - does that mean it's reasonable to anticipate that it will do a 640-curve block in 16 hours?
Curious. My 970 runs 832 curves at once. This with stock GMP-ECM, meaning kilobit arithmetic.

Code:
Input number is 3739842974907858397552172903385764902385045463835258551832047737904189157686320432379794586053438820815117446826957517100181574651282354947806538316031297323554516867664691768863509608671071081 (199 digits)
Using B1=11000000, B2=0, sigma=3:3799984916-3:3799985747 (832 curves)
GPU: Block: 32x32x1 Grid: 26x1x1 (832 parallel curves)
Computing 832 Step 1 took 59050ms of CPU time / 2989511ms of GPU time
Code:
pcl@horus /opt/cuda/sdk/bin/x86_64/linux/release $ ./deviceQueryDrv
./deviceQueryDrv Starting...

CUDA Device Query (Driver API) statically linked version 
Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 970"
  CUDA Driver Version:                           7.5
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 4095 MBytes (4294246400 bytes)
MapSMtoCores for SM 5.2 is undefined.  Default to use 128 Cores/SM
MapSMtoCores for SM 5.2 is undefined.  Default to use 128 Cores/SM
  (13) Multiprocessors, (128) CUDA Cores/MP:     1664 CUDA Cores
  GPU Clock rate:                                1216 MHz (1.22 GHz)
  Memory Clock rate:                             3505 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 1835008 bytes
  Max Texture Dimension Sizes                    1D=(65536) 2D=(65536, 65536) 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size (x,y,z):    (2147483647, 65535, 65535)
  Texture alignment:                             512 bytes
  Maximum memory pitch:                          2147483647 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Result = PASS
600/11 * 2990/3600 * 1216/1050 * 416/832 = 26 hours. Assuming that I placed the numerators and denominators in the correct order, that's consistent with your timings once the density of primes < B1 has been taken into account.

Last fiddled with by xilman on 2016-05-11 at 09:09 Reason: Fix formatting
xilman is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
NVIDIA GTX1080 Ti VictordeHolland GPU Computing 5 2017-03-23 22:50
Do you expect team sub forums to be wiped clean after game ends? Raman Chess 10 2016-11-11 17:28
What Murphy score to expect or need [cado-nfs] kosta Factoring 3 2013-04-17 17:30
How much should one expect to pay for a used/refurb. macbook? ewmayer Lounge 6 2013-01-23 19:45
primes to expect per k? roger Riesel Prime Search 4 2007-04-29 04:29

All times are UTC. The time now is 14:11.


Sun Dec 5 14:11:10 UTC 2021 up 135 days, 8:40, 1 user, load averages: 1.32, 1.15, 1.12

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.