![]() |
|
|
#1 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
2×7×461 Posts |
A GTX970 wants to run 416 curves at a time, and at B1=6e8 each block takes 24 hours.
A GTX1080 has more execution units, so will run 640 curves at a time, and is clocked at 1607MHz vs 1050MHz for the 970 - does that mean it's reasonable to anticipate that it will do a 640-curve block in 16 hours? |
|
|
|
|
|
#2 | |
|
Sep 2006
The Netherlands
3×269 Posts |
Quote:
|
|
|
|
|
|
|
#3 |
|
Einyen
Dec 2003
Denmark
22×863 Posts |
GTX 970 also has poor FP64 performance so I doubt it uses FP64.
GTX 950, 960, 970, 980, 980Ti and Titan X all have FP64 = 1/32th of FP32. Last fiddled with by ATH on 2016-05-10 at 15:11 |
|
|
|
|
|
#4 | |
|
"Ben"
Feb 2007
3·5·251 Posts |
Quote:
I see the memory clock and bandwidth also scale up, but not by quite as much (7 Gb/s --> 10 Gb/s, 224 GB/sec --> 320 GB/sec). So whether GPU-ECM is compute or memory bound I think it's reasonable to expect the performance to improve by a factor of 1.4-1.5. |
|
|
|
|
|
|
#5 |
|
Sep 2006
The Netherlands
11001001112 Posts |
if i look at graphs at number of gflops (double precision) that LLR-cuda type software gets out of gpu's then this is very bad indeed. Seems to scale with bandwidth. Not having too many fp64 resources is no big deal then to them. They need a different transform to implement... ...right now they burn big watts for nothing seemingly.
... Yet objectively if you look to fp64 resources and the gtx1080 i doubt it has many. they can clock it way higher when not using them. What would stop 'em to do that? Last fiddled with by diep on 2016-05-10 at 15:33 |
|
|
|
|
|
#6 |
|
"Dana Jacobsen"
Feb 2011
Bangkok, TH
2×5×7×13 Posts |
Some sources are claiming much better double precision, albeit these are more like "educated rumors" right now. E.g. this says "For starters the company is bringing back double precision floating point compute performance to the forefront. Where only one double precision CUDA core existed for every 32 standard CUDA cores in Maxwell, Pascal is built with one double precision CUDA core for every two standard CUDA cores."
The Wikipedia article also cites two sources for it's claim of "Half precision FP16 operations executed at twice the rate of FP32, while FP64 operations run at 1/2 the speed respectively." It wouldn't surprise me if the press completely misinterpreted this and it was still crippled on the consumer models. The sources seem to be discussions of the Pascal architecture, which is targeted to both Tesla and consumers. |
|
|
|
|
|
#7 |
|
Sep 2006
The Netherlands
3×269 Posts |
If you no good in GPU programming then you scale with bandwidth.
GTX980 has versions with 224GB/s going up to versions with 336 GB/s bandwidth. Initial released GTX1080 has bandwidth 320GB/s. So if it's software that's not so very well programmed and bandwidth oriented rather than CPU oriented then it'll be a lot slower. edit: GTX980ti = 336 GB/s Last fiddled with by diep on 2016-05-10 at 15:54 |
|
|
|
|
|
#8 |
|
Sep 2006
The Netherlands
14478 Posts |
Danaj that 2 units per fp64 is only true of course for the Tesla P100.
Nvidia is bunch of very clever guys. they aren't gonna sell for 600 euro a card that they wanna sell to you for 120k dollar (with 8 in 1 node). Tesla P100 is a very impressive card. It'll dominate HPC machines that get built right now i bet. So they aren't gonna offer you reduction of factor 100 in price for a gamers card that's nearly same performance like 1/3 of a tesla as they won't sell a single tesla then. |
|
|
|
|
|
#9 |
|
Sep 2006
The Netherlands
3·269 Posts |
If card has fast 32x32 == 64 bits multiplication then we can consider taking look at producing FFT that uses only integers.
Maybe a NTT using CRT to put more bits an unit and be able to scale to larger transforms. Those gpu's so fast now and you got so many resources. That would be worth investigation. Overhead of NTT is pretty big compared to what Woltman is doing at CPU. Calculating modulo is not fast at GPU. Lots of instructions for each unit compared to ideal double precision transform. |
|
|
|
|
|
#10 | |
|
"Ben"
Feb 2007
3×5×251 Posts |
Quote:
|
|
|
|
|
|
|
#11 | |
|
Bamboozled!
"๐บ๐๐ท๐ท๐ญ"
May 2003
Down not across
1179810 Posts |
Quote:
Code:
Input number is 3739842974907858397552172903385764902385045463835258551832047737904189157686320432379794586053438820815117446826957517100181574651282354947806538316031297323554516867664691768863509608671071081 (199 digits) Using B1=11000000, B2=0, sigma=3:3799984916-3:3799985747 (832 curves) GPU: Block: 32x32x1 Grid: 26x1x1 (832 parallel curves) Computing 832 Step 1 took 59050ms of CPU time / 2989511ms of GPU time Code:
pcl@horus /opt/cuda/sdk/bin/x86_64/linux/release $ ./deviceQueryDrv
./deviceQueryDrv Starting...
CUDA Device Query (Driver API) statically linked version
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 970"
CUDA Driver Version: 7.5
CUDA Capability Major/Minor version number: 5.2
Total amount of global memory: 4095 MBytes (4294246400 bytes)
MapSMtoCores for SM 5.2 is undefined. Default to use 128 Cores/SM
MapSMtoCores for SM 5.2 is undefined. Default to use 128 Cores/SM
(13) Multiprocessors, (128) CUDA Cores/MP: 1664 CUDA Cores
GPU Clock rate: 1216 MHz (1.22 GHz)
Memory Clock rate: 3505 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 1835008 bytes
Max Texture Dimension Sizes 1D=(65536) 2D=(65536, 65536) 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Result = PASS
Last fiddled with by xilman on 2016-05-11 at 09:09 Reason: Fix formatting |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| NVIDIA GTX1080 Ti | VictordeHolland | GPU Computing | 5 | 2017-03-23 22:50 |
| Do you expect team sub forums to be wiped clean after game ends? | Raman | Chess | 10 | 2016-11-11 17:28 |
| What Murphy score to expect or need [cado-nfs] | kosta | Factoring | 3 | 2013-04-17 17:30 |
| How much should one expect to pay for a used/refurb. macbook? | ewmayer | Lounge | 6 | 2013-01-23 19:45 |
| primes to expect per k? | roger | Riesel Prime Search | 4 | 2007-04-29 04:29 |