![]() |
What to expect from GTX1080 for ECM?
A GTX970 wants to run 416 curves at a time, and at B1=6e8 each block takes 24 hours.
A GTX1080 has more execution units, so will run 640 curves at a time, and is clocked at 1607MHz vs 1050MHz for the 970 - does that mean it's reasonable to anticipate that it will do a 640-curve block in 16 hours? |
[QUOTE=fivemack;433493]A GTX970 wants to run 416 curves at a time, and at B1=6e8 each block takes 24 hours.
A GTX1080 has more execution units, so will run 640 curves at a time, and is clocked at 1607MHz vs 1050MHz for the 970 - does that mean it's reasonable to anticipate that it will do a 640-curve block in 16 hours?[/QUOTE] If ECM is using FP64 resources i wouldn't count on GTX1080 being faster at all. |
GTX 970 also has poor FP64 performance so I doubt it uses FP64.
GTX 950, 960, 970, 980, 980Ti and Titan X all have FP64 = 1/32th of FP32. |
[QUOTE=ecm-gpu.h]
#ifndef ECM_GPU_NB_DIGITS #define ECM_GPU_NB_DIGITS 32 //by default #endif #ifndef ECM_GPU_DIGITS #define ECM_GPU_DIGITS 0 #endif #if (ECM_GPU_DIGITS==0) #define ECM_GPU_SIZE_DIGIT 32 typedef unsigned int digit_t; typedef int carry_t; #endif [/QUOTE] It appears to use 32-bit unsigned ints as its bignum digits, as expected. I see the memory clock and bandwidth also scale up, but not by quite as much (7 Gb/s --> 10 Gb/s, 224 GB/sec --> 320 GB/sec). So whether GPU-ECM is compute or memory bound I think it's reasonable to expect the performance to improve by a factor of 1.4-1.5. |
if i look at graphs at number of gflops (double precision) that LLR-cuda type software gets out of gpu's then this is very bad indeed. Seems to scale with bandwidth. Not having too many fp64 resources is no big deal then to them. They need a different transform to implement... ...right now they burn big watts for nothing seemingly.
... Yet objectively if you look to fp64 resources and the gtx1080 i doubt it has many. they can clock it way higher when not using them. What would stop 'em to do that? |
Some sources are claiming much better double precision, albeit these are more like "educated rumors" right now. E.g. [URL="http://www.mobipicker.com/nvidia-pascal-gtx-1060-launching-fall/"]this[/URL] says "For starters the company is bringing back double precision floating point compute performance to the forefront. Where only one double precision CUDA core existed for every 32 standard CUDA cores in Maxwell, Pascal is built with one double precision CUDA core for every two standard CUDA cores."
The [URL="https://en.wikipedia.org/wiki/GeForce_10_series"]Wikipedia article[/URL] also cites two sources for it's claim of "Half precision FP16 operations executed at twice the rate of FP32, while FP64 operations run at 1/2 the speed respectively." It wouldn't surprise me if the press completely misinterpreted this and it was still crippled on the consumer models. The sources seem to be discussions of the Pascal architecture, which is targeted to both Tesla and consumers. |
If you no good in GPU programming then you scale with bandwidth.
GTX980 has versions with 224GB/s going up to versions with 336 GB/s bandwidth. Initial released GTX1080 has bandwidth 320GB/s. So if it's software that's not so very well programmed and bandwidth oriented rather than CPU oriented then it'll be a lot slower. edit: GTX980ti = 336 GB/s |
Danaj that 2 units per fp64 is only true of course for the Tesla P100.
Nvidia is bunch of very clever guys. they aren't gonna sell for 600 euro a card that they wanna sell to you for 120k dollar (with 8 in 1 node). Tesla P100 is a very impressive card. It'll dominate HPC machines that get built right now i bet. So they aren't gonna offer you reduction of factor 100 in price for a gamers card that's nearly same performance like 1/3 of a tesla as they won't sell a single tesla then. |
If card has fast 32x32 == 64 bits multiplication then we can consider taking look at producing FFT that uses only integers.
Maybe a NTT using CRT to put more bits an unit and be able to scale to larger transforms. Those gpu's so fast now and you got so many resources. That would be worth investigation. Overhead of NTT is pretty big compared to what Woltman is doing at CPU. Calculating modulo is not fast at GPU. Lots of instructions for each unit compared to ideal double precision transform. |
[QUOTE=fivemack;433493]A GTX970 wants to run 416 curves at a time, and at B1=6e8 each block takes 24 hours.
A GTX1080 has more execution units, so will run 640 curves at a time, and is clocked at 1607MHz vs 1050MHz for the 970 - does that mean it's reasonable to anticipate that it will do a 640-curve block in 16 hours?[/QUOTE] Out of curiosity, are these timings for half size or full size (1018-bit) inputs? |
[QUOTE=fivemack;433493]A GTX970 wants to run 416 curves at a time, and at B1=6e8 each block takes 24 hours.
A GTX1080 has more execution units, so will run 640 curves at a time, and is clocked at 1607MHz vs 1050MHz for the 970 - does that mean it's reasonable to anticipate that it will do a 640-curve block in 16 hours?[/QUOTE]Curious. My 970 runs 832 curves at once. This with stock GMP-ECM, meaning kilobit arithmetic. [code] Input number is 3739842974907858397552172903385764902385045463835258551832047737904189157686320432379794586053438820815117446826957517100181574651282354947806538316031297323554516867664691768863509608671071081 (199 digits) Using B1=11000000, B2=0, sigma=3:3799984916-3:3799985747 (832 curves) GPU: Block: 32x32x1 Grid: 26x1x1 (832 parallel curves) Computing 832 Step 1 took 59050ms of CPU time / 2989511ms of GPU time [/code] [code]pcl@horus /opt/cuda/sdk/bin/x86_64/linux/release $ ./deviceQueryDrv ./deviceQueryDrv Starting... CUDA Device Query (Driver API) statically linked version Detected 1 CUDA Capable device(s) Device 0: "GeForce GTX 970" CUDA Driver Version: 7.5 CUDA Capability Major/Minor version number: 5.2 Total amount of global memory: 4095 MBytes (4294246400 bytes) MapSMtoCores for SM 5.2 is undefined. Default to use 128 Cores/SM MapSMtoCores for SM 5.2 is undefined. Default to use 128 Cores/SM (13) Multiprocessors, (128) CUDA Cores/MP: 1664 CUDA Cores GPU Clock rate: 1216 MHz (1.22 GHz) Memory Clock rate: 3505 Mhz Memory Bus Width: 256-bit L2 Cache Size: 1835008 bytes Max Texture Dimension Sizes 1D=(65536) 2D=(65536, 65536) 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Texture alignment: 512 bytes Maximum memory pitch: 2147483647 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Concurrent kernel execution: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Bus ID / PCI location ID: 1 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Result = PASS [/code] 600/11 * 2990/3600 * 1216/1050 * 416/832 = 26 hours. Assuming that I placed the numerators and denominators in the correct order, that's consistent with your timings once the density of primes < B1 has been taken into account. |
| All times are UTC. The time now is 13:02. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.