mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2014-10-26, 20:44   #309
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1CEF16 Posts
Default

FFT length is way too small
Prime95 is online now   Reply With Quote
Old 2014-10-26, 20:46   #310
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by Prime95 View Post
FFT length is way too small
You mean, FFT size = 3538944 is way too small for M64847711 ?
Bdot is offline   Reply With Quote
Old 2014-10-26, 21:08   #311
legendarymudkip
 
legendarymudkip's Avatar
 
Jun 2014

23×3×5 Posts
Default

I think George was talking about the 2^17 FFT length earlier.
legendarymudkip is offline   Reply With Quote
Old 2015-01-08, 10:04   #312
Lorenzo
 
Lorenzo's Avatar
 
Aug 2010
Republic of Belarus

2×89 Posts
Default

Hello! Where i can see benchmark's for 100M exponent (ETA, ms)!? Very interested in the results for the 295X2. Now it's card very cheap and i want buy it!

And second question. LL OpenCl using only GPU or GPU+CPU?
Lorenzo is offline   Reply With Quote
Old 2015-01-10, 19:40   #313
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by Lorenzo View Post
Hello! Where i can see benchmark's for 100M exponent (ETA, ms)!? Very interested in the results for the 295X2. Now it's card very cheap and i want buy it!

And second question. LL OpenCl using only GPU or GPU+CPU?
It's using only GPU (and a negligible part of the CPU to drive the GPU). As for benchmarks, there's only what you find here in this thread - just above the R290 tests of AK76 and my HD7950 tests. 100M exponents are not a good choice for AMD cards right now because the suitable FFT length (6291456) is not a power of 2. The HD7950 (same 1100/1400 MHz clocks) runs these at iteration times of ~46 ms (ETA for M100000007: 1276h). The next power of 2 (8388608) runs at 13.3 ms, but it is too big for M100. With the 8M FFT, M130000007's ETA is just 480h.

The 295x2 can run two such tests in parallel. I'd expect the speed of each slightly below my results: HD7950 has 717 DP GFlops/ 240 GB/s memory rate, OC'd to 1100/1400 MHz ==> 985 GFlops/268 GB/s. R295x2 has 2x 717 GFlops, 2x 320 GB/s memory rate. I'm not sure what counts stronger: the lower DP power, or the better memory bandwidth.

In an attempt to answer this last question I separately reduced the clock of GPU cores and memory by 10%. 10% lower GFlops result in 5.9% longer iteration times, whereas 10% lower bandwidth cause 5.2% longer iteration times. If both clocks are lowered by 10%, then the iteration times increase by 10.1% . So it seems both GFlops and memory rate are important, but GFlops a tiny bit more so.
Bdot is offline   Reply With Quote
Old 2015-01-11, 09:51   #314
Lorenzo
 
Lorenzo's Avatar
 
Aug 2010
Republic of Belarus

2·89 Posts
Default

Quote:
Originally Posted by Bdot View Post
It's using only GPU (and a negligible part of the CPU to drive the GPU). As for benchmarks, there's only what you find here in this thread - just above the R290 tests of AK76 and my HD7950 tests. 100M exponents are not a good choice for AMD cards right now because the suitable FFT length (6291456) is not a power of 2. The HD7950 (same 1100/1400 MHz clocks) runs these at iteration times of ~46 ms (ETA for M100000007: 1276h). The next power of 2 (8388608) runs at 13.3 ms, but it is too big for M100. With the 8M FFT, M130000007's ETA is just 480h.

The 295x2 can run two such tests in parallel. I'd expect the speed of each slightly below my results: HD7950 has 717 DP GFlops/ 240 GB/s memory rate, OC'd to 1100/1400 MHz ==> 985 GFlops/268 GB/s. R295x2 has 2x 717 GFlops, 2x 320 GB/s memory rate. I'm not sure what counts stronger: the lower DP power, or the better memory bandwidth.

In an attempt to answer this last question I separately reduced the clock of GPU cores and memory by 10%. 10% lower GFlops result in 5.9% longer iteration times, whereas 10% lower bandwidth cause 5.2% longer iteration times. If both clocks are lowered by 10%, then the iteration times increase by 10.1% . So it seems both GFlops and memory rate are important, but GFlops a tiny bit more so.
Thank you so much for the answer!
Lorenzo is offline   Reply With Quote
Old 2015-07-25, 02:05   #315
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

32×241 Posts
Default

Nice performance improvements with the latest clFFT library... playing around with it now

clFFT 2.0(current binary)
Code:
Platform :AdvancedMicro Devices, Inc.
Device 0 : Tonga

M( 1257787 )C, 0x3f45bf9bea7213ea, n = 65536, clLucas v1.01 err = 0.1211 (0:03 real, 0.3194 ms/iter, ETA 6:36)
M( 1398269 )C, 0xa4a6d2f0e34629db, n = 73728, clLucas v1.01 err = 0.1016 (0:04 real, 0.3781 ms/iter, ETA 8:41)
M( 2976221 )C, 0x2a7111b7f70fea2f, n = 163840, clLucas v1.01 err = 0.05078 (0:06 real, 0.6307 ms/iter, ETA 31:06)
M( 3021377 )C, 0x6387a70a85d46baf, n = 163840, clLucas v1.01 err = 0.0625 (0:06 real, 0.6283 ms/iter, ETA 31:31)
M( 6972593 )C, 0x88f1d2640adb89e1, n = 393216, clLucas v1.01 err = 0.04688 (0:13 real, 1.2852 ms/iter, ETA 2:29:05)
M( 13466917 )C, 0x9fdc1f4092b15d69, n = 786432, clLucas v1.01 err = 0.03223 (0:29 real, 2.9072 ms/iter, ETA 10:51:42)
M( 20996011 )C, 0x5fc58920a821da11, n = 1179648, clLucas v1.01 err = 0.09375 (0:50 real, 5.0678 ms/iter, ETA 29:32:02)
M( 24036583 )C, 0xcbdef38a0bdc4f00, n = 1310720, clLucas v1.01 err = 0.1875 (1:04 real, 6.4269 ms/iter, ETA 42:52:54)
M( 25964951 )C, 0x62eb3ff0a5f6237c, n = 1572864, clLucas v1.01 err = 0.02051 (1:26 real, 8.5475 ms/iter, ETA 61:36:48)
M( 30402457 )C, 0x0b8600ef47e69d27, n = 1638400, clLucas v1.01 err = 0.3125 (1:38 real, 9.8404 ms/iter, ETA 83:04:09)
M( 32582657 )C, 0x02751b7fcec76bb1, n = 1769472, clLucas v1.01 err = 0.2969 (2:29 real, 14.9789 ms/iter, ETA 135:31:01)
M( 37156667 )C, 0x67ad7646a1fad514, n = 2097152, clLucas v1.01 err = 0.1201 (0:56 real, 5.5684 ms/iter, ETA 57:26:50)
M( 42643801 )C, 0x8f90d78d5007bba7, n = 2359296, clLucas v1.01 err = 0.2031 (2:34 real, 15.4499 ms/iter, ETA 182:57:10)
M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2359296, clLucas v1.01 err = 0.2656 (2:34 real, 15.4018 ms/iter, ETA 184:23:39)
clFFT 2.6
Code:
Platform :Advanced Micro Devices, Inc.
Device 0 : Tonga

M( 1257787 )C, 0x3f45bf9bea7213ea, n = 65536, clLucas v1.01 err = 0.1094 (0:03 real, 0.3001 ms/iter, ETA 6:12)
M( 1398269 )C, 0xa4a6d2f0e34629db, n = 73728, clLucas v1.01 err = 0.09375 (0:05 real, 0.5239 ms/iter, ETA 12:03)
M( 2976221 )C, 0x2a7111b7f70fea2f, n = 163840, clLucas v1.01 err = 0.04883 (0:09 real, 0.8545 ms/iter, ETA 42:09)
M( 3021377 )C, 0x6387a70a85d46baf, n = 163840, clLucas v1.01 err = 0.06641 (0:08 real, 0.8560 ms/iter, ETA 42:56)
M( 6972593 )C, 0x88f1d2640adb89e1, n = 393216, clLucas v1.01 err = 0.05139 (0:14 real, 1.4218 ms/iter, ETA 2:44:55)
M( 13466917 )C, 0x9fdc1f4092b15d69, n = 786432, clLucas v1.01 err = 0.03125 (0:24 real, 2.3861 ms/iter, ETA 8:54:52)
M( 20996011 )C, 0x5fc58920a821da11, n = 1179648, clLucas v1.01 err = 0.09375 (0:36 real, 3.5629 ms/iter, ETA 20:45:50)
M( 24036583 )C, 0xcbdef38a0bdc4f00, n = 1310720, clLucas v1.01 err = 0.2031 (0:41 real, 4.1131 ms/iter, ETA 27:26:37)
M( 25964951 )C, 0x62eb3ff0a5f6237c, n = 1572864, clLucas v1.01 err = 0.02002 (0:42 real, 4.1595 ms/iter, ETA 29:59:00)
M( 30402457 )C, 0x0b8600ef47e69d27, n = 1638400, clLucas v1.01 err = 0.2881 (0:50 real, 5.0494 ms/iter, ETA 42:37:29)
M( 32582657 )C, 0x02751b7fcec76bb1, n = 1769472, clLucas v1.01 err = 0.3125 (0:49 real, 4.8774 ms/iter, ETA 44:07:37)
M( 37156667 )C, 0x67ad7646a1fad514, n = 2097152, clLucas v1.01 err = 0.1074 (0:47 real, 4.7492 ms/iter, ETA 48:59:46)
M( 42643801 )C, 0x8f90d78d5007bba7, n = 2359296, clLucas v1.01 err = 0.209 (1:10 real, 6.9796 ms/iter, ETA 82:39:00)
M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2359296, clLucas v1.01 err = 0.2656 (1:10 real, 6.9811 ms/iter, ETA 83:34:46)

Last fiddled with by kracker on 2015-07-25 at 02:06
kracker is offline   Reply With Quote
Old 2015-07-25, 07:09   #316
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2·3·347 Posts
Default

Quote:
Originally Posted by kracker View Post
Nice performance improvements with the latest clFFT library... playing around with it now
Nice! On my R9 290X,

clFFT 2.2:
Platform :Advanced Micro Devices, Inc.
Device 0 : Hawaii
Build Options are : -D KHR_DP_EXTENSION
start M35064059 fft length = 2097152
Iteration 10000 0x005a9a8bbdfa894b, n = 2097152 err = 0.02771 (0:42 real, 4.1596 ms/iter, ETA 40:29:55)
Iteration 20000 0x085623e4553c8c01, n = 2097152 err = 0.02771 (0:41 real, 4.1491 ms/iter, ETA 40:23:05)

clFFT 2.6:
Platform :Advanced Micro Devices, Inc.
Device 0 : Hawaii
Build Options are : -D KHR_DP_EXTENSION
start M35064059 fft length = 2097152
Iteration 10000 0x005a9a8bbdfa894b, n = 2097152 err = 0.02734 (0:30 real, 2.9535 ms/iter, ETA 28:45:21)
Iteration 20000 0x085623e4553c8c01, n = 2097152 err = 0.02734 (0:29 real, 2.8963 ms/iter, ETA 28:11:27)
frmky is offline   Reply With Quote
Old 2015-07-25, 19:28   #317
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2×3×347 Posts
Default

With the clFFT speed improvements, perhaps it's time to make clLucas more user friendly by bringing over code from CUDALucas. Reading from worktodo.txt is essential. Supporting offsets would be nice. Benchmarking FFTs ahead of time and auto-choosing the fastest would be cool.
frmky is offline   Reply With Quote
Old 2015-07-25, 23:41   #318
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

32×241 Posts
Default

Quote:
Originally Posted by frmky View Post
With the clFFT speed improvements, perhaps it's time to make clLucas more user friendly by bringing over code from CUDALucas. Reading from worktodo.txt is essential. Supporting offsets would be nice. Benchmarking FFTs ahead of time and auto-choosing the fastest would be cool.
That would be really nice.. but it won't be me

Also.. it seems that the faster cards get a bigger boost..

I wonder how the Fury X's perform with their HBM memory...
kracker is offline   Reply With Quote
Old 2015-07-26, 01:39   #319
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2·3·347 Posts
Default

Quote:
Originally Posted by kracker View Post
I wonder how the Fury X's perform with their HBM memory...
I expect no faster than an R9 290X. The Fiji GPU runs DP at 1/16 SP, while the Hawaii GPU runs DP at 1/8 SP. So although the Fury X is over 50% faster than the R9 290X at SP, it will likely lag behind the R9 290X at clLucas even with the faster memory.

Edit: Let me know when you have a Windows version ready and I'll try it on my R9 280 at home. The Tahiti GPU runs DP at 1/4 SP, so it might be faster than both the R9 290X and the Fury X.

Last fiddled with by frmky on 2015-07-26 at 01:45
frmky is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1668 2020-12-22 15:38
Can't get OpenCL to work on HD7950 Ubuntu 14.04.5 LTS VictordeHolland Linux 4 2018-04-11 13:44
OpenCL accellerated lattice siever pstach Factoring 1 2014-05-23 01:03
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
AMD's Graphics Core Next- a reason to accelerate towards OpenCL? Belteshazzar GPU Computing 19 2012-03-07 18:58

All times are UTC. The time now is 06:25.

Thu Apr 15 06:25:31 UTC 2021 up 7 days, 1:06, 0 users, load averages: 2.05, 2.32, 2.10

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.