mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   LL with OpenCL (https://www.mersenneforum.org/showthread.php?t=18297)

Prime95 2014-10-26 20:44

FFT length is way too small

Bdot 2014-10-26 20:46

[QUOTE=Prime95;386178]FFT length is way too small[/QUOTE]
You mean, FFT size = 3538944 is way too small for M64847711 ?

legendarymudkip 2014-10-26 21:08

I think George was talking about the 2^17 FFT length earlier.

Lorenzo 2015-01-08 10:04

Hello! Where i can see benchmark's for 100M exponent (ETA, ms)!? Very interested in the results for the 295X2. Now it's card very cheap and i want buy it!

And second question. LL OpenCl using only GPU or GPU+CPU?

Bdot 2015-01-10 19:40

[QUOTE=Lorenzo;391945]Hello! Where i can see benchmark's for 100M exponent (ETA, ms)!? Very interested in the results for the 295X2. Now it's card very cheap and i want buy it!

And second question. LL OpenCl using only GPU or GPU+CPU?[/QUOTE]

It's using only GPU (and a negligible part of the CPU to drive the GPU). As for benchmarks, there's only what you find here in this thread - just above the R290 tests of AK76 and my HD7950 tests. 100M exponents are not a good choice for AMD cards right now because the suitable FFT length (6291456) is not a power of 2. The HD7950 (same 1100/1400 MHz clocks) runs these at iteration times of ~46 ms (ETA for M100000007: 1276h). The next power of 2 (8388608) runs at 13.3 ms, but it is too big for M100. With the 8M FFT, M130000007's ETA is just 480h.

The 295x2 can run two such tests in parallel. I'd expect the speed of each slightly below my results: HD7950 has 717 DP GFlops/ 240 GB/s memory rate, OC'd to 1100/1400 MHz ==> 985 GFlops/268 GB/s. R295x2 has 2x 717 GFlops, 2x 320 GB/s memory rate. I'm not sure what counts stronger: the lower DP power, or the better memory bandwidth.

In an attempt to answer this last question I separately reduced the clock of GPU cores and memory by 10%. 10% lower GFlops result in 5.9% longer iteration times, whereas 10% lower bandwidth cause 5.2% longer iteration times. If both clocks are lowered by 10%, then the iteration times increase by 10.1% :smile:. So it seems both GFlops and memory rate are important, but GFlops a tiny bit more so.

Lorenzo 2015-01-11 09:51

[QUOTE=Bdot;392155]It's using only GPU (and a negligible part of the CPU to drive the GPU). As for benchmarks, there's only what you find here in this thread - just above the R290 tests of AK76 and my HD7950 tests. 100M exponents are not a good choice for AMD cards right now because the suitable FFT length (6291456) is not a power of 2. The HD7950 (same 1100/1400 MHz clocks) runs these at iteration times of ~46 ms (ETA for M100000007: 1276h). The next power of 2 (8388608) runs at 13.3 ms, but it is too big for M100. With the 8M FFT, M130000007's ETA is just 480h.

The 295x2 can run two such tests in parallel. I'd expect the speed of each slightly below my results: HD7950 has 717 DP GFlops/ 240 GB/s memory rate, OC'd to 1100/1400 MHz ==> 985 GFlops/268 GB/s. R295x2 has 2x 717 GFlops, 2x 320 GB/s memory rate. I'm not sure what counts stronger: the lower DP power, or the better memory bandwidth.

In an attempt to answer this last question I separately reduced the clock of GPU cores and memory by 10%. 10% lower GFlops result in 5.9% longer iteration times, whereas 10% lower bandwidth cause 5.2% longer iteration times. If both clocks are lowered by 10%, then the iteration times increase by 10.1% :smile:. So it seems both GFlops and memory rate are important, but GFlops a tiny bit more so.[/QUOTE]
Thank you so much for the answer!

kracker 2015-07-25 02:05

Nice performance improvements with the latest clFFT library... playing around with it now :smile:

clFFT 2.0(current binary)
[code]
Platform :AdvancedMicro Devices, Inc.
Device 0 : Tonga

M( 1257787 )C, 0x3f45bf9bea7213ea, n = 65536, clLucas v1.01 err = 0.1211 (0:03 real, 0.3194 ms/iter, ETA 6:36)
M( 1398269 )C, 0xa4a6d2f0e34629db, n = 73728, clLucas v1.01 err = 0.1016 (0:04 real, 0.3781 ms/iter, ETA 8:41)
M( 2976221 )C, 0x2a7111b7f70fea2f, n = 163840, clLucas v1.01 err = 0.05078 (0:06 real, 0.6307 ms/iter, ETA 31:06)
M( 3021377 )C, 0x6387a70a85d46baf, n = 163840, clLucas v1.01 err = 0.0625 (0:06 real, 0.6283 ms/iter, ETA 31:31)
M( 6972593 )C, 0x88f1d2640adb89e1, n = 393216, clLucas v1.01 err = 0.04688 (0:13 real, 1.2852 ms/iter, ETA 2:29:05)
M( 13466917 )C, 0x9fdc1f4092b15d69, n = 786432, clLucas v1.01 err = 0.03223 (0:29 real, 2.9072 ms/iter, ETA 10:51:42)
M( 20996011 )C, 0x5fc58920a821da11, n = 1179648, clLucas v1.01 err = 0.09375 (0:50 real, 5.0678 ms/iter, ETA 29:32:02)
M( 24036583 )C, 0xcbdef38a0bdc4f00, n = 1310720, clLucas v1.01 err = 0.1875 (1:04 real, 6.4269 ms/iter, ETA 42:52:54)
M( 25964951 )C, 0x62eb3ff0a5f6237c, n = 1572864, clLucas v1.01 err = 0.02051 (1:26 real, 8.5475 ms/iter, ETA 61:36:48)
M( 30402457 )C, 0x0b8600ef47e69d27, n = 1638400, clLucas v1.01 err = 0.3125 (1:38 real, 9.8404 ms/iter, ETA 83:04:09)
M( 32582657 )C, 0x02751b7fcec76bb1, n = 1769472, clLucas v1.01 err = 0.2969 (2:29 real, 14.9789 ms/iter, ETA 135:31:01)
M( 37156667 )C, 0x67ad7646a1fad514, n = 2097152, clLucas v1.01 err = 0.1201 (0:56 real, 5.5684 ms/iter, ETA 57:26:50)
M( 42643801 )C, 0x8f90d78d5007bba7, n = 2359296, clLucas v1.01 err = 0.2031 (2:34 real, 15.4499 ms/iter, ETA 182:57:10)
M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2359296, clLucas v1.01 err = 0.2656 (2:34 real, 15.4018 ms/iter, ETA 184:23:39)
[/code]clFFT 2.6
[code]
Platform :Advanced Micro Devices, Inc.
Device 0 : Tonga

M( 1257787 )C, 0x3f45bf9bea7213ea, n = 65536, clLucas v1.01 err = 0.1094 (0:03 real, 0.3001 ms/iter, ETA 6:12)
M( 1398269 )C, 0xa4a6d2f0e34629db, n = 73728, clLucas v1.01 err = 0.09375 (0:05 real, 0.5239 ms/iter, ETA 12:03)
M( 2976221 )C, 0x2a7111b7f70fea2f, n = 163840, clLucas v1.01 err = 0.04883 (0:09 real, 0.8545 ms/iter, ETA 42:09)
M( 3021377 )C, 0x6387a70a85d46baf, n = 163840, clLucas v1.01 err = 0.06641 (0:08 real, 0.8560 ms/iter, ETA 42:56)
M( 6972593 )C, 0x88f1d2640adb89e1, n = 393216, clLucas v1.01 err = 0.05139 (0:14 real, 1.4218 ms/iter, ETA 2:44:55)
M( 13466917 )C, 0x9fdc1f4092b15d69, n = 786432, clLucas v1.01 err = 0.03125 (0:24 real, 2.3861 ms/iter, ETA 8:54:52)
M( 20996011 )C, 0x5fc58920a821da11, n = 1179648, clLucas v1.01 err = 0.09375 (0:36 real, 3.5629 ms/iter, ETA 20:45:50)
M( 24036583 )C, 0xcbdef38a0bdc4f00, n = 1310720, clLucas v1.01 err = 0.2031 (0:41 real, 4.1131 ms/iter, ETA 27:26:37)
M( 25964951 )C, 0x62eb3ff0a5f6237c, n = 1572864, clLucas v1.01 err = 0.02002 (0:42 real, 4.1595 ms/iter, ETA 29:59:00)
M( 30402457 )C, 0x0b8600ef47e69d27, n = 1638400, clLucas v1.01 err = 0.2881 (0:50 real, 5.0494 ms/iter, ETA 42:37:29)
M( 32582657 )C, 0x02751b7fcec76bb1, n = 1769472, clLucas v1.01 err = 0.3125 (0:49 real, 4.8774 ms/iter, ETA 44:07:37)
M( 37156667 )C, 0x67ad7646a1fad514, n = 2097152, clLucas v1.01 err = 0.1074 (0:47 real, 4.7492 ms/iter, ETA 48:59:46)
M( 42643801 )C, 0x8f90d78d5007bba7, n = 2359296, clLucas v1.01 err = 0.209 (1:10 real, 6.9796 ms/iter, ETA 82:39:00)
M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2359296, clLucas v1.01 err = 0.2656 (1:10 real, 6.9811 ms/iter, ETA 83:34:46)
[/code]

frmky 2015-07-25 07:09

[QUOTE=kracker;406437]Nice performance improvements with the latest clFFT library... playing around with it now :smile:[/QUOTE]
Nice! On my R9 290X,

clFFT 2.2:
Platform :Advanced Micro Devices, Inc.
Device 0 : Hawaii
Build Options are : -D KHR_DP_EXTENSION
start M35064059 fft length = 2097152
Iteration 10000 0x005a9a8bbdfa894b, n = 2097152 err = 0.02771 (0:42 real, 4.1596 ms/iter, ETA 40:29:55)
Iteration 20000 0x085623e4553c8c01, n = 2097152 err = 0.02771 (0:41 real, 4.1491 ms/iter, ETA 40:23:05)

clFFT 2.6:
Platform :Advanced Micro Devices, Inc.
Device 0 : Hawaii
Build Options are : -D KHR_DP_EXTENSION
start M35064059 fft length = 2097152
Iteration 10000 0x005a9a8bbdfa894b, n = 2097152 err = 0.02734 (0:30 real, 2.9535 ms/iter, ETA 28:45:21)
Iteration 20000 0x085623e4553c8c01, n = 2097152 err = 0.02734 (0:29 real, 2.8963 ms/iter, ETA 28:11:27)

frmky 2015-07-25 19:28

With the clFFT speed improvements, perhaps it's time to make clLucas more user friendly by bringing over code from CUDALucas. Reading from worktodo.txt is essential. Supporting offsets would be nice. Benchmarking FFTs ahead of time and auto-choosing the fastest would be cool.

kracker 2015-07-25 23:41

[QUOTE=frmky;406480]With the clFFT speed improvements, perhaps it's time to make clLucas more user friendly by bringing over code from CUDALucas. Reading from worktodo.txt is essential. Supporting offsets would be nice. Benchmarking FFTs ahead of time and auto-choosing the fastest would be cool.[/QUOTE]

That would be really nice.. but it won't be me :sad:

Also.. it seems that the faster cards get a bigger boost..

I wonder how the Fury X's perform with their [URL="http://www.tomshardware.com/news/amd-fury-x-fiji-preview,29400.html"]HBM memory[/URL]...

frmky 2015-07-26 01:39

[QUOTE=kracker;406489]I wonder how the Fury X's perform with their [URL="http://www.tomshardware.com/news/amd-fury-x-fiji-preview,29400.html"]HBM memory[/URL]...[/QUOTE]
I expect no faster than an R9 290X. The Fiji GPU runs DP at 1/16 SP, while the Hawaii GPU runs DP at 1/8 SP. So although the Fury X is over 50% faster than the R9 290X at SP, it will likely lag behind the R9 290X at clLucas even with the faster memory.

Edit: Let me know when you have a Windows version ready and I'll try it on my R9 280 at home. The Tahiti GPU runs DP at 1/4 SP, so it might be faster than both the R9 290X and the Fury X.


All times are UTC. The time now is 22:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.