mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   LL with OpenCL (https://www.mersenneforum.org/showthread.php?t=18297)

Robish 2013-10-09 16:51

[QUOTE=LaurV;355730]Well, man, you are serious about it! :smile:

If you want to use cudaLucas, then you can do better. With a GTX Titan, even better... :w00t: Do you have one? (if yes, send it to me! :razz:)

Start P95 (or go to it if it is already running) and use "Advanced -> Time..." option, to "time" your exponent. Fill the exponent in the box, select 100 iterations, click ok. See what FFT length is used. Write it down somewhere.

Play with "cudalucas -cufftbench x y z" option, where x is a bit smaller than the value chosen by p95, y is about 20% higher, and z is a multiple of 1024 (see discussions on cudaLucas thread).

With last command you "tunned" the right FFT size for your card. Select the FFT length which is around the one chosen by p95, it may be a bit smaller, or up to 10-15-20 percent larger, AND which is the fastest for your card. Run few iterations with the command you used above, but with the right FFT, the one you "cufftbench" it.

Try with both -t and without. Try with "-polite 0" and without. (edit: use -k, then you can turn "polite" on/off interactively using the keyboard, just press "p" and "enter", see the speed difference. watch the GPU occupancy in this time, it should go down a bit when polite is active).

My estimation (from a superficial mental calculus and from the error your test shows) it would be that a FFT close to 18M would be much faster, and keep the error under 0.21 (I may be wrong). 2097152*3^2 is better than 2097152*2*5 when splitting the "butterflies" of the FFT multiplication, therefore if the error given for ~18M8 FFT size is not too big, that would be much faster for your card.

If my estimation is right, you may come few weeks shorter, comparing with the time you have now.

Keep us informed.

edit: if you can save the residues on the way it would be even better, in case someone wants to DC later.[/QUOTE]


Thanks again LaurV

This is exactly what I need, it'll keep me busy for a while but I'll report back with the results. :bow:

Second ClLucas
M( 63083221 )C, 0x3c3b3e3282ec79__, n = 4194304, clLucas v1.01

kracker 2013-10-11 00:09

Makefile for mingw(windows).
[code]
DEPTH = ../../../../..

include $(DEPTH)/make/openclsdkdefs.mk

# Root path of clFFT
CLFFT = /opt/clFFT-2.0

####
#
# Targets
#
####

OPENCL = 1
SAMPLE_EXE = 1
EXE_TARGET = clLucas
EXE_TARGET_INSTALL = clLucas

####
#
# C/CPP files
#
####

FILES = clLucas
CLFILES = Kernels.cl

CFLAGS += -O3 -Wno-conversion-null -Wno-write-strings -Wno-pointer-arith -I . -I /opt/AMDAPP/include/ -I $(CLFFT)/include -I $(CLFFT)/src/include

LLIBS += SDKUtil
LDFLAGS += -O3 $(CLFFT)/library/libclFFT.dll.a -lOpenCL

include $(DEPTH)/make/openclsdkrules.mk

[/code]

kracker 2013-10-13 17:11

First LL-First time verified :smile:
[URL="http://mersenne.org/report_exponent/?exp_lo=58191149"]M58191149[/URL]

LaurV 2013-10-14 05:56

BTW, a supermod could mask the residue two posts above... Just in case...

@Robish: please do not post full residues of un-verified LL tests, some "credit hunters" will be tempted to "verify" them using your residue, and 20 years later we may find out we missed a prime, in case that residue is wrong.

kracker 2013-10-15 14:47

Just finished my [I]ninth [/I]DC(38M) no mismatch, all matching with one card memory overclocked all the way to max.

[URL="http://mersenne.org/report_exponent/?exp_lo=38198291"]38198291[/URL] Looks oddly familiar.

Also, upgrading from 13.9 to 13.11, I had 13.4 ms vs 12.0 .

Manpowre 2013-10-16 10:02

[QUOTE=kracker;356295]Just finished my [I]ninth [/I]DC(38M) no mismatch, all matching with one card memory overclocked all the way to max.

[URL="http://mersenne.org/report_exponent/?exp_lo=38198291"]38198291[/URL] Looks oddly familiar.

Also, upgrading from 13.9 to 13.11, I had 13.4 ms vs 12.0 .[/QUOTE]

Which board are you using ?

I am considering getting a R9 290X which has almost same number of compute cores and number of transistors as the GK110 on a Titan. Just wondering if we would see similar iteration times ? What do you think Kracker ?

Karl M Johnson 2013-10-16 11:00

[QUOTE=Manpowre;356387]Which board are you using ?

I am considering getting a R9 290X which has almost same number of compute cores and number of transistors as the GK110 on a Titan. Just wondering if we would see similar iteration times ? What do you think Kracker ?[/QUOTE]
It can very roughly be approximated from 7970's timings, since we know the amount of shaders and their clocks for both cards.

kracker 2013-10-16 15:23

[QUOTE=Karl M Johnson;356390]It can very roughly be approximated from 7970's timings, since we know the amount of shaders and their clocks for both cards.[/QUOTE]

Hate to disagree, but I don't think so. We do know the number of st's and clocks but the R9 290(X) is from a new architecture, and that can/will change things. I have my ideas on the performance of it, but...

chris2be8 2013-10-16 16:02

Would there be a range about 76M where the next larger power of 2 FFT would be optimal for first time checks?

Chris

LaurV 2013-10-16 16:35

Yes, with a max of 77M, for which the error stays at 0.25 (therefore a bit risky, if it "increase" in the middle, then a hundred hours gone...).

Safe to use from about 68M to about 76M5, for which a Tahiti gets ETA 155-168 hours with the CPU idle and 230-260 hours with the CPU busy (P95 running in all cores). With this score, the question is if it is not better to let the Tahiti do TF or coins mining.

The 2^x (i.e. 4194304 in our case) is indeed much faster, for comparison, the "precedent" FFT (3932160) gets double times (330 hours at 70M expo), and the "next" in size after the power of two (4718592) needs more time (like 650++ hours at 78M expo). Interesting that the "non-power of two" sizes are not affected by P95 running or not (!?!?!?). (for the records, I use driver 13.1, all the other after it are either slower or have other problems).

kracker 2013-10-16 17:52

[QUOTE=LaurV;356430]Yes, with a max of 77M, for which the error stays at 0.25 (therefore a bit risky, if it "increase" in the middle, then a hundred hours gone...).

Safe to use from about 68M to about 76M5, for which a Tahiti gets ETA 155-168 hours with the CPU idle and 230-260 hours with the CPU busy (P95 running in all cores). With this score, the question is if it is not better to let the Tahiti do TF or coins mining.

The 2^x (i.e. 4194304 in our case) is indeed much faster, for comparison, the "precedent" FFT (3932160) gets double times (330 hours at 70M expo), and the "next" in size after the power of two (4718592) needs more time (like 650++ hours at 78M expo). Interesting that the "non-power of two" sizes are not affected by P95 running or not (!?!?!?). (for the records, I use driver 13.1, all the other after it are either slower or have other problems).[/QUOTE]

My 38M DC, which originally had a err of 0.13, increased to 0.2 and it was fine.

Frankly, I'm waiting for a mismatch on my heavily oc'ed card but none yet.


All times are UTC. The time now is 22:42.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.