mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   LL with OpenCL (https://www.mersenneforum.org/showthread.php?t=18297)

kracker 2014-02-11 19:35

[QUOTE=Shirik;366675]I'm trying this out, and keep getting the following error:

[code]
X:\cllucas>clLucas_x64.exe -f 41943040 332233123
Platform :Advanced Micro Devices, Inc.
Device 0 : Tahiti

Build Options are : -D KHR_DP_EXTENSION

start M332233123 fft length = 41943040
OPENCL_V_THROWERROR< CLFFT_NOTIMPLEMENTED > (772): Failed to clfftBakePlan.
terminate called after throwing an instance of 'std::runtime_error'
what(): OPENCL_V_THROWERROR< CLFFT_NOTIMPLEMENTED > (772): Failed to clfftBakePlan.

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
[/code]

Running on a 7970.

The problem goes away if I don't use the -f argument, but based on this thread I was under the impression that I have to use it, or it will either be slow or wrong. (If I can get away without the -f argument, then I have nothing to worry about.)

I'm running a double-check on 20000003 now without the -f argument just to see what the result is, but I was wondering if anyone knows about this.[/QUOTE]

Well, you need (-f) I guess. How much memory do you have on your 7970?

For double checks, -f 2097152 is the fastest, although you have to start from the beginning again.

Shirik 2014-02-12 02:20

[QUOTE=kracker;366676]Well, you need (-f) I guess. How much memory do you have on your 7970?

For double checks, -f 2097152 is the fastest, although you have to start from the beginning again.[/QUOTE]
Wow, so I have no idea how I screwed this up but apparently there was an extra 0 in there. I was trying to double 2097152 and ended up adding a 0 at the end (which is simply bogus, considering it's not a power of 2).

However I have a separate problem now.

Again, without the -f argument everything's fine:
[code]
X:\cllucas>clLucas_x64.exe 20000003
Platform :Advanced Micro Devices, Inc.
Device 0 : Tahiti

Build Options are : -D KHR_DP_EXTENSION

start M20000003 fft length = 1048576
Iteration 10000 0xe1d19e8defcac129, n = 1048576 err = 0.3125 (0:28 real, 2.7999 ms/iter, ETA 15:32:49)
Iteration 20000 0x0a9bb6bde838dd3c, n = 1048576 err = 0.3125 (0:27 real, 2.7667 ms/iter, ETA 15:21:18)
Iteration 30000 0x9e2eb9c2edcf5d6b, n = 1048576 err = 0.3125 (0:28 real, 2.7688 ms/iter, ETA 15:21:32)
[/code]

Or if I specify -f 1048576 (which it seems to be choosing by default) it's fine:
[code]
X:\cllucas>clLucas_x64.exe -f 1048576 20000003
Platform :Advanced Micro Devices, Inc.
Device 0 : Tahiti

Build Options are : -D KHR_DP_EXTENSION

start M20000003 fft length = 1048576
Iteration 10000 0xe1d19e8defcac129, n = 1048576 err = 0.3125 (0:28 real, 2.8004 ms/iter, ETA 15:33:00)
[/code]

But if I try to specify a different value:
[code]
X:\cllucas>clLucas_x64.exe -f 2097152 20000003
Platform :Advanced Micro Devices, Inc.
Device 0 : Tahiti

Build Options are : -D KHR_DP_EXTENSION

start M20000003 fft length = 2097152
err = 4.5036e+015,fft length = 2097152 exiting.
Warning: Program terminating, but clFFT resources not freed.
Please consider explicitly calling clfftTeardown( ).

X:\cllucas>clLucas_x64.exe -f 4194304 20000003
Platform :Advanced Micro Devices, Inc.
Device 0 : Tahiti

Build Options are : -D KHR_DP_EXTENSION

start M20000003 fft length = 4194304
err = 4.5036e+015,fft length = 4194304 exiting.
Warning: Program terminating, but clFFT resources not freed.
Please consider explicitly calling clfftTeardown( ).
[/code]

If I choose a larger exponent, though, it's OK. Does this indicate the FFT length was simply too large? I don't entirely understand the math here; how can I find a good FFT length given an exponent? Is trial-and-error really the right way to go?

[code]
X:\cllucas>clLucas_x64.exe -f 4194304 60000229
Platform :Advanced Micro Devices, Inc.
Device 0 : Tahiti

Build Options are : -D KHR_DP_EXTENSION

start M60000229 fft length = 4194304
Iteration 10000 0x0db207ceeff383bc, n = 4194304 err = 0.0009766 (1:17 real, 7.6787 ms/iter, ETA 127:57:22)
[/code]

To directly answer your earlier question, my GPU has 3GB of memory.

Thanks for the pointers,

axn 2014-02-12 04:05

You might have had an "extra" 0, but you also had a extra digit in the exponent. So it evens out :smile: For 332,233,123 you need f = 2^25 (33554432). f = 2^24 (16777216) will be too small.

As to the problem with 20000003, it could be because of resuming from the save file with a different FFT. Try deleting the save file and start afresh with a larger FFT, and see if it still causes error.

Shirik 2014-02-12 04:15

[QUOTE=axn;366707]You might have had an "extra" 0, but you also had a extra digit in the exponent. So it evens out :smile: For 332,233,123 you need f = 2^25 (33554432). f = 2^24 (16777216) will be too small.[/quote]
Wow, I really wasn't paying attention :smile: But how did you come up with the required FFT size?

[QUOTE=axn;366707]
As to the problem with 20000003, it could be because of resuming from the save file with a different FFT. Try deleting the save file and start afresh with a larger FFT, and see if it still causes error.[/QUOTE]
Thanks, but I actually had already made sure of this. Actually, in the process I learned that even if you specify an FFT size, if there's a save file, cllucas seems to force continuing to use the old FFT size as indicated in the save file. (Or it at least reports that FFT size.)

kracker 2014-02-12 04:24

[quote]
[code]
X:\cllucas>clLucas_x64.exe -f 2097152 20000003
Platform :Advanced Micro Devices, Inc.
Device 0 : Tahiti

Build Options are : -D KHR_DP_EXTENSION

start M20000003 fft length = 2097152
err = 4.5036e+015,fft length = 2097152 exiting.
Warning: Program terminating, but clFFT resources not freed.
Please consider explicitly calling clfftTeardown( ).

X:\cllucas>clLucas_x64.exe -f 4194304 20000003
Platform :Advanced Micro Devices, Inc.
Device 0 : Tahiti

Build Options are : -D KHR_DP_EXTENSION

start M20000003 fft length = 4194304
err = 4.5036e+015,fft length = 4194304 exiting.
Warning: Program terminating, but clFFT resources not freed.
Please consider explicitly calling clfftTeardown( ).
[/code]
[/quote]

for M20000003, you need a smaller fft or bigger exponent. Try -f 1048576 [strike]or no -f at all[/strike] and see what it does.

EDIT: Assuming GIMPS LL and DC(non 332M):

Double Checks=2097152
First time Checks=4194304

axn 2014-02-12 05:00

[QUOTE=Shirik;366709]But how did you come up with the required FFT size?[/QUOTE]

You can always plug the number into Prime95 and see what size it uses, then pick the next higher power-of-2. In this case, Prime95 uses 18M (9*2^21) FFT. I use a rule of thumb (exponent/18) to get a ballpark figure.

Shirik 2014-02-12 05:49

[QUOTE=axn;366717]You can always plug the number into Prime95 and see what size it uses, then pick the next higher power-of-2. In this case, Prime95 uses 18M (9*2^21) FFT. I use a rule of thumb (exponent/18) to get a ballpark figure.[/QUOTE]
Awesome, thanks.

And thanks for the help everyone. I've been running some lower-order double-checks (both primes and non-primes) and so far it looks like everything's working well so long as I get the FFT size right, so I picked up a double-check assignment and it looks like it will finish in just under 2 days, so here's hoping everything checks out :smile:

LaurV 2014-02-12 06:48

Just to add in: there is somewhere here a discussion about the "optimal exponents" for clLucas. Due to the fact that the opencl-FFT is much faster for powers of two, you should select the maximum exponent that can be selected for THAT particular power-of-two FFT which you want to use. This way, you will do more work, in about the same time (and get higher credit, help GIMPS more, etc).

For example: (fictive example, using some particular nVidia GPU card, that is not important, but the proportion of the numbers are right):

cudaLucas (which is very efficient in selecting the FFT and is also fast with non-power-of-2 sizes) may take [B]25 hours[/B] to LL one exponent in the 30M range with the optimum-selected FFT size, which is not power of 2. It will take [B]30 hours[/B] to do the same exponent with the next-power-of-2 FFT size, because now, the iteration time is longer, as the FFT is longer. The cudaFFT library is quite optimized, and it does not care too much if the size is power of 2 or not. So, if you do a 37M exponent (to which luckily the default optimal FFT selected is also a power of two, the same power of 2 which you used for the 30 hours test above), it will take about [B]37 hours[/B], because now the time per iteration is the same as in the former case, but you have to do 37M iterations, instead of only 30M.

The same calculus made for clLucas will look different. Because the openCL FFT library is not optimized for "non powers of two" (some are even not possible), then you may get like [B]35 hours[/B] of work for a 30M exponent with the optimal-shortest FFT, but only need [B]33 hours[/B] for it, if you select the "next power of 2" size. So, you are "forced" to use powers of 2 sizes, due to the fact that they are faster. Going to 37M exponents (where the default, optimal size, FFT is luckily a power of 2 too), the calculus is the same, you get each iteration in the same amount of time, but you have more iterations, so ~33*37/30 which is about [B]41 hours[/B]. So, as long as you can LL one 37M exponent in 41 hours, it makes no sense to use 3 quarters of the time to LL a 30M, and get much less credit (and help GIMPS less). Using the higher FFT for lower exponent is kinda "wasting time", other GPUs (or even P95) can LL the lower exponent in half of the time you need to LL the bigger one.

So, there is an "optimum" point to LL or DC with AMD GPUs.

For DC (and depending on your card) the point is close to 37M5 exponents. Here your HD7970 (if you have one) is same efficient as one GTX580. As 30M exponents, the HD card is only about 60-70% efficient, compared with the same GTX580. Because you must use higher FFT than required (the required one is even slower).

GPU72 site used to offer a bunch of 37M exponents to DC, exactly for this reason - they are more optimal to LL in AMD cards compared with other ranges.

Look to it like this: some other card needs almost a double time to LL a 37M compared with a 28M exponent. The HD card will only need ~40% more time to LL a 37M, compared with a 28M.

(this not because the card is faster, but because the 28M is slower in this card - but who care about the reason :razz:, the fact is that is better to do 37M in it)

Your absolute times (written in bold in the text above) may vary according with your GPU, but the ratios are correct. You can test and see for yourself (which is highly recommended to do before you commit to such long time jobs!)

kracker 2014-02-12 15:16

:smile: :goodposting:

Shirik 2014-02-14 18:08

Finished off my [url=http://mersenne.org/report_exponent/default.php?exp_lo=32603359&exp_hi=32603359&full=1]double-check[/url] yesterday so it looks like the math/settings are working right. I wanted something a little easier (while useful) just so I'm not wasting more time if something's wrong.

LaurV makes a good point and so I think I'm going to try to request exponents in a higher range relative to a given FFT. It just makes sense.

sanaris 2014-03-10 20:43

Guys, why OpenCL performance is low?
Usual bandwidth problems?
clFFT problems?
What does OOURA code? Only checks rounding errors?


All times are UTC. The time now is 22:30.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.