mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   LL with OpenCL (https://www.mersenneforum.org/showthread.php?t=18297)

chalsall 2015-10-29 19:09

[QUOTE=axn;414169]If the project's goal is to find a prime as soon as possible, then 1 LL test is what you should optimize for. You don't need doublecheck to find a prime.[/QUOTE]

Just to be pedantic... Your second statement is (almost certainly) true, but your first statement is conditional on your personal desires...

If you want to _help_ GIMPS find the next MP then you should either do LL tests or TF'ing in the LL range, based on the economic cross-over points of the hardware you have available to bring to bear.

If *you* want to find the next MP, then you should do LL'ing regardless of the relative efficiencies.

airsquirrels 2015-10-29 19:24

If you are truly operating only for yourself and ignoring the project, you should TF your own exponents to ideal levels and then immediately LL them.

Or if you're like me, on your handful of 100M tests you run concurrent very-high-level TF work because you'd rather find out you aren't going to get a prime a month or two in than 12 months in, even if it doesn't make mathematical sense.

I'm curious to see if the multi-GPU clLucas/clFFT patch I'm working on is very effective in speeding up total throughput. I haven't tested it, but now that the 24 bit FFT limit is removed we should be able to do 100M tests with clLucas

chalsall 2015-10-29 22:20

[QUOTE=airsquirrels;414225]If you are truly operating only for yourself and ignoring the project, you should TF your own exponents to ideal levels and then immediately LL them.[/QUOTE]

Good (and correct) point.

[QUOTE=airsquirrels;414225]I'm curious to see if the multi-GPU clLucas/clFFT patch I'm working on is very effective in speeding up total throughput. I haven't tested it, but now that the 24 bit FFT limit is removed we should be able to do 100M tests with clLucas[/QUOTE]

I'm sure many are interested in hearing how that goes.... :smile:

axn 2015-10-30 03:05

[QUOTE=chalsall;414219]Just to be pedantic... Your second statement is (almost certainly) true, but your first statement is conditional on your personal desires...[/quote]
No, Chris.

[QUOTE=chalsall;414219]If you want to _help_ GIMPS find the next MP then you should either do LL tests or TF'ing in the LL range, based on the economic cross-over points of the hardware you have available to bring to bear.[/QUOTE]
If you want GIMPS to find the next MP, all resource should be diverted to fist-time LL -- you should not even think about DC.

If you want to be sure that GIMPS has found all the MPs (in a given range), then, and only then, you should factor in how to optimally move both LL and DC wavefront.

You need to understand the subtle difference between the two goals. These have nothing to do with individual motivations. Both of these are potential goals for the project. But which one is more relevant is for the project to decide.

But what I said in the previous post stands. "IF the project's goal is to find a prime as soon as possible ... "

EDIT:- Just to clarify. The principle of checking the economic cross-over point of an individual piece of hardware is correct. But whether it should be based on 1 LL or 2 LL saved is the point of contention (I am saying 1 LL, *IF* goal is to speed up the finding of next MP)

LaurV 2015-10-30 03:36

[QUOTE=axn;414264]
If you want GIMPS to find the next MP, all resource should be diverted to fist-time LL -- you should not even think about DC.
[/QUOTE]
Yes. That is totally right, elementary probabilities calculus. There is a (say) 5% (maximum, chosen special against, in this example, the real is in about 3%-4%) chance a LL test is bull, so to find the "possibly missed prime" by DC you have to DC 20 exponents, which at the end gives you the same gain (toward the first goal) as doing a single LL test: the gain is that [U]one[/U] exponent was cleared. Of course, 20 DC take about 5 times the time for one LL. So, computers doing DC are 5 times slower toward the goal "find a prime". But doing DC contribute the rest of 80% to the second goal, "don't miss a prime", for which LL has only a 20% contribution. Well... about...
(the last conclusion is forced, yeah, I know, but you got the idea).

As a LL test takes about 4 times longer than a DC (double number of iterations, double FFT length for each iteration), we would "break even" doing DC if the error rate of the first LL test would be somewhere in 25%. Then, one LL in 4 would be wrong, and we would make 4 DCs in the same time, to find that wrong LL. But we don't have such a high error rate, thanks god! :razz:

At the end, everyone does what he likes. I don't like loose ends.
(and probably other people don't like them too, that is why some of us concentrated on the "rip DCTF" subproject)

chalsall 2015-10-30 16:14

[QUOTE=axn;414264]The principle of checking the economic cross-over point of an individual piece of hardware is correct. But whether it should be based on 1 LL or 2 LL saved is the point of contention (I am saying 1 LL, *IF* goal is to speed up the finding of next MP)[/QUOTE]

OK, I don't disagree with that argument. But, thus, there is still value in TF'ing before LL'ing for most current candidates, just one bit lower, iff one accepts the argument that finding the MP is the only goal; there are still more candidates TF'ed to below 74 (91,535) than are at 74 (86,568) (below 80M).

msft 2015-11-30 17:59

1 Attachment(s)
Hi,
bringing code from cudalucas-code-37-trunk.
[CODE]
$ ./clLucas -h

Platform 0 : Advanced Micro Devices, Inc.
$ CUDALucas -h|-v

$ CUDALucas [-d device_number] [-info] [-i inifile] [-threads 32|64|128|256] [-c checkpoint_iteration] [-f fft_length] [-s folder] [-t] [-polite iteration] [-k] exponent|input_filename

$ CUDALucas [-d device_number] [-info] [-i inifile] [-threads 32|64|128|256] [-polite iteration] -r

$ CUDALucas [-d device_number] [-info] -cufftbench start end distance

-h print this help message
-v print version number
-info print device information
-i set .ini file name (default = "CUDALucas.ini")
-threads set threads number (default = 256)
-f set fft length (if round off error then exit)
-s save all checkpoint files
-t check round off error all iterations
-polite GPU is polite every n iterations (default -polite 1) (-polite 0 = GPU aggressive)
-cufftbench exec CUFFT benchmark (Ex. $ ./CUDALucas -d 1 -cufftbench 1179648 6291456 32768 )
-r exec residue test.
-k enable keys (p change -polite, t disable -t, s change -s)

$ ./clLucas -cufftbench 524288 4194304 524288

Platform 0 : Advanced Micro Devices, Inc.
Platform :Advanced Micro Devices, Inc.
Device 0 : Capeverde

Build Options are : -D KHR_DP_EXTENSION
CUFFT bench start = 524288 end = 4194304 distance = 524288
CUFFT_Z2Z size= 524288 time= 1.524630 msec
CUFFT_Z2Z size= 1048576 time= 2.947390 msec
CUFFT_Z2Z size= 1572864 time= 4.713490 msec
CUFFT_Z2Z size= 2097152 time= 5.878710 msec
CUFFT_Z2Z size= 2621440 time= 10.299940 msec
CUFFT_Z2Z size= 3145728 time= 9.566070 msec
CUFFT_Z2Z size= 3670016 time= 11.889020 msec
CUFFT_Z2Z size= 4194304 time= 11.951850 msec
[/CODE]

kracker 2015-12-01 00:14

[QUOTE=msft;417783]Hi,
bringing code from cudalucas-code-37-trunk.
[CODE]
$ ./clLucas -h

Platform 0 : Advanced Micro Devices, Inc.
$ CUDALucas -h|-v

$ CUDALucas [-d device_number] [-info] [-i inifile] [-threads 32|64|128|256] [-c checkpoint_iteration] [-f fft_length] [-s folder] [-t] [-polite iteration] [-k] exponent|input_filename

$ CUDALucas [-d device_number] [-info] [-i inifile] [-threads 32|64|128|256] [-polite iteration] -r

$ CUDALucas [-d device_number] [-info] -cufftbench start end distance

-h print this help message
-v print version number
-info print device information
-i set .ini file name (default = "CUDALucas.ini")
-threads set threads number (default = 256)
-f set fft length (if round off error then exit)
-s save all checkpoint files
-t check round off error all iterations
-polite GPU is polite every n iterations (default -polite 1) (-polite 0 = GPU aggressive)
-cufftbench exec CUFFT benchmark (Ex. $ ./CUDALucas -d 1 -cufftbench 1179648 6291456 32768 )
-r exec residue test.
-k enable keys (p change -polite, t disable -t, s change -s)

$ ./clLucas -cufftbench 524288 4194304 524288

Platform 0 : Advanced Micro Devices, Inc.
Platform :Advanced Micro Devices, Inc.
Device 0 : Capeverde

Build Options are : -D KHR_DP_EXTENSION
CUFFT bench start = 524288 end = 4194304 distance = 524288
CUFFT_Z2Z size= 524288 time= 1.524630 msec
CUFFT_Z2Z size= 1048576 time= 2.947390 msec
CUFFT_Z2Z size= 1572864 time= 4.713490 msec
CUFFT_Z2Z size= 2097152 time= 5.878710 msec
CUFFT_Z2Z size= 2621440 time= 10.299940 msec
CUFFT_Z2Z size= 3145728 time= 9.566070 msec
CUFFT_Z2Z size= 3670016 time= 11.889020 msec
CUFFT_Z2Z size= 4194304 time= 11.951850 msec
[/CODE][/QUOTE]

Nice, thank you!! :bow::bow:

One thing I've noticed though, is that the results from -cufftbench and actual "work" differ. Right now for the 2048K FFT I'm running at 4.5 ms/iter.

[code]

Platform 0 : Advanced Micro Devices, Inc.
Platform 1 : Intel(R) Corporation
Platform :Advanced Micro Devices, Inc.
Device 0 : Tonga

Build Options are : -D KHR_DP_EXTENSION
CUFFT bench start = 524288 end = 4194304 distance = 524288
CUFFT_Z2Z size= 524288 time= 0.679310 msec
CUFFT_Z2Z size= 1048576 time= 1.078500 msec
CUFFT_Z2Z size= 1572864 time= 1.698430 msec
CUFFT_Z2Z size= 2097152 time= 1.913530 msec
CUFFT_Z2Z size= 2621440 time= 3.322380 msec
CUFFT_Z2Z size= 3145728 time= 3.213230 msec
CUFFT_Z2Z size= 3670016 time= 3.945440 msec
CUFFT_Z2Z size= 4194304 time= 3.810150 msec
[/code]

msft 2015-12-01 03:01

[QUOTE=kracker;417820]One thing I've noticed though, is that the results from -cufftbench and actual "work" differ. Right now for the 2048K FFT I'm running at 4.5 ms/iter.
[/QUOTE]
Hi,
real work include 2 times FFT,mul,normalize.
CUDALucas on GT720.
[CODE]
cudalucas-code-37-trunk$ ./CUDALucas -cufftbench 524288 4194304 524288

CUFFT bench start = 524288 end = 4194304 distance = 524288
CUFFT_Z2Z size= 524288 time= 4.295260 msec
CUFFT_Z2Z size= 1048576 time= 8.666337 msec
CUFFT_Z2Z size= 1572864 time= 15.023411 msec
CUFFT_Z2Z size= 2097152 time= 17.336903 msec
CUFFT_Z2Z size= 2621440 time= 26.231358 msec
CUFFT_Z2Z size= 3145728 time= 31.551888 msec
CUFFT_Z2Z size= 3670016 time= 36.858215 msec
CUFFT_Z2Z size= 4194304 time= 35.139626 msec
cudalucas-code-37-trunk$ ./CUDALucas 37156667

Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2048K, CUDALucas v2.04 Beta err = 0.0859 (7:34 real, 45.4005 ms/iter, ETA 468:22:53)
[/CODE]

msft 2015-12-01 14:08

[QUOTE=airsquirrels;414225]I'm curious to see if the multi-GPU clLucas/clFFT patch I'm working on is very effective in speeding up total throughput. I haven't tested it, but now that the 24 bit FFT limit is removed we should be able to do 100M tests with clLucas[/QUOTE]
24 bit Complex to Complex FFT mean 25 bit FFT in GIMPS world.
[CODE]
Iteration 10000 M( 332220523 )C, 0x1a313d709bfa6663, n = 18432K, CUDALucas v2.04 Beta err = 0.2637 (1:24:39 real, 507.9446 ms/iter, ETA 46873:24:40)
Iteration 10000 M( 332220523 )C, 0x1a313d709bfa6663, n = 19200K, clLucas v1.026 err = 0.1582 (2:11:16 real, 787.5769 ms/iter, ETA 72678:02:03)
[/CODE]

msft 2015-12-09 07:40

Fix clFFT precision problem.
clFFT 2.8
[CODE]
Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2048K, clLucas v1.026 err = 0.1064 (2:15 real, 13.5008 ms/iter, ETA 139:16:58)
Iteration 10000 M( 75002911 )C, 0xc9a6d6ecad1fb00c, n = 4096K, clLucas v1.026 err = 0.2188 (4:38 real, 27.7515 ms/iter, ETA 578:04:48)
[/CODE]
[url]https://github.com/shoichiro-yamada/clFFT[/url]
[CODE]
Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2048K, clLucas v1.026 err = 0.0781 (2:15 real, 13.5368 ms/iter, ETA 139:39:16)
Iteration 10000 M( 75002911 )C, 0xc9a6d6ecad1fb00c, n = 4000K, clLucas v1.026 err = 0.2500 (5:27 real, 32.6878 ms/iter, ETA 680:54:18)
[/CODE]


All times are UTC. The time now is 13:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.