![]() |
[QUOTE=axn;414169]If the project's goal is to find a prime as soon as possible, then 1 LL test is what you should optimize for. You don't need doublecheck to find a prime.[/QUOTE]
Just to be pedantic... Your second statement is (almost certainly) true, but your first statement is conditional on your personal desires... If you want to _help_ GIMPS find the next MP then you should either do LL tests or TF'ing in the LL range, based on the economic cross-over points of the hardware you have available to bring to bear. If *you* want to find the next MP, then you should do LL'ing regardless of the relative efficiencies. |
If you are truly operating only for yourself and ignoring the project, you should TF your own exponents to ideal levels and then immediately LL them.
Or if you're like me, on your handful of 100M tests you run concurrent very-high-level TF work because you'd rather find out you aren't going to get a prime a month or two in than 12 months in, even if it doesn't make mathematical sense. I'm curious to see if the multi-GPU clLucas/clFFT patch I'm working on is very effective in speeding up total throughput. I haven't tested it, but now that the 24 bit FFT limit is removed we should be able to do 100M tests with clLucas |
[QUOTE=airsquirrels;414225]If you are truly operating only for yourself and ignoring the project, you should TF your own exponents to ideal levels and then immediately LL them.[/QUOTE]
Good (and correct) point. [QUOTE=airsquirrels;414225]I'm curious to see if the multi-GPU clLucas/clFFT patch I'm working on is very effective in speeding up total throughput. I haven't tested it, but now that the 24 bit FFT limit is removed we should be able to do 100M tests with clLucas[/QUOTE] I'm sure many are interested in hearing how that goes.... :smile: |
[QUOTE=chalsall;414219]Just to be pedantic... Your second statement is (almost certainly) true, but your first statement is conditional on your personal desires...[/quote]
No, Chris. [QUOTE=chalsall;414219]If you want to _help_ GIMPS find the next MP then you should either do LL tests or TF'ing in the LL range, based on the economic cross-over points of the hardware you have available to bring to bear.[/QUOTE] If you want GIMPS to find the next MP, all resource should be diverted to fist-time LL -- you should not even think about DC. If you want to be sure that GIMPS has found all the MPs (in a given range), then, and only then, you should factor in how to optimally move both LL and DC wavefront. You need to understand the subtle difference between the two goals. These have nothing to do with individual motivations. Both of these are potential goals for the project. But which one is more relevant is for the project to decide. But what I said in the previous post stands. "IF the project's goal is to find a prime as soon as possible ... " EDIT:- Just to clarify. The principle of checking the economic cross-over point of an individual piece of hardware is correct. But whether it should be based on 1 LL or 2 LL saved is the point of contention (I am saying 1 LL, *IF* goal is to speed up the finding of next MP) |
[QUOTE=axn;414264]
If you want GIMPS to find the next MP, all resource should be diverted to fist-time LL -- you should not even think about DC. [/QUOTE] Yes. That is totally right, elementary probabilities calculus. There is a (say) 5% (maximum, chosen special against, in this example, the real is in about 3%-4%) chance a LL test is bull, so to find the "possibly missed prime" by DC you have to DC 20 exponents, which at the end gives you the same gain (toward the first goal) as doing a single LL test: the gain is that [U]one[/U] exponent was cleared. Of course, 20 DC take about 5 times the time for one LL. So, computers doing DC are 5 times slower toward the goal "find a prime". But doing DC contribute the rest of 80% to the second goal, "don't miss a prime", for which LL has only a 20% contribution. Well... about... (the last conclusion is forced, yeah, I know, but you got the idea). As a LL test takes about 4 times longer than a DC (double number of iterations, double FFT length for each iteration), we would "break even" doing DC if the error rate of the first LL test would be somewhere in 25%. Then, one LL in 4 would be wrong, and we would make 4 DCs in the same time, to find that wrong LL. But we don't have such a high error rate, thanks god! :razz: At the end, everyone does what he likes. I don't like loose ends. (and probably other people don't like them too, that is why some of us concentrated on the "rip DCTF" subproject) |
[QUOTE=axn;414264]The principle of checking the economic cross-over point of an individual piece of hardware is correct. But whether it should be based on 1 LL or 2 LL saved is the point of contention (I am saying 1 LL, *IF* goal is to speed up the finding of next MP)[/QUOTE]
OK, I don't disagree with that argument. But, thus, there is still value in TF'ing before LL'ing for most current candidates, just one bit lower, iff one accepts the argument that finding the MP is the only goal; there are still more candidates TF'ed to below 74 (91,535) than are at 74 (86,568) (below 80M). |
1 Attachment(s)
Hi,
bringing code from cudalucas-code-37-trunk. [CODE] $ ./clLucas -h Platform 0 : Advanced Micro Devices, Inc. $ CUDALucas -h|-v $ CUDALucas [-d device_number] [-info] [-i inifile] [-threads 32|64|128|256] [-c checkpoint_iteration] [-f fft_length] [-s folder] [-t] [-polite iteration] [-k] exponent|input_filename $ CUDALucas [-d device_number] [-info] [-i inifile] [-threads 32|64|128|256] [-polite iteration] -r $ CUDALucas [-d device_number] [-info] -cufftbench start end distance -h print this help message -v print version number -info print device information -i set .ini file name (default = "CUDALucas.ini") -threads set threads number (default = 256) -f set fft length (if round off error then exit) -s save all checkpoint files -t check round off error all iterations -polite GPU is polite every n iterations (default -polite 1) (-polite 0 = GPU aggressive) -cufftbench exec CUFFT benchmark (Ex. $ ./CUDALucas -d 1 -cufftbench 1179648 6291456 32768 ) -r exec residue test. -k enable keys (p change -polite, t disable -t, s change -s) $ ./clLucas -cufftbench 524288 4194304 524288 Platform 0 : Advanced Micro Devices, Inc. Platform :Advanced Micro Devices, Inc. Device 0 : Capeverde Build Options are : -D KHR_DP_EXTENSION CUFFT bench start = 524288 end = 4194304 distance = 524288 CUFFT_Z2Z size= 524288 time= 1.524630 msec CUFFT_Z2Z size= 1048576 time= 2.947390 msec CUFFT_Z2Z size= 1572864 time= 4.713490 msec CUFFT_Z2Z size= 2097152 time= 5.878710 msec CUFFT_Z2Z size= 2621440 time= 10.299940 msec CUFFT_Z2Z size= 3145728 time= 9.566070 msec CUFFT_Z2Z size= 3670016 time= 11.889020 msec CUFFT_Z2Z size= 4194304 time= 11.951850 msec [/CODE] |
[QUOTE=msft;417783]Hi,
bringing code from cudalucas-code-37-trunk. [CODE] $ ./clLucas -h Platform 0 : Advanced Micro Devices, Inc. $ CUDALucas -h|-v $ CUDALucas [-d device_number] [-info] [-i inifile] [-threads 32|64|128|256] [-c checkpoint_iteration] [-f fft_length] [-s folder] [-t] [-polite iteration] [-k] exponent|input_filename $ CUDALucas [-d device_number] [-info] [-i inifile] [-threads 32|64|128|256] [-polite iteration] -r $ CUDALucas [-d device_number] [-info] -cufftbench start end distance -h print this help message -v print version number -info print device information -i set .ini file name (default = "CUDALucas.ini") -threads set threads number (default = 256) -f set fft length (if round off error then exit) -s save all checkpoint files -t check round off error all iterations -polite GPU is polite every n iterations (default -polite 1) (-polite 0 = GPU aggressive) -cufftbench exec CUFFT benchmark (Ex. $ ./CUDALucas -d 1 -cufftbench 1179648 6291456 32768 ) -r exec residue test. -k enable keys (p change -polite, t disable -t, s change -s) $ ./clLucas -cufftbench 524288 4194304 524288 Platform 0 : Advanced Micro Devices, Inc. Platform :Advanced Micro Devices, Inc. Device 0 : Capeverde Build Options are : -D KHR_DP_EXTENSION CUFFT bench start = 524288 end = 4194304 distance = 524288 CUFFT_Z2Z size= 524288 time= 1.524630 msec CUFFT_Z2Z size= 1048576 time= 2.947390 msec CUFFT_Z2Z size= 1572864 time= 4.713490 msec CUFFT_Z2Z size= 2097152 time= 5.878710 msec CUFFT_Z2Z size= 2621440 time= 10.299940 msec CUFFT_Z2Z size= 3145728 time= 9.566070 msec CUFFT_Z2Z size= 3670016 time= 11.889020 msec CUFFT_Z2Z size= 4194304 time= 11.951850 msec [/CODE][/QUOTE] Nice, thank you!! :bow::bow: One thing I've noticed though, is that the results from -cufftbench and actual "work" differ. Right now for the 2048K FFT I'm running at 4.5 ms/iter. [code] Platform 0 : Advanced Micro Devices, Inc. Platform 1 : Intel(R) Corporation Platform :Advanced Micro Devices, Inc. Device 0 : Tonga Build Options are : -D KHR_DP_EXTENSION CUFFT bench start = 524288 end = 4194304 distance = 524288 CUFFT_Z2Z size= 524288 time= 0.679310 msec CUFFT_Z2Z size= 1048576 time= 1.078500 msec CUFFT_Z2Z size= 1572864 time= 1.698430 msec CUFFT_Z2Z size= 2097152 time= 1.913530 msec CUFFT_Z2Z size= 2621440 time= 3.322380 msec CUFFT_Z2Z size= 3145728 time= 3.213230 msec CUFFT_Z2Z size= 3670016 time= 3.945440 msec CUFFT_Z2Z size= 4194304 time= 3.810150 msec [/code] |
[QUOTE=kracker;417820]One thing I've noticed though, is that the results from -cufftbench and actual "work" differ. Right now for the 2048K FFT I'm running at 4.5 ms/iter.
[/QUOTE] Hi, real work include 2 times FFT,mul,normalize. CUDALucas on GT720. [CODE] cudalucas-code-37-trunk$ ./CUDALucas -cufftbench 524288 4194304 524288 CUFFT bench start = 524288 end = 4194304 distance = 524288 CUFFT_Z2Z size= 524288 time= 4.295260 msec CUFFT_Z2Z size= 1048576 time= 8.666337 msec CUFFT_Z2Z size= 1572864 time= 15.023411 msec CUFFT_Z2Z size= 2097152 time= 17.336903 msec CUFFT_Z2Z size= 2621440 time= 26.231358 msec CUFFT_Z2Z size= 3145728 time= 31.551888 msec CUFFT_Z2Z size= 3670016 time= 36.858215 msec CUFFT_Z2Z size= 4194304 time= 35.139626 msec cudalucas-code-37-trunk$ ./CUDALucas 37156667 Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2048K, CUDALucas v2.04 Beta err = 0.0859 (7:34 real, 45.4005 ms/iter, ETA 468:22:53) [/CODE] |
[QUOTE=airsquirrels;414225]I'm curious to see if the multi-GPU clLucas/clFFT patch I'm working on is very effective in speeding up total throughput. I haven't tested it, but now that the 24 bit FFT limit is removed we should be able to do 100M tests with clLucas[/QUOTE]
24 bit Complex to Complex FFT mean 25 bit FFT in GIMPS world. [CODE] Iteration 10000 M( 332220523 )C, 0x1a313d709bfa6663, n = 18432K, CUDALucas v2.04 Beta err = 0.2637 (1:24:39 real, 507.9446 ms/iter, ETA 46873:24:40) Iteration 10000 M( 332220523 )C, 0x1a313d709bfa6663, n = 19200K, clLucas v1.026 err = 0.1582 (2:11:16 real, 787.5769 ms/iter, ETA 72678:02:03) [/CODE] |
Fix clFFT precision problem.
clFFT 2.8 [CODE] Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2048K, clLucas v1.026 err = 0.1064 (2:15 real, 13.5008 ms/iter, ETA 139:16:58) Iteration 10000 M( 75002911 )C, 0xc9a6d6ecad1fb00c, n = 4096K, clLucas v1.026 err = 0.2188 (4:38 real, 27.7515 ms/iter, ETA 578:04:48) [/CODE] [url]https://github.com/shoichiro-yamada/clFFT[/url] [CODE] Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2048K, clLucas v1.026 err = 0.0781 (2:15 real, 13.5368 ms/iter, ETA 139:39:16) Iteration 10000 M( 75002911 )C, 0xc9a6d6ecad1fb00c, n = 4000K, clLucas v1.026 err = 0.2500 (5:27 real, 32.6878 ms/iter, ETA 680:54:18) [/CODE] |
| All times are UTC. The time now is 13:00. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.