![]() |
[QUOTE=apsen;293792]Use cygwin :smile:
[CODE][I]cmd [/I]| tee [-a] [I]log.file[/I][/CODE][/QUOTE] Indeed... If I may be a little "catty" here, it constantly blows my mind that the most popular Operating System in the world doesn't let the Power User do the most basic of things at the Command Line. But then, if popular meant good McDonalds and KFC would be gourmet food.... :wink: |
FRIED CHICKEN
(American Classic) Consider that much of the awesomeness of GNU is the number of programs that can be easily and powerfully used with Bash. As far as I can tell, there isn't much that Bash (meaning specifically only bash) can do that cmd.exe can't do. |
[QUOTE=Dubslow;293796]As far as I can tell, there isn't much that Bash (meaning specifically only bash) can do that cmd.exe can't do.[/QUOTE]
Open the Windows Command Prompt. CD into a directory. "grep" for a particular string within the files. "sort" the results. Then run that through a regex (sed, Perl, et al) to transform it into what you want... Please tell us what, [I][U]exactly[/U][/I], do we have to type into cmd.exe to do that. |
Grep and sort are external programs, not bash features/utilities. It is in theory possible to compile their source code in MSVS and use them with cmd.exe. I'm not defending Windows, just pointing out both the modularity and completeness(/power) of the GNU system.
Edit: IPC is one of the many strong points of *nix. (And yes, in this case IPC is implemented by bash with pipes, etc...) (I have a source here I can reference as soon as I get home.) |
[QUOTE=Dubslow;293804]Grep and sort are external programs, not bash features/utilities. It is in theory possible to compile their source code in MSVS and use them with cmd.exe. I'm not defending Windows, just pointing out both the modularity and completeness(/power) of the GNU system.[/QUOTE]
OK. Then let's then step back... How does one, at the Windows Command Prompt, issue a command and send the results to both a file and the console? Let me make this easy for you... Make the results of the command "ping 8.8.8.8" appear both at the console, and in a file. |
[QUOTE=chalsall;293808]OK.
Then let's then step back... How does one, at the Windows Command Prompt, issue a command and send the results to both a file and the console? Let me make this easy for you... Make the results of the command "ping 8.8.8.8" appear both at the console, and in a file.[/QUOTE] Indeed... I went back and edited my previous post right before you posted, mentioning something related... while l admit I have no clue how to do that with bash, I wouldn't be surprised if it isn't possible with cmd.exe. Edit: Hurg... my brother took so damn long in the hardware store... exactly 60 minutes since my previous post... here's that citation I was talking about: [url]http://www.catb.org/~esr/writings/taoup/html/ch03s01.html[/url] |
[QUOTE=Dubslow;293811]...while l admit I have no clue how to do that with bash, I wouldn't be surprised if it isn't possible with cmd.exe.[/QUOTE]
As apsen points out, it is what Power Users are used to doing under Unix every day... The "tee" program in a pipeline.... |
Not exactly on point, but Windows 7 (and Server 2008R2) install powershell by default, which includes tee. So,[CODE]powershell
ping 127.0.0.1 | tee pinged.txt[/CODE]gives the desired result. |
[QUOTE=sdbardwick;293814]Not exactly on point, but Windows 7 (and Server 2008R2) install powershell by default, which includes tee. So,[CODE]powershell
ping 127.0.0.1 | tee pinged.txt[/CODE]gives the desired result.[/QUOTE] That's good to know. Just wondering... Does "powershell" include Perl? |
I am not a PowerShell expert (or even regular user); I know of some of its capabilities while searching for solutions to specific problems.
With that in mind, I'm going to say no; programming functionality is akin to C#, as PowerShell is .Net based. |
[QUOTE=sdbardwick;293824]With that in mind, I'm going to say no; programming functionality is akin to C#, as PowerShell is .Net based.[/QUOTE]
Is that anything like "Lie back and think of England."? |
Heh. More like "Turn your head and cough".
PowerShell is, in fact, rather powerful. But it focuses on system administration rather than general purpose programming. |
Success for 1.69[CODE]Processing result: M( 26275721 )C, 0x8c6e4dc676cb816a, n = 1474560, CUDALucas v1.69
LL test successfully completes double-check of M26275721 [/CODE] |
OT
[QUOTE=chalsall;293795]Indeed...
If I may be a little "catty" here, it constantly blows my mind that the most popular Operating System in the world doesn't let the Power User do the most basic of things at the Command Line.[/QUOTE]The MPOSITW didn't become the most popular by catering to Power Users (or by keeping track of all its memory allocations). |
Wow, look what I have started :smile:
In the midst of it I hardly even noticed 1.69 success report. But to stay off topic: PowerShell is extremely powerful but it is not portable and requires a bit of different thinking. I looked into it but since I already have cygwin I was too lazy to get used to it. Anyway i did write couple of scrips that were easier to implement with it than with perl and cygwin. Andriy |
I'd like to put together a CUAlucas performance comparison chart, similar to what I have for [url=http://mersenne-aries.sili.net/mfaktc.php]mfaktc[/url]. I'll need a wider variety of data samples than the single one I have from my own system. If anyone reading this thread could fire up CUDAlucas and PM/email me some iteration times for a variety of exponent sizes (at least 25M, 50M and 75M would be great). If possible, a variety of CUDAlucas versions would also be interesting. Naturally I'd also need to know what GPU you're using (and at what clock speed, if overclocked (whether factory or by yourself)).
|
[QUOTE=James Heinrich;293952]I'd like to put together a CUAlucas performance comparison chart, similar to what I have for [URL="http://mersenne-aries.sili.net/mfaktc.php"]mfaktc[/URL]. I'll need a wider variety of data samples than the single one I have from my own system. If anyone reading this thread could fire up CUDAlucas and PM/email me some iteration times for a variety of exponent sizes (at least 25M, 50M and 75M would be great). If possible, a variety of CUDAlucas versions would also be interesting. Naturally I'd also need to know what GPU you're using (and at what clock speed, if overclocked (whether factory or by yourself)).[/QUOTE]
I'll compile some data for you. It may take me a few days to get it all. On a bad note, I had a 1.69 mismatch today. I'm re-running the exponent to see if the original is bad, but it will need a P95 run anyway. Anyone want to run it? If not, I'll start P95 on it Saturday when my other P95 DC completes. M( 26229943 )C, 0x2bb14485101dd1__, n = 1474560, CUDALucas v1.69 |
How long would your run take?
|
[QUOTE=Dubslow;293958]How long would your run take?[/QUOTE]
My CuLu DC will be done in 14 hours. Since it's a DC for me I'll run it on a slower laptop;the P95 run will take about a week. Can you run it faster? |
Yeah, I think I can do it in a few days. I'll run it now.
|
1 Attachment(s)
[QUOTE=Dubslow;293962]Yeah, I think I can do it in a few days. I'll run it now.[/QUOTE]
Cool, thanks :smile:. I'll post my CuLu re-run results when it's done... Edit: I attatched the full run test (minus the last residue) |
Wait, whoops...
When you reported the result, you lost the assignment key and it was reassigned. I'd rather not poach... Edit: Repeat for emphasis: If you want me to do a quick DC with Prime95, you must not submit the result to PrimeNet, so that we retain control of the exponent. (Yes, that does mean checking the residue for a match before submitting.) Sorry about that flash. |
[QUOTE=Dubslow;293966]Wait, whoops...
When you reported the result, you lost the assignment key and it was reassigned. I'd rather not poach... Edit: Repeat for emphasis: If you want me to do a quick DC with Prime95, you must not submit the result to PrimeNet, so that we retain control of the exponent. (Yes, that does mean checking the residue for a match before submitting.) Sorry about that flash.[/QUOTE] No, my fault. I should have checked before I submitted. I've gotten used to everything matching, so I didn't check first. I'll let my CuLu finish, but you're right. Next time... |
It took me a while and a few re-runs, but I ended up with two good residues with v1.69. :smile: ([SIZE=2]26251817 and [/SIZE][SIZE=2]26240761)[/SIZE].
After a lot of experimenting I reached the conclusion that you have to use -t always (that means: IS A MUST), just to be on the safe side. In this case, for "production", the polite or aggressive has no influence. Because checking the sums/errors at every iteration works somehow same as the polite "trick" with the memory, it will "give a break" to the GPU for a while, about 20%, so with polite trick the GPU is only busy 79%, and when aggressive, it become 81% busy. In both cases the -t holds the most, and in both cases -t is necessary to be on the safe side (otherwise you will be sorry at the end when residues wont match), and in both cases the computer is responsive enough (this means good!) for daily job and average-hungry graphic applications. If you need more output, next step is to disable -t and enable aggressive mode [B]in the same time[/B]. In such case the GPU load goes to 98-99%, you WILL get 25% more output (from 100 to 80 it is 20%, but from 80 to 100 it is 25% :P) but you computer is hotter, louder, much less responsive (assuming the card is also used ad primary graphic) and you lose the confidence. For DC could be ok, if you can afford it, because you have the former residue on PrimeNet and can check your result. But still it is not recommended. For [B]first-time-LL, running without -t would be a BIG mistake[/B], unless you are sure, but SURE, objective, not subjective (like "my card is the best because is mine!"), that your card is a very stable one and does not produce hardware errors, does not get hot, etc. Much better is if you let -t there, and when you really-really want to maximize your GPU, add one copy of mfaktc. This way you can make nice credit too :D |
[QUOTE=James Heinrich;293952]I'd like to put together a CUAlucas performance comparison chart[/QUOTE]Thanks to those who have submitted data, but I need more data points, please. :smile:
After looking over a few benchmark results, I'm going to standardize and ask that everyone submit results using v1.69 on three specific exponents:[code]CUDAlucas -polite 0 26214400 CUDAlucas -polite 0 52428800 CUDAlucas -polite 0 78643200[/code]And (important), I need to know what FFT size was used. You may see it start with a smaller FFT size at first and then move up if the error is too high:[quote]C:\Prime95\cudalucas>CUDALucas_169_20 -polite 0 26214400 [color=red]start M26214400 fft length = 1310720 iteration = 22 < 1000 && err = 0.26196 >= 0.25, increasing n from 1310720[/color] [color=blue]start M26214400 fft length = 1572864[/color] [color=gray]Iteration 10000 M( 26214400 )C, 0x0344448e4bf0eb62, n = 1572864, CUDALucas v1.69 err = 0.02403 (0:31 real, 3.0623 ms/iter, ETA 22:17:12)[/color] [b]Iteration 20000[/b] M( 26214400 )C, 0x9f4a57b1f324d325, n = 1572864, CUDALucas v1.69 err = 0.02403 (0:30 real, [b]3.0247 ms/iter[/b], ETA 22:00:15)[/quote]For consistency, I'm using the timing data as reported on iteration 20000. So for anyone willing to run (or re-run) benchmark data for me, please: * use v1.69 ([url=http://www.mersenneforum.org/showpost.php?p=293735&postcount=1062]Windows binaries here[/url]) * use the exact 3 commandlines above * send me the output from start to 20000 iteration (as the above example). |
[QUOTE=LaurV;294019]you have to use -t always (that means: IS A MUST), just to be on the safe side.[/QUOTE]
Good point. I experiment was cudaMemcpyAsync(); But slow. |
[QUOTE=msft;294028]Good point.
I experiment was cudaMemcpyAsync(); But slow.[/QUOTE] The -t option doesn't have to copy g_error to the CPU every iteration. It could copy every 10th, or 100th, or whatever. Just make sure you check g_error before writing a new save file. |
[QUOTE=Prime95;294030]The -t option doesn't have to copy g_error to the CPU every iteration. It could copy every 10th, or 100th, or whatever. Just make sure you check g_error before writing a new save file.[/QUOTE]
Yes,Yes.Now test. [code] Iteration 80000 M( 86243 )C, 0x871aac1149a65db1, n = 4608, CUDALucas v2.00 err = 0.01172 (0:17 real, 1.7138 ms/iter, ETA 0:00) M( 86243 )C, 0x0000000000000000, n = 4608, CUDALucas v2.00 [/code]:lol: |
[QUOTE=flashjh;293964]Cool, thanks :smile:. I'll post my CuLu re-run results when it's done...
Edit: I attatched the full run test (minus the last residue)[/QUOTE] The original P95 DC is correct based on my second run, so the P95 DC will be correct. M( 26229943 )C, 0x76916187254012__, n = 1474[CODE][/CODE]560, CUDALucas v1.69 |
I logged in to complie v2.0, it's gone? Where did it go msft?
|
1 Attachment(s)
[QUOTE=flashjh;294044]I logged in to complie v2.0, it's gone? Where did it go msft?[/QUOTE]
Sorry find fatal error. Ver 2.00 1) Speed up with -t option. 2) use "sEXPONENT.ITERATION.RESIDUE.txt" [code] $ ./CUDALucas -polite 0 26974951 Iteration 23300000 M( 26974951 )C, 0x31b4d280a170995a, n = 1474560, CUDALucas v2.00 err = 0.1797 (0:56 real, 5.6171 ms/iter, ETA 5:43:34) $ ./CUDALucas -polite 0 26974951 -t Iteration 23320000 M( 26974951 )C, 0x537f9e116a703252, n = 1474560, CUDALucas v2.00 err = 0.207 (0:56 real, 5.6250 ms/iter, ETA 5:42:11) [/code] |
Does anyone have a link to the 4.1 cudart64 and cufft64 dll's? I tested 3.2 and 4.0 on one GPU so far, and 3.2 is faster, so I wanted to check 4.1 as well. Thanks.
|
[QUOTE=James Heinrich;294025]Thanks to those who have submitted data, but I need more data points, please. :smile:
After looking over a few benchmark results, I'm going to standardize and ask that everyone submit results using v1.69 on three specific exponents:[code]CUDAlucas -polite 0 26214400 CUDAlucas -polite 0 52428800 CUDAlucas -polite 0 78643200[/code]And (important), I need to know what FFT size was used. You may see it start with a smaller FFT size at first and then move up if the error is too high:For consistency, I'm using the timing data as reported on iteration 20000. So for anyone willing to run (or re-run) benchmark data for me, please: * use v1.69 ([url=http://www.mersenneforum.org/showpost.php?p=293735&postcount=1062]Windows binaries here[/url]) * use the exact 3 commandlines above * send me the output from start to 20000 iteration (as the above example).[/QUOTE] I am using a GTX275, CUDA toolkit 3.0, cc 1.3. Here are my benchmarks: [code] luigi@luigi-desktop:~/luigi/CUDA/cudaLucas/test/cudalucas.1.69$ ./CUDALucas -polite 0 26214400 start M26214400 fft length = 1310720 iteration = 21 < 1000 && err = 0.287598 >= 0.25, increasing n from 1310720 start M26214400 fft length = 1572864 Iteration 10000 M( 26214400 )C, 0x0344448e4bf0eb62, n = 1572864, CUDALucas v1.69 err = 0.04517 (4:56 real, 29.6005 ms/iter, ETA 215:25:33) Iteration 20000 M( 26214400 )C, 0x9f4a57b1f324d325, n = 1572864, CUDALucas v1.69 err = 0.04517 (4:54 real, 29.4113 ms/iter, ETA 213:58:00) --- luigi@luigi-desktop:~/luigi/CUDA/cudaLucas/test/cudalucas.1.69$ ./CUDALucas -polite 0 52428800 start M52428800 fft length = 2621440 iteration = 21 < 1000 && err = 0.25 >= 0.25, increasing n from 2621440 start M52428800 fft length = 3145728 Iteration 10000 M( 52428800 )C, 0x3ceee1cc01747326, n = 3145728, CUDALucas v1.69 err = 0.05469 (9:09 real, 54.8493 ms/iter, ETA 798:30:51) Iteration 20000 M( 52428800 )C, 0x9281347573ff62eb, n = 3145728, CUDALucas v1.69 err = 0.05469 (9:00 real, 53.9812 ms/iter, ETA 785:43:32) --- luigi@luigi-desktop:~/luigi/CUDA/cudaLucas/test/cudalucas.1.69$ ./CUDALucas -polite 0 78643200 start M78643200 fft length = 3932160 iteration = 20 < 1000 && err = 0.25 >= 0.25, increasing n from 3932160 start M78643200 fft length = 4194304 iteration = 25 < 1000 && err = 0.339844 >= 0.25, increasing n from 4194304 start M78643200 fft length = 4718592 Iteration 10000 M( 78643200 )C, 0x0a6f35cd25e82e0f, n = 4718592, CUDALucas v1.69 err = 0.07617 (13:12 real, 79.2440 ms/iter, ETA 1730:49:15) Iteration 20000 M( 78643200 )C, 0x00dda91d63971fb3, n = 4718592, CUDALucas v1.69 err = 0.07617 (13:16 real, 79.5197 ms/iter, ETA 1736:37:17) [/code] The timings were higher than with v1.3, and my computer was nearly unusable (with v1.3 there was no apparent slowdown). Luigi |
1 Attachment(s)
[QUOTE=msft;294046]Sorry find fatal error.
Ver 2.00 1) Speed up with -t option. 2) use "sEXPONENT.ITERATION.RESIDUE.txt" [code] $ ./CUDALucas -polite 0 26974951 Iteration 23300000 M( 26974951 )C, 0x31b4d280a170995a, n = 1474560, CUDALucas v2.00 err = 0.1797 (0:56 real, 5.6171 ms/iter, ETA 5:43:34) $ ./CUDALucas -polite 0 26974951 -t Iteration 23320000 M( 26974951 )C, 0x537f9e116a703252, n = 1474560, CUDALucas v2.00 err = 0.207 (0:56 real, 5.6250 ms/iter, ETA 5:42:11) [/code][/QUOTE] 2.00 x64 Binaries (untested): 3.2 / sm 1.3 4.0 / sm 2.0 4.1 / sm 2.0 |
[QUOTE=ET_;294049]I am using a GTX275, CUDA toolkit 3.0, cc 1.3The timings were higher than with v1.3, and my computer was nearly unusable (with v1.3 there was no apparent slowdown).[/QUOTE]These timings seem odd. For 52M I'd expect timings (based on other data I've seen) somewhere around 11ms, not 54ms.
|
Wow
Initial testing on 2.00 with new -t testing is ~20% faster (3.2 / 1.3)!
[QUOTE]e:\cuda2\cuda -d 1 -threads 512 -c 10000 -f 1474560 -t -polite 0 26232301 >> 26232301.txt[/QUOTE] 1.69[CODE] Iteration 14010000 M( 26232301 )C, 0xdcf162b969f4b93f, n = 1474560, CUDALucas v1.69 err = 0.125 (0:25 real, 2.5782 ms/iter, ETA 8:45:06) Iteration 14020000 M( 26232301 )C, 0x0280e1e19768d6c5, n = 1474560, CUDALucas v1.69 err = 0.125 (0:25 real, 2.5630 ms/iter, ETA 8:41:33) Iteration 14030000 M( 26232301 )C, 0xdf3cb8472cf8663e, n = 1474560, CUDALucas v1.69 err = 0.125 (0:26 real, 2.5630 ms/iter, ETA 8:41:08) Iteration 14040000 M( 26232301 )C, 0x76a8dff0761ecaac, n = 1474560, CUDALucas v1.69 err = 0.125 (0:26 real, 2.5692 ms/iter, ETA 8:41:58) [/CODE] 2.00[CODE] continuing work from a partial result M26232301 fft length = 1474560 iteration = 14040001 Iteration 14050000 M( 26232301 )C, 0xf64c760ed90b27d4, n = 1474560, CUDALucas v2.00 err = 0.1094 (0:21 real, 2.1435 ms/iter, ETA 7:15:07) Iteration 14060000 M( 26232301 )C, 0xf2b2b558215ee274, n = 1474560, CUDALucas v2.00 err = 0.1172 (0:22 real, 2.1427 ms/iter, ETA 7:14:37) Iteration 14070000 M( 26232301 )C, 0x72fd99a3b37b01ef, n = 1474560, CUDALucas v2.00 err = 0.1172 (0:21 real, 2.1432 ms/iter, ETA 7:14:21) Iteration 14080000 M( 26232301 )C, 0x29a7604f2a950ae8, n = 1474560, CUDALucas v2.00 err = 0.1172 (0:22 real, 2.1277 ms/iter, ETA 7:10:51) [/CODE] |
[QUOTE=bcp19;294048]Does anyone have a link to the 4.1 cudart64 and cufft64 dll's? I tested 3.2 and 4.0 on one GPU so far, and 3.2 is faster, so I wanted to check 4.1 as well. Thanks.[/QUOTE]
Here you go, let me know if you need a different file. [URL="http://www.sendspace.com/file/qyyqxl"]Link[/URL] |
[QUOTE=James Heinrich;294061]These timings seem odd. For 52M I'd expect timings (based on other data I've seen) somewhere around 11ms, not 54ms.[/QUOTE]
I get 23ms with v1.3 Now I upgraded to Ubuntu 11.04, and have new drivers and toolkit to install. I'll let you know tomorrow how they perform.:smile: Luigi |
[QUOTE=flashjh;294069]Here you go, let me know if you need a different file.
[URL="http://www.sendspace.com/file/qyyqxl"]Link[/URL][/QUOTE] That is what I needed, but 3.2 still seems fastest. Is this normal or do certain cards work better with 4.0/4.1? |
[QUOTE=bcp19;294048]Does anyone have a link to the 4.1 cudart64 and cufft64 dll's? I tested 3.2 and 4.0 on one GPU so far, and 3.2 is faster, so I wanted to check 4.1 as well. Thanks.[/QUOTE]
[URL="http://home.htp-tel.de/shornbostel/"]http://home.htp-tel.de/shornbostel/[/URL] |
Another 1 DC success and 1 DC fail for v1.69 (26242253 and respective 26269081), first reported, second running TC. This makes the score 3 to 1. For the former version I had 12 to 2. The mismatches were caused by hardware (OC, memory, playing too much around, whatever).
Switching to v2.0. I will resume the current work done with 1.69, few checkpoints behind, to see what's going on, if I get same residues. Theoretically I am now able to build my own exe, but I prefer to use the one provided by flashjh, as it is now recognized as well compiled and running without issues, until I would be confident with my play'around. |
[CODE]start M78643200 fft length = 4718592
Iteration 10000 M( 78643200 )C, 0x0a6f35cd25e82e0f, n = 4718592, CUDALucas v1.69 err = 0.04327 (1:19 real, 7.9267 ms/iter, ETA 173:07:58) Iteration 20000 M( 78643200 )C, 0x00dda91d63971fb3, n = 4718592, CUDALucas v1.69 err = 0.04785 (1:18 real, 7.7783 ms/iter, ETA 169:52:09) Iteration 30000 M( 78643200 )C, 0xe0b1e59a43b7098b, n = 4718592, CUDALucas v1.69 err = 0.04785 (1:19 real, 7.8894 ms/iter, ETA 172:16:24)[/CODE] Mean[{7.9267,7.7783,7.8894}] = 7.8648 [CODE]Iteration 10000 M( 78643200 )C, 0x0a6f35cd25e82e0f, n = 4718592, CUDALucas v2.00 err = 0.04327 (1:13 real, 7.2161 ms/iter, ETA 157:36:44) Iteration 20000 M( 78643200 )C, 0x00dda91d63971fb3, n = 4718592, CUDALucas v2.00 err = 0.04785 (1:11 real, 7.1533 ms/iter, ETA 156:13:09) Iteration 30000 M( 78643200 )C, 0xe0b1e59a43b7098b, n = 4718592, CUDALucas v2.00 err = 0.04785 (1:12 real, 7.1515 ms/iter, ETA 156:09:39)[/CODE] Mean[{7.2161,7.1533,7.1515}] = 7.17363 Avg. 8.78% faster. Very good result. |
[QUOTE=ET_;294049]I am using a GTX275, CUDA toolkit 3.0, cc 1.3.
Here are my benchmarks: The timings were higher than with v1.3, and my computer was nearly unusable (with v1.3 there was no apparent slowdown). Luigi[/QUOTE] Never mind... I just upgraded my system to Ubuntu 11.04, CUDA 4.1 (driver 295.33). My system is again responsive, and version 1.69 seems twice as fast as 1.3. Gonna try v2.0 now. James, you were right, I get 11.3 ms/iteration. Please cancel my previous results, I'll send something more updated during the week. Luigi |
Gah, I can't get anything beyond 270.xx to install properly on 11.04 :P
|
[QUOTE=Dubslow;294155]Gah, I can't get anything beyond 270.xx to install properly on 11.04 :P[/QUOTE]
I had similar problems running the drivers package downloaded from NVIDIA. I resolved stopping the gdm (sudo service gdm stop), from a console (ctrl-alt-F2) and issuing the following commands: [code] sudo add-apt-repository ppa:ubuntu-x-swat/x-updates sudo apt-get update sudo apt-get install nvidia-current nvidia-settings sudo service gdm start [/code] HTH :smile: Luigi |
Ah, I've been wondering why nvidia-current wasn't getting passed 270.xx. Thanks, I'll have to try that when I get back to my desktop.
|
[QUOTE=LaurV;294102]Another 1 DC success and 1 DC fail for v1.69 (26242253 and respective 26269081), first reported, second running TC. This makes the score 3 to 1. For the former version I had 12 to 2. The mismatches were caused by hardware (OC, memory, playing too much around, whatever).
Switching to v2.0. I will resume the current work done with 1.69, few checkpoints behind, to see what's going on, if I get same residues. Theoretically I am now able to build my own exe, but I prefer to use the one provided by flashjh, as it is now recognized as well compiled and running without issues, until I would be confident with my play'around.[/QUOTE] Got a match for the jobs started with 1.69 and resumed with v2.0, for exponent 26244851. So, the procedure works. Aslo, my TC for [COLOR=Red][B]26269081[/B] [/COLOR](started with 1.69 and resumed with 2.0 after about 5M iterations) got the [B]same residue as my previous DC[/B] (done with 1.69 in full). Unfortunately I was enough stupid and reported it (copy/paste mistake) so we lost the assignment. But I can say almost 100% (the reserve is for an eventual software bug in CudaLucas) that my CL DC+TC residue is correct and the original P95 residue is wrong. If one of you got this exponent, [B]don't run it with CudaLucas[/B]. I will add it to the "Don't DC..." thread too. I am curious to know the final result in case the expo will be cleared, but I can't P95-it by myself right now. |
Huh...
That raises some very disconcerting questions about the accuracy of curtisc's tests... (unless cosmic ray?) |
[QUOTE=Dubslow;294207]Huh...
That raises some very disconcerting questions about the accuracy of curtisc's tests... (unless cosmic ray?)[/QUOTE] Not really, this is not the first time we meet a DC which is not conform with FC, and when run TC shows that DC was right and FC was wrong. That is why we do DC/TC, and historical error rate is not insignificant. There ARE exponents with wrong residue for the first LL test in the PrimeNet DB. edit: in fact about 3 LL tests in 2 hundreds are wrong, and this is normal, according with last paragraph of [URL="http://www.mersenne.org/various/math.php"]this[/URL]. It is not the first time for me too, see the "Don't DC them..." thread, and I still keep the residue chain for all of them till they are cleared. For the exponent in cause, I have all the checkpoint files every 100k, starting from 5M (the one done with CL2.0). If anyone is interested in repeating the test, please use P95, and put the line "InterimFiles=100000" and "InterimResidues=100000" into prime.txt file. I can post (or send) the list of residues, for confirmation. |
[QUOTE=LaurV;294213]Not really, this is not the first time we meet a DC which is not confirm with FC, and when run TC shows that DC was right and FC was wrong. That is why we do DC/TC, and historical error rate is not insignificant.[/QUOTE]
I'm saying though, the first test was done by curtisc, and at the moment, it appears to be wrong. How many others of his are wrong? Or was it just a cosmic ray, not a hardware error? [QUOTE=ET_;294159]I had similar problems running the drivers package downloaded from NVIDIA. I resolved stopping the gdm (sudo service gdm stop), from a console (ctrl-alt-F2) and issuing the following commands: [code] sudo add-apt-repository ppa:ubuntu-x-swat/x-updates sudo apt-get update sudo apt-get install nvidia-current nvidia-settings sudo service gdm start [/code] HTH :smile: Luigi[/QUOTE] Oh my goodness, how did you figure out what to do? I've been floundering with this one since like last October. mfaktc 0.18, here I come!!! |
On Debian unstable, I had a performance problem with 295.20 on mfaktc and gpu-ecm (performance was divided by more than 3), but it came back to the previous level (290.x) after installing 295.33.
|
When -t saves yer ass:
[CODE] Iteration 18600000 M( 26276197 )C, 0xcfa7786f54f16d9b, n = 1474560, CUDALucas v2.00 err = 0.1348 (5:05 real, 3.0521 ms/iter, ETA 6:26:35) Iteration 18700000 M( 26276197 )C, 0xc1f11399edd62632, n = 1474560, CUDALucas v2.00 err = 0.1348 (5:06 real, 3.0634 ms/iter, ETA 6:22:55) Iteration 18800000 M( 26276197 )C, 0xfb97497ee5183685, n = 1474560, CUDALucas v2.00 err = 0.1348 (5:08 real, 3.0807 ms/iter, ETA 6:19:57) iteration = 18822701 >= 1000 && err = 0.498047 >= 0.35,fft length = 1474560 write checkpoint file and exit.(when enable -t option) [/CODE] by the way, I found out that the residue written to the name of the file in this case is not the last known good residue, but the last residue written to a file. This is not a big deal, anyhow I resumed few checkpoints before, just to be sure. |
[QUOTE=LaurV;294242]by the way, I found out that the residue written to the name of the file in this case is not the last known good residue, but the last residue written to a file. This is not a big deal, anyhow I resumed few checkpoints before, just to be sure.[/QUOTE]
Please add log. |
[QUOTE=James Heinrich;294025]Thanks to those who have submitted data, but I need more data points, please. :smile:[/QUOTE]Thanks everyone who has submitted data, I now have a pretty good picture of CUDALucas throughput, and can current predict timings +/- 6% for pretty much any card.
[url]http://mersenne-aries.sili.net/cudalucas.php[/url] Performance depends on GFLOPS, FFT size, and compute version. As with mfaktc, compute 2.0 is best, 2.1 is second-best and 1.3 is slowest. |
[QUOTE=James Heinrich;294256]Thanks everyone who has submitted data, I now have a pretty good picture of CUDALucas throughput, and can current predict timings +/- 6% for pretty much any card.
[url]http://mersenne-aries.sili.net/cudalucas.php[/url] Performance depends on GFLOPS, FFT size, and compute version. As with mfaktc, compute 2.0 is best, 2.1 is second-best and 1.3 is slowest.[/QUOTE] Thanks for putting this together. |
[QUOTE=flashjh;294258]Thanks for putting this together.[/QUOTE]No problem. I'm not entirely sure how to present it best. The "efficiency" of CUDALucas in terms of performance-per-day varies across exponent size as the chosen FFT sizes for CUDALucas and Prime95 (from which credit values are derived) don't align. It's more obvious if I show more columns (e.g. every 1M instead of every 10M), but that leads to many columns. But the overall trend is that CUDALucas appears more efficient at larger exponent sizes, especially around 70M.
|
CUDALucas 2.00 on a GTX680 (sm 3.0):
[CODE]start M26214400 fft length = 1310720 iteration = 22 < 1000 && err = 0.370434 >= 0.25, increasing n from 1310720 start M26214400 fft length = 1572864 Iteration 10000 M( 26214400 )C, 0x0344448e4bf0eb62, n = 1572864, CUDALucas v2.00 err = 0.02403 (0:33 real, 3.2582 ms/iter, ETA 23:42:45) Iteration 20000 M( 26214400 )C, 0x9f4a57b1f324d325, n = 1572864, CUDALucas v2.00 err = 0.02403 (0:33 real, 3.2477 ms/iter, ETA 23:37:38) [/CODE] [CODE]start M52428800 fft length = 2621440 iteration = 22 < 1000 && err = 0.359292 >= 0.25, increasing n from 2621440 start M52428800 fft length = 3145728 Iteration 10000 M( 52428800 )C, 0x3ceee1cc01747326, n = 3145728, CUDALucas v2.00 err = 0.03371 (1:04 real, 6.4324 ms/iter, ETA 93:38:44) Iteration 20000 M( 52428800 )C, 0x9281347573ff62eb, n = 3145728, CUDALucas v2.00 err = 0.03371 (1:05 real, 6.4328 ms/iter, ETA 93:37:57) [/CODE] [CODE]start M78643200 fft length = 3932160 iteration = 22 < 1000 && err = 0.300339 >= 0.25, increasing n from 3932160 start M78643200 fft length = 4194304 iteration = 25 < 1000 && err = 0.313914 >= 0.25, increasing n from 4194304 start M78643200 fft length = 4718592 Iteration 10000 M( 78643200 )C, 0x0a6f35cd25e82e0f, n = 4718592, CUDALucas v2.00 err = 0.04076 (1:36 real, 9.5832 ms/iter, ETA 209:18:46) Iteration 20000 M( 78643200 )C, 0x00dda91d63971fb3, n = 4718592, CUDALucas v2.00 err = 0.04204 (1:36 real, 9.5889 ms/iter, ETA 209:24:40) [/CODE] |
[QUOTE=James Heinrich;294262]No problem. I'm not entirely sure how to present it best. The "efficiency" of CUDALucas in terms of performance-per-day varies across exponent size as the chosen FFT sizes for CUDALucas and Prime95 (from which credit values are derived) don't align. It's more obvious if I show more columns (e.g. every 1M instead of every 10M), but that leads to many columns. But the overall trend is that CUDALucas appears more efficient at larger exponent sizes, especially around 70M.[/QUOTE]
It is rather curious how the older GTX 2XX card are more efficient at CL compared to TF than their older cousins. |
[QUOTE=bcp19;294277]It is rather curious how the older GTX 2XX card are more efficient at CL compared to TF than their older cousins.[/QUOTE]Assuming compute 2.1 as a baseline:
CUDALucas: compute 1.3 = 82% compute 2.0 = 137% compute 2.1 = 100% compute 3.0 = 56% mfaktc: compute 1.3 = 54% compute 2.0 = 150% compute 2.1 = 100% compute 3.0 = ? (no data) What strikes me as somewhat unexpected is the [i]horrible[/i] performance of the GTX 680 as posted above (the 56% is based on the single benchmark 2 posts above). |
[QUOTE=James Heinrich;294283]Assuming compute 2.1 as a baseline:
CUDALucas: compute 1.3 = 82% compute 2.0 = 137% compute 2.1 = 100% compute 3.0 = 56% mfaktc: compute 1.3 = 54% compute 2.0 = 150% compute 2.1 = 100% compute 3.0 = ? (no data) What strikes me as somewhat unexpected is the [I]horrible[/I] performance of the GTX 680 as posted above (the 56% is based on the single benchmark 2 posts above).[/QUOTE] I thought the timings from my cards were faster using the 1.3 version than the 2.0 one. |
[QUOTE=bcp19;294285]I thought the timings from my cards were faster using the 1.3 version than the 2.0 one.[/QUOTE]There are minor performance differences based on the software version; my comparison numbers are based on hardware capabilities (e.g. GTX 560 is compute 2.1 whereas GTX 570 is compute 2.0, for example). Going from 1.3 to 2.0 was a big improvement, but (gaming aside) it seems to have been downhill from there. :sad:
|
[QUOTE=bcp19;294285]I thought the timings from my cards were faster using the 1.3 version than the 2.0 one.[/QUOTE]
They are, for gtx5xx, sm1.3 is faster then sm2.0 with drv 4.0, which is faster then sm2.0 with drv 4.1 (I do not have sm2.1 cards to compare). |
Just tested 2.00 with mixed results...
Letting mfaktc decide FFT, no noticible speedup: [code]cudalucas1.69.cuda3.2.sm_13.x64 -polite 0 26214400 start M26214400 fft length = 1310720 iteration = 21 < 1000 && err = 0.25 >= 0.25, increasing n from 1310720 start M26214400 fft length = 1572864 Iteration 10000 M( 26214400 )C, 0x0344448e4bf0eb62, n = 1572864, CUDALucas v1.69 err = 0.02441 (1:17 real, 7.6947 ms/iter, ETA 56:00:01) Iteration 20000 M( 26214400 )C, 0x9f4a57b1f324d325, n = 1572864, CUDALucas v1.69 err = 0.02441 (1:17 real, 7.7383 ms/iter, ETA 56:17:46) Iteration 30000 M( 26214400 )C, 0x2603d4f32b1447b1, n = 1572864, CUDALucas v1.69 err = 0.02441 (1:17 real, 7.6608 ms/iter, ETA 55:42:40) cudalucas2.00.cuda3.2.sm_13.x64 -polite 0 26214400 start M26214400 fft length = 1310720 iteration = 21 < 1000 && err = 0.25 >= 0.25, increasing n from 1310720 start M26214400 fft length = 1572864 Iteration 10000 M( 26214400 )C, 0x0344448e4bf0eb62, n = 1572864, CUDALucas v2.00 err = 0.02441 (1:17 real, 7.6971 ms/iter, ETA 56:01:03) Iteration 20000 M( 26214400 )C, 0x9f4a57b1f324d325, n = 1572864, CUDALucas v2.00 err = 0.02441 (1:17 real, 7.6540 ms/iter, ETA 55:40:58) Iteration 30000 M( 26214400 )C, 0x2603d4f32b1447b1, n = 1572864, CUDALucas v2.00 err = 0.02441 (1:17 real, 7.7160 ms/iter, ETA 56:06:45) Iteration 40000 M( 26214400 )C, 0xad8c5ef324794a7f, n = 1572864, CUDALucas v2.00 err = 0.02441 (1:16 real, 7.6570 ms/iter, ETA 55:39:43) [/code] Specifying an FFT, ~5% speedup: [code]cudalucas1.69.cuda3.2.sm_13.x64 -threads 512 -c 10000 -f 1474560 -t -polite 0 26232301 start M26232301 fft length = 1474560 Iteration 10000 M( 26232301 )C, 0xf6f119964a437acf, n = 1474560, CUDALucas v1.69 err = 0.1094 (1:17 real, 7.6517 ms/iter, ETA 55:43:47) Iteration 20000 M( 26232301 )C, 0x3c43951af66bdf31, n = 1474560, CUDALucas v1.69 err = 0.1133 (1:16 real, 7.6471 ms/iter, ETA 55:40:30) Iteration 30000 M( 26232301 )C, 0x56a23afa69fbb918, n = 1474560, CUDALucas v1.69 err = 0.1133 (1:17 real, 7.6466 ms/iter, ETA 55:39:00) Iteration 40000 M( 26232301 )C, 0xd2d3eeab0f0b0e40, n = 1474560, CUDALucas v1.69 err = 0.1133 (1:16 real, 7.6486 ms/iter, ETA 55:38:37) ^C caught. Writing checkpoint. cudalucas2.00.cuda3.2.sm_13.x64 -threads 512 -c 10000 -f 1474560 -t -polite 0 26232301 continuing work from a partial result M26232301 fft length = 1474560 iteration = 40043 Iteration 50000 M( 26232301 )C, 0x71785e5f16f5da16, n = 1474560, CUDALucas v2.00 err = 0.1074 (1:12 real, 7.2297 ms/iter, ETA 52:34:34) Iteration 60000 M( 26232301 )C, 0xf745bd35ce3b0ab5, n = 1474560, CUDALucas v2.00 err = 0.1094 (1:13 real, 7.2783 ms/iter, ETA 52:54:32) Iteration 70000 M( 26232301 )C, 0x3a3a81d0ce422b82, n = 1474560, CUDALucas v2.00 err = 0.1094 (1:13 real, 7.2781 ms/iter, ETA 52:53:16) [/code] Is it possible in future versions to 'clean up' the FFT selection? |
[QUOTE=James Heinrich;294283]
What strikes me as somewhat unexpected is the [i]horrible[/i] performance of the GTX 680 as posted above (the 56% is based on the single benchmark 2 posts above).[/QUOTE] Go see the Kepler thread; from the reviews that were linked there, we have all been expecting (sadly) reduced compute performance for the 680. However, since the 680 is the GK104, not GK110, (and as such is more related to the 560 Ti than the 580) we're waiting to see what the GK110 can do. |
2.0's have 1/8 of DP GFlops, 2.1's, 1/12.
From anandtech's treatment, it was expected the 3.0 to have 1/24, and so it, sadly, appears to be. [QUOTE="Python"][B]Shop Owner[/B]: Remarkable bird, the Norwegian Blue, isn't it, eh? Beautiful plumage! [B]Mr. Praline[/B]: The plumage don't enter into it. It's stone dead! [/QUOTE] |
I've had 4 2.00 sucesses and 1 mismatch. I don't know for sure what caused the mismatch, but I had a driver failure that Win7 recovered from, so that's probably it (I caused it by closing CuLu too soon after starting)[CODE]
M( 26232301 )C, 0x[COLOR=red]251f67a97a93197a[/COLOR], n = 1474560, CUDALucas v2.00 M( 26232803 )C, 0xd00b85dcfaee04b3, n = 1474560, CUDALucas v2.00 M( 26232301 )C, 0x[COLOR=lime]040e8dd990e95b17[/COLOR], n = 1474560, CUDALucas v2.00 M( 26240933 )C, 0x68d29225ff867aa5, n = 1474560, CUDALucas v2.00 M( 26296561 )C, 0x60db292b00734623, n = 1474560, CUDALucas v2.00 [/CODE] 2.00 is very fast, even with -t. I'm getting just over 15 hours per DC. Too bad my new GTX680 is not worth opening up... any gamer out there want to trade an unopened 680 for a 580+some cash :smile: |
[QUOTE=flashjh;294446]
[/CODE]2.00 is very fast, even with -t. I'm getting just over 15 hours per DC. Too bad my new GTX680 is not worth opening up... any gamer out there want to trade an unopened 680 for a 580+some cash :smile:[/QUOTE] Good luck with that. I expect someone will want it. |
[url]http://mersenne.org/report_exponent/?exp_lo=26232301[/url]
...How did you get the good result before the bad one? :huh: |
[QUOTE=flashjh;294446]I've had 4 2.00 sucesses and 1 mismatch. I don't know for sure what caused the mismatch, but I had a driver failure that Win7 recovered from, so that's probably it (I caused it by closing CuLu too soon after starting)[CODE]
M( 26232301 )C, 0x[COLOR=red]251f67a97a93197a[/COLOR], n = 1474560, CUDALucas v2.00 M( 26232803 )C, 0xd00b85dcfaee04b3, n = 1474560, CUDALucas v2.00 M( 26232301 )C, 0x[COLOR=lime]040e8dd990e95b17[/COLOR], n = 1474560, CUDALucas v2.00 M( 26240933 )C, 0x68d29225ff867aa5, n = 1474560, CUDALucas v2.00 M( 26296561 )C, 0x60db292b00734623, n = 1474560, CUDALucas v2.00 [/CODE] 2.00 is very fast, even with -t. I'm getting just over 15 hours per DC. Too bad my new GTX680 is not worth opening up... any gamer out there want to trade an unopened 680 for a 580+some cash :smile:[/QUOTE] could try ebay, $600+ is the current selling price. |
[QUOTE=Dubslow;294456][URL]http://mersenne.org/report_exponent/?exp_lo=26232301[/URL]
...How did you get the good result before the bad one? :huh:[/QUOTE] In my file it was bad then good, PrimeNet shows it different; I don't know why. Basically I got the Bad DC, so I ran a TC to get the good result. [QUOTE=kladner;294450]Good luck with that. I expect someone will want it.[/QUOTE] [QUOTE=bcp19;294459]could try ebay, $600+ is the current selling price.[/QUOTE] I'm sure it will sell; I can't list it until I get home. Ironically, I haven't even seen it yet and I need to sell it. Here's to hoping nVidia does [I]much[/I] better with GK110. |
[QUOTE=bcp19;294321]
Is it possible in future versions to 'clean up' the FFT selection?[/QUOTE] [code] CUFFT_Z2Z size= 1081344 time= 17.912073 msec CUFFT_Z2Z size= 1179648 time= 3.668577 msec CUFFT_Z2Z size= 1277952 time= 23.579775 msec CUFFT_Z2Z size= 1376256 time= 4.184882 msec CUFFT_Z2Z size= 1474560 time= 4.181284 msec CUFFT_Z2Z size= 1572864 time= 4.305398 msec CUFFT_Z2Z size= 1671168 time= 29.604942 msec CUFFT_Z2Z size= 1769472 time= 4.611794 msec CUFFT_Z2Z size= 1867776 time= 35.498314 msec CUFFT_Z2Z size= 1966080 time= 5.232563 msec CUFFT_Z2Z size= 2064384 time= 6.078244 msec CUFFT_Z2Z size= 2162688 time= 30.571829 msec CUFFT_Z2Z size= 2260992 time= 45.150196 msec CUFFT_Z2Z size= 2359296 time= 6.338501 msec CUFFT_Z2Z size= 2457600 time= 8.342790 msec CUFFT_Z2Z size= 2555904 time= 37.028366 msec CUFFT_Z2Z size= 2654208 time= 8.117778 msec CUFFT_Z2Z size= 2752512 time= 7.537333 msec CUFFT_Z2Z size= 2850816 time= 63.291031 msec CUFFT_Z2Z size= 2949120 time= 7.693821 msec [/code] Multiples of 98304 is unstable performance. |
Another 2 successful DCs, CudaLucas v2.0, -f 1474560, for 26276197 and 26247433.
|
I just completed my first full run on CuLu (2.0) of a DC. Residue matched the original LL and Jerry's good DC. This gives me confidence to try some "real work" now.
[CODE]M( 26229943 )C, 0x769161872540121c, n = 1572864, CUDALucas v2.00[/CODE] |
[QUOTE=kladner;294514]I just completed my first full run on CuLu (2.0) of a DC. Residue matched the original LL and Jerry's good DC. This gives me confidence to try some "real work" now.
[CODE]M( 26229943 )C, 0x769161872540121c, n = 1572864, CUDALucas v2.00[/CODE][/QUOTE] Did you try -f 1474560. It will increase your throughput, just use -t and watch your error. |
[QUOTE=flashjh;294517]Did you try -f 1474560. It will increase your throughput, just use -t and watch your error.[/QUOTE]
No, I didn't. I'm still baffled by selecting FFT sizes. I messed around a little, but only succeeded in terminating the program. I've started another 26M DC. The automatic system set 1572864. I will try it with 1474560. Thanks for the suggestion! EDIT: This is weird. I have this command line: [CODE]start CUDALucas2.00.cuda3.2.sm_13.x64 -c 10000 -f 1474560 -t -s check -polite 1 -k worktodo.txt[/CODE] But CuLu continues to start with 1572864. |
[QUOTE=kladner;294524]No, I didn't. I'm still baffled by selecting FFT sizes. I messed around a little, but only succeeded in terminating the program.
I've started another 26M DC. The automatic system set 1572864. I will try it with 1474560. Thanks for the suggestion![/QUOTE] It will terminate with a FFT that causes too high of an error. If you read back through the thread you can see some examples. CuLu also includes a FFT test: cudalucas.exe -cufftbench [I]start stop distance[/I] start: beginning exponent end: stop exponent distance: size between FFTs (32768 is normal) Once you run that, you can use the fastest FFTs and run each one until you get it stable. For me, all the FFTs that were faster than 1572864 caused errors except 1474560. So far with -f 1474560 -t -polite 0 I have had very good runs excpet one (which was my fault), but the TrippleCheck was good. Jerry |
Thanks Jerry, for the explanation. I'll go back and run -cufftbench and try to get a handle on things. But at the moment, it ignores -f for some reason. I'll get back to it in a bit.
|
Is there any kind of benchmark (or reference in the code) that gives a list of all suggested FFT sizes for a given exponent size? A quick glance at the source didn't jump out at me where the lookup table was for what FFT size CUDALucas will start with by default. How can I find this (ideally for all possible exponents that it can handle)?
|
1 Attachment(s)
Attached is my cufftbench for a GTX460. I flagged with a "Y" the FFT sizes that make sense.
|
[QUOTE=James Heinrich;294529]Is there any kind of benchmark (or reference in the code) that gives a list of all suggested FFT sizes for a given exponent size? A quick glance at the source didn't jump out at me where the lookup table was for what FFT size CUDALucas will start with by default. How can I find this (ideally for all possible exponents that it can handle)?[/QUOTE]
As far as I know, right now just pick a range around the exponent and run the cufft test to choose the best FFT for your card/system/exponent. I have not yet tested different thread sizes and the corresponding cufft test for different FFTs. When I get some time, I'll see what i can do. Look back though the thread for more FFT discussions. |
[QUOTE=Prime95;294530]Attached is my cufftbench for a GTX460. I flagged with a "Y" the FFT sizes that make sense.[/QUOTE]
I've had the same question as James about Prime95. Where in the P95 source is the table of possible FFT lengths? 15-20 minutes digging around didn't turn up much. |
[QUOTE=Dubslow;294533]I've had the same question as James about Prime95. Where in the P95 source is the table of possible FFT lengths? 15-20 minutes digging around didn't turn up much.[/QUOTE]
We're getting a little off topic. P95 uses a table in mult.asm. The xjmptable is for SSE2, the yjmptable is for AVX. The table includes FFT size, maximum Mersenne exponent, estimated timing, mem used, which CPU architectures should use it, and some other stuff. |
[QUOTE=Prime95;294530]Attached is my cufftbench for a GTX460. I flagged with a "Y" the FFT sizes that make sense.[/QUOTE]
Found a problem. [CODE]CUFFT_Z2Z size= 1146880 time= 1.417851 msec Y CUFFT_Z2Z size= 1179648 time= 1.390691 msec [/CODE] |
Looking at the multipliers, there are definite patterns. The multipliers 1,3,5,7,9,21,27,45,49 and 81 are always selected as preferred.
Except for one instance of 21 vs 45. 1376256 is slower than 1474560. It might be worth re-benchmarking the following four to see if the results are consistent. [CODE]CUFFT_Z2Z size= 1376256 time= 1.818975 msec (21) CUFFT_Z2Z size= 1474560 time= 1.809079 msec Y (45) CUFFT_Z2Z size= 2752512 time= 3.812189 msec Y (21) CUFFT_Z2Z size= 2949120 time= 3.853927 msec Y (45) [/CODE] |
[QUOTE=Prime95;294530]Attached is my cufftbench for a GTX460. I flagged with a "Y" the FFT sizes that make sense.[/QUOTE]
Rehashed to show a bit more about the FFT size, as well as axn's correction. Edit: Whoops, cross post. @axn: msft said CuLu can use any multiple of 32K, that's why I did as such. Edit2: Redone to show more lengths that are "reasonable", but not best. Those are marked with M. [code]CUFFT_Z2Z size= 1048576 = 1024K = 32*32K time= 1.130540 msec Y CUFFT_Z2Z size= 1146880 = 1120K = 35*32K time= 1.417851 msec M CUFFT_Z2Z size= 1179648 = 1152K = 36*32K time= 1.390691 msec Y CUFFT_Z2Z size= 1310720 = 1280K = 40*32K time= 1.533345 msec Y CUFFT_Z2Z size= 1376256 = 1344K = 42*32K time= 1.818975 msec M CUFFT_Z2Z size= 1474560 = 1440K = 45*32K time= 1.809079 msec Y CUFFT_Z2Z size= 1572864 = 1536K = 48*32K time= 1.937807 msec Y CUFFT_Z2Z size= 1605632 = 1568K = 49*32K time= 2.023415 msec Y CUFFT_Z2Z size= 1638400 = 1600K = 50*32K time= 2.217558 msec M CUFFT_Z2Z size= 1769472 = 1728K = 54*32K time= 2.141137 msec Y CUFFT_Z2Z size= 1835008 = 1792K = 56*32K time= 2.163136 msec Y CUFFT_Z2Z size= 1966080 = 1920K = 60*32K time= 2.700584 msec M CUFFT_Z2Z size= 2064384 = 2016K = 63*32K time= 2.551482 msec M CUFFT_Z2Z size= 2097152 = 2048K = 64*32K time= 2.409963 msec Y CUFFT_Z2Z size= 2293760 = 2240K = 70*32K time= 3.018234 msec M CUFFT_Z2Z size= 2359296 = 2304K = 72*32K time= 2.766602 msec Y CUFFT_Z2Z size= 2457600 = 2400K = 75*32K time= 3.627161 msec M CUFFT_Z2Z size= 2621440 = 2560K = 80*32K time= 3.239111 msec Y CUFFT_Z2Z size= 2654208 = 2592K = 81*32K time= 3.409978 msec Y CUFFT_Z2Z size= 2752512 = 2688K = 84*32K time= 3.812189 msec Y CUFFT_Z2Z size= 2949120 = 2880K = 90*32K time= 3.853927 msec Y CUFFT_Z2Z size= 3145728 = 3072K = 96*32K time= 4.029561 msec Y CUFFT_Z2Z size= 3211264 = 3136K = 98*32K time= 4.324980 msec Y CUFFT_Z2Z size= 3276800 = 3200K = 100*32K time= 4.702814 msec M CUFFT_Z2Z size= 3440640 = 3360K = 105*32K time= 4.934543 msec M CUFFT_Z2Z size= 3538944 = 3456K = 108*32K time= 4.573230 msec Y CUFFT_Z2Z size= 3670016 = 3584K = 112*32K time= 4.591721 msec Y CUFFT_Z2Z size= 3932160 = 3840K = 120*32K time= 5.395338 msec M CUFFT_Z2Z size= 4128768 = 4032K = 126*32K time= 5.436691 msec M CUFFT_Z2Z size= 4194304 = 4096K = 128*32K time= 5.049356 msec Y CUFFT_Z2Z size= 4423680 = 4320K = 135*32K time= 5.862155 msec M CUFFT_Z2Z size= 4587520 = 4480K = 140*32K time= 6.353941 msec M CUFFT_Z2Z size= 4718592 = 4608K = 144*32K time= 5.858453 msec Y CUFFT_Z2Z size= 4816896 = 4704K = 147*32K time= 7.085539 msec M CUFFT_Z2Z size= 4915200 = 4800K = 150*32K time= 7.661496 msec M [/code] [QUOTE=Prime95;294535]We're getting a little off topic. P95 uses a table in mult.asm. The xjmptable is for SSE2, the yjmptable is for AVX. The table includes FFT size, maximum Mersenne exponent, estimated timing, mem used, which CPU architectures should use it, and some other stuff.[/QUOTE] :ouch1: ...The former is 2800 lines. Did you write those all by hand? |
[QUOTE=Dubslow;294539]Rehashed to show a bit more about the FFT size, as well as axn's correction.
Edit: Whoops, cross post. @axn: msft said CuLu can use any multiple of 32K, that's why I did as such. Edit2: Redone to show more lengths that are "reasonable", but not best. Those are marked with M. [/QUOTE] I did a similar exercise, this time normalizing the time by dividing it by (FFT/1048576). There is a clear pattern. Any multiplier that is 7-smooth yields decent (not necessarily preferred) performance. Anything that is not 7-smooth yields terrible performance. Something like 4x or worse. |
[QUOTE=axn;294542]I did a similar exercise, this time normalizing the time by dividing it by (FFT/1048576). There is a clear pattern. Any multiplier that is 7-smooth yields decent (not necessarily preferred) performance. Anything that is not 7-smooth yields terrible performance. Something like 4x or worse.[/QUOTE]
Could you post a chart of the multiplier's factorizations or do you want me to do it? |
[QUOTE=Dubslow;294544]Could you post a chart of the multiplier's factorizations or do you want me to do it?[/QUOTE]
[CODE] FFT Pref Mult Smooth Time (ms) Normalized 1048576 Y 1 1 1.1305 1.130 2097152 Y 1 1 2.4099 1.204 4194304 Y 1 1 5.0493 1.262 1572864 Y 3 3 1.9378 1.291 3145728 Y 3 3 4.0295 1.343 1310720 Y 5 5 1.5333 1.226 2621440 Y 5 5 3.2391 1.295 1835008 Y 7 7 2.1631 1.236 3670016 Y 7 7 4.5917 1.311 1179648 Y 9 3 1.3906 1.236 2359296 Y 9 3 2.7666 1.229 4718592 Y 9 3 5.8584 1.301 1441792 11 11 12.0507 8.764 2883584 11 11 25.2414 9.178 1703936 13 13 15.8089 9.728 3407872 13 13 32.9923 10.151 1966080 15 5 2.7005 1.440 3932160 15 5 5.3953 1.438 1114112 17 17 17.8324 16.783 2228224 17 17 23.8903 11.242 4456448 17 17 54.9814 12.936 1245184 19 19 14.2480 11.998 2490368 19 19 28.5571 12.024 4980736 19 19 65.0263 13.689 1376256 ? 21 7 1.8189 1.385 2752512 Y 21 7 3.8121 1.452 1507328 23 23 19.8118 13.782 3014656 23 23 40.1851 13.977 1638400 25 5 2.2175 1.419 3276800 25 5 4.7028 1.504 1769472 Y 27 3 2.1411 1.268 3538944 Y 27 3 4.5732 1.355 1900544 29 29 30.3831 16.763 3801088 29 29 61.4396 16.948 2031616 31 31 33.3520 17.213 4063232 31 31 67.5301 17.427 1081344 33 11 9.9185 9.618 2162688 33 11 18.5583 8.997 4325376 33 11 40.4085 9.796 1146880 35 7 1.4178 1.296 2293760 35 7 3.0182 1.379 4587520 35 7 6.3539 1.452 1212416 37 37 22.6343 19.575 2424832 37 37 45.7872 19.799 4849664 37 37 99.3098 21.472 1277952 39 13 12.9222 10.602 2555904 39 13 23.8400 9.780 1343488 41 41 27.0680 21.126 2686976 41 41 54.9051 21.426 1409024 43 43 29.3962 21.876 2818048 43 43 59.6049 22.178 1474560 Y 45 5 1.8090 1.286 2949120 Y 45 5 3.8539 1.370 1540096 47 47 33.5578 22.847 3080192 47 47 68.1485 23.199 1605632 Y 49 7 2.0234 1.321 3211264 Y 49 7 4.3249 1.412 1671168 51 17 18.4646 11.585 3342336 51 17 37.9425 11.903 1736704 53 53 13.0645 7.888 3473408 53 53 26.7417 8.072 1802240 55 11 15.3619 8.937 3604480 55 11 33.7740 9.825 1867776 57 19 22.4452 12.600 3735552 57 19 45.0705 12.651 1933312 59 59 15.6682 8.498 3866624 59 59 32.2185 8.737 1998848 61 61 17.2398 9.043 3997696 61 61 36.1076 9.470 2064384 63 7 2.5514 1.295 4128768 63 7 5.4366 1.380 2129920 65 13 20.0319 9.861 4259840 65 13 43.8546 10.794 2195456 67 67 14.6807 7.011 4390912 67 67 31.1684 7.443 2260992 69 23 30.2865 14.045 4521984 69 23 62.9652 14.600 2326528 71 71 17.2002 7.752 4653056 71 71 36.2993 8.180 2392064 73 73 21.2844 9.330 4784128 73 73 44.9508 9.852 2457600 75 5 3.6271 1.547 4915200 75 5 7.6614 1.634 2523136 77 11 21.6817 9.010 2588672 79 79 20.6799 8.376 2654208 Y 81 3 3.4099 1.347 2719744 83 83 19.7181 7.602 2785280 85 17 31.6756 11.924 2850816 87 29 44.9787 16.543 2916352 89 89 20.1506 7.245 2981888 91 13 28.5019 10.022 3047424 93 31 49.9787 17.197 3112960 95 19 37.6290 12.675 3178496 97 97 26.7683 8.830 3244032 99 11 32.7558 10.587 3309568 101 101 23.0312 7.297 3375104 103 103 26.8451 8.340 3440640 105 7 4.9345 1.503 3506176 107 107 23.0966 6.907 3571712 109 109 30.6403 8.995 3637248 111 37 70.8848 20.435 3702784 113 113 26.2526 7.434 3768320 115 23 52.5471 14.621 3833856 117 13 38.0365 10.403 3899392 119 17 44.3925 11.937 3964928 121 11 54.2576 14.349 4030464 123 41 84.7203 22.041 4096000 125 5 6.4623 1.654 4161536 127 127 26.5980 6.701 4227072 129 43 91.9276 22.803 4292608 131 131 36.1097 8.820 4358144 133 19 52.7457 12.690 4423680 135 5 5.8621 1.389 4489216 137 137 40.8194 9.534 4554752 139 139 36.4544 8.392 4620288 141 47 104.9800 23.825 4685824 143 13 68.4353 15.314 4751360 145 29 79.7051 17.590 4816896 147 7 7.0855 1.542 4882432 149 149 38.9179 8.358 4947968 151 151 38.3244 8.121 [/CODE] |
[QUOTE=kladner;294528]But at the moment, it ignores -f for some reason. I'll get back to it in a bit.[/QUOTE]
I just had to move the check files out of the folder. -f 474560 does run faster than the default 1572864. [CODE]474560 err = 0.09766 (0:50 real, 4.9958 ms/iter 1572864 err = 0.02148 , 5.3635 ms/iter[/CODE] It seems the fft could go smaller, but I'll have to read the part of the thread that's been posted since I started experimenting and writing about it. |
[QUOTE=kladner;294547]-f 474560 does run faster than the default 1572864. [CODE]474560
err = 0.09766 (0:50 real, 4.9958 ms/iter 1572864 err = 0.02148 , 5.3635 ms/iter[/CODE]It seems the fft could go smaller, but I'll have to read the part of the thread that's been posted since I started experimenting and writing about it.[/QUOTE] Oops. That is 1474560. So far the smallest that doesn't terminate on a GTX 460 with a 26M exponent. |
[QUOTE=kladner;294571]Oops. That is 1474560. So far the smallest that doesn't terminate on a GTX 460 with a 26M exponent.[/QUOTE]
That was the same for me on a 580 |
[QUOTE=kladner;294524]
EDIT: This is weird. <snitp> But CuLu continues to start with 1572864.[/QUOTE] You have to delete the checkpoint file "cXXXXX" and "tXXXX". If there is a checkpoint file it will always resume from where it left, and the checkpoint files are not interchangeable, they have the size of the fft used. So, if a file with old fft-size size exists, it will use THAT size regardless of what -f you use. So, you can appreciate, if you job is done more then 10-15-20% or so, it would be faster to let it finish with old 1572864 (=32768*48), then use 1474560 (=32768*45) starting with the new exponent. Both sizes work well for 26-27M range, the shortest one is faster between 10% and 30% depending on your card. Use fftbench option as explained before to check exactly for your card. |
[QUOTE=flashjh;294586]That was the same for me on a 580[/QUOTE]
Thanks again for throwing that out there. It made a difference for me. |
[code]
Processing result: M( 26768243 )C, 0x3280d4e28ef0b188, n = 1474560, CUDALucas v2.00 LL test successfully completes double-check of M26768243 [/code] |
Successful LLDC with CuLu:
[CODE]26158007No factors below2^69 P-1B1=390000 Verified LLB50D7F090E32331F by "David Triggerson" Verified LLB50D7F090E32331F by "ktony" on 2012-03-31 Historyno factor for M26158007 from 2^67 to 2^68 [mfaktc 0.17-Win barrett79_mul32] by "lalera" on 2011-12-05 Historyno factor for M26158007 from 2^68 to 2^69 [mfaktc 0.18-pre7 71bit_mul24] by "Luigi Morelli" on 2011-12-06 Historyb50d7f090e3233__ by "ktony" on 2012-03-31[/CODE] |
I've been assigned triple check and got mismatch with the first two checks for 28982959.
I've run the check twice with different FFT lengths (and -t both times) and got all residues match. Could someone run it through P95? Thanks, Andriy |
[QUOTE=apsen;295165]I've been assigned triple check and got mismatch with the first two checks for 28982959.
I've run the check twice with different FFT lengths (and -t both times) and got all residues match. Could someone run it through P95? Thanks, Andriy[/QUOTE] I'll run it. Will take a few days. |
| All times are UTC. The time now is 13:00. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.