mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

flashjh 2012-03-24 03:28

[QUOTE=Dubslow;293966]Wait, whoops...
When you reported the result, you lost the assignment key and it was reassigned. I'd rather not poach...

Edit: Repeat for emphasis: If you want me to do a quick DC with Prime95, you must not submit the result to PrimeNet, so that we retain control of the exponent. (Yes, that does mean checking the residue for a match before submitting.) Sorry about that flash.[/QUOTE]
No, my fault. I should have checked before I submitted. I've gotten used to everything matching, so I didn't check first. I'll let my CuLu finish, but you're right. Next time...

LaurV 2012-03-24 11:12

It took me a while and a few re-runs, but I ended up with two good residues with v1.69. :smile: ([SIZE=2]26251817 and [/SIZE][SIZE=2]26240761)[/SIZE].

After a lot of experimenting I reached the conclusion that you have to use -t always (that means: IS A MUST), just to be on the safe side. In this case, for "production", the polite or aggressive has no influence. Because checking the sums/errors at every iteration works somehow same as the polite "trick" with the memory, it will "give a break" to the GPU for a while, about 20%, so with polite trick the GPU is only busy 79%, and when aggressive, it become 81% busy. In both cases the -t holds the most, and in both cases -t is necessary to be on the safe side (otherwise you will be sorry at the end when residues wont match), and in both cases the computer is responsive enough (this means good!) for daily job and average-hungry graphic applications. If you need more output, next step is to disable -t and enable aggressive mode [B]in the same time[/B]. In such case the GPU load goes to 98-99%, you WILL get 25% more output (from 100 to 80 it is 20%, but from 80 to 100 it is 25% :P) but you computer is hotter, louder, much less responsive (assuming the card is also used ad primary graphic) and you lose the confidence. For DC could be ok, if you can afford it, because you have the former residue on PrimeNet and can check your result. But still it is not recommended. For [B]first-time-LL, running without -t would be a BIG mistake[/B], unless you are sure, but SURE, objective, not subjective (like "my card is the best because is mine!"), that your card is a very stable one and does not produce hardware errors, does not get hot, etc.

Much better is if you let -t there, and when you really-really want to maximize your GPU, add one copy of mfaktc. This way you can make nice credit too :D

James Heinrich 2012-03-24 12:57

[QUOTE=James Heinrich;293952]I'd like to put together a CUAlucas performance comparison chart[/QUOTE]Thanks to those who have submitted data, but I need more data points, please. :smile:

After looking over a few benchmark results, I'm going to standardize and ask that everyone submit results using v1.69 on three specific exponents:[code]CUDAlucas -polite 0 26214400
CUDAlucas -polite 0 52428800
CUDAlucas -polite 0 78643200[/code]And (important), I need to know what FFT size was used. You may see it start with a smaller FFT size at first and then move up if the error is too high:[quote]C:\Prime95\cudalucas>CUDALucas_169_20 -polite 0 26214400

[color=red]start M26214400 fft length = 1310720
iteration = 22 < 1000 && err = 0.26196 >= 0.25, increasing n from 1310720[/color]

[color=blue]start M26214400 fft length = 1572864[/color]
[color=gray]Iteration 10000 M( 26214400 )C, 0x0344448e4bf0eb62, n = 1572864, CUDALucas v1.69
err = 0.02403 (0:31 real, 3.0623 ms/iter, ETA 22:17:12)[/color]
[b]Iteration 20000[/b] M( 26214400 )C, 0x9f4a57b1f324d325, n = 1572864, CUDALucas v1.69
err = 0.02403 (0:30 real, [b]3.0247 ms/iter[/b], ETA 22:00:15)[/quote]For consistency, I'm using the timing data as reported on iteration 20000. So for anyone willing to run (or re-run) benchmark data for me, please:
* use v1.69 ([url=http://www.mersenneforum.org/showpost.php?p=293735&postcount=1062]Windows binaries here[/url])
* use the exact 3 commandlines above
* send me the output from start to 20000 iteration (as the above example).

msft 2012-03-24 13:40

[QUOTE=LaurV;294019]you have to use -t always (that means: IS A MUST), just to be on the safe side.[/QUOTE]
Good point.
I experiment was cudaMemcpyAsync();
But slow.

Prime95 2012-03-24 14:18

[QUOTE=msft;294028]Good point.
I experiment was cudaMemcpyAsync();
But slow.[/QUOTE]

The -t option doesn't have to copy g_error to the CPU every iteration. It could copy every 10th, or 100th, or whatever. Just make sure you check g_error before writing a new save file.

msft 2012-03-24 14:49

[QUOTE=Prime95;294030]The -t option doesn't have to copy g_error to the CPU every iteration. It could copy every 10th, or 100th, or whatever. Just make sure you check g_error before writing a new save file.[/QUOTE]
Yes,Yes.Now test.
[code]
Iteration 80000 M( 86243 )C, 0x871aac1149a65db1, n = 4608, CUDALucas v2.00 err = 0.01172 (0:17 real, 1.7138 ms/iter, ETA 0:00)
M( 86243 )C, 0x0000000000000000, n = 4608, CUDALucas v2.00
[/code]:lol:

flashjh 2012-03-24 16:57

[QUOTE=flashjh;293964]Cool, thanks :smile:. I'll post my CuLu re-run results when it's done...

Edit: I attatched the full run test (minus the last residue)[/QUOTE]
The original P95 DC is correct based on my second run, so the P95 DC will be correct.

M( 26229943 )C, 0x76916187254012__, n = 1474[CODE][/CODE]560, CUDALucas v1.69

flashjh 2012-03-24 16:59

I logged in to complie v2.0, it's gone? Where did it go msft?

msft 2012-03-24 17:18

1 Attachment(s)
[QUOTE=flashjh;294044]I logged in to complie v2.0, it's gone? Where did it go msft?[/QUOTE]
Sorry find fatal error.

Ver 2.00
1) Speed up with -t option.
2) use "sEXPONENT.ITERATION.RESIDUE.txt"
[code]
$ ./CUDALucas -polite 0 26974951
Iteration 23300000 M( 26974951 )C, 0x31b4d280a170995a, n = 1474560, CUDALucas v2.00 err = 0.1797 (0:56 real, 5.6171 ms/iter, ETA 5:43:34)
$ ./CUDALucas -polite 0 26974951 -t
Iteration 23320000 M( 26974951 )C, 0x537f9e116a703252, n = 1474560, CUDALucas v2.00 err = 0.207 (0:56 real, 5.6250 ms/iter, ETA 5:42:11)
[/code]

bcp19 2012-03-24 17:23

Does anyone have a link to the 4.1 cudart64 and cufft64 dll's? I tested 3.2 and 4.0 on one GPU so far, and 3.2 is faster, so I wanted to check 4.1 as well. Thanks.

ET_ 2012-03-24 17:26

[QUOTE=James Heinrich;294025]Thanks to those who have submitted data, but I need more data points, please. :smile:

After looking over a few benchmark results, I'm going to standardize and ask that everyone submit results using v1.69 on three specific exponents:[code]CUDAlucas -polite 0 26214400
CUDAlucas -polite 0 52428800
CUDAlucas -polite 0 78643200[/code]And (important), I need to know what FFT size was used. You may see it start with a smaller FFT size at first and then move up if the error is too high:For consistency, I'm using the timing data as reported on iteration 20000. So for anyone willing to run (or re-run) benchmark data for me, please:
* use v1.69 ([url=http://www.mersenneforum.org/showpost.php?p=293735&postcount=1062]Windows binaries here[/url])
* use the exact 3 commandlines above
* send me the output from start to 20000 iteration (as the above example).[/QUOTE]

I am using a GTX275, CUDA toolkit 3.0, cc 1.3.

Here are my benchmarks:

[code]
luigi@luigi-desktop:~/luigi/CUDA/cudaLucas/test/cudalucas.1.69$ ./CUDALucas -polite 0 26214400

start M26214400 fft length = 1310720
iteration = 21 < 1000 && err = 0.287598 >= 0.25, increasing n from 1310720

start M26214400 fft length = 1572864
Iteration 10000 M( 26214400 )C, 0x0344448e4bf0eb62, n = 1572864, CUDALucas v1.69 err = 0.04517 (4:56 real, 29.6005 ms/iter, ETA 215:25:33)
Iteration 20000 M( 26214400 )C, 0x9f4a57b1f324d325, n = 1572864, CUDALucas v1.69 err = 0.04517 (4:54 real, 29.4113 ms/iter, ETA 213:58:00)

---

luigi@luigi-desktop:~/luigi/CUDA/cudaLucas/test/cudalucas.1.69$ ./CUDALucas -polite 0 52428800

start M52428800 fft length = 2621440
iteration = 21 < 1000 && err = 0.25 >= 0.25, increasing n from 2621440

start M52428800 fft length = 3145728
Iteration 10000 M( 52428800 )C, 0x3ceee1cc01747326, n = 3145728, CUDALucas v1.69 err = 0.05469 (9:09 real, 54.8493 ms/iter, ETA 798:30:51)
Iteration 20000 M( 52428800 )C, 0x9281347573ff62eb, n = 3145728, CUDALucas v1.69 err = 0.05469 (9:00 real, 53.9812 ms/iter, ETA 785:43:32)

---

luigi@luigi-desktop:~/luigi/CUDA/cudaLucas/test/cudalucas.1.69$ ./CUDALucas -polite 0 78643200

start M78643200 fft length = 3932160
iteration = 20 < 1000 && err = 0.25 >= 0.25, increasing n from 3932160

start M78643200 fft length = 4194304
iteration = 25 < 1000 && err = 0.339844 >= 0.25, increasing n from 4194304

start M78643200 fft length = 4718592
Iteration 10000 M( 78643200 )C, 0x0a6f35cd25e82e0f, n = 4718592, CUDALucas v1.69 err = 0.07617 (13:12 real, 79.2440 ms/iter, ETA 1730:49:15)
Iteration 20000 M( 78643200 )C, 0x00dda91d63971fb3, n = 4718592, CUDALucas v1.69 err = 0.07617 (13:16 real, 79.5197 ms/iter, ETA 1736:37:17)

[/code]

The timings were higher than with v1.3, and my computer was nearly unusable (with v1.3 there was no apparent slowdown).

Luigi


All times are UTC. The time now is 23:13.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.