mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

chalsall 2012-03-22 21:29

[QUOTE=apsen;293792]Use cygwin :smile:
[CODE][I]cmd [/I]| tee [-a] [I]log.file[/I][/CODE][/QUOTE]

Indeed...

If I may be a little "catty" here, it constantly blows my mind that the most popular Operating System in the world doesn't let the Power User do the most basic of things at the Command Line.

But then, if popular meant good McDonalds and KFC would be gourmet food.... :wink:

Dubslow 2012-03-22 22:03

FRIED CHICKEN

(American Classic)

Consider that much of the awesomeness of GNU is the number of programs that can be easily and powerfully used with Bash. As far as I can tell, there isn't much that Bash (meaning specifically only bash) can do that cmd.exe can't do.

chalsall 2012-03-22 22:18

[QUOTE=Dubslow;293796]As far as I can tell, there isn't much that Bash (meaning specifically only bash) can do that cmd.exe can't do.[/QUOTE]

Open the Windows Command Prompt.

CD into a directory.

"grep" for a particular string within the files.

"sort" the results.

Then run that through a regex (sed, Perl, et al) to transform it into what you want...

Please tell us what, [I][U]exactly[/U][/I], do we have to type into cmd.exe to do that.

Dubslow 2012-03-22 22:40

Grep and sort are external programs, not bash features/utilities. It is in theory possible to compile their source code in MSVS and use them with cmd.exe. I'm not defending Windows, just pointing out both the modularity and completeness(/power) of the GNU system.
Edit: IPC is one of the many strong points of *nix. (And yes, in this case IPC is implemented by bash with pipes, etc...) (I have a source here I can reference as soon as I get home.)

chalsall 2012-03-22 22:52

[QUOTE=Dubslow;293804]Grep and sort are external programs, not bash features/utilities. It is in theory possible to compile their source code in MSVS and use them with cmd.exe. I'm not defending Windows, just pointing out both the modularity and completeness(/power) of the GNU system.[/QUOTE]

OK.

Then let's then step back...

How does one, at the Windows Command Prompt, issue a command and send the results to both a file and the console?

Let me make this easy for you...

Make the results of the command "ping 8.8.8.8" appear both at the console, and in a file.

Dubslow 2012-03-22 22:59

[QUOTE=chalsall;293808]OK.

Then let's then step back...

How does one, at the Windows Command Prompt, issue a command and send the results to both a file and the console?

Let me make this easy for you...

Make the results of the command "ping 8.8.8.8" appear both at the console, and in a file.[/QUOTE]

Indeed... I went back and edited my previous post right before you posted, mentioning something related... while l admit I have no clue how to do that with bash, I wouldn't be surprised if it isn't possible with cmd.exe.


Edit: Hurg... my brother took so damn long in the hardware store... exactly 60 minutes since my previous post... here's that citation I was talking about:
[url]http://www.catb.org/~esr/writings/taoup/html/ch03s01.html[/url]

chalsall 2012-03-22 23:14

[QUOTE=Dubslow;293811]...while l admit I have no clue how to do that with bash, I wouldn't be surprised if it isn't possible with cmd.exe.[/QUOTE]

As apsen points out, it is what Power Users are used to doing under Unix every day... The "tee" program in a pipeline....

sdbardwick 2012-03-22 23:41

Not exactly on point, but Windows 7 (and Server 2008R2) install powershell by default, which includes tee. So,[CODE]powershell
ping 127.0.0.1 | tee pinged.txt[/CODE]gives the desired result.

chalsall 2012-03-23 00:01

[QUOTE=sdbardwick;293814]Not exactly on point, but Windows 7 (and Server 2008R2) install powershell by default, which includes tee. So,[CODE]powershell
ping 127.0.0.1 | tee pinged.txt[/CODE]gives the desired result.[/QUOTE]

That's good to know.

Just wondering... Does "powershell" include Perl?

sdbardwick 2012-03-23 00:30

I am not a PowerShell expert (or even regular user); I know of some of its capabilities while searching for solutions to specific problems.
With that in mind, I'm going to say no; programming functionality is akin to C#, as PowerShell is .Net based.

chalsall 2012-03-23 00:49

[QUOTE=sdbardwick;293824]With that in mind, I'm going to say no; programming functionality is akin to C#, as PowerShell is .Net based.[/QUOTE]

Is that anything like "Lie back and think of England."?

sdbardwick 2012-03-23 01:02

Heh. More like "Turn your head and cough".
PowerShell is, in fact, rather powerful. But it focuses on system administration rather than general purpose programming.

flashjh 2012-03-23 03:33

Success for 1.69[CODE]Processing result: M( 26275721 )C, 0x8c6e4dc676cb816a, n = 1474560, CUDALucas v1.69
LL test successfully completes double-check of M26275721
[/CODE]

cheesehead 2012-03-23 07:18

OT
 
[QUOTE=chalsall;293795]Indeed...

If I may be a little "catty" here, it constantly blows my mind that the most popular Operating System in the world doesn't let the Power User do the most basic of things at the Command Line.[/QUOTE]The MPOSITW didn't become the most popular by catering to Power Users (or by keeping track of all its memory allocations).

apsen 2012-03-23 13:31

Wow, look what I have started :smile:

In the midst of it I hardly even noticed 1.69 success report.

But to stay off topic: PowerShell is extremely powerful but it is not portable and requires a bit of different thinking. I looked into it but since I already have cygwin I was too lazy to get used to it. Anyway i did write couple of scrips that were easier to implement with it than with perl and cygwin.

Andriy

James Heinrich 2012-03-23 22:12

I'd like to put together a CUAlucas performance comparison chart, similar to what I have for [url=http://mersenne-aries.sili.net/mfaktc.php]mfaktc[/url]. I'll need a wider variety of data samples than the single one I have from my own system. If anyone reading this thread could fire up CUDAlucas and PM/email me some iteration times for a variety of exponent sizes (at least 25M, 50M and 75M would be great). If possible, a variety of CUDAlucas versions would also be interesting. Naturally I'd also need to know what GPU you're using (and at what clock speed, if overclocked (whether factory or by yourself)).

flashjh 2012-03-23 23:48

[QUOTE=James Heinrich;293952]I'd like to put together a CUAlucas performance comparison chart, similar to what I have for [URL="http://mersenne-aries.sili.net/mfaktc.php"]mfaktc[/URL]. I'll need a wider variety of data samples than the single one I have from my own system. If anyone reading this thread could fire up CUDAlucas and PM/email me some iteration times for a variety of exponent sizes (at least 25M, 50M and 75M would be great). If possible, a variety of CUDAlucas versions would also be interesting. Naturally I'd also need to know what GPU you're using (and at what clock speed, if overclocked (whether factory or by yourself)).[/QUOTE]


I'll compile some data for you. It may take me a few days to get it all.

On a bad note, I had a 1.69 mismatch today. I'm re-running the exponent to see if the original is bad, but it will need a P95 run anyway. Anyone want to run it? If not, I'll start P95 on it Saturday when my other P95 DC completes.

M( 26229943 )C, 0x2bb14485101dd1__, n = 1474560, CUDALucas v1.69

Dubslow 2012-03-23 23:52

How long would your run take?

flashjh 2012-03-23 23:55

[QUOTE=Dubslow;293958]How long would your run take?[/QUOTE]

My CuLu DC will be done in 14 hours. Since it's a DC for me I'll run it on a slower laptop;the P95 run will take about a week. Can you run it faster?

Dubslow 2012-03-23 23:58

Yeah, I think I can do it in a few days. I'll run it now.

flashjh 2012-03-24 00:01

1 Attachment(s)
[QUOTE=Dubslow;293962]Yeah, I think I can do it in a few days. I'll run it now.[/QUOTE]
Cool, thanks :smile:. I'll post my CuLu re-run results when it's done...

Edit: I attatched the full run test (minus the last residue)

Dubslow 2012-03-24 00:20

Wait, whoops...
When you reported the result, you lost the assignment key and it was reassigned. I'd rather not poach...

Edit: Repeat for emphasis: If you want me to do a quick DC with Prime95, you must not submit the result to PrimeNet, so that we retain control of the exponent. (Yes, that does mean checking the residue for a match before submitting.) Sorry about that flash.

flashjh 2012-03-24 03:28

[QUOTE=Dubslow;293966]Wait, whoops...
When you reported the result, you lost the assignment key and it was reassigned. I'd rather not poach...

Edit: Repeat for emphasis: If you want me to do a quick DC with Prime95, you must not submit the result to PrimeNet, so that we retain control of the exponent. (Yes, that does mean checking the residue for a match before submitting.) Sorry about that flash.[/QUOTE]
No, my fault. I should have checked before I submitted. I've gotten used to everything matching, so I didn't check first. I'll let my CuLu finish, but you're right. Next time...

LaurV 2012-03-24 11:12

It took me a while and a few re-runs, but I ended up with two good residues with v1.69. :smile: ([SIZE=2]26251817 and [/SIZE][SIZE=2]26240761)[/SIZE].

After a lot of experimenting I reached the conclusion that you have to use -t always (that means: IS A MUST), just to be on the safe side. In this case, for "production", the polite or aggressive has no influence. Because checking the sums/errors at every iteration works somehow same as the polite "trick" with the memory, it will "give a break" to the GPU for a while, about 20%, so with polite trick the GPU is only busy 79%, and when aggressive, it become 81% busy. In both cases the -t holds the most, and in both cases -t is necessary to be on the safe side (otherwise you will be sorry at the end when residues wont match), and in both cases the computer is responsive enough (this means good!) for daily job and average-hungry graphic applications. If you need more output, next step is to disable -t and enable aggressive mode [B]in the same time[/B]. In such case the GPU load goes to 98-99%, you WILL get 25% more output (from 100 to 80 it is 20%, but from 80 to 100 it is 25% :P) but you computer is hotter, louder, much less responsive (assuming the card is also used ad primary graphic) and you lose the confidence. For DC could be ok, if you can afford it, because you have the former residue on PrimeNet and can check your result. But still it is not recommended. For [B]first-time-LL, running without -t would be a BIG mistake[/B], unless you are sure, but SURE, objective, not subjective (like "my card is the best because is mine!"), that your card is a very stable one and does not produce hardware errors, does not get hot, etc.

Much better is if you let -t there, and when you really-really want to maximize your GPU, add one copy of mfaktc. This way you can make nice credit too :D

James Heinrich 2012-03-24 12:57

[QUOTE=James Heinrich;293952]I'd like to put together a CUAlucas performance comparison chart[/QUOTE]Thanks to those who have submitted data, but I need more data points, please. :smile:

After looking over a few benchmark results, I'm going to standardize and ask that everyone submit results using v1.69 on three specific exponents:[code]CUDAlucas -polite 0 26214400
CUDAlucas -polite 0 52428800
CUDAlucas -polite 0 78643200[/code]And (important), I need to know what FFT size was used. You may see it start with a smaller FFT size at first and then move up if the error is too high:[quote]C:\Prime95\cudalucas>CUDALucas_169_20 -polite 0 26214400

[color=red]start M26214400 fft length = 1310720
iteration = 22 < 1000 && err = 0.26196 >= 0.25, increasing n from 1310720[/color]

[color=blue]start M26214400 fft length = 1572864[/color]
[color=gray]Iteration 10000 M( 26214400 )C, 0x0344448e4bf0eb62, n = 1572864, CUDALucas v1.69
err = 0.02403 (0:31 real, 3.0623 ms/iter, ETA 22:17:12)[/color]
[b]Iteration 20000[/b] M( 26214400 )C, 0x9f4a57b1f324d325, n = 1572864, CUDALucas v1.69
err = 0.02403 (0:30 real, [b]3.0247 ms/iter[/b], ETA 22:00:15)[/quote]For consistency, I'm using the timing data as reported on iteration 20000. So for anyone willing to run (or re-run) benchmark data for me, please:
* use v1.69 ([url=http://www.mersenneforum.org/showpost.php?p=293735&postcount=1062]Windows binaries here[/url])
* use the exact 3 commandlines above
* send me the output from start to 20000 iteration (as the above example).

msft 2012-03-24 13:40

[QUOTE=LaurV;294019]you have to use -t always (that means: IS A MUST), just to be on the safe side.[/QUOTE]
Good point.
I experiment was cudaMemcpyAsync();
But slow.

Prime95 2012-03-24 14:18

[QUOTE=msft;294028]Good point.
I experiment was cudaMemcpyAsync();
But slow.[/QUOTE]

The -t option doesn't have to copy g_error to the CPU every iteration. It could copy every 10th, or 100th, or whatever. Just make sure you check g_error before writing a new save file.

msft 2012-03-24 14:49

[QUOTE=Prime95;294030]The -t option doesn't have to copy g_error to the CPU every iteration. It could copy every 10th, or 100th, or whatever. Just make sure you check g_error before writing a new save file.[/QUOTE]
Yes,Yes.Now test.
[code]
Iteration 80000 M( 86243 )C, 0x871aac1149a65db1, n = 4608, CUDALucas v2.00 err = 0.01172 (0:17 real, 1.7138 ms/iter, ETA 0:00)
M( 86243 )C, 0x0000000000000000, n = 4608, CUDALucas v2.00
[/code]:lol:

flashjh 2012-03-24 16:57

[QUOTE=flashjh;293964]Cool, thanks :smile:. I'll post my CuLu re-run results when it's done...

Edit: I attatched the full run test (minus the last residue)[/QUOTE]
The original P95 DC is correct based on my second run, so the P95 DC will be correct.

M( 26229943 )C, 0x76916187254012__, n = 1474[CODE][/CODE]560, CUDALucas v1.69

flashjh 2012-03-24 16:59

I logged in to complie v2.0, it's gone? Where did it go msft?

msft 2012-03-24 17:18

1 Attachment(s)
[QUOTE=flashjh;294044]I logged in to complie v2.0, it's gone? Where did it go msft?[/QUOTE]
Sorry find fatal error.

Ver 2.00
1) Speed up with -t option.
2) use "sEXPONENT.ITERATION.RESIDUE.txt"
[code]
$ ./CUDALucas -polite 0 26974951
Iteration 23300000 M( 26974951 )C, 0x31b4d280a170995a, n = 1474560, CUDALucas v2.00 err = 0.1797 (0:56 real, 5.6171 ms/iter, ETA 5:43:34)
$ ./CUDALucas -polite 0 26974951 -t
Iteration 23320000 M( 26974951 )C, 0x537f9e116a703252, n = 1474560, CUDALucas v2.00 err = 0.207 (0:56 real, 5.6250 ms/iter, ETA 5:42:11)
[/code]

bcp19 2012-03-24 17:23

Does anyone have a link to the 4.1 cudart64 and cufft64 dll's? I tested 3.2 and 4.0 on one GPU so far, and 3.2 is faster, so I wanted to check 4.1 as well. Thanks.

ET_ 2012-03-24 17:26

[QUOTE=James Heinrich;294025]Thanks to those who have submitted data, but I need more data points, please. :smile:

After looking over a few benchmark results, I'm going to standardize and ask that everyone submit results using v1.69 on three specific exponents:[code]CUDAlucas -polite 0 26214400
CUDAlucas -polite 0 52428800
CUDAlucas -polite 0 78643200[/code]And (important), I need to know what FFT size was used. You may see it start with a smaller FFT size at first and then move up if the error is too high:For consistency, I'm using the timing data as reported on iteration 20000. So for anyone willing to run (or re-run) benchmark data for me, please:
* use v1.69 ([url=http://www.mersenneforum.org/showpost.php?p=293735&postcount=1062]Windows binaries here[/url])
* use the exact 3 commandlines above
* send me the output from start to 20000 iteration (as the above example).[/QUOTE]

I am using a GTX275, CUDA toolkit 3.0, cc 1.3.

Here are my benchmarks:

[code]
luigi@luigi-desktop:~/luigi/CUDA/cudaLucas/test/cudalucas.1.69$ ./CUDALucas -polite 0 26214400

start M26214400 fft length = 1310720
iteration = 21 < 1000 && err = 0.287598 >= 0.25, increasing n from 1310720

start M26214400 fft length = 1572864
Iteration 10000 M( 26214400 )C, 0x0344448e4bf0eb62, n = 1572864, CUDALucas v1.69 err = 0.04517 (4:56 real, 29.6005 ms/iter, ETA 215:25:33)
Iteration 20000 M( 26214400 )C, 0x9f4a57b1f324d325, n = 1572864, CUDALucas v1.69 err = 0.04517 (4:54 real, 29.4113 ms/iter, ETA 213:58:00)

---

luigi@luigi-desktop:~/luigi/CUDA/cudaLucas/test/cudalucas.1.69$ ./CUDALucas -polite 0 52428800

start M52428800 fft length = 2621440
iteration = 21 < 1000 && err = 0.25 >= 0.25, increasing n from 2621440

start M52428800 fft length = 3145728
Iteration 10000 M( 52428800 )C, 0x3ceee1cc01747326, n = 3145728, CUDALucas v1.69 err = 0.05469 (9:09 real, 54.8493 ms/iter, ETA 798:30:51)
Iteration 20000 M( 52428800 )C, 0x9281347573ff62eb, n = 3145728, CUDALucas v1.69 err = 0.05469 (9:00 real, 53.9812 ms/iter, ETA 785:43:32)

---

luigi@luigi-desktop:~/luigi/CUDA/cudaLucas/test/cudalucas.1.69$ ./CUDALucas -polite 0 78643200

start M78643200 fft length = 3932160
iteration = 20 < 1000 && err = 0.25 >= 0.25, increasing n from 3932160

start M78643200 fft length = 4194304
iteration = 25 < 1000 && err = 0.339844 >= 0.25, increasing n from 4194304

start M78643200 fft length = 4718592
Iteration 10000 M( 78643200 )C, 0x0a6f35cd25e82e0f, n = 4718592, CUDALucas v1.69 err = 0.07617 (13:12 real, 79.2440 ms/iter, ETA 1730:49:15)
Iteration 20000 M( 78643200 )C, 0x00dda91d63971fb3, n = 4718592, CUDALucas v1.69 err = 0.07617 (13:16 real, 79.5197 ms/iter, ETA 1736:37:17)

[/code]

The timings were higher than with v1.3, and my computer was nearly unusable (with v1.3 there was no apparent slowdown).

Luigi

flashjh 2012-03-24 18:05

1 Attachment(s)
[QUOTE=msft;294046]Sorry find fatal error.

Ver 2.00
1) Speed up with -t option.
2) use "sEXPONENT.ITERATION.RESIDUE.txt"
[code]
$ ./CUDALucas -polite 0 26974951
Iteration 23300000 M( 26974951 )C, 0x31b4d280a170995a, n = 1474560, CUDALucas v2.00 err = 0.1797 (0:56 real, 5.6171 ms/iter, ETA 5:43:34)
$ ./CUDALucas -polite 0 26974951 -t
Iteration 23320000 M( 26974951 )C, 0x537f9e116a703252, n = 1474560, CUDALucas v2.00 err = 0.207 (0:56 real, 5.6250 ms/iter, ETA 5:42:11)
[/code][/QUOTE]

2.00 x64 Binaries (untested):

3.2 / sm 1.3
4.0 / sm 2.0
4.1 / sm 2.0

James Heinrich 2012-03-24 19:33

[QUOTE=ET_;294049]I am using a GTX275, CUDA toolkit 3.0, cc 1.3The timings were higher than with v1.3, and my computer was nearly unusable (with v1.3 there was no apparent slowdown).[/QUOTE]These timings seem odd. For 52M I'd expect timings (based on other data I've seen) somewhere around 11ms, not 54ms.

flashjh 2012-03-24 20:46

Wow
 
Initial testing on 2.00 with new -t testing is ~20% faster (3.2 / 1.3)!

[QUOTE]e:\cuda2\cuda -d 1 -threads 512 -c 10000 -f 1474560 -t -polite 0 26232301 >> 26232301.txt[/QUOTE]

1.69[CODE]
Iteration 14010000 M( 26232301 )C, 0xdcf162b969f4b93f, n = 1474560, CUDALucas v1.69 err = 0.125 (0:25 real, 2.5782 ms/iter, ETA 8:45:06)
Iteration 14020000 M( 26232301 )C, 0x0280e1e19768d6c5, n = 1474560, CUDALucas v1.69 err = 0.125 (0:25 real, 2.5630 ms/iter, ETA 8:41:33)
Iteration 14030000 M( 26232301 )C, 0xdf3cb8472cf8663e, n = 1474560, CUDALucas v1.69 err = 0.125 (0:26 real, 2.5630 ms/iter, ETA 8:41:08)
Iteration 14040000 M( 26232301 )C, 0x76a8dff0761ecaac, n = 1474560, CUDALucas v1.69 err = 0.125 (0:26 real, 2.5692 ms/iter, ETA 8:41:58)
[/CODE]
2.00[CODE]
continuing work from a partial result M26232301 fft length = 1474560 iteration = 14040001
Iteration 14050000 M( 26232301 )C, 0xf64c760ed90b27d4, n = 1474560, CUDALucas v2.00 err = 0.1094 (0:21 real, 2.1435 ms/iter, ETA 7:15:07)
Iteration 14060000 M( 26232301 )C, 0xf2b2b558215ee274, n = 1474560, CUDALucas v2.00 err = 0.1172 (0:22 real, 2.1427 ms/iter, ETA 7:14:37)
Iteration 14070000 M( 26232301 )C, 0x72fd99a3b37b01ef, n = 1474560, CUDALucas v2.00 err = 0.1172 (0:21 real, 2.1432 ms/iter, ETA 7:14:21)
Iteration 14080000 M( 26232301 )C, 0x29a7604f2a950ae8, n = 1474560, CUDALucas v2.00 err = 0.1172 (0:22 real, 2.1277 ms/iter, ETA 7:10:51)
[/CODE]

flashjh 2012-03-24 20:59

[QUOTE=bcp19;294048]Does anyone have a link to the 4.1 cudart64 and cufft64 dll's? I tested 3.2 and 4.0 on one GPU so far, and 3.2 is faster, so I wanted to check 4.1 as well. Thanks.[/QUOTE]
Here you go, let me know if you need a different file.

[URL="http://www.sendspace.com/file/qyyqxl"]Link[/URL]

ET_ 2012-03-24 22:04

[QUOTE=James Heinrich;294061]These timings seem odd. For 52M I'd expect timings (based on other data I've seen) somewhere around 11ms, not 54ms.[/QUOTE]

I get 23ms with v1.3

Now I upgraded to Ubuntu 11.04, and have new drivers and toolkit to install. I'll let you know tomorrow how they perform.:smile:

Luigi

bcp19 2012-03-24 22:19

[QUOTE=flashjh;294069]Here you go, let me know if you need a different file.

[URL="http://www.sendspace.com/file/qyyqxl"]Link[/URL][/QUOTE]

That is what I needed, but 3.2 still seems fastest. Is this normal or do certain cards work better with 4.0/4.1?

Brain 2012-03-24 23:03

[QUOTE=bcp19;294048]Does anyone have a link to the 4.1 cudart64 and cufft64 dll's? I tested 3.2 and 4.0 on one GPU so far, and 3.2 is faster, so I wanted to check 4.1 as well. Thanks.[/QUOTE]

[URL="http://home.htp-tel.de/shornbostel/"]http://home.htp-tel.de/shornbostel/[/URL]

LaurV 2012-03-25 06:07

Another 1 DC success and 1 DC fail for v1.69 (26242253 and respective 26269081), first reported, second running TC. This makes the score 3 to 1. For the former version I had 12 to 2. The mismatches were caused by hardware (OC, memory, playing too much around, whatever).

Switching to v2.0. I will resume the current work done with 1.69, few checkpoints behind, to see what's going on, if I get same residues.

Theoretically I am now able to build my own exe, but I prefer to use the one provided by flashjh, as it is now recognized as well compiled and running without issues, until I would be confident with my play'around.

Karl M Johnson 2012-03-25 08:35

[CODE]start M78643200 fft length = 4718592
Iteration 10000 M( 78643200 )C, 0x0a6f35cd25e82e0f, n = 4718592, CUDALucas v1.69 err = 0.04327 (1:19 real, 7.9267 ms/iter, ETA 173:07:58)
Iteration 20000 M( 78643200 )C, 0x00dda91d63971fb3, n = 4718592, CUDALucas v1.69 err = 0.04785 (1:18 real, 7.7783 ms/iter, ETA 169:52:09)
Iteration 30000 M( 78643200 )C, 0xe0b1e59a43b7098b, n = 4718592, CUDALucas v1.69 err = 0.04785 (1:19 real, 7.8894 ms/iter, ETA 172:16:24)[/CODE]

Mean[{7.9267,7.7783,7.8894}] = 7.8648



[CODE]Iteration 10000 M( 78643200 )C, 0x0a6f35cd25e82e0f, n = 4718592, CUDALucas v2.00 err = 0.04327 (1:13 real, 7.2161 ms/iter, ETA 157:36:44)
Iteration 20000 M( 78643200 )C, 0x00dda91d63971fb3, n = 4718592, CUDALucas v2.00 err = 0.04785 (1:11 real, 7.1533 ms/iter, ETA 156:13:09)
Iteration 30000 M( 78643200 )C, 0xe0b1e59a43b7098b, n = 4718592, CUDALucas v2.00 err = 0.04785 (1:12 real, 7.1515 ms/iter, ETA 156:09:39)[/CODE]

Mean[{7.2161,7.1533,7.1515}] = 7.17363


Avg. 8.78% faster.
Very good result.

ET_ 2012-03-25 19:16

[QUOTE=ET_;294049]I am using a GTX275, CUDA toolkit 3.0, cc 1.3.

Here are my benchmarks:

The timings were higher than with v1.3, and my computer was nearly unusable (with v1.3 there was no apparent slowdown).

Luigi[/QUOTE]

Never mind... I just upgraded my system to Ubuntu 11.04, CUDA 4.1 (driver 295.33).
My system is again responsive, and version 1.69 seems twice as fast as 1.3. Gonna try v2.0 now.

James, you were right, I get 11.3 ms/iteration. Please cancel my previous results, I'll send something more updated during the week.

Luigi

Dubslow 2012-03-25 19:25

Gah, I can't get anything beyond 270.xx to install properly on 11.04 :P

ET_ 2012-03-25 19:47

[QUOTE=Dubslow;294155]Gah, I can't get anything beyond 270.xx to install properly on 11.04 :P[/QUOTE]

I had similar problems running the drivers package downloaded from NVIDIA.

I resolved stopping the gdm (sudo service gdm stop), from a console (ctrl-alt-F2) and issuing the following commands:

[code]
sudo add-apt-repository ppa:ubuntu-x-swat/x-updates

sudo apt-get update

sudo apt-get install nvidia-current nvidia-settings

sudo service gdm start
[/code]

HTH :smile:

Luigi

Dubslow 2012-03-25 19:53

Ah, I've been wondering why nvidia-current wasn't getting passed 270.xx. Thanks, I'll have to try that when I get back to my desktop.

LaurV 2012-03-26 02:44

[QUOTE=LaurV;294102]Another 1 DC success and 1 DC fail for v1.69 (26242253 and respective 26269081), first reported, second running TC. This makes the score 3 to 1. For the former version I had 12 to 2. The mismatches were caused by hardware (OC, memory, playing too much around, whatever).

Switching to v2.0. I will resume the current work done with 1.69, few checkpoints behind, to see what's going on, if I get same residues.

Theoretically I am now able to build my own exe, but I prefer to use the one provided by flashjh, as it is now recognized as well compiled and running without issues, until I would be confident with my play'around.[/QUOTE]

Got a match for the jobs started with 1.69 and resumed with v2.0, for exponent 26244851. So, the procedure works.

Aslo, my TC for [COLOR=Red][B]26269081[/B] [/COLOR](started with 1.69 and resumed with 2.0 after about 5M iterations) got the [B]same residue as my previous DC[/B] (done with 1.69 in full). Unfortunately I was enough stupid and reported it (copy/paste mistake) so we lost the assignment. But I can say almost 100% (the reserve is for an eventual software bug in CudaLucas) that my CL DC+TC residue is correct and the original P95 residue is wrong.

If one of you got this exponent, [B]don't run it with CudaLucas[/B]. I will add it to the "Don't DC..." thread too. I am curious to know the final result in case the expo will be cleared, but I can't P95-it by myself right now.

Dubslow 2012-03-26 04:28

Huh...
That raises some very disconcerting questions about the accuracy of curtisc's tests... (unless cosmic ray?)

LaurV 2012-03-26 04:59

[QUOTE=Dubslow;294207]Huh...
That raises some very disconcerting questions about the accuracy of curtisc's tests... (unless cosmic ray?)[/QUOTE]
Not really, this is not the first time we meet a DC which is not conform with FC, and when run TC shows that DC was right and FC was wrong. That is why we do DC/TC, and historical error rate is not insignificant. There ARE exponents with wrong residue for the first LL test in the PrimeNet DB.

edit: in fact about 3 LL tests in 2 hundreds are wrong, and this is normal, according with last paragraph of [URL="http://www.mersenne.org/various/math.php"]this[/URL]. It is not the first time for me too, see the "Don't DC them..." thread, and I still keep the residue chain for all of them till they are cleared. For the exponent in cause, I have all the checkpoint files every 100k, starting from 5M (the one done with CL2.0). If anyone is interested in repeating the test, please use P95, and put the line "InterimFiles=100000" and "InterimResidues=100000" into prime.txt file. I can post (or send) the list of residues, for confirmation.

Dubslow 2012-03-26 05:05

[QUOTE=LaurV;294213]Not really, this is not the first time we meet a DC which is not confirm with FC, and when run TC shows that DC was right and FC was wrong. That is why we do DC/TC, and historical error rate is not insignificant.[/QUOTE]


I'm saying though, the first test was done by curtisc, and at the moment, it appears to be wrong. How many others of his are wrong? Or was it just a cosmic ray, not a hardware error?


[QUOTE=ET_;294159]I had similar problems running the drivers package downloaded from NVIDIA.

I resolved stopping the gdm (sudo service gdm stop), from a console (ctrl-alt-F2) and issuing the following commands:

[code]
sudo add-apt-repository ppa:ubuntu-x-swat/x-updates

sudo apt-get update

sudo apt-get install nvidia-current nvidia-settings

sudo service gdm start
[/code]

HTH :smile:

Luigi[/QUOTE]

Oh my goodness, how did you figure out what to do? I've been floundering with this one since like last October. mfaktc 0.18, here I come!!!

debrouxl 2012-03-26 05:52

On Debian unstable, I had a performance problem with 295.20 on mfaktc and gpu-ecm (performance was divided by more than 3), but it came back to the previous level (290.x) after installing 295.33.

LaurV 2012-03-26 12:09

When -t saves yer ass:

[CODE]
Iteration 18600000 M( 26276197 )C, 0xcfa7786f54f16d9b, n = 1474560, CUDALucas v2.00 err = 0.1348 (5:05 real, 3.0521 ms/iter, ETA 6:26:35)
Iteration 18700000 M( 26276197 )C, 0xc1f11399edd62632, n = 1474560, CUDALucas v2.00 err = 0.1348 (5:06 real, 3.0634 ms/iter, ETA 6:22:55)
Iteration 18800000 M( 26276197 )C, 0xfb97497ee5183685, n = 1474560, CUDALucas v2.00 err = 0.1348 (5:08 real, 3.0807 ms/iter, ETA 6:19:57)
iteration = 18822701 >= 1000 && err = 0.498047 >= 0.35,fft length = 1474560 write checkpoint file and exit.(when enable -t option)
[/CODE]

by the way, I found out that the residue written to the name of the file in this case is not the last known good residue, but the last residue written to a file. This is not a big deal, anyhow I resumed few checkpoints before, just to be sure.

msft 2012-03-26 15:44

[QUOTE=LaurV;294242]by the way, I found out that the residue written to the name of the file in this case is not the last known good residue, but the last residue written to a file. This is not a big deal, anyhow I resumed few checkpoints before, just to be sure.[/QUOTE]
Please add log.

James Heinrich 2012-03-26 16:26

[QUOTE=James Heinrich;294025]Thanks to those who have submitted data, but I need more data points, please. :smile:[/QUOTE]Thanks everyone who has submitted data, I now have a pretty good picture of CUDALucas throughput, and can current predict timings +/- 6% for pretty much any card.

[url]http://mersenne-aries.sili.net/cudalucas.php[/url]

Performance depends on GFLOPS, FFT size, and compute version. As with mfaktc, compute 2.0 is best, 2.1 is second-best and 1.3 is slowest.

flashjh 2012-03-26 16:34

[QUOTE=James Heinrich;294256]Thanks everyone who has submitted data, I now have a pretty good picture of CUDALucas throughput, and can current predict timings +/- 6% for pretty much any card.

[url]http://mersenne-aries.sili.net/cudalucas.php[/url]

Performance depends on GFLOPS, FFT size, and compute version. As with mfaktc, compute 2.0 is best, 2.1 is second-best and 1.3 is slowest.[/QUOTE]

Thanks for putting this together.

James Heinrich 2012-03-26 16:41

[QUOTE=flashjh;294258]Thanks for putting this together.[/QUOTE]No problem. I'm not entirely sure how to present it best. The "efficiency" of CUDALucas in terms of performance-per-day varies across exponent size as the chosen FFT sizes for CUDALucas and Prime95 (from which credit values are derived) don't align. It's more obvious if I show more columns (e.g. every 1M instead of every 10M), but that leads to many columns. But the overall trend is that CUDALucas appears more efficient at larger exponent sizes, especially around 70M.

BigBrother 2012-03-26 17:25

CUDALucas 2.00 on a GTX680 (sm 3.0):

[CODE]start M26214400 fft length = 1310720
iteration = 22 < 1000 && err = 0.370434 >= 0.25, increasing n from 1310720

start M26214400 fft length = 1572864
Iteration 10000 M( 26214400 )C, 0x0344448e4bf0eb62, n = 1572864, CUDALucas v2.00 err = 0.02403 (0:33 real, 3.2582 ms/iter, ETA 23:42:45)
Iteration 20000 M( 26214400 )C, 0x9f4a57b1f324d325, n = 1572864, CUDALucas v2.00 err = 0.02403 (0:33 real, 3.2477 ms/iter, ETA 23:37:38)
[/CODE]

[CODE]start M52428800 fft length = 2621440
iteration = 22 < 1000 && err = 0.359292 >= 0.25, increasing n from 2621440

start M52428800 fft length = 3145728
Iteration 10000 M( 52428800 )C, 0x3ceee1cc01747326, n = 3145728, CUDALucas v2.00 err = 0.03371 (1:04 real, 6.4324 ms/iter, ETA 93:38:44)
Iteration 20000 M( 52428800 )C, 0x9281347573ff62eb, n = 3145728, CUDALucas v2.00 err = 0.03371 (1:05 real, 6.4328 ms/iter, ETA 93:37:57)
[/CODE]

[CODE]start M78643200 fft length = 3932160
iteration = 22 < 1000 && err = 0.300339 >= 0.25, increasing n from 3932160

start M78643200 fft length = 4194304
iteration = 25 < 1000 && err = 0.313914 >= 0.25, increasing n from 4194304

start M78643200 fft length = 4718592
Iteration 10000 M( 78643200 )C, 0x0a6f35cd25e82e0f, n = 4718592, CUDALucas v2.00 err = 0.04076 (1:36 real, 9.5832 ms/iter, ETA 209:18:46)
Iteration 20000 M( 78643200 )C, 0x00dda91d63971fb3, n = 4718592, CUDALucas v2.00 err = 0.04204 (1:36 real, 9.5889 ms/iter, ETA 209:24:40)
[/CODE]

bcp19 2012-03-26 18:04

[QUOTE=James Heinrich;294262]No problem. I'm not entirely sure how to present it best. The "efficiency" of CUDALucas in terms of performance-per-day varies across exponent size as the chosen FFT sizes for CUDALucas and Prime95 (from which credit values are derived) don't align. It's more obvious if I show more columns (e.g. every 1M instead of every 10M), but that leads to many columns. But the overall trend is that CUDALucas appears more efficient at larger exponent sizes, especially around 70M.[/QUOTE]

It is rather curious how the older GTX 2XX card are more efficient at CL compared to TF than their older cousins.

James Heinrich 2012-03-26 18:24

[QUOTE=bcp19;294277]It is rather curious how the older GTX 2XX card are more efficient at CL compared to TF than their older cousins.[/QUOTE]Assuming compute 2.1 as a baseline:

CUDALucas:
compute 1.3 = 82%
compute 2.0 = 137%
compute 2.1 = 100%
compute 3.0 = 56%

mfaktc:
compute 1.3 = 54%
compute 2.0 = 150%
compute 2.1 = 100%
compute 3.0 = ? (no data)

What strikes me as somewhat unexpected is the [i]horrible[/i] performance of the GTX 680 as posted above (the 56% is based on the single benchmark 2 posts above).

bcp19 2012-03-26 18:31

[QUOTE=James Heinrich;294283]Assuming compute 2.1 as a baseline:

CUDALucas:
compute 1.3 = 82%
compute 2.0 = 137%
compute 2.1 = 100%
compute 3.0 = 56%

mfaktc:
compute 1.3 = 54%
compute 2.0 = 150%
compute 2.1 = 100%
compute 3.0 = ? (no data)

What strikes me as somewhat unexpected is the [I]horrible[/I] performance of the GTX 680 as posted above (the 56% is based on the single benchmark 2 posts above).[/QUOTE]

I thought the timings from my cards were faster using the 1.3 version than the 2.0 one.

James Heinrich 2012-03-26 18:35

[QUOTE=bcp19;294285]I thought the timings from my cards were faster using the 1.3 version than the 2.0 one.[/QUOTE]There are minor performance differences based on the software version; my comparison numbers are based on hardware capabilities (e.g. GTX 560 is compute 2.1 whereas GTX 570 is compute 2.0, for example). Going from 1.3 to 2.0 was a big improvement, but (gaming aside) it seems to have been downhill from there. :sad:

LaurV 2012-03-26 18:54

[QUOTE=bcp19;294285]I thought the timings from my cards were faster using the 1.3 version than the 2.0 one.[/QUOTE]
They are, for gtx5xx, sm1.3 is faster then sm2.0 with drv 4.0, which is faster then sm2.0 with drv 4.1 (I do not have sm2.1 cards to compare).

bcp19 2012-03-26 21:09

Just tested 2.00 with mixed results...

Letting mfaktc decide FFT, no noticible speedup:
[code]cudalucas1.69.cuda3.2.sm_13.x64 -polite 0 26214400
start M26214400 fft length = 1310720
iteration = 21 < 1000 && err = 0.25 >= 0.25, increasing n from 1310720
start M26214400 fft length = 1572864
Iteration 10000 M( 26214400 )C, 0x0344448e4bf0eb62, n = 1572864, CUDALucas v1.69 err = 0.02441 (1:17 real, 7.6947 ms/iter, ETA 56:00:01)
Iteration 20000 M( 26214400 )C, 0x9f4a57b1f324d325, n = 1572864, CUDALucas v1.69 err = 0.02441 (1:17 real, 7.7383 ms/iter, ETA 56:17:46)
Iteration 30000 M( 26214400 )C, 0x2603d4f32b1447b1, n = 1572864, CUDALucas v1.69 err = 0.02441 (1:17 real, 7.6608 ms/iter, ETA 55:42:40)

cudalucas2.00.cuda3.2.sm_13.x64 -polite 0 26214400
start M26214400 fft length = 1310720
iteration = 21 < 1000 && err = 0.25 >= 0.25, increasing n from 1310720
start M26214400 fft length = 1572864
Iteration 10000 M( 26214400 )C, 0x0344448e4bf0eb62, n = 1572864, CUDALucas v2.00 err = 0.02441 (1:17 real, 7.6971 ms/iter, ETA 56:01:03)
Iteration 20000 M( 26214400 )C, 0x9f4a57b1f324d325, n = 1572864, CUDALucas v2.00 err = 0.02441 (1:17 real, 7.6540 ms/iter, ETA 55:40:58)
Iteration 30000 M( 26214400 )C, 0x2603d4f32b1447b1, n = 1572864, CUDALucas v2.00 err = 0.02441 (1:17 real, 7.7160 ms/iter, ETA 56:06:45)
Iteration 40000 M( 26214400 )C, 0xad8c5ef324794a7f, n = 1572864, CUDALucas v2.00 err = 0.02441 (1:16 real, 7.6570 ms/iter, ETA 55:39:43)
[/code]

Specifying an FFT, ~5% speedup:
[code]cudalucas1.69.cuda3.2.sm_13.x64 -threads 512 -c 10000 -f 1474560 -t -polite 0 26232301
start M26232301 fft length = 1474560
Iteration 10000 M( 26232301 )C, 0xf6f119964a437acf, n = 1474560, CUDALucas v1.69 err = 0.1094 (1:17 real, 7.6517 ms/iter, ETA 55:43:47)
Iteration 20000 M( 26232301 )C, 0x3c43951af66bdf31, n = 1474560, CUDALucas v1.69 err = 0.1133 (1:16 real, 7.6471 ms/iter, ETA 55:40:30)
Iteration 30000 M( 26232301 )C, 0x56a23afa69fbb918, n = 1474560, CUDALucas v1.69 err = 0.1133 (1:17 real, 7.6466 ms/iter, ETA 55:39:00)
Iteration 40000 M( 26232301 )C, 0xd2d3eeab0f0b0e40, n = 1474560, CUDALucas v1.69 err = 0.1133 (1:16 real, 7.6486 ms/iter, ETA 55:38:37)
^C caught. Writing checkpoint.

cudalucas2.00.cuda3.2.sm_13.x64 -threads 512 -c 10000 -f 1474560 -t -polite 0 26232301
continuing work from a partial result M26232301 fft length = 1474560 iteration = 40043
Iteration 50000 M( 26232301 )C, 0x71785e5f16f5da16, n = 1474560, CUDALucas v2.00 err = 0.1074 (1:12 real, 7.2297 ms/iter, ETA 52:34:34)
Iteration 60000 M( 26232301 )C, 0xf745bd35ce3b0ab5, n = 1474560, CUDALucas v2.00 err = 0.1094 (1:13 real, 7.2783 ms/iter, ETA 52:54:32)
Iteration 70000 M( 26232301 )C, 0x3a3a81d0ce422b82, n = 1474560, CUDALucas v2.00 err = 0.1094 (1:13 real, 7.2781 ms/iter, ETA 52:53:16)
[/code]

Is it possible in future versions to 'clean up' the FFT selection?

Dubslow 2012-03-26 21:24

[QUOTE=James Heinrich;294283]
What strikes me as somewhat unexpected is the [i]horrible[/i] performance of the GTX 680 as posted above (the 56% is based on the single benchmark 2 posts above).[/QUOTE]

Go see the Kepler thread; from the reviews that were linked there, we have all been expecting (sadly) reduced compute performance for the 680. However, since the 680 is the GK104, not GK110, (and as such is more related to the 560 Ti than the 580) we're waiting to see what the GK110 can do.

Batalov 2012-03-26 21:30

2.0's have 1/8 of DP GFlops, 2.1's, 1/12.
From anandtech's treatment, it was expected the 3.0 to have 1/24, and so it, sadly, appears to be.

[QUOTE="Python"][B]Shop Owner[/B]: Remarkable bird, the Norwegian Blue, isn't it, eh? Beautiful plumage!
[B]Mr. Praline[/B]: The plumage don't enter into it. It's stone dead!
[/QUOTE]

flashjh 2012-03-27 23:50

I've had 4 2.00 sucesses and 1 mismatch. I don't know for sure what caused the mismatch, but I had a driver failure that Win7 recovered from, so that's probably it (I caused it by closing CuLu too soon after starting)[CODE]
M( 26232301 )C, 0x[COLOR=red]251f67a97a93197a[/COLOR], n = 1474560, CUDALucas v2.00
M( 26232803 )C, 0xd00b85dcfaee04b3, n = 1474560, CUDALucas v2.00
M( 26232301 )C, 0x[COLOR=lime]040e8dd990e95b17[/COLOR], n = 1474560, CUDALucas v2.00
M( 26240933 )C, 0x68d29225ff867aa5, n = 1474560, CUDALucas v2.00
M( 26296561 )C, 0x60db292b00734623, n = 1474560, CUDALucas v2.00
[/CODE]

2.00 is very fast, even with -t. I'm getting just over 15 hours per DC. Too bad my new GTX680 is not worth opening up... any gamer out there want to trade an unopened 680 for a 580+some cash :smile:

kladner 2012-03-28 00:24

[QUOTE=flashjh;294446]
[/CODE]2.00 is very fast, even with -t. I'm getting just over 15 hours per DC. Too bad my new GTX680 is not worth opening up... any gamer out there want to trade an unopened 680 for a 580+some cash :smile:[/QUOTE]

Good luck with that. I expect someone will want it.

Dubslow 2012-03-28 01:15

[url]http://mersenne.org/report_exponent/?exp_lo=26232301[/url]

...How did you get the good result before the bad one? :huh:

bcp19 2012-03-28 01:36

[QUOTE=flashjh;294446]I've had 4 2.00 sucesses and 1 mismatch. I don't know for sure what caused the mismatch, but I had a driver failure that Win7 recovered from, so that's probably it (I caused it by closing CuLu too soon after starting)[CODE]
M( 26232301 )C, 0x[COLOR=red]251f67a97a93197a[/COLOR], n = 1474560, CUDALucas v2.00
M( 26232803 )C, 0xd00b85dcfaee04b3, n = 1474560, CUDALucas v2.00
M( 26232301 )C, 0x[COLOR=lime]040e8dd990e95b17[/COLOR], n = 1474560, CUDALucas v2.00
M( 26240933 )C, 0x68d29225ff867aa5, n = 1474560, CUDALucas v2.00
M( 26296561 )C, 0x60db292b00734623, n = 1474560, CUDALucas v2.00
[/CODE]

2.00 is very fast, even with -t. I'm getting just over 15 hours per DC. Too bad my new GTX680 is not worth opening up... any gamer out there want to trade an unopened 680 for a 580+some cash :smile:[/QUOTE]

could try ebay, $600+ is the current selling price.

flashjh 2012-03-28 01:49

[QUOTE=Dubslow;294456][URL]http://mersenne.org/report_exponent/?exp_lo=26232301[/URL]

...How did you get the good result before the bad one? :huh:[/QUOTE]
In my file it was bad then good, PrimeNet shows it different; I don't know why. Basically I got the Bad DC, so I ran a TC to get the good result.
[QUOTE=kladner;294450]Good luck with that. I expect someone will want it.[/QUOTE]

[QUOTE=bcp19;294459]could try ebay, $600+ is the current selling price.[/QUOTE]

I'm sure it will sell; I can't list it until I get home. Ironically, I haven't even seen it yet and I need to sell it. Here's to hoping nVidia does [I]much[/I] better with GK110.

msft 2012-03-28 02:26

[QUOTE=bcp19;294321]
Is it possible in future versions to 'clean up' the FFT selection?[/QUOTE]
[code]
CUFFT_Z2Z size= 1081344 time= 17.912073 msec
CUFFT_Z2Z size= 1179648 time= 3.668577 msec
CUFFT_Z2Z size= 1277952 time= 23.579775 msec
CUFFT_Z2Z size= 1376256 time= 4.184882 msec
CUFFT_Z2Z size= 1474560 time= 4.181284 msec
CUFFT_Z2Z size= 1572864 time= 4.305398 msec
CUFFT_Z2Z size= 1671168 time= 29.604942 msec
CUFFT_Z2Z size= 1769472 time= 4.611794 msec
CUFFT_Z2Z size= 1867776 time= 35.498314 msec
CUFFT_Z2Z size= 1966080 time= 5.232563 msec
CUFFT_Z2Z size= 2064384 time= 6.078244 msec
CUFFT_Z2Z size= 2162688 time= 30.571829 msec
CUFFT_Z2Z size= 2260992 time= 45.150196 msec
CUFFT_Z2Z size= 2359296 time= 6.338501 msec
CUFFT_Z2Z size= 2457600 time= 8.342790 msec
CUFFT_Z2Z size= 2555904 time= 37.028366 msec
CUFFT_Z2Z size= 2654208 time= 8.117778 msec
CUFFT_Z2Z size= 2752512 time= 7.537333 msec
CUFFT_Z2Z size= 2850816 time= 63.291031 msec
CUFFT_Z2Z size= 2949120 time= 7.693821 msec
[/code]
Multiples of 98304 is unstable performance.

LaurV 2012-03-28 11:23

Another 2 successful DCs, CudaLucas v2.0, -f 1474560, for 26276197 and 26247433.

kladner 2012-03-28 15:54

I just completed my first full run on CuLu (2.0) of a DC. Residue matched the original LL and Jerry's good DC. This gives me confidence to try some "real work" now.

[CODE]M( 26229943 )C, 0x769161872540121c, n = 1572864, CUDALucas v2.00[/CODE]

flashjh 2012-03-28 16:59

[QUOTE=kladner;294514]I just completed my first full run on CuLu (2.0) of a DC. Residue matched the original LL and Jerry's good DC. This gives me confidence to try some "real work" now.

[CODE]M( 26229943 )C, 0x769161872540121c, n = 1572864, CUDALucas v2.00[/CODE][/QUOTE]

Did you try -f 1474560. It will increase your throughput, just use -t and watch your error.

kladner 2012-03-28 18:40

[QUOTE=flashjh;294517]Did you try -f 1474560. It will increase your throughput, just use -t and watch your error.[/QUOTE]

No, I didn't. I'm still baffled by selecting FFT sizes. I messed around a little, but only succeeded in terminating the program.

I've started another 26M DC. The automatic system set 1572864. I will try it with 1474560. Thanks for the suggestion!

EDIT: This is weird. I have this command line: [CODE]start CUDALucas2.00.cuda3.2.sm_13.x64 -c 10000 -f 1474560 -t -s check -polite 1 -k worktodo.txt[/CODE]

But CuLu continues to start with 1572864.

flashjh 2012-03-28 18:50

[QUOTE=kladner;294524]No, I didn't. I'm still baffled by selecting FFT sizes. I messed around a little, but only succeeded in terminating the program.

I've started another 26M DC. The automatic system set 1572864. I will try it with 1474560. Thanks for the suggestion![/QUOTE]

It will terminate with a FFT that causes too high of an error. If you read back through the thread you can see some examples. CuLu also includes a FFT test:

cudalucas.exe -cufftbench [I]start stop distance[/I]

start: beginning exponent
end: stop exponent
distance: size between FFTs (32768 is normal)

Once you run that, you can use the fastest FFTs and run each one until you get it stable. For me, all the FFTs that were faster than 1572864 caused errors except 1474560. So far with -f 1474560 -t -polite 0 I have had very good runs excpet one (which was my fault), but the TrippleCheck was good.

Jerry

kladner 2012-03-28 19:07

Thanks Jerry, for the explanation. I'll go back and run -cufftbench and try to get a handle on things. But at the moment, it ignores -f for some reason. I'll get back to it in a bit.

James Heinrich 2012-03-28 19:20

Is there any kind of benchmark (or reference in the code) that gives a list of all suggested FFT sizes for a given exponent size? A quick glance at the source didn't jump out at me where the lookup table was for what FFT size CUDALucas will start with by default. How can I find this (ideally for all possible exponents that it can handle)?

Prime95 2012-03-28 19:38

1 Attachment(s)
Attached is my cufftbench for a GTX460. I flagged with a "Y" the FFT sizes that make sense.

flashjh 2012-03-28 19:46

[QUOTE=James Heinrich;294529]Is there any kind of benchmark (or reference in the code) that gives a list of all suggested FFT sizes for a given exponent size? A quick glance at the source didn't jump out at me where the lookup table was for what FFT size CUDALucas will start with by default. How can I find this (ideally for all possible exponents that it can handle)?[/QUOTE]

As far as I know, right now just pick a range around the exponent and run the cufft test to choose the best FFT for your card/system/exponent. I have not yet tested different thread sizes and the corresponding cufft test for different FFTs. When I get some time, I'll see what i can do. Look back though the thread for more FFT discussions.

Dubslow 2012-03-28 19:50

[QUOTE=Prime95;294530]Attached is my cufftbench for a GTX460. I flagged with a "Y" the FFT sizes that make sense.[/QUOTE]

I've had the same question as James about Prime95. Where in the P95 source is the table of possible FFT lengths? 15-20 minutes digging around didn't turn up much.

Prime95 2012-03-28 19:58

[QUOTE=Dubslow;294533]I've had the same question as James about Prime95. Where in the P95 source is the table of possible FFT lengths? 15-20 minutes digging around didn't turn up much.[/QUOTE]

We're getting a little off topic. P95 uses a table in mult.asm. The xjmptable is for SSE2, the yjmptable is for AVX. The table includes FFT size, maximum Mersenne exponent, estimated timing, mem used, which CPU architectures should use it, and some other stuff.

axn 2012-03-28 19:59

[QUOTE=Prime95;294530]Attached is my cufftbench for a GTX460. I flagged with a "Y" the FFT sizes that make sense.[/QUOTE]

Found a problem.
[CODE]CUFFT_Z2Z size= 1146880 time= 1.417851 msec Y
CUFFT_Z2Z size= 1179648 time= 1.390691 msec
[/CODE]

axn 2012-03-28 20:13

Looking at the multipliers, there are definite patterns. The multipliers 1,3,5,7,9,21,27,45,49 and 81 are always selected as preferred.

Except for one instance of 21 vs 45. 1376256 is slower than 1474560. It might be worth re-benchmarking the following four to see if the results are consistent.

[CODE]CUFFT_Z2Z size= 1376256 time= 1.818975 msec (21)
CUFFT_Z2Z size= 1474560 time= 1.809079 msec Y (45)

CUFFT_Z2Z size= 2752512 time= 3.812189 msec Y (21)
CUFFT_Z2Z size= 2949120 time= 3.853927 msec Y (45)
[/CODE]

Dubslow 2012-03-28 20:13

[QUOTE=Prime95;294530]Attached is my cufftbench for a GTX460. I flagged with a "Y" the FFT sizes that make sense.[/QUOTE]

Rehashed to show a bit more about the FFT size, as well as axn's correction.

Edit: Whoops, cross post.
@axn: msft said CuLu can use any multiple of 32K, that's why I did as such.

Edit2: Redone to show more lengths that are "reasonable", but not best. Those are marked with M.

[code]CUFFT_Z2Z size= 1048576 = 1024K = 32*32K time= 1.130540 msec Y
CUFFT_Z2Z size= 1146880 = 1120K = 35*32K time= 1.417851 msec M
CUFFT_Z2Z size= 1179648 = 1152K = 36*32K time= 1.390691 msec Y
CUFFT_Z2Z size= 1310720 = 1280K = 40*32K time= 1.533345 msec Y
CUFFT_Z2Z size= 1376256 = 1344K = 42*32K time= 1.818975 msec M
CUFFT_Z2Z size= 1474560 = 1440K = 45*32K time= 1.809079 msec Y
CUFFT_Z2Z size= 1572864 = 1536K = 48*32K time= 1.937807 msec Y
CUFFT_Z2Z size= 1605632 = 1568K = 49*32K time= 2.023415 msec Y
CUFFT_Z2Z size= 1638400 = 1600K = 50*32K time= 2.217558 msec M
CUFFT_Z2Z size= 1769472 = 1728K = 54*32K time= 2.141137 msec Y
CUFFT_Z2Z size= 1835008 = 1792K = 56*32K time= 2.163136 msec Y
CUFFT_Z2Z size= 1966080 = 1920K = 60*32K time= 2.700584 msec M
CUFFT_Z2Z size= 2064384 = 2016K = 63*32K time= 2.551482 msec M
CUFFT_Z2Z size= 2097152 = 2048K = 64*32K time= 2.409963 msec Y
CUFFT_Z2Z size= 2293760 = 2240K = 70*32K time= 3.018234 msec M
CUFFT_Z2Z size= 2359296 = 2304K = 72*32K time= 2.766602 msec Y
CUFFT_Z2Z size= 2457600 = 2400K = 75*32K time= 3.627161 msec M
CUFFT_Z2Z size= 2621440 = 2560K = 80*32K time= 3.239111 msec Y
CUFFT_Z2Z size= 2654208 = 2592K = 81*32K time= 3.409978 msec Y
CUFFT_Z2Z size= 2752512 = 2688K = 84*32K time= 3.812189 msec Y
CUFFT_Z2Z size= 2949120 = 2880K = 90*32K time= 3.853927 msec Y
CUFFT_Z2Z size= 3145728 = 3072K = 96*32K time= 4.029561 msec Y
CUFFT_Z2Z size= 3211264 = 3136K = 98*32K time= 4.324980 msec Y
CUFFT_Z2Z size= 3276800 = 3200K = 100*32K time= 4.702814 msec M
CUFFT_Z2Z size= 3440640 = 3360K = 105*32K time= 4.934543 msec M
CUFFT_Z2Z size= 3538944 = 3456K = 108*32K time= 4.573230 msec Y
CUFFT_Z2Z size= 3670016 = 3584K = 112*32K time= 4.591721 msec Y
CUFFT_Z2Z size= 3932160 = 3840K = 120*32K time= 5.395338 msec M
CUFFT_Z2Z size= 4128768 = 4032K = 126*32K time= 5.436691 msec M
CUFFT_Z2Z size= 4194304 = 4096K = 128*32K time= 5.049356 msec Y
CUFFT_Z2Z size= 4423680 = 4320K = 135*32K time= 5.862155 msec M
CUFFT_Z2Z size= 4587520 = 4480K = 140*32K time= 6.353941 msec M
CUFFT_Z2Z size= 4718592 = 4608K = 144*32K time= 5.858453 msec Y
CUFFT_Z2Z size= 4816896 = 4704K = 147*32K time= 7.085539 msec M
CUFFT_Z2Z size= 4915200 = 4800K = 150*32K time= 7.661496 msec M
[/code]
[QUOTE=Prime95;294535]We're getting a little off topic. P95 uses a table in mult.asm. The xjmptable is for SSE2, the yjmptable is for AVX. The table includes FFT size, maximum Mersenne exponent, estimated timing, mem used, which CPU architectures should use it, and some other stuff.[/QUOTE]
:ouch1:

...The former is 2800 lines. Did you write those all by hand?

axn 2012-03-28 20:40

[QUOTE=Dubslow;294539]Rehashed to show a bit more about the FFT size, as well as axn's correction.

Edit: Whoops, cross post.
@axn: msft said CuLu can use any multiple of 32K, that's why I did as such.

Edit2: Redone to show more lengths that are "reasonable", but not best. Those are marked with M.

[/QUOTE]

I did a similar exercise, this time normalizing the time by dividing it by (FFT/1048576). There is a clear pattern. Any multiplier that is 7-smooth yields decent (not necessarily preferred) performance. Anything that is not 7-smooth yields terrible performance. Something like 4x or worse.

Dubslow 2012-03-28 20:48

[QUOTE=axn;294542]I did a similar exercise, this time normalizing the time by dividing it by (FFT/1048576). There is a clear pattern. Any multiplier that is 7-smooth yields decent (not necessarily preferred) performance. Anything that is not 7-smooth yields terrible performance. Something like 4x or worse.[/QUOTE]
Could you post a chart of the multiplier's factorizations or do you want me to do it?

axn 2012-03-28 21:16

[QUOTE=Dubslow;294544]Could you post a chart of the multiplier's factorizations or do you want me to do it?[/QUOTE]

[CODE] FFT Pref Mult Smooth Time (ms) Normalized
1048576 Y 1 1 1.1305 1.130
2097152 Y 1 1 2.4099 1.204
4194304 Y 1 1 5.0493 1.262
1572864 Y 3 3 1.9378 1.291
3145728 Y 3 3 4.0295 1.343
1310720 Y 5 5 1.5333 1.226
2621440 Y 5 5 3.2391 1.295
1835008 Y 7 7 2.1631 1.236
3670016 Y 7 7 4.5917 1.311
1179648 Y 9 3 1.3906 1.236
2359296 Y 9 3 2.7666 1.229
4718592 Y 9 3 5.8584 1.301
1441792 11 11 12.0507 8.764
2883584 11 11 25.2414 9.178
1703936 13 13 15.8089 9.728
3407872 13 13 32.9923 10.151
1966080 15 5 2.7005 1.440
3932160 15 5 5.3953 1.438
1114112 17 17 17.8324 16.783
2228224 17 17 23.8903 11.242
4456448 17 17 54.9814 12.936
1245184 19 19 14.2480 11.998
2490368 19 19 28.5571 12.024
4980736 19 19 65.0263 13.689
1376256 ? 21 7 1.8189 1.385
2752512 Y 21 7 3.8121 1.452
1507328 23 23 19.8118 13.782
3014656 23 23 40.1851 13.977
1638400 25 5 2.2175 1.419
3276800 25 5 4.7028 1.504
1769472 Y 27 3 2.1411 1.268
3538944 Y 27 3 4.5732 1.355
1900544 29 29 30.3831 16.763
3801088 29 29 61.4396 16.948
2031616 31 31 33.3520 17.213
4063232 31 31 67.5301 17.427
1081344 33 11 9.9185 9.618
2162688 33 11 18.5583 8.997
4325376 33 11 40.4085 9.796
1146880 35 7 1.4178 1.296
2293760 35 7 3.0182 1.379
4587520 35 7 6.3539 1.452
1212416 37 37 22.6343 19.575
2424832 37 37 45.7872 19.799
4849664 37 37 99.3098 21.472
1277952 39 13 12.9222 10.602
2555904 39 13 23.8400 9.780
1343488 41 41 27.0680 21.126
2686976 41 41 54.9051 21.426
1409024 43 43 29.3962 21.876
2818048 43 43 59.6049 22.178
1474560 Y 45 5 1.8090 1.286
2949120 Y 45 5 3.8539 1.370
1540096 47 47 33.5578 22.847
3080192 47 47 68.1485 23.199
1605632 Y 49 7 2.0234 1.321
3211264 Y 49 7 4.3249 1.412
1671168 51 17 18.4646 11.585
3342336 51 17 37.9425 11.903
1736704 53 53 13.0645 7.888
3473408 53 53 26.7417 8.072
1802240 55 11 15.3619 8.937
3604480 55 11 33.7740 9.825
1867776 57 19 22.4452 12.600
3735552 57 19 45.0705 12.651
1933312 59 59 15.6682 8.498
3866624 59 59 32.2185 8.737
1998848 61 61 17.2398 9.043
3997696 61 61 36.1076 9.470
2064384 63 7 2.5514 1.295
4128768 63 7 5.4366 1.380
2129920 65 13 20.0319 9.861
4259840 65 13 43.8546 10.794
2195456 67 67 14.6807 7.011
4390912 67 67 31.1684 7.443
2260992 69 23 30.2865 14.045
4521984 69 23 62.9652 14.600
2326528 71 71 17.2002 7.752
4653056 71 71 36.2993 8.180
2392064 73 73 21.2844 9.330
4784128 73 73 44.9508 9.852
2457600 75 5 3.6271 1.547
4915200 75 5 7.6614 1.634
2523136 77 11 21.6817 9.010
2588672 79 79 20.6799 8.376
2654208 Y 81 3 3.4099 1.347
2719744 83 83 19.7181 7.602
2785280 85 17 31.6756 11.924
2850816 87 29 44.9787 16.543
2916352 89 89 20.1506 7.245
2981888 91 13 28.5019 10.022
3047424 93 31 49.9787 17.197
3112960 95 19 37.6290 12.675
3178496 97 97 26.7683 8.830
3244032 99 11 32.7558 10.587
3309568 101 101 23.0312 7.297
3375104 103 103 26.8451 8.340
3440640 105 7 4.9345 1.503
3506176 107 107 23.0966 6.907
3571712 109 109 30.6403 8.995
3637248 111 37 70.8848 20.435
3702784 113 113 26.2526 7.434
3768320 115 23 52.5471 14.621
3833856 117 13 38.0365 10.403
3899392 119 17 44.3925 11.937
3964928 121 11 54.2576 14.349
4030464 123 41 84.7203 22.041
4096000 125 5 6.4623 1.654
4161536 127 127 26.5980 6.701
4227072 129 43 91.9276 22.803
4292608 131 131 36.1097 8.820
4358144 133 19 52.7457 12.690
4423680 135 5 5.8621 1.389
4489216 137 137 40.8194 9.534
4554752 139 139 36.4544 8.392
4620288 141 47 104.9800 23.825
4685824 143 13 68.4353 15.314
4751360 145 29 79.7051 17.590
4816896 147 7 7.0855 1.542
4882432 149 149 38.9179 8.358
4947968 151 151 38.3244 8.121
[/CODE]

kladner 2012-03-28 21:47

[QUOTE=kladner;294528]But at the moment, it ignores -f for some reason. I'll get back to it in a bit.[/QUOTE]

I just had to move the check files out of the folder. -f 474560 does run faster than the default 1572864. [CODE]474560
err = 0.09766 (0:50 real, 4.9958 ms/iter
1572864
err = 0.02148 , 5.3635 ms/iter[/CODE]

It seems the fft could go smaller, but I'll have to read the part of the thread that's been posted since I started experimenting and writing about it.

kladner 2012-03-28 23:51

[QUOTE=kladner;294547]-f 474560 does run faster than the default 1572864. [CODE]474560
err = 0.09766 (0:50 real, 4.9958 ms/iter
1572864
err = 0.02148 , 5.3635 ms/iter[/CODE]It seems the fft could go smaller, but I'll have to read the part of the thread that's been posted since I started experimenting and writing about it.[/QUOTE]


Oops. That is 1474560. So far the smallest that doesn't terminate on a GTX 460 with a 26M exponent.

flashjh 2012-03-29 01:50

[QUOTE=kladner;294571]Oops. That is 1474560. So far the smallest that doesn't terminate on a GTX 460 with a 26M exponent.[/QUOTE]
That was the same for me on a 580

LaurV 2012-03-29 02:26

[QUOTE=kladner;294524]
EDIT: This is weird. <snitp>
But CuLu continues to start with 1572864.[/QUOTE]
You have to delete the checkpoint file "cXXXXX" and "tXXXX". If there is a checkpoint file it will always resume from where it left, and the checkpoint files are not interchangeable, they have the size of the fft used. So, if a file with old fft-size size exists, it will use THAT size regardless of what -f you use. So, you can appreciate, if you job is done more then 10-15-20% or so, it would be faster to let it finish with old 1572864 (=32768*48), then use 1474560 (=32768*45) starting with the new exponent. Both sizes work well for 26-27M range, the shortest one is faster between 10% and 30% depending on your card. Use fftbench option as explained before to check exactly for your card.

kladner 2012-03-29 02:26

[QUOTE=flashjh;294586]That was the same for me on a 580[/QUOTE]

Thanks again for throwing that out there. It made a difference for me.

msft 2012-03-30 14:45

[code]
Processing result: M( 26768243 )C, 0x3280d4e28ef0b188, n = 1474560, CUDALucas v2.00
LL test successfully completes double-check of M26768243
[/code]

kladner 2012-03-31 16:59

Successful LLDC with CuLu:

[CODE]26158007No factors below2^69
P-1B1=390000
Verified LLB50D7F090E32331F by "David Triggerson"
Verified LLB50D7F090E32331F by "ktony" on 2012-03-31
Historyno factor for M26158007 from 2^67 to 2^68 [mfaktc 0.17-Win barrett79_mul32] by "lalera" on 2011-12-05
Historyno factor for M26158007 from 2^68 to 2^69 [mfaktc 0.18-pre7 71bit_mul24] by "Luigi Morelli" on 2011-12-06
Historyb50d7f090e3233__ by "ktony" on 2012-03-31[/CODE]

apsen 2012-04-02 15:49

I've been assigned triple check and got mismatch with the first two checks for 28982959.

I've run the check twice with different FFT lengths (and -t both times) and got all residues match.

Could someone run it through P95?

Thanks,
Andriy

flashjh 2012-04-02 16:59

[QUOTE=apsen;295165]I've been assigned triple check and got mismatch with the first two checks for 28982959.

I've run the check twice with different FFT lengths (and -t both times) and got all residues match.

Could someone run it through P95?

Thanks,
Andriy[/QUOTE]

I'll run it. Will take a few days.


All times are UTC. The time now is 13:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.