mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

flashjh 2012-03-10 15:06

I had a match but the assignment [URL="http://www.mersenneforum.org/showthread.php?p=292493#post292493"]had already been turned in[/URL]. Good news is the original LL was bad because my 1.64 matched David's CUDALucas run.

M( 26002063 )C, 0x1c5e4ca283b033__, n = 1572864, CUDALucas v1.64

Prime95 2012-03-10 15:12

[QUOTE=LaurV;292500]Which indeed would need to be much lower to catch the spikes going over 0.5. If you check it at every iteration (what -t is doing) then comparing it with 0.5 would be enough.[/QUOTE]

This is not correct. A roundoff error of 0.49 is harmless, but a roundoff error of 0.51 is deadly. The problem is the program will correctly report both as 0.49.

So if CUDALucas reports a round off error of 0.49, how confident are you that it really wasn't a deadly roundoff of 0.51??? This is why PFGW aborts (actually switches to a larger FFT length) when the roundoff error exceeds 0.45. Prime95 retries the iteration if the roundoff exceeds 0.40.

LaurV 2012-03-10 17:02

[QUOTE=Prime95;292511]This is not correct. A roundoff error of 0.49 is harmless, but a roundoff error of 0.51 is deadly. The problem is the program will correctly report both as 0.49.

So if CUDALucas reports a round off error of 0.49, how confident are you that it really wasn't a deadly roundoff of 0.51??? This is why PFGW aborts (actually switches to a larger FFT length) when the roundoff error exceeds 0.45. Prime95 retries the iteration if the roundoff exceeds 0.40.[/QUOTE]
That was EXACTLY what I was talking about. You may not get that if only read post 931, but please read careful my post 929 [edit: the first observation, last part].

msft 2012-03-10 22:20

[QUOTE=Prime95;292511]So if CUDALucas reports a round off error of 0.49, how confident are you that it really wasn't a deadly roundoff of 0.51??? This is why PFGW aborts (actually switches to a larger FFT length) when the roundoff error exceeds 0.45. Prime95 retries the iteration if the roundoff exceeds 0.40.[/QUOTE]

[code]
Ver 1.64
default:
if((iteration % 100) == 0 || iteration < 1000)
if(roundoff > 0.35)
increasing fft length

-t option:
if(roundoff > 0.49)
exit program
else if(roundoff > 0.35)
increasing fft length
[/code]
[code]
if(roundoff > 0.49)
exit program
[/code]
this is experimental code.

flashjh 2012-03-11 02:48

Another good one
 
Another 1.64 success
[CODE]
Processing result: M( 26134351 )C, 0xb9d6a5672486c791, n = 1572864, CUDALucas v1.64
LL test successfully completes double-check of M26134351

[/CODE]

kladner 2012-03-11 03:50

[QUOTE=flashjh;292579]Another 1.64 success
[CODE]
Processing result: M( 26134351 )C, 0xb9d6a5672486c791, n = 1572864, CUDALucas v1.64
LL test successfully completes double-check of M26134351

[/CODE][/QUOTE]

Encouraging.

apsen 2012-03-11 18:35

[QUOTE=msft;292504][code]
M( 29198173 )C, 0x6fd7e4d6557f5b77, n = 1572864, CUDALucas v1.58
[/code]
correct.[/QUOTE]

That does not match first time test. I guess I better rerun it with P95.

flashjh 2012-03-11 19:29

[QUOTE=apsen;292627]That does not match first time test. I guess I better rerun it with P95.[/QUOTE]

You should submit the result to PrimeNet, it may be correct.

LaurV 2012-03-12 03:38

I finished first-time-LL for 45130601 and 4520386. The tests were done with CL1.64, with -s and -t, so the intermediate residues and all checkpoint files (every 250k iterations) are available if someone wants to do the double check with p95. When (and if) my cores become less loaded, I would attempt a DC with P95 by myself, but this will not be the coming weeks.

Currently, I am testing another expo in the same range (45221537) with 2 cards in the same time, no overclocking. This to be sure if CL.1.64 is "reliable" in the 45M range area (in fact, this is more a test of the fact that the "cheap" gtx580 with 1.5Gig memory which I use are "reliable" from the hardware point of view, at factory speed 782MHz, they shall get the same results, no matter if the software is mathematically correct or not). Up to now, 19M iterations done on both (they have about the same speed, one is a bit slower maybe because it is used as primary display(?!)) and both residues are matching.

edit: Roughly 40 hours to go, I don't use -s and -t, in fact this is the idea, to see how reliable is without checking every iteration, but I am saving the checkpoints (using my batch file posted before) every 30 minutes, in case there will be a mismatch, to avoid starting everything from the beginning. Without -t switch, CL is faster, as discussed before.

Anyhow, if two copies are testing the same exponent (in two different folders) then [B]-s can not be used[/B], as they will try writing the [B]SAME[/B] checkpoint files. The idea with the "backup" subfolder was to have it [B]in the current folder[/B], and not in the root of the disk... Like in ".\backup\......." and not "c:\backup\....." Anyhow, you could argue that no one will test the same expo with more copies of CL in the same time, but in the case you re-test the same expo later using -s, the chechpoint files will be overwritten too... Why not let the user to customize the output path?

Brain 2012-03-12 05:33

Responsibility
 
[QUOTE=James Heinrich;292430]I just started experimenting with CUDAlucas yesterday. First impressions: it uses zero CPU, but the GPU usage is more aggressive than mfaktc. Normal Windows usage is fine, I can't watch even DVD-quality video smoothly with CUDAlucas whereas it's only 1080 video I have to switch mfaktc off for. Most likely I'll go back to mfaktc, partly for usability, but also because the extra two cores don't scale so well with the new AVX cores in Prime95 (iteration times when running 6 workers are significantly slower than 4 workers).[/QUOTE]
I cannot even run low res playback with 1.64. Because of lags / bad responsiveness. I suggest - again - a command line switch: for example: --polite or --agressive where --polite would be default. This would insert an artificial CUDA wait loop where other apps (playback) have a go.

It was introduced when an unnecessary CudaMemCpy was killed.

Karl M Johnson 2012-03-12 06:15

[QUOTE=Brain;292671]I cannot even run low res playback with 1.64. Because of lags / bad responsiveness. I suggest - again - a command line switch: for example: --polite or --agressive where --polite would be default. This would insert an artificial CUDA wait loop where other apps (playback) have a go.

It was introduced when an unnecessary CudaMemCpy was killed.[/QUOTE]
Or a CL option to control threads and blocks. This way, it's up to the user to decide whether to run at max performance or at some gpu-idle state.


All times are UTC. The time now is 23:12.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.