mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

James Heinrich 2013-11-28 04:08

[QUOTE=flashjh;360493]Yes, I'll need to talk with James to have the PHP code updated to recognize the 2.05 format.[/QUOTE]Yes, you will. :smile:

If you can keep the [b]M( <exponent )C[/b] style that would be useful. As in, something like:[code]M( 10061 )C, 0x56eb9bb91825b188, offset = 4054, n = 1K, CUDALucas v2.05 Beta[/code]
Other than that, the only change is the addition of the "offset" parameter? If so I can add support for that relatively easily.

flashjh 2013-11-28 04:10

That's no problem, do you want the rest of the line to stay the same?

Edit: keep the AID at the end, if it is there?

BTW - Thanks!

James Heinrich 2013-11-28 04:31

These lines are now handled by mersenne.ca, and will (soon, hopefully) be handled by near-identical code on mersenne.org when I finish debugging the new manual-results parser there:[code]M( 10061 )C, 0x56eb9bb91825b188, offset = 9029, n = 1K, CUDALucas v2.05 Beta
M( 216091 )P, offset = 1234, n = 12K, CUDALucas v2.05 Beta[/code]I don't know if offset is relevant or would be printed in case of a prime, but it's handled anyways.

Lines can also be prefixed by userID/compID if known:[code]UID: flashjh/Server, M( 25928543 )C, 0x24b8387cb9765463, n = 1572864, CUDALucas v1.46[/code]And yes, keep the AID at the end if available. So, with everything in:[code]UID: flashjh/Server, M( 10061 )C, 0x56eb9bb91825b188, offset = 4054, n = 1K, CUDALucas v2.05 Beta, AID: DD556623539A3B33B816E3C5F77D1D97[/code]

flashjh 2013-11-28 15:36

[QUOTE=James Heinrich;360500]<>I don't know if offset is relevant or would be printed in case of a prime, but it's handled anyways.<>[/QUOTE]
You are correct, this is the line:[CODE]M( 57885161 )P, n = 3136K, CUDALucas v2.05 Beta[/CODE]

kladner 2013-11-29 23:54

Memtest results GTX 570, 844 core, 1600 vram
 
Last few lines of:

E:\CUDA\2.05-BETA>CudaLucas.exe -memtest 56 1 -d 1
[QUOTE]Position 22, Data Type 0, Iteration 1110000, Errors: 0, completed 47.23%, Read 82.60GB/s, Write 27.53GB/s, ETA 18:20)
Position 22, Data Type 1, Iteration 1120000, Errors: 0, completed 47.66%, Read 82.44GB/s, Write 27.48GB/s, ETA 18:11)
Position 22, Data Type 2, Iteration 1130000, Errors: 0, completed 48.09%, Read 82.32GB/s, Write 27.44GB/s, ETA 18:02)
Position 22, Data Type 3, Iteration 1140000, Errors: 0, completed 48.51%, Read 82.41GB/s, Write 27.47GB/s, ETA 17:53)
Position 22, Data Type 4, Iteration 1150000, Errors: 0, completed 48.94%, Read 82.67GB/s, Write 27.56GB/s, ETA 17:44)
Position 23, Data Type 0, Iteration 1160000, Errors: 0, completed 49.36%, Read 82.30GB/s, Write 27.43GB/s, ETA 17:35)
Position 23, Data Type 1, Iteration 1170000, Errors: 0, completed 49.79%, Read 82.50GB/s, Write 27.50GB/s, ETA 17:27)
Position 23, Data Type 2, Iteration 1180000, Errors: 0, completed 50.21%, Read 82.60GB/s, Write 27.53GB/s, ETA 17:18)
Position 23, Data Type 3, Iteration 1190000, Errors: 0, completed 50.64%, Read 82.67GB/s, Write 27.56GB/s, ETA 17:09)
Position 23, Data Type 4, Iteration 1200000, Errors: 0, completed 51.06%, Read 82.28GB/s, Write 27.43GB/s, ETA 17:00)
C:/CUDA/CuLu/src/CUDALucas.cu(1438) : cudaSafeCall() Runtime API error 2: out of memory.[/QUOTE]

EDIT: The GTX 570 is a secondary GPU which does not drive a display, FWIW.

I am now running

CUDALucas -memtest 35 10 -d 1

4.11% complete, ETA: 04:09.

kladner 2013-11-30 05:19

1 Attachment(s)
[QUOTE=kladner;360682].....

I am now running

CUDALucas -memtest 35 10 -d 1

4.11% complete, ETA: 04:09.[/QUOTE]

The above completed successfully. GTX 570, 844 core, 1600 VRAM

Attached is the latter part of the run; the buffer for cmd was not large enough.

flashjh 2013-11-30 15:31

[QUOTE=LaurV;360489]:shock: They are allowed for ages, since 1.48 (the first stable one), few years ago. <...> Edit: sorry, let me be stupid few minutes each day... No coffee yet, this morning.<...> Beside of "shifts", any reasons to switch?[/QUOTE]
I read your original post the other day, and came back today since I finally have time, and saw that you edited you post. :smile:
I am, indeed, talking about being able to do both tests with CUDALucas. There have been a significant amount of changes besides the shifting. A lot of work was done to eliminate errors from bad FFT selection. Primarily, now, CUDALucas handles FFT errors by reverting to the last save and increasing FFT appropriately. In fact, the original FFT selection is much better too. It makes sense to run the -cufftbench as you discussed to generate a good FFT file for your card. Memtest is also incorporated into this version now.

[QUOTE]Edit 2: some simple mechanism to protect against fraud is still missing, I would[U] vote against[/U] accepting "first-time LL" [B][U]and[/U][/B] "DC" from cudaLucas, for the same exponent. What stops me to edit the "offset" parameter, to get the credit two times? You will find after 20 years that we missed a prime because some idiot credit-whore (I learned the word here on the forum, as someone called it, sorry). At least, with P95 is not so easy for childish individuals to fake a report, due to the we1 checksum, etc. Some simple security mechanism should be implemented, beside of shifting, to make it safer. Don't get me wrong, no disrespect for your work, shifting is an [B][U]immense[/U][/B] improvement to guard against software (FFT bugs), for which I am very grateful.[/QUOTE]
I agree, but was falsely under the impression that the shift was what was missing to allow 1st-time and DCs on the same exponent. Your suggestion is heard loud and clear, but we need to know what ideas can be implemented to allow for both checks. Ideas?

[QUOTE]Edit 3: (BTW, after updating the drivers, I am also getting negative iteration times and negative ETA's too, which are very accurate if you multiply them with (about) minus 28 (!?!??!), and consider them in minutes, not in hours :smile:, using the "old good version" 2.04, untouched since Dubslow made it. But the residues are right, and it is about 1% faster, so I let it be).[/QUOTE]
If I may, I request that you please download and test 2.05Beta after your next exponent completes (there is a debug version if you need it). -t is always enabled now, but it is quite fast. I compile it with CUDA 5.5 using sm 13,20,30,35. I have actually tried to 'break' it messing with FFT sizes, etc. With rare exception it handles everything I throw at it and despite all my testing the checksums have all matched (so far). I've even had two that matched with two bad checksums ([URL="http://www.mersenne.org/report_exponent/?exp_lo=30424021&exp_hi=10000&B1=Get+status"]30424021[/URL] & [URL="http://www.mersenne.org/report_exponent/?exp_lo=30793229&exp_hi=10000&B1=Get+status"]30793229[/URL]). I also ran M[URL="http://www.mersenne.org/report_exponent/?exp_lo=62807803&exp_hi=62807803&B1=Get+status"]62807803[/URL] that [URL="http://www.mersenneforum.org/showthread.php?p=359101#post359101"]Lan_party[/URL] had trouble with and got a match, though I can't submit the result :rolleyes:

If you (or anyone else) start testing, please test the keyboard interaction on Windows as I have been unable to get it to work. If it does work for you then I'll know it's my system.

[QUOTE=James Heinrich;360500]These lines are now handled by mersenne.ca, and will (soon, hopefully) be handled by near-identical code on mersenne.org<>[/QUOTE]

[QUOTE=Prime95;359314]Can we change the intermediate output (example below) so that it does not look very much like the final result lines?<>[/QUOTE]

The changes to the output of 2.05Beta are done and I'll upload them shortly. The results file now outputs the format above, but the screen output is simplified for better formatting and to not allow GIMPS to read the result without a lot of changes that someone would need to do on purpose. In the future I'm sure we can implement output formatting similar to mfaktX.

kladner 2013-11-30 17:25

I ran the r47 version (not debug) overnight on both cards: 570 and 580. Event viewer shows two display driver restarts.

>>I experimented last night with the interactive feature. These are some great enhancements! Increase/decrease checkpoint interval works fine, as does +/- FFT. Auto FFT seems to make the best choices, so far. Decreasing always provoked a "restart with larger" response. Increasing several steps gave progressively slower performance. Toggle Polite works, as it always has.<<

I have up to date FFT and Threads files, generated yesterday. Stability seems to have improved, as I got multiple successful -r selftest runs on both cards. -memtest 35 10 completed on the 570. It ran for about two hours on the 580 without errors, but I got impatient and did not let it finish.

Let me know if there is any other info I might be able to provide. The two assignments are still running, so I don't know about residue matches, yet. I have 13 and 17 hours to go on those.

flashjh 2013-11-30 17:45

1 Attachment(s)
[URL="https://sourceforge.net/projects/cudalucas/files/2.05%20Beta/"]r48[/URL] is up with the updated display and results file output

So the Windows version allows you to use the interactive mode?

Edit: Attached a zip file with a .bat of [URL="http://www.mersenneforum.org/showthread.php?p=359102#post359102"]LaurV's post[/URL] (corrected) :smile:

kladner 2013-11-30 17:48

[QUOTE=flashjh;360733]r48 is up with the updated display and results file output

So the Windows version allows you to use the interactive mode?[/QUOTE]
Yes. Win 7 Pro, 64 bit. 331.82 drivers.

I'll get r48 if that is on Sourceforge now.

EDIT: Got r48. Should have refreshed the page the last time I looked.

Manpowre 2013-11-30 19:21

[QUOTE=kladner;360732]I ran the r47 version (not debug) overnight on both cards: 570 and 580. Event viewer shows two display driver restarts.

[/QUOTE]

I get device driver restart witn 2.03 and latest drivers too on both 580 and 590 boards. I found 306.23 to be the most stable driver for 5xx boards.

flashjh 2013-11-30 20:42

What FFT lengths are you using when the restarts happen?

On my current 580 system with the latest drivers (and several of the last few) I get restarts on the following FFT lengths:[CODE]
CUDALucas -cufftbench 3360 3360 6
CUDALucas -cufftbench 5040 5040 6
CUDALucas -cufftbench 5670 5670 6
CUDALucas -cufftbench 6720 6720 6[/CODE] I discovered these when I ran the .bat file I attached above. I have more 580s to test this on, but I haven't yet. I have tried updating and downgrading drivers with no luck. I have not rolled back to 306.23, I may try and see if it helps.

Anyone notice if these match the ones you're having trouble with or are they different?

[STRIKE]So the interactive does not work on my system. Anyone have any ideas as to what setting could be causing that? If I press a key with interactive enabled, I can see it (them) on the screen, but the program pauses and the only way to restart is to ^c and restart.
[/STRIKE]

I realized you have to press ENTER after the key press :redface:

EDIT:[QUOTE=kladner;360735]EDIT: Got r48. Should have refreshed the page the last time I looked.[/QUOTE]

I just put it there a while ago, it was probably after you looked the first time.

Prime95 2013-11-30 21:04

[QUOTE=flashjh;360724]I agree, but was falsely under the impression that the shift was what was missing to allow 1st-time and DCs on the same exponent. Your suggestion is heard loud and clear, but we need to know what ideas can be implemented to allow for both checks. Ideas?[/QUOTE]

We have 3 options that I see:

1) Disallow CUDALucas from double-checking CUDALucas results (the status quo).
2) Allow double-checks as long as the shift counts are different. The downside: it is real easy to forge a double-check.
3) Add a security code to the CUDALucas final result output (a simple hash of the exponent and shift count). This code could be secret, which isn't allowed if CUDALucas is GPL. Those building executables would be entrusted with the secret code. Or, the code can be public. At least a forger of results has to go to the trouble of reading C code.

There is an optional add-on to options 2 and 3. The Primenet server and/or GPU72 server can be upgraded to only give double-check exponents that were first tested by prime95.

chalsall 2013-11-30 22:12

[QUOTE=Prime95;360756]1) Disallow CUDALucas from double-checking CUDALucas results (the status quo).

There is an optional add-on to options 2 and 3. The Primenet server and/or GPU72 server can be upgraded to only give double-check exponents that were first tested by prime95.[/QUOTE]

I would like to argue that Option #1 is the only one which guarantees the integrity of the GIMPS knowledge of the status of MPs. Or, at least, doesn't leave it wide-open to attack. There are rarely times where security through obscurity is warranted, but this might be one of them. [COLOR="White"](Anyone good with a de-compiler / dis-assembler could still break this thing, though...)[/COLOR]

Towards the end of not wasting resources, I will add to the top of my todo list to implement an additional option on the GPU72 manual assignment page: For CPU, For GPU.

Prime95 2013-11-30 22:24

[QUOTE=chalsall;360759]I would like to argue that Option #1 is the only one which guarantees the security of the GIMPS knowledge of the status of MPs.[/QUOTE]

I don't disagree, except that we are already vulnerable. Prime95 uses option 3 with "secret" code. Will we make matters any worse by giving CUDALucas the exact same vulnerability?

chalsall 2013-11-30 22:26

[QUOTE=Prime95;360761]Will we make matters any worse by giving CUDALucas the exact same vulnerability?[/QUOTE]

Perhaps this vulnerability should be closed.

kladner 2013-12-01 03:10

[QUOTE=kladner;360735]Yes. Win 7 Pro, 64 bit. 331.82 drivers.

[/QUOTE]

Updated driver to 331.93 BETA. No noticeable difference.

r48 seems determined to output each check point on two line, no matter how wide the CMD box.

flashjh 2013-12-01 03:13

[QUOTE=kladner;360775]Updated driver to 331.93 BETA. No noticeable difference.

r48 seems determined to output each check point on two line, no matter how wide the CMD box.[/QUOTE]

It always scrolled over two lines anyway, so I added a newline to make it easier to read. [STRIKE]Are you on windows or linux?[/STRIKE]

Edit: I'll remove the newline on the next commit.

kladner 2013-12-01 03:42

[QUOTE=flashjh;360776]It always scrolled over two lines anyway, so I added a newline to make it easier to read. [STRIKE]Are you on windows or linux?[/STRIKE]

Edit: I'll remove the newline on the next commit.[/QUOTE]

No biggy. I just wanted to confirm that it is not a malfunction.

EDIT: Can it be made an option, to split or not to split?

flashjh 2013-12-01 03:49

It's no problem. Changes are posted on sourceforge. If anything else needs updating, etc. just post it and I'll include it in a future commit.

I want to bring in the custom output formatting from mfactx, so I'll also look at line breaks, also. That will also allow for adding username and computer id to the results file line.

kladner 2013-12-01 04:08

[QUOTE=flashjh;360752]What FFT lengths are you using when the restarts happen?
.......
[/QUOTE]

Got the data, but forgot to post it, till now-[INDENT]30.8M exponent, 1728K, GTX 570
37.5M exponent, 2048K, GTX 580

[/INDENT]

kladner 2013-12-01 05:55

306.23 gives this-
[CODE]E:\CUDA\2.05-BETA\CL_2.05_A>cudalucas -r

device_number >= device_count ... exiting
(This is probably a driver problem)[/CODE]I'll move back up to something a bit more recent and see what happens.

EDIT: Shoot! 314.22 gives the same result with CUDALucas.......R49.

kladner 2013-12-01 07:58

Back on driver 331.82. Seeming to run pretty well, again.

I have noticed that CUDALucas does not load the GPU's as heavily as mfaktc. My line-measured power consumption is down ~80 W with CL running on both cards. This is with nearly the OC core settings that mfaktc will run at. I still feel better about turning down the VRAM even from stock speeds to run CL, and it does affect the iteration time and the power consumption.

Regardless, with CL running on both cards, the whole system is pulling ~720 W with P95 running all eight cores of an FX-8350 on P-1, with 24 GB of RAM allowed. If the GPU's were running mfaktc, the power draw would be a bit over 800 W.

ET_ 2013-12-01 10:35

May I gently ask to also post GPU type, OS source name and release when you test new drivers?

Thanks :bow:

Luigi

flashjh 2013-12-01 12:15

[QUOTE=kladner;360788]306.23 gives this-
[CODE]E:\CUDA\2.05-BETA\CL_2.05_A>cudalucas -r

device_number >= device_count ... exiting
(This is probably a driver problem)[/CODE]I'll move back up to something a bit more recent and see what happens.

EDIT: Shoot! 314.22 gives the same result with CUDALucas.......R49.[/QUOTE]

320.18 is the first WHQL built on CUDA 5.5, which should be the earliest release driver that would work with this build of CUDALucas. I have all CUDA install from 3.2 and up. I could try building earlier versions if you're interested.

Manpowre 2013-12-01 13:21

[QUOTE=kladner;360792]Back on driver 331.82. Seeming to run pretty well, again.

I have noticed that CUDALucas does not load the GPU's as heavily as mfaktc. My line-measured power consumption is down ~80 W with CL running on both cards. This is with nearly the OC core settings that mfaktc will run at. I still feel better about turning down the VRAM even from stock speeds to run CL, and it does affect the iteration time and the power consumption.

Regardless, with CL running on both cards, the whole system is pulling ~720 W with P95 running all eight cores of an FX-8350 on P-1, with 24 GB of RAM allowed. If the GPU's were running mfaktc, the power draw would be a bit over 800 W.[/QUOTE]

This is probably due to the amount of memcopy done back and forth between each iteration from host->device->host. As far as I have understood MfaktC, is that it keeps data in device mem. therefore activating the card alot more.

kladner 2013-12-01 14:46

[QUOTE=ET_;360795]May I gently ask to also post GPU type, OS source name and release when you test new drivers?

Thanks :bow:

Luigi[/QUOTE]

Sorry. I was running in sloppy late-night mode.

Driver 331.82, latest WHQL
The cards are a Gigabyte GTX 570, and an Asus GTX 580.
Windows 7 Pro 64 bit, SP 1, all current Windows updates.
More on request if I missed something.

EDIT: Completed a DC on each card, matched residues on both. Before completion, the 580 log showed three batch file starts, or two restarts. This is an incomplete picture as it restarted several times in the previous evening. Some of these were spontaneous, while others had to do with switching out drivers.

kladner 2013-12-01 15:16

[QUOTE=flashjh;360781]It's no problem. Changes are posted on sourceforge. If anything else needs updating, etc. just post it and I'll include it in a future commit.

I want to bring in the custom output formatting from mfactx, so I'll also look at line breaks, also. That will also allow for adding username and computer id to the results file line.[/QUOTE]

Thanks, Jerry. For some reason, I can sort out the lines more easily without the break, in spite of the rather wide box that requires.

EDIT: Another display driver restart, GTX 570 running CUDALucas_BETA_2.05_r49, 580 running mfaktc.
[CODE]C:/CUDA/CuLu/src/CUDALucas.cu(372) : cudaSafeCall() Runtime API error 30: unknown error.[/CODE]Aside from the CL batch file loop restart, this did not seem to cause any disruption. mfaktc appeared to be unaffected.

LaurV 2013-12-01 16:36

Well, it took me a wile to uncover this bug... I was beginning to think I am stupid :smile:, because for all of you it was working, but for me not...

Then suddenly it came... (I had to take the options one by one and play with them!)

I got so many errors about my cards not having enough memory, registers, wheels, purple lights, whatever, it even said I have minus few terabytes of RAM (!?!?), I was ready to give up... Then I tried to use the -info switch to see what freaking card he believes I have...

... And with -info switch it worked!

Here is where it did hit me! I have "PrintDeviceInfo=0" in the ini file ("who the hack need that? I know what kind of card I have!").

If you have "PrintDeviceInfo=0" in the ini file, then the program not only ignore printing them on screen, but also ignores reading them for himself... :razz:

[CODE]
e:\CudaLucas\CL0>cl205b_x64r49 -info

------- DEVICE 0 -------
name GeForce GTX 580
Compatibility 2.0
clockRate (MHz) 1564
memClockRate (MHz) 2004
totalGlobalMem 1610612736
totalConstMem 65536
l2CacheSize 786432
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 1536
multiProcessorCount 16
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
textureAlignment 512
deviceOverlap 1

mkdir: cannot create directory `backup0': File exists
Using threads: norm1 256, mult 128, norm2 128.
Starting M37500769 fft length = 2048K
SIGINT caught, writing checkpoint. Estimated time spent so far: 0:39

[COLOR=Red]<it works perfectly>[/COLOR]

e:\CudaLucas\CL0>cl205b_x64r49

mkdir: cannot create directory `backup0': File exists
Using threads: norm1 256, mult 128, norm2 128.
over specifications Grid = 4096
try increasing norm1 threads (256) or decreasing FFT length (2048K)

[COLOR=Red]<freaks out>[/COLOR]

e:\CudaLucas\CL0>
[/CODE]

flashjh 2013-12-02 01:23

LaurV, Good find!

Found the problem in the init_device function.

I tested it, but please test again, thanks! :smile:

Committed the change and updated the .exe [URL="https://sourceforge.net/projects/cudalucas/files/2.05%20Beta/"]files[/URL].

Edit:
[QUOTE=Prime95;360761]I don't disagree, except that we are already vulnerable. Prime95 uses option 3 with "secret" code. Will we make matters any worse by giving CUDALucas the exact same vulnerability?[/QUOTE]

[QUOTE=chalsall;360762]Perhaps this vulnerability should be closed.[/QUOTE]

I like the idea of keeping the code open.

I think the changes to Primenet/G72 are a great option for now, but they require a reasonable amount of work, right? (And there is no guarantee that someone getting assignments would use the right option anyway). As such, I think leaving things as they are, may work best and once CUDALucas is stable and produces reliable results we can readdress the need for the secret code. Thoughts?

Also, CUDA 6 is going to (potentially significantly) change CUDALucas. This is one reason I don't think making big changes right now is a good idea.

LaurV 2013-12-02 03:41

Just for recording, and as a guy who makes a living from writing code, I have nothing against "secret" CRCs. Small function in a dll, cudaLucas can call to it and generate some key, which may also depend on the assignment key (if the work was "legally" reserved). It can call the function every 1M iterations, and every time add few characters to the key string. At the end, they would be easy to be verified without re-doing whole the work. We should not be afraid of "vulnerabilities", and does not need to be something very complicate. Prime95 is fine as it is.

My point is that people who [U]know[/U] how to exploit the vulnerability are too clever and too mature to use the exploit, they are "above" the "credit hunting fever". You don't get money for it (you can not "fake" a prime, for example - it will be verified by others immediately), and you even don't get "fame", contrarily, someone can realize you are cheating the system and you will have more to lose and suffer from the community. The "guarding" has to be against "childish" and "cmd*-like" stuff, like editing a text line and reporting two times, which anybody could do. (I wanted to write "any kid", but realized that kids today are so clever... hehe...)


(* for the new users here, "cmd" is a mersenneforum user who liked to do this kind of stupid things line adding all numbers with 37 digits to factorDB)

LaurV 2013-12-02 17:24

Did [URL="http://www.mersenne.org/report_exponent/?exp_lo=37500769&exp_hi=&B1=Get+status"]37500769[/URL] few times, it is already "multiple" checked :smile: (I changed the version of cL to 2.05 on the way to a triple test, first was mismatch, then I realized that it didn't resume the work, but started from scratch. As the other test was almost finish, I decided to go back to 2.04 and finish both of them. So, now I got the same residue 3 times, so mine is good, for sure! Bookmarked, waiting for some P95 checkers to confirm...

The release 50 seems to work ok. I like the interactive options, and the thread tuning. The new drivers are a bit slower for older cards (gtx 580), but this is compensated by the tuning features and other small nice things...

Good job!

flashjh 2013-12-02 17:34

When I get some time, I'm going to try and compile from other CUDA versions to see if that helps with speed at all. I've gotten away from tracking the exact iteration times because I want to get the program stable. Maybe I'll worry about the speed when I'm racing to beat you all for DCing the next MP :razz:

So far when testing 2.05 the mismatches already had two LLs done, so my DC matched one. I completed [URL="http://www.mersenne.org/report_exponent/?exp_lo=30612941&exp_hi=&B1=Get+status"]30612941[/URL] yesterday and it was a mismatch. petrw1 picked up the assignment for DC, so I'll have to wait until it's done. I could run it for a TC, maybe tomorrow? For 2.05s sake, hopefully the DC matches mine.

owftheevil 2013-12-02 17:56

[QUOTE=Prime95;360756]We have 3 options that I see:

1) Disallow CUDALucas from double-checking CUDALucas results (the status quo).
2) Allow double-checks as long as the shift counts are different. The downside: it is real easy to forge a double-check.
3) Add a security code to the CUDALucas final result output (a simple hash of the exponent and shift count). This code could be secret, which isn't allowed if CUDALucas is GPL. Those building executables would be entrusted with the secret code. Or, the code can be public. At least a forger of results has to go to the trouble of reading C code.

There is an optional add-on to options 2 and 3. The Primenet server and/or GPU72 server can be upgraded to only give double-check exponents that were first tested by prime95.[/QUOTE]

The code that generates the security code could be a separate application.

kracker 2013-12-02 18:26

[QUOTE=LaurV;360902]Did [URL="http://www.mersenne.org/report_exponent/?exp_lo=37500769&exp_hi=&B1=Get+status"]37500769[/URL] few times, it is already "multiple" checked :smile: [/QUOTE]

Yay... I somehow got it. I'll probably toss it on my 7770.

LaurV 2013-12-02 18:44

[QUOTE=kracker;360921]Yay... I somehow got it. I'll probably toss it on my 7770.[/QUOTE]
If you do that, you will get no credit. PrimeNet will not accept a result for this expo, if it does not come from P95. That is why all the discussions about shifts and secret CRCs. In the past we had a thread for this type of exponents (which were DC-ed and TC-ed [U]with CudaLucas[/U], and still gave different residues from the original test), to warn the potential crunchers that they must use P95 for them, otherwise they waste the time. I even was moderator and did the maintenance for that thread, but since the rights were restricted, the thread is forgotten and I can not find it.

kracker 2013-12-02 18:52

[QUOTE=LaurV;360925]If you do that, you will get no credit. PrimeNet will not accept a result for this expo, if it does not come from P95. That is why all the discussions about shifts and secret CRCs. In the past we had a thread for this type of exponents (which were DC-ed and TC-ed [U]with CudaLucas[/U], and still gave different residues from the original test), to warn the potential crunchers that they must use P95 for them, otherwise they waste the time. I even was moderator and did the maintenance for that thread, but since the rights were restricted, the thread is forgotten and I can not find it.[/QUOTE]

I use clLucas. I guess it is same as CudaLucas or considered different?

EDIT: "if it does not come from P95" Sorry didn't see that. Tossing the exponent on a CPU.
EDIT2: [URL="http://mersenneforum.org/showthread.php?t=16281"]This?[/URL]

LaurV 2013-12-02 19:00

[QUOTE=kracker;360926]EDIT: "if it does not come from P95" Sorry didn't see that. Tossing the exponent on a CPU.
[/QUOTE]
Yes, one of the checks, either the first LL or the DC, must come from P95, there is no other way, because up to now (i.e. before cudalucas 2.05, done last week) only P95 implemented the random shift. All other programs used shift zero (clLucas included) and in case of a FFT error, they will all get the same (wrong) residue.
[QUOTE]
EDIT2: [URL="http://mersenneforum.org/showthread.php?t=16281"]This?[/URL][/QUOTE]Yes, thanks, didn't really look (1.55 A.M. here!), had bookmark, which "expired", hehe. I will actualize the bookmark. You should read it, it has some basic theory which will help you understand the shifting process :razz:

flashjh 2013-12-02 19:17

[QUOTE=owftheevil;360911]The code that generates the security code could be a separate application.[/QUOTE]
Would it be a program called by CUDALucas or could it be in a .dll file? Do you have something in mind?

owftheevil 2013-12-02 19:26

[QUOTE=flashjh;360932]Would it be a program called by CUDALucas or could it be in a .dll file? Do you have something in mind?[/QUOTE]

Could be either, or something else, and yes I have something in mind.

kladner 2013-12-03 00:21

FWIW, I have run one DC on my GTX 580, and two on the 570. All have matched. All were single first-time LL's. These runs included both intentional restarts, and program crash restarts. Various interactive features were invoked, and I played with clock speeds for both core and VRAM in the course of the tests.

chappjc 2013-12-03 01:21

I haven't been around for a while, but I recently started TF'ing with mfaktc on a 580, 670 and C2075. Now, with the latest updates to CUDALucas 2.05, I am actually able to run it on these GPUs (although the 580 has random runtime API errors if FFT length is not tweaked).

Are results from CUDALucas 2.05 accepted at PrimeNet? I completed an exponent in a few days, and the system didn't find any CUDALucas lines in the results.txt, but there was a line there indicating version 2.05.

Also, are there any tests I should run and report here?

flashjh 2013-12-03 03:25

1 Attachment(s)
Welcome back!

A few of us have some particular FFTs that won't work and the issue with the program stopping will get addressed. We're using a [URL="http://www.mersenneforum.org/showpost.php?p=360417&postcount=2030"][COLOR=#0066cc]simple loop[/COLOR][/URL] to keep CUDALucas going if it stops until the code is fixed.

As for tests:

1) For your cards, you should run the batch file attached for each card. It will take a while and some of the FFTs may fail as you've experienced, but it will create two files that help fine-tune CUDALucas for each card.

2) Run the built-in [URL="http://www.mersenneforum.org/showpost.php?p=359754&postcount=2003"][COLOR=#0066cc]memtest[/COLOR][/URL]. CUDALucas -memtest k n. Read from mid Nov threads until now to see more info.

3) Run the built-in test CUDALucas -r. Make sure all residues match.

The results are accepted as long as the exponent(s) don't already have a CUDALucas/mlucas residue. Download the latest version from [URL="https://sourceforge.net/projects/cudalucas/files/2.05%20Beta/"][COLOR=#0066cc]sourceforge[/COLOR][/URL] and it will format the results.txt file correctly. Use the format to properly format previous results.

If you have any bugs/suggestions, let us know. Thanks for testing and your contribution.

LaurV 2013-12-03 04:02

[QUOTE=flashjh;360992]The results are accepted as long as the exponent(s) don't already have a CUDALucas/mlucas [U]same[/U] residue [U](i.e. different residues are accepted, the server can't know which one is good, until DC-ed) and as long as you don't use the "user/computer/timestamp" option of cudaLucas. You ca use manual report form to report the results[/U].[/QUOTE]

underlined text is mine. The rest is a Jerry said.

flashjh 2013-12-03 04:04

Said much better, thanks.:smile:

henryzz 2013-12-03 19:24

[QUOTE=owftheevil;360911]The code that generates the security code could be a separate application.[/QUOTE]
The API to call that program wouldn't be secret though and that could probably be abused.

chappjc 2013-12-03 19:26

Thanks for the advice. I was running r47, so the formatting changes were not included. Once I reformatted the results.txt, [URL="http://www.mersenne.org/report_exponent/?exp_lo=56803127&exp_hi=&B1=Get+status"]PrimeNet recognized it[/URL]. I left the "AID" part at the end of the line. Should I run the same exponent again just to verify? This result is from the 670. I'll have another result in just 92 hours! :smile:

All cards pass all residue tests (CUDALucas -r). I ran a few very short memory tests (i.e. -memtest 6 2), and a longer one is presently running on the 580.

One thing to note about the 580 that always has runtime API errors, is that it is also display card. Often the driver stops responding and recovers (331.82). The other two cards on which I have never seen a runtime error (yet) are on two different machines and are not the display cards.

I ran the batch script, which generated the fft and threads .txt files, but some of the results are surprising to me. At 2592k, the optimal threads drops off:

[CODE]...
2048 512 512 256 2.8779
2240 512 512 256 3.3209
2304 512 512 128 3.3607
2352 512 512 1024 3.8242
2592 64 32 32 3.9552
2688 64 64 32 4.6925
2880 64 32 32 4.6117
3024 64 32 32 5.1544
3136 64 32 32 4.9940
...[/CODE]

That probably makes sense for a 580 with 3GB, but I just wanted to make sure.

owftheevil 2013-12-03 19:38

[QUOTE=henryzz;361055]The API to call that program wouldn't be secret though and that could probably be abused.[/QUOTE]


You are right. I thought more about it last night and came to the same conclusion. Personally, I have no problem with changing the license to account for a closed source authenticator.

owftheevil 2013-12-03 19:47

[QUOTE=chappjc;361056]Thanks for the advice. I was running r47, so the formatting changes were not included. Once I reformatted the results.txt, [URL="http://www.mersenne.org/report_exponent/?exp_lo=56803127&exp_hi=&B1=Get+status"]PrimeNet recognized it[/URL]. I left the "AID" part at the end of the line. Should I run the same exponent again just to verify? This result is from the 670. I'll have another result in just 92 hours! :smile:

All cards pass all residue tests (CUDALucas -r). I ran a few very short memory tests (i.e. -memtest 6 2), and a longer one is presently running on the 580.

One thing to note about the 580 that always has runtime API errors, is that it is also display card. Often the driver stops responding and recovers (331.82). The other two cards on which I have never seen a runtime error (yet) are on two different machines and are not the display cards.

I ran the batch script, which generated the fft and threads .txt files, but some of the results are surprising to me. At 2592k, the optimal threads drops off:

[CODE]...
2048 512 512 256 2.8779
2240 512 512 256 3.3209
2304 512 512 128 3.3607
2352 512 512 1024 3.8242
2592 64 32 32 3.9552
2688 64 64 32 4.6925
2880 64 32 32 4.6117
3024 64 32 32 5.1544
3136 64 32 32 4.9940
...[/CODE]That probably makes sense for a 580 with 3GB, but I just wanted to make sure.[/QUOTE]

That looks fishy to me. The third thread parameter flops around a lot, but the first two are usually pretty stable. How are the timings as compared to the fft bench test? I'd like to see the corresponding section of <gpu> fft.txt.

flashjh 2013-12-03 20:18

[QUOTE=chappjc;361056]<>Should I run the same exponent again just to verify? This result is from the 670. I'll have another result in just 92 hours! :smile:[/QUOTE]You can run it again, but use P95. Otherwise, just let the natural DC process test it (whenever that will happen). Also, if you're in the process of 'verifying' that your cards are stable, I recommend you pull DCs from Primenet or GPU72; that way you will know if your card is producing good results or not. If it mismatches, you can post it [URL="http://www.mersenneforum.org/showthread.php?p=333819&goto=newpost"]here[/URL] which will tell others not to use CUDALucas to DC/TC the exponent. Sometimes folks will do a quick run on it for you so you can see which one (or both) was wrong. You can always do another run on the GPU, Primenet won't accept the run unless the residue is different.

[QUOTE]One thing to note about the 580 that always has runtime API errors, is that it is also display card. Often the driver stops responding and recovers (331.82). The other two cards on which I have never seen a runtime error (yet) are on two different machines and are not the display cards.[/QUOTE]I have a 580 with the same issue, and others have this problem with other cards. owftheevil said it's caused by the drivers, but it will get fixed. My 580 is not the display card and it still happens.

owftheevil 2013-12-03 20:57

[QUOTE=flashjh;361062]You can run it again, but use P95. Otherwise, just let the natural DC process test it (whenever that will happen). Also, if you're in the process of 'verifying' that your cards are stable, I recommend you pull DCs from Primenet or GPU72; that way you will know if your card is producing good results or not. If it mismatches, you can post it [URL="http://www.mersenneforum.org/showthread.php?p=333819&goto=newpost"]here[/URL] which will tell others not to use CUDALucas to DC/TC the exponent. Sometimes folks will do a quick run on it for you so you can see which one (or both) was wrong. You can always do another run on the GPU, Primenet won't accept the run unless the residue is different.

I have a 580 with the same issue, and others have this problem with other cards. owftheevil said it's caused by the drivers, but it will get fixed. My 580 is not the display card and it still happens.[/QUOTE]


Maybe I misunderstand you, but the problem won't be fixed until Nvidia does something about their drivers. All I'm trying to do is make the batch files unnecessary for restarting CL when the error does occur. It won't take away the fft hangs, resetting drivers etc. By the way I have it working on Linux, but Windows is again another story.

flashjh 2013-12-03 21:06

[QUOTE=owftheevil;361069]Maybe I misunderstand you, but the problem won't be fixed until Nvidia does something about their drivers. All I'm trying to do is make the batch files unnecessary for restarting CL when the error does occur. It won't take away the fft hangs, resetting drivers etc. By the way I have it working on Linux, but Windows is again another story.[/QUOTE]
Ok, so you can detect and restart, but the 'real' problem is the drivers? I thought it was a good fix. Sorry for the confusion.

If you have the code working for Linux, can you commit/merge it with the changes on SourceForge so I can take a look at it on Windows?

chappjc 2013-12-03 21:07

[QUOTE=owftheevil;361060]That looks fishy to me. The third thread parameter flops around a lot, but the first two are usually pretty stable. How are the timings as compared to the fft bench test? I'd like to see the corresponding section of <gpu> fft.txt.[/QUOTE]

From "GeForce GTX 580 fft.txt":

[CODE] 2048 38492887 2.9761
2160 40551479 3.5742
2240 42020509 3.6679
2304 43194913 3.6846
2592 48471289 3.9861
2880 53735041 4.6150
3072 57237889 4.9730
3136 58404433 4.9740[/CODE]

Do you want to see the full output from -cufftbench 2592 2592 6?

chalsall 2013-12-03 21:22

[QUOTE=flashjh;361070]If you have the code working for Linux, can you commit/merge it with the changes on SourceForge so I can take a look at it on Windows?[/QUOTE]

Just putting this out there for thought...

If you're not having fun, perhaps you should be doing something different.

Clearly we're having fun here, even if some don't understand the interest, the work, or the humor....

owftheevil 2013-12-03 22:31

[QUOTE=chappjc;361071]From "GeForce GTX 580 fft.txt":

[CODE] 2048 38492887 2.9761
2160 40551479 3.5742
2240 42020509 3.6679
2304 43194913 3.6846
2592 48471289 3.9861
2880 53735041 4.6150
3072 57237889 4.9730
3136 58404433 4.9740[/CODE]Do you want to see the full output from -cufftbench 2592 2592 6?[/QUOTE]

Yes, that would be useful.

chappjc 2013-12-03 23:07

1 Attachment(s)
Attached is the output of [FONT="Courier New"]CUDALucas -cufftbench 2592 2592 6[/FONT]. I don't get what it means by "best time" as it seems unrelated to the "ave time" values reported for the different threads.

owftheevil 2013-12-04 14:43

Thanks for those results. I'm still perplexed.

The first 36 lines are only timing the two normalization kernels which are the only things that depend on the thread values being varied. The last six lines are testing a full LL iteration with the two normalization kernels, the multiplication kernel, and two ffts.

flashjh 2013-12-08 21:11

CUDALucas 2.05 Beta r52 posted to sourceforge. New Windows executables are [URL="https://sourceforge.net/projects/cudalucas/files/2.05%20Beta/"]here[/URL].

The code to exit when one of those fft hangs occurs is deleted. The problem is that windows resets the driver after the timeout error and the code needs to wait and then check to see if everything is ready to go.

[B]Just a headsup, the timing is now handled a little differently, so tests resumed from old savefiles will give incorrect ETAs.[/B]

There is also now a simple checksum to verify the disk data, rather than the old, "does the save file have the prime q in the correct location" method of verification. So that old savefiles can be used with this new format, it doesn't enforce this yet but does give a warning that the checksums don't match.

Other changes:
1. overflow error checking
2. consolidated device momory allocations, reduces amount used slightly
3. tighter fft selection
4. better error handling
5. method for thread testing a range instead of just a single fft (eg ./CUDALucas -cufftbench 8192 1 5, end of range first)
6. put the ffts back into threads test, slower but much more accurate results on cards used for display

Please test and post results. Anyone have any verified mismatches with r50, please post and if you get any with this version, let us know. Thanks!

ET_ 2013-12-08 21:24

[QUOTE=flashjh;361490]CUDALucas 2.05 Beta r52 posted to sourceforge. New Windows executables are [URL="https://sourceforge.net/projects/cudalucas/files/2.05%20Beta/"]here[/URL].

The code to exit when one of those fft hangs occurs is deleted. The problem is that windows resets the driver after the timeout error and the code needs to wait and then check to see if everything is ready to go.

[B]Just a headsup, the timing is now handled a little differently, so tests resumed from old savefiles will give incorrect ETAs.[/B]

There is also now a simple checksum to verify the disk data, rather than the old, "does the save file have the prime q in the correct location" method of verification. So that old savefiles can be used with this new format, it doesn't enforce this yet but does give a warning that the checksums don't match.

Other changes:
1. overflow error checking
2. consolidated device momory allocations, reduces amount used slightly
3. tighter fft selection
4. better error handling
5. method for thread testing a range instead of just a single fft (eg ./CUDALucas -cufftbench 8192 1 5, end of range first)
6. put the ffts back into threads test, slower but much more accurate results on cards used for display

Please test and post results. Anyone have any verified mismatches with r50, please post and if you get any with this version, let us know. Thanks![/QUOTE]

Is this a Windows-only update?

Luigi

flashjh 2013-12-08 22:12

No, it applies to Linux, also. I requested a Linux file for SourceForge, but if you can compile it, you the updates need to be tested. Thanks.

flashjh 2013-12-10 23:44

I'm still getting the stop error and the batch file needs to keep CUDALucas going.

This is the code identified for the error:[CODE][FONT=Consolas][SIZE=2][COLOR=#0000ff][FONT=Consolas][SIZE=2][COLOR=#0000ff][FONT=Consolas][SIZE=2][COLOR=#0000ff]
void[/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][FONT=Consolas][SIZE=2][FONT=Consolas][SIZE=2] reset_err([/SIZE][/FONT][/SIZE][/FONT][FONT=Consolas][SIZE=2][COLOR=#0000ff][FONT=Consolas][SIZE=2][COLOR=#0000ff][FONT=Consolas][SIZE=2][COLOR=#0000ff]float[/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][FONT=Consolas][SIZE=2][FONT=Consolas][SIZE=2]* [/SIZE][/FONT][/SIZE][/FONT][FONT=Consolas][SIZE=2][COLOR=#808080][FONT=Consolas][SIZE=2][COLOR=#808080][FONT=Consolas][SIZE=2][COLOR=#808080]maxerr[/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][FONT=Consolas][SIZE=2][FONT=Consolas][SIZE=2], [/SIZE][/FONT][/SIZE][/FONT][FONT=Consolas][SIZE=2][COLOR=#0000ff][FONT=Consolas][SIZE=2][COLOR=#0000ff][FONT=Consolas][SIZE=2][COLOR=#0000ff]float[/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][FONT=Consolas][SIZE=2][COLOR=#808080][FONT=Consolas][SIZE=2][COLOR=#808080][FONT=Consolas][SIZE=2][COLOR=#808080]value[/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][FONT=Consolas][SIZE=2][FONT=Consolas][SIZE=2])
{
[/SIZE][/FONT][/SIZE][/FONT][FONT=Consolas][SIZE=2][COLOR=#6f008a][FONT=Consolas][SIZE=2][COLOR=#6f008a][FONT=Consolas][SIZE=2][COLOR=#6f008a]cutilSafeCall[/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][FONT=Consolas][SIZE=2][FONT=Consolas][SIZE=2] (cudaMemset (g_err, 0, [/SIZE][/FONT][/SIZE][/FONT][FONT=Consolas][SIZE=2][COLOR=#0000ff][FONT=Consolas][SIZE=2][COLOR=#0000ff][FONT=Consolas][SIZE=2][COLOR=#0000ff]sizeof[/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][FONT=Consolas][SIZE=2][FONT=Consolas][SIZE=2] ([/SIZE][/FONT][/SIZE][/FONT][FONT=Consolas][SIZE=2][COLOR=#0000ff][FONT=Consolas][SIZE=2][COLOR=#0000ff][FONT=Consolas][SIZE=2][COLOR=#0000ff]float[/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][FONT=Consolas][SIZE=2][FONT=Consolas][SIZE=2])));
*[/SIZE][/FONT][/SIZE][/FONT][FONT=Consolas][SIZE=2][COLOR=#808080][FONT=Consolas][SIZE=2][COLOR=#808080][FONT=Consolas][SIZE=2][COLOR=#808080]maxerr[/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][FONT=Consolas][SIZE=2][FONT=Consolas][SIZE=2] *= [/SIZE][/FONT][/SIZE][/FONT][FONT=Consolas][SIZE=2][COLOR=#808080][FONT=Consolas][SIZE=2][COLOR=#808080][FONT=Consolas][SIZE=2][COLOR=#808080]value[/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][/COLOR][/SIZE][/FONT][FONT=Consolas][SIZE=2][FONT=Consolas][SIZE=2];
}
[/SIZE][/FONT][/SIZE][/FONT][/CODE] This is the screen output:[CODE]Using threads: norm1 128, mult 256, norm2 128.
C:/CUDA/CuLu/src/CUDALucas.cu(543) : cufftSafeCall() CUFFT error 9999: CUFFT Unknown error code[/CODE]

kladner 2013-12-10 23:48

[QUOTE] Code:
Using threads: norm1 128, mult 256, norm2 128. C:/CUDA/CuLu/src/CUDALucas.cu(543) : cufftSafeCall() CUFFT error 9999: CUFFT Unknown error code
[/QUOTE]

Interesting. The error has changed, at least from the one I saw when the program quit for me.

flashjh 2013-12-10 23:58

It's different now because owftheevil made changes to the code. Still seeing if we can get the program the catch and clear the fault without exiting on Windows.

mognuts 2013-12-13 20:38

I'm getting this error, if it's of any use to anybody:
[CODE]D:\Cuda\CUDALucas>CUDALucas_205Beta_x64_r52.exe -cufftbench 1 8192 5
------- DEVICE 0 -------
name GeForce GTX 570
Compatibility 2.0
clockRate (MHz) 1464
memClockRate (MHz) 1900
totalGlobalMem 1342177280
totalConstMem 65536
l2CacheSize 655360
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 1536
multiProcessorCount 15
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
textureAlignment 512
deviceOverlap 1
CUDA bench, testing reasonable fft sizes 1K to 8192K, doing 5 passes.
fft size = 1K, ave time = 0.0273 msec, max-ave = 0.00060
fft size = 2K, ave time = 0.0329 msec, max-ave = 0.00684
fft size = 3K, ave time = 0.0716 msec, max-ave = 0.00737
fft size = 4K, ave time = 0.0540 msec, max-ave = 0.00778
fft size = 5K, ave time = 0.0529 msec, max-ave = 0.00765
fft size = 6K, ave time = 0.0525 msec, max-ave = 0.00007
fft size = 7K, ave time = 0.1198 msec, max-ave = 0.00315
fft size = 8K, ave time = 0.0514 msec, max-ave = 0.00005
fft size = 9K, ave time = 0.0540 msec, max-ave = 0.00015
fft size = 10K, ave time = 0.0605 msec, max-ave = 0.00526
fft size = 12K, ave time = 0.0639 msec, max-ave = 0.00018
fft size = 14K, ave time = 0.0599 msec, max-ave = 0.00308
fft size = 15K, ave time = 0.1332 msec, max-ave = 0.00249
fft size = 16K, ave time = 0.0682 msec, max-ave = 0.00304
fft size = 18K, ave time = 0.0624 msec, max-ave = 0.00113
fft size = 20K, ave time = 0.0726 msec, max-ave = 0.00299
fft size = 21K, ave time = 0.0738 msec, max-ave = 0.00237
fft size = 24K, ave time = 0.0891 msec, max-ave = 0.00284
fft size = 25K, ave time = 0.1378 msec, max-ave = 0.00328
fft size = 27K, ave time = 0.1417 msec, max-ave = 0.00018
fft size = 28K, ave time = 0.0928 msec, max-ave = 0.00369
fft size = 30K, ave time = 0.0948 msec, max-ave = 0.00302
fft size = 32K, ave time = 0.0824 msec, max-ave = 0.00008
fft size = 35K, ave time = 0.1550 msec, max-ave = 0.00241
fft size = 36K, ave time = 0.0995 msec, max-ave = 0.00247
fft size = 40K, ave time = 0.1051 msec, max-ave = 0.00247
fft size = 42K, ave time = 0.1085 msec, max-ave = 0.00206
fft size = 45K, ave time = 0.1684 msec, max-ave = 0.00200
fft size = 48K, ave time = 0.1081 msec, max-ave = 0.00262
fft size = 49K, ave time = 0.1212 msec, max-ave = 0.00285
fft size = 50K, ave time = 0.1188 msec, max-ave = 0.00167
fft size = 54K, ave time = 0.1316 msec, max-ave = 0.00317
fft size = 56K, ave time = 0.1183 msec, max-ave = 0.00104
fft size = 60K, ave time = 0.1417 msec, max-ave = 0.00308
fft size = 63K, ave time = 0.1869 msec, max-ave = 0.00165
fft size = 64K, ave time = 0.1429 msec, max-ave = 0.00278
fft size = 70K, ave time = 0.1678 msec, max-ave = 0.00228
fft size = 72K, ave time = 0.1714 msec, max-ave = 0.00192
fft size = 75K, ave time = 0.2364 msec, max-ave = 0.00292
fft size = 80K, ave time = 0.1697 msec, max-ave = 0.00258
fft size = 81K, ave time = 0.1969 msec, max-ave = 0.00242
fft size = 84K, ave time = 0.1873 msec, max-ave = 0.00353
fft size = 90K, ave time = 0.1956 msec, max-ave = 0.00296
fft size = 96K, ave time = 0.1912 msec, max-ave = 0.00261
fft size = 98K, ave time = 0.2060 msec, max-ave = 0.00251
fft size = 100K, ave time = 0.2082 msec, max-ave = 0.00247
fft size = 105K, ave time = 0.2809 msec, max-ave = 0.01314
fft size = 108K, ave time = 0.2220 msec, max-ave = 0.00268
fft size = 112K, ave time = 0.2066 msec, max-ave = 0.00269
fft size = 120K, ave time = 0.2396 msec, max-ave = 0.00223
fft size = 125K, ave time = 0.3224 msec, max-ave = 0.00303
fft size = 126K, ave time = 0.2600 msec, max-ave = 0.00272
fft size = 128K, ave time = 0.2473 msec, max-ave = 0.00267
fft size = 135K, ave time = 0.3416 msec, max-ave = 0.00243
fft size = 140K, ave time = 0.2864 msec, max-ave = 0.00152
fft size = 144K, ave time = 0.2645 msec, max-ave = 0.00264
fft size = 147K, ave time = 0.3659 msec, max-ave = 0.00392
fft size = 150K, ave time = 0.3193 msec, max-ave = 0.00333
fft size = 160K, ave time = 0.2903 msec, max-ave = 0.00386
fft size = 162K, ave time = 0.3330 msec, max-ave = 0.00242
fft size = 168K, ave time = 0.3331 msec, max-ave = 0.00439
fft size = 175K, ave time = 0.4022 msec, max-ave = 0.00344
fft size = 180K, ave time = 0.3385 msec, max-ave = 0.00578
fft size = 189K, ave time = 0.4385 msec, max-ave = 0.00424
fft size = 192K, ave time = 0.3540 msec, max-ave = 0.00371
fft size = 196K, ave time = 0.3763 msec, max-ave = 0.00530
fft size = 200K, ave time = 0.3905 msec, max-ave = 0.00511
fft size = 210K, ave time = 0.4171 msec, max-ave = 0.00389
fft size = 216K, ave time = 0.4135 msec, max-ave = 0.00383
fft size = 224K, ave time = 0.3805 msec, max-ave = 0.00748
fft size = 225K, ave time = 0.4789 msec, max-ave = 0.00789
fft size = 240K, ave time = 0.4466 msec, max-ave = 0.01557
fft size = 243K, ave time = 0.4917 msec, max-ave = 0.00647
fft size = 245K, ave time = 0.5389 msec, max-ave = 0.00815
fft size = 250K, ave time = 0.4767 msec, max-ave = 0.00844
fft size = 252K, ave time = 0.4824 msec, max-ave = 0.00267
fft size = 256K, ave time = 0.4456 msec, max-ave = 0.00454
fft size = 270K, ave time = 0.5332 msec, max-ave = 0.00474
fft size = 280K, ave time = 0.5253 msec, max-ave = 0.00931
fft size = 288K, ave time = 0.4752 msec, max-ave = 0.01467
fft size = 294K, ave time = 0.5797 msec, max-ave = 0.01844
fft size = 300K, ave time = 0.5838 msec, max-ave = 0.01188
fft size = 315K, ave time = 0.6671 msec, max-ave = 0.00862
fft size = 320K, ave time = 0.5398 msec, max-ave = 0.00571
fft size = 324K, ave time = 0.6093 msec, max-ave = 0.00350
fft size = 336K, ave time = 0.6200 msec, max-ave = 0.00447
fft size = 343K, ave time = 0.6894 msec, max-ave = 0.00486
fft size = 350K, ave time = 0.6783 msec, max-ave = 0.00658
fft size = 360K, ave time = 0.6460 msec, max-ave = 0.00605
fft size = 375K, ave time = 0.8148 msec, max-ave = 0.00743
fft size = 378K, ave time = 0.7359 msec, max-ave = 0.00680
fft size = 384K, ave time = 0.6703 msec, max-ave = 0.00187
fft size = 392K, ave time = 0.7014 msec, max-ave = 0.00381
fft size = 400K, ave time = 0.7023 msec, max-ave = 0.00418
fft size = 405K, ave time = 0.8098 msec, max-ave = 0.00206
C:/CUDA/CuLu/src/CUDALucas.cu(1877) : cudaSafeCall() Runtime API error 6: the launch timed out and was terminated.
C:/CUDA/CuLu/src/CUDALucas.cu(1886) : cudaSafeCall() Runtime API error 46: all CUDA-capable devices are busy or unavailable.[/CODE]

Prime95 2013-12-15 03:40

I just tried r52 - no luck. I get the error "device_number >= device_count". I'm presently running CUDALucas 2.00 without problems on this Windows 7 box with a GTX 460.

mognuts 2013-12-15 10:32

[QUOTE=Prime95;362071]I just tried r52 - no luck. I get the error "device_number >= device_count". I'm presently running CUDALucas 2.00 without problems on this Windows 7 box with a GTX 460.[/QUOTE]I get that error if I use version 2xx.xx drivers with r52. Upgrading the drivers solved this for me.

mognuts 2013-12-15 11:45

I'm getting bad selftests on a GTX460 with r52. I have never had this before with earlier versions.

[CODE]
C:\Users\John\Desktop\cudalucas>CUDALucas_205Beta_x64_r52.exe -r
------- DEVICE 0 -------
name GeForce GTX 460
Compatibility 2.1
clockRate (MHz) 1430
memClockRate (MHz) 1800
totalGlobalMem 1073741824
totalConstMem 65536
l2CacheSize 524288
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 1536
multiProcessorCount 7
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
textureAlignment 512
deviceOverlap 1
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M86243 fft length = 4K
Iteration 10000 / 86243, 0x23992ccd735a03d9, 4K, CUDALucas v2.05 Beta err = 0.26563 (0:01 real, 0.0651 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M132049 fft length = 8K
Iteration 10000 / 132049, 0x4c52a92b54635f9e, 8K, CUDALucas v2.05 Beta err = 0.00046 (0:01 real, 0.0709 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M216091 fft length = 16K
Iteration 10000 / 216091, 0x30247786758b8792, 16K, CUDALucas v2.05 Beta err = 0.00001 (0:00 real, 0.0884 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M756839 fft length = 40K
Iteration 10000 / 756839, 0x5d2cbe7cb24a109a, 40K, CUDALucas v2.05 Beta err = 0.03320 (0:02 real, 0.1868 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M859433 fft length = 48K
Iteration 10000 / 859433, 0x3c4ad525c2d0aed0, 48K, CUDALucas v2.05 Beta err = 0.01074 (0:02 real, 0.1988 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M1257787 fft length = 64K
Iteration 10000 / 1257787, 0x3f45bf9bea7213ea, 64K, CUDALucas v2.05 Beta err = 0.10938 (0:03 real, 0.2440 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M1398269 fft length = 128K
Iteration 10000 / 1398269, 0xa4a6d2f0e34629db, 128K, CUDALucas v2.05 Beta err = 0.00000 (0:04 real, 0.4409 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M2976221 fft length = 256K
Iteration 10000 / 2976221, 0x2a7111b7f70fea2f, 256K, CUDALucas v2.05 Beta err = 0.00001 (0:09 real, 0.8995 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M3021377 fft length = 256K
Iteration 10000 / 3021377, 0x6387a70a85d46baf, 256K, CUDALucas v2.05 Beta err = 0.00001 (0:09 real, 0.8994 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M6972593 fft length = 512K
Iteration 10000 / 6972593, 0x88f1d2640adb89e1, 512K, CUDALucas v2.05 Beta err = 0.00011 (0:18 real, 1.7766 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M13466917 fft length = 1024K
Iteration 10000 / 13466917, 0x9fdc1f4092b15d69, 1024K, CUDALucas v2.05 Beta err = 0.00009 (0:37 real, 3.6937 ms/iter)
This residue is correct.
The fft length 2048K is too large for exponent 20996011, decreasing to 1024K
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M20996011 fft length = 1024K
Iteration 10000 / 20996011, 0x2a354d3a0f96e64e, 1024K, CUDALucas v2.05 Beta err = 0.50000 (0:37 real, 3.6876 ms/iter)
[COLOR=red]Expected residue [5fc58920a821da11] does not match actual residue [2a354d3a0f96e64e]
[/COLOR]The fft length 2048K is too large for exponent 24036583, decreasing to 1024K
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M24036583 fft length = 1024K
Iteration 10000 / 24036583, 0x47fba1785d32a924, 1024K, CUDALucas v2.05 Beta err = 1.00000 (0:51 real, 5.1785 ms/iter)
[COLOR=red]Expected residue [cbdef38a0bdc4f00] does not match actual residue [47fba1785d32a924][/COLOR]
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M25964951 fft length = 2048K
Iteration 10000 / 25964951, 0x62eb3ff0a5f6237c, 2048K, CUDALucas v2.05 Beta err = 0.00008 (1:14 real, 7.4363 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M30402457 fft length = 2048K
Iteration 10000 / 30402457, 0x0b8600ef47e69d27, 2048K, CUDALucas v2.05 Beta err = 0.00131 (1:15 real, 7.4195 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M32582657 fft length = 2048K
Iteration 10000 / 32582657, 0x02751b7fcec76bb1, 2048K, CUDALucas v2.05 Beta err = 0.00537 (1:14 real, 7.4358 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M37156667 fft length = 2048K
Iteration 10000 / 37156667, 0x67ad7646a1fad514, 2048K, CUDALucas v2.05 Beta err = 0.11719 (1:14 real, 7.4356 ms/iter)
This residue is correct.
The fft length 4096K is too large for exponent 42643801, decreasing to 2048K
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M42643801 fft length = 2048K
Iteration 10000 / 42643801, 0x93ec1e0141513b57, 2048K, CUDALucas v2.05 Beta err = 1.00000 (1:15 real, 7.4357 ms/iter)
[COLOR=red]Expected residue [8f90d78d5007bba7] does not match actual residue [93ec1e0141513b57]
[/COLOR]The fft length 4096K is too large for exponent 43112609, decreasing to 2048K
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M43112609 fft length = 2048K
Iteration 10000 / 43112609, 0x93f526f2d01c1686, 2048K, CUDALucas v2.05 Beta err = 1.00000 (1:14 real, 7.4352 ms/iter)
[COLOR=red]Expected residue [e86891ebf6cd70c4] does not match actual residue [93f526f2d01c1686]
[/COLOR]Using threads: norm1 256, mult 128, norm2 128.
Starting self test M57885161 fft length = 4096K
Iteration 10000 / 57885161, 0x76c27556683cd84d, 4096K, CUDALucas v2.05 Beta err = 0.00076 (2:37 real, 15.7022 ms/iter)
This residue is correct.
[COLOR=red]Error: There were 4 bad selftests!
[/COLOR]C:\Users\John\Desktop\cudalucas>pause
Press any key to continue . . .
[/CODE]

flashjh 2013-12-15 12:03

I can't speak for the bad self test yet, but the other problems are probably from the driver version, as stated above. I build with CUDA 5.5 now. If you need a different version let me know and I'll try to build one. Otherwise, updating to the newest drivers should fix the problem.

The bad self test may have something to do with FFT selection. We'll look at it.

Prime95 2013-12-15 16:47

[QUOTE=mognuts;362083]I get that error if I use version 2xx.xx drivers with r52. Upgrading the drivers solved this for me.[/QUOTE]

I'm using driver 311.06. I'll try a newer one.

Prime95 2013-12-15 18:00

[QUOTE=mognuts;362084]I'm getting bad selftests on a GTX460 with r52. I have never had this before with earlier versions.[/QUOTE]

FWIW, my GTX460 passes the selftest.

I do have one minor bug. I ran "-cufftbench 2000 4100 1". It ran all the benches successfully, but the file to mail to james contained only one line for FFT length 2048K.

mognuts 2013-12-15 18:53

[QUOTE=Prime95;362117]FWIW, my GTX460 passes the selftest.

I do have one minor bug. I ran "-cufftbench 2000 4100 1". It ran all the benches successfully, but the file to mail to james contained only one line for FFT length 2048K.[/QUOTE] -cufftbench is broken for me with r52. It crashes but doesn't bring down the driver. Makes no difference if I'm benchmarking a range of FFTs, or threads for a given FFT. r50 was fine.

flashjh 2013-12-15 19:12

A lot of code was re written for r52. Will need to debugging. Keep posting errors and bugs, thanks :smile:

owftheevil 2013-12-16 14:48

[QUOTE=mognuts;362084]I'm getting bad selftests on a GTX460 with r52. I have never had this before with earlier versions.

[CODE]
C:\Users\John\Desktop\cudalucas>CUDALucas_205Beta_x64_r52.exe -r
------- DEVICE 0 -------
name GeForce GTX 460
Compatibility 2.1
clockRate (MHz) 1430
memClockRate (MHz) 1800
totalGlobalMem 1073741824
totalConstMem 65536
l2CacheSize 524288
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 1536
multiProcessorCount 7
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
textureAlignment 512
deviceOverlap 1
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M86243 fft length = 4K
Iteration 10000 / 86243, 0x23992ccd735a03d9, 4K, CUDALucas v2.05 Beta err = 0.26563 (0:01 real, 0.0651 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M132049 fft length = 8K
Iteration 10000 / 132049, 0x4c52a92b54635f9e, 8K, CUDALucas v2.05 Beta err = 0.00046 (0:01 real, 0.0709 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M216091 fft length = 16K
Iteration 10000 / 216091, 0x30247786758b8792, 16K, CUDALucas v2.05 Beta err = 0.00001 (0:00 real, 0.0884 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M756839 fft length = 40K
Iteration 10000 / 756839, 0x5d2cbe7cb24a109a, 40K, CUDALucas v2.05 Beta err = 0.03320 (0:02 real, 0.1868 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M859433 fft length = 48K
Iteration 10000 / 859433, 0x3c4ad525c2d0aed0, 48K, CUDALucas v2.05 Beta err = 0.01074 (0:02 real, 0.1988 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M1257787 fft length = 64K
Iteration 10000 / 1257787, 0x3f45bf9bea7213ea, 64K, CUDALucas v2.05 Beta err = 0.10938 (0:03 real, 0.2440 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M1398269 fft length = 128K
Iteration 10000 / 1398269, 0xa4a6d2f0e34629db, 128K, CUDALucas v2.05 Beta err = 0.00000 (0:04 real, 0.4409 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M2976221 fft length = 256K
Iteration 10000 / 2976221, 0x2a7111b7f70fea2f, 256K, CUDALucas v2.05 Beta err = 0.00001 (0:09 real, 0.8995 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M3021377 fft length = 256K
Iteration 10000 / 3021377, 0x6387a70a85d46baf, 256K, CUDALucas v2.05 Beta err = 0.00001 (0:09 real, 0.8994 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M6972593 fft length = 512K
Iteration 10000 / 6972593, 0x88f1d2640adb89e1, 512K, CUDALucas v2.05 Beta err = 0.00011 (0:18 real, 1.7766 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M13466917 fft length = 1024K
Iteration 10000 / 13466917, 0x9fdc1f4092b15d69, 1024K, CUDALucas v2.05 Beta err = 0.00009 (0:37 real, 3.6937 ms/iter)
This residue is correct.
The fft length 2048K is too large for exponent 20996011, decreasing to 1024K
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M20996011 fft length = 1024K
Iteration 10000 / 20996011, 0x2a354d3a0f96e64e, 1024K, CUDALucas v2.05 Beta err = 0.50000 (0:37 real, 3.6876 ms/iter)
[COLOR=red]Expected residue [5fc58920a821da11] does not match actual residue [2a354d3a0f96e64e]
[/COLOR]The fft length 2048K is too large for exponent 24036583, decreasing to 1024K
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M24036583 fft length = 1024K
Iteration 10000 / 24036583, 0x47fba1785d32a924, 1024K, CUDALucas v2.05 Beta err = 1.00000 (0:51 real, 5.1785 ms/iter)
[COLOR=red]Expected residue [cbdef38a0bdc4f00] does not match actual residue [47fba1785d32a924][/COLOR]
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M25964951 fft length = 2048K
Iteration 10000 / 25964951, 0x62eb3ff0a5f6237c, 2048K, CUDALucas v2.05 Beta err = 0.00008 (1:14 real, 7.4363 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M30402457 fft length = 2048K
Iteration 10000 / 30402457, 0x0b8600ef47e69d27, 2048K, CUDALucas v2.05 Beta err = 0.00131 (1:15 real, 7.4195 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M32582657 fft length = 2048K
Iteration 10000 / 32582657, 0x02751b7fcec76bb1, 2048K, CUDALucas v2.05 Beta err = 0.00537 (1:14 real, 7.4358 ms/iter)
This residue is correct.
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M37156667 fft length = 2048K
Iteration 10000 / 37156667, 0x67ad7646a1fad514, 2048K, CUDALucas v2.05 Beta err = 0.11719 (1:14 real, 7.4356 ms/iter)
This residue is correct.
The fft length 4096K is too large for exponent 42643801, decreasing to 2048K
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M42643801 fft length = 2048K
Iteration 10000 / 42643801, 0x93ec1e0141513b57, 2048K, CUDALucas v2.05 Beta err = 1.00000 (1:15 real, 7.4357 ms/iter)
[COLOR=red]Expected residue [8f90d78d5007bba7] does not match actual residue [93ec1e0141513b57]
[/COLOR]The fft length 4096K is too large for exponent 43112609, decreasing to 2048K
Using threads: norm1 256, mult 128, norm2 128.
Starting self test M43112609 fft length = 2048K
Iteration 10000 / 43112609, 0x93f526f2d01c1686, 2048K, CUDALucas v2.05 Beta err = 1.00000 (1:14 real, 7.4352 ms/iter)
[COLOR=red]Expected residue [e86891ebf6cd70c4] does not match actual residue [93f526f2d01c1686]
[/COLOR]Using threads: norm1 256, mult 128, norm2 128.
Starting self test M57885161 fft length = 4096K
Iteration 10000 / 57885161, 0x76c27556683cd84d, 4096K, CUDALucas v2.05 Beta err = 0.00076 (2:37 real, 15.7022 ms/iter)
This residue is correct.
[COLOR=red]Error: There were 4 bad selftests!
[/COLOR]C:\Users\John\Desktop\cudalucas>pause
Press any key to continue . . .
[/CODE][/QUOTE]

This should be fixed with r53. Forgot to reinitialize a pointer after freeing the memory.

owftheevil 2013-12-16 15:00

[QUOTE=Prime95;362117]FWIW, my GTX460 passes the selftest.

I do have one minor bug. I ran "-cufftbench 2000 4100 1". It ran all the benches successfully, but the file to mail to james contained only one line for FFT length 2048K.[/QUOTE]

Found the problem. I was making the silly assumption that limits would always be powers of 2. I should have the time to fix it tonight.

owftheevil 2013-12-16 15:02

[QUOTE=mognuts;362127]-cufftbench is broken for me with r52. It crashes but doesn't bring down the driver. Makes no difference if I'm benchmarking a range of FFTs, or threads for a given FFT. r50 was fine.[/QUOTE]

Crashes how?

mognuts 2013-12-16 19:25

[QUOTE=owftheevil;362199]Crashes how?[/QUOTE]

This is the console output:[CODE]
C:\Users\John\Desktop\cudalucas>CUDALucas_205Beta_x64_r52 -cufftbench 2048 1 5
------- DEVICE 0 -------
name GeForce GTX 460
Compatibility 2.1
clockRate (MHz) 1430
memClockRate (MHz) 1800
totalGlobalMem 1073741824
totalConstMem 65536
l2CacheSize 524288
sharedMemPerBlock 49152
regsPerBlock 32768
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 1536
multiProcessorCount 7
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 65535,65535,65535
textureAlignment 512
deviceOverlap 1
Thread bench, testing various thread sizes for ffts 1K to 2048K, doing 5 passes.
fft size = 1K, ave time = 6.5100 msec, Norm1 threads 32, Norm2 threads 32
fft size = 1K, ave time = 6.5094 msec, Norm1 threads 32, Norm2 threads 64
fft size = 1K, ave time = 6.5098 msec, Norm1 threads 32, Norm2 threads 128
fft size = 1K, ave time = 6.5084 msec, Norm1 threads 32, Norm2 threads 256
fft size = 1K, ave time = 6.5089 msec, Norm1 threads 32, Norm2 threads 512
fft size = 1K, ave time = 6.5085 msec, Norm1 threads 32, Norm2 threads 1024
fft size = 1K, ave time = 6.5084 msec, Norm1 threads 64, Norm2 threads 32
fft size = 1K, ave time = 6.5088 msec, Norm1 threads 64, Norm2 threads 64
fft size = 1K, ave time = 6.5085 msec, Norm1 threads 64, Norm2 threads 128
fft size = 1K, ave time = 6.5087 msec, Norm1 threads 64, Norm2 threads 256
fft size = 1K, ave time = 6.5080 msec, Norm1 threads 64, Norm2 threads 512
fft size = 1K, ave time = 6.5087 msec, Norm1 threads 64, Norm2 threads 1024
fft size = 1K, ave time = 6.5084 msec, Norm1 threads 128, Norm2 threads 32
fft size = 1K, ave time = 6.5080 msec, Norm1 threads 128, Norm2 threads 64
fft size = 1K, ave time = 6.5082 msec, Norm1 threads 128, Norm2 threads 128
fft size = 1K, ave time = 6.5079 msec, Norm1 threads 128, Norm2 threads 256
fft size = 1K, ave time = 6.5082 msec, Norm1 threads 128, Norm2 threads 512
fft size = 1K, ave time = 6.5072 msec, Norm1 threads 128, Norm2 threads 1024
fft size = 1K, ave time = 6.5090 msec, Norm1 threads 256, Norm2 threads 32
fft size = 1K, ave time = 6.5091 msec, Norm1 threads 256, Norm2 threads 64
fft size = 1K, ave time = 6.5085 msec, Norm1 threads 256, Norm2 threads 128
fft size = 1K, ave time = 6.5088 msec, Norm1 threads 256, Norm2 threads 256
fft size = 1K, ave time = 6.5078 msec, Norm1 threads 256, Norm2 threads 512
fft size = 1K, ave time = 6.5085 msec, Norm1 threads 256, Norm2 threads 1024
fft size = 1K, ave time = 6.5099 msec, Norm1 threads 512, Norm2 threads 32
fft size = 1K, ave time = 6.5098 msec, Norm1 threads 512, Norm2 threads 64
fft size = 1K, ave time = 6.5093 msec, Norm1 threads 512, Norm2 threads 128
fft size = 1K, ave time = 6.5098 msec, Norm1 threads 512, Norm2 threads 256
fft size = 1K, ave time = 6.5096 msec, Norm1 threads 512, Norm2 threads 512
fft size = 1K, ave time = 6.5099 msec, Norm1 threads 512, Norm2 threads 1024
fft size = 1K, ave time = 5.9309 msec, Norm1 threads 128, Mult threads 32, Norm2 threads 1024
fft size = 1K, ave time = 5.9307 msec, Norm1 threads 128, Mult threads 64, Norm2 threads 1024
fft size = 1K, ave time = 5.9311 msec, Norm1 threads 128, Mult threads 128, Norm2 threads 1024
fft size = 1K, ave time = 5.9318 msec, Norm1 threads 128, Mult threads 256, Norm2 threads 1024
Best time for fft = 1K, time: 5.9307, t0 = 128, t1 = 64, t2 = 1024
[/CODE] Followed by a dialogue box containing the following text:

[B]CUDALucas_205Beta_x64_r52.exe has stopped working[/B]
A problem caused the program to stop working correctly. Windows will close the program and notify you if a solution is available.

This happens regardless of the parameters used for -cufftbench.

mognuts 2013-12-16 21:24

On a more positive note, r52 correctly found 3 known primes.:showoff:

M( 11213 )P, n = 1K, CUDALucas v2.05 Beta
M( 1257787 )P, n = 64K, CUDALucas v2.05 Beta
M( 2976221 )P, n = 256K, CUDALucas v2.05 Beta

owftheevil 2013-12-17 15:04

R53 is up, fixing the sparse <gpu> fft.txt file issue, the uninitialized pointer causing mismatched residues in the self-test, an incorrect fft length in the threads bench and a bad bounday case condition in the fft initialization.

@mognuts: I could not get the behaviour your 460 showed to happen, so I don't know if the problem is fixed or not.

Windows version is not up yet.

ET_ 2013-12-17 15:29

[QUOTE=owftheevil;362287]R53 is up, fixing the sparse <gpu> fft.txt file issue, the uninitialized pointer causing mismatched residues in the self-test, an incorrect fft length in the threads bench and a bad bounday case condition in the fft initialization.

@mognuts: I could not get the behaviour your 460 showed to happen, so I don't know if the problem is fixed or not.

Windows version is not up yet.[/QUOTE]

You are referring to CUDALucas, not CUDAPm1 issues, aren't you?

Luigi

flashjh 2013-12-17 15:29

r53 is on [URL="https://sourceforge.net/projects/cudalucas/files/2.05%20Beta/"]SourceForge[/URL].

[B].ini file is updated, please re-download.[/B]

Formatting output can be customized now.

Please run the tests in this [URL="http://www.mersenneforum.org/showthread.php?p=360992#post360992"][COLOR=#0066cc]post[/COLOR][/URL] and continute to post any issues or bugs.

Thanks!

[QUOTE=ET_;362291]You are referring to CUDALucas, not CUDAPm1 issues, aren't you?

Luigi[/QUOTE]
Yes

flashjh 2013-12-17 16:42

Posted Win32 .exe files on SourceForge - first time I've built Win32 with 2.05 Beta, please test accordingly.

flashjh 2013-12-18 02:10

[URL="http://www.mersenneforum.org/26926727"]Successful[/URL] test of Win32 version of r53

I am now able to build CUDA version 4.0 and up, 64 bit only, if anyone needs a version, let me know.

flashjh 2013-12-19 01:42

[URL="https://sourceforge.net/projects/cudalucas/files/2.05%20Beta/"]SorceForge[/URL] updated with latest commit, currently r55. Minor formatting changes and updated makefile.win file to allow for Win32 or x64 compiles with CUDA 4.0 up to 5.5.

If anyone wants help compiling with make or in MSVS, let me know.

Had another successful DC with Win32 version. With the help of petrw1 I have 23/24 good DCs. The bad one was probably caused by all my stopping/starting while compiling, etc. None the less, that's why we DC.

flashjh 2013-12-21 03:39

1 Attachment(s)
[QUOTE=mognuts;362210]This is the console output:[CODE]<snip>
[/CODE] Followed by a dialogue box containing the following text:

[B]CUDALucas_205Beta_x64_r52.exe has stopped working[/B]
A problem caused the program to stop working correctly. Windows will close the program and notify you if a solution is available.

This happens regardless of the parameters used for -cufftbench.[/QUOTE]
I am running tests to cause the NVIDIA Windows Kernel Mode Driver failure. Testing all versions of NVidia WHQL drivers since 296.10. Those results later...

@mognuts, I was able to (accidentally) reproduce the results you experienced.

@owftheevil

-Anytime I run -cufftbench fft# [B]smallerfft# [/B]1 it causes CUDALucas to crash like mognuts experienced
-When I run -cufftbench fft# fft# any# it skips [U]some[/U] of the fft tests completely

See the attached file for screenshot and bench.txt output for the skipped tests. I included the .exe file I'm using for testing. I'm currently on 314.22, but it doesn't seem to matter what driver I use.

owftheevil 2013-12-21 21:53

I'll take a look.

New commit r56, fixes a regression concerning command line input. Try to specify a nonstandard fft like 3150k and you'll see what I'm talking about.

flashjh 2013-12-22 19:44

Windows r56 executables posted to [URL="https://sourceforge.net/projects/cudalucas/files/2.05%20Beta/"]SourceForge[/URL]

mognuts 2013-12-26 18:37

[QUOTE=flashjh;362676]Windows r56 executables posted to [URL="https://sourceforge.net/projects/cudalucas/files/2.05%20Beta/"]SourceForge[/URL][/QUOTE]

r56 just successfully completed double check of [URL="http://www.mersenne.org/report_exponent/?exp_lo=31010747&exp_hi=31010747&B1=Get+status"]31010747[/URL].

mognuts 2013-12-26 22:40

[QUOTE=mognuts;362958]r56 just successfully completed double check of [URL="http://www.mersenne.org/report_exponent/?exp_lo=31010747&exp_hi=31010747&B1=Get+status"]31010747[/URL].[/QUOTE]
Through the run I had a couple of these, but it didn't affect the result, or cause the drivers to stop working.
[CODE]| Dec 26 22:33:05 | M 54297883 2820000 0xa5d98db4daef2036 | 3136K 0.06641 5.3687 53.68s | 3:04:59:19 5.19% |
| Dec 26 22:33:59 | M 54297883 2830000 0x4a7f94d8efb62886 | 3136K 0.06641 5.3695 53.69s | 3:04:58:22 5.21% |
| Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done |
| Dec 26 22:34:53 | M 54297883 2840000 0xa2c3927d7aefb869 | 3136K 0.06641 5.3689 53.68s | 3:04:57:25 5.23% |
| Dec 26 22:35:47 | M 54297883 2850000 0xf5b7f62de86145e4 | 3136K 0.06641 5.3705 53.70s | 3:04:56:28 5.24% |
C:/CUDA/CuLu/src/CUDALucas.cu(1509) : cudaSafeCall() Runtime API error 6: the launch timed out and was terminated.
Resetting device and restarting from last checkpoint.
Using threads: norm1 256, mult 256, norm2 1024.
C:/CUDA/CuLu/src/CUDALucas.cu(891) : cudaSafeCall() Runtime API error 46: all CUDA-capable devices are busy or unavailable.[/CODE]

frmky 2014-01-09 01:50

In 64-bit Linux, r56 segfaults when I run with -r, but has correctly completed a number of double-checks. r59 simply exits without starting a test from worktodo.txt.

owftheevil 2014-01-09 14:35

Does r59 work ok with -r?

Edit: For the exiting without starting a test in r59, on line 3346 in CUDALucas.cu, take the negation off of get_next_assignment. That should do it, although I can't check it myself right now.

frmky 2014-01-09 23:41

No, r59 crashes as well.

[CODE]Program received signal SIGSEGV, Segmentation fault.
0x000000000040420f in init_lucas (x_packed=0x6ddba0, q=86243,
n=0x7fffffffe044, j=0x7fffffffe040, offset=0x7fffffffe03c, total_time=0x0,
time_adj=0x0, iter_adj=0x0) at CUDALucas.cu:1317
1317 *time_adj = *total_time;
(gdb) bt
#0 0x000000000040420f in init_lucas (x_packed=0x6ddba0, q=86243,
n=0x7fffffffe044, j=0x7fffffffe040, offset=0x7fffffffe03c, total_time=0x0,
time_adj=0x0, iter_adj=0x0) at CUDALucas.cu:1317
#1 0x000000000040bb65 in check_residue (ls=0) at CUDALucas.cu:2624
#2 0x000000000040df57 in main (argc=2, argv=0x7fffffffe1f8)
at CUDALucas.cu:3334
(gdb) print total_time
$1 = (unsigned long long *) 0x0
(gdb) print time_adj
$2 = (unsigned long long *) 0x0
[/CODE]
So time_adj is a null pointer.

owftheevil 2014-01-10 14:51

Thanks frmky, your information was very useful. r60 should fix these bugs.

flashjh 2014-01-12 05:02

Getting close to release!
 
r60 compiled and tested (still needs more). CUDA 4.2 up to 5.5 all working, release and debug. All posted to [URL="https://sourceforge.net/projects/cudalucas/files/2.05%20Beta/"]SourceForge[/URL]

This version (and r57 and up) include new rcb code from Prime95 that give about a 1% speed improvement! Exciting for CUDALucas, but does need testing, please.

In my testing CUDA 5.5 and Win32 are slightly faster than earlier versions or x64 (but you may need a batch file to keep it going, see below)

What works:
-cufftbench
-r
-normal testing

What Doesn't:
-threadbench

Didn't test:
-memtest

[U][B]For those experiencing stops: This is an nVidia driver issue. Here is some info and I included some workarounds[/B][/U]

<=306.97 work with x86/x64 CUDA 4.2 and CUDA 5.0 builds perfectly fine and produces no restarts (at least none from my testing over several days).

>=310.70 have resets no matter what platform/CUDA version including 5.5 with >=320.18.

There are two workarounds for anyone experiencing a similar problem described by [URL="http://www.mersenneforum.org/showthread.php?p=362968#post362968"]mognuts[/URL]:

1) The best way to fix the error is to downgrade your driver to one of the versions <=306.97 as mentioned above.

CUDA Driver Versions:

[CODE]CUDA 5.5: CUDA 5.0 CUDA 4.2
331.82 19-Nov-13 314.22 25-Mar-13 301.42 22-May-12
331.65 07-Nov-13 314.07 18-Feb-13 296.10 13-Mar-12
331.58 21-Oct-13 310.90 05-Jan-13 295.73 21-Feb-12
327.23 19-Sep-13 310.70 17-Dec-12 285.62 24-Oct-11
320.49 01-Jul-13 [B]306.97 10-Oct-12[/B] 280.26 09-Aug-11
320.18 23-May-13 306.23 13-Sep-12 275.33 01-Jun-11
[/CODE] I did not actually test below 296.10 so I don't know where the CUDA changes over to < CUDA 4.2 but I figure most will be on 296.10 by now.

Windows CUDALucas from CUDA 4.0 up to 5.5, 32 or 64 bit are on SourceForge

Request: I need to know who else is having the *stop* issue and what driver and video card you have. I'm working with NVidia to try and get the drivers fixed, so it will be helpful to know what other cards have this issue.

2) The other 'fix' for this issue is to use a batch file similar to this:
[CODE]@echo off
Set count=0
Set program=CUDALucas2.05Beta-CUDA5.0-Win32-r60
:loop
TITLE %program% Current Reset Count = %count%
Set /A count+=1
rem echo %count% >> log.txt
rem echo %count%
%program%.exe
GOTO loop[/CODE] This will restart CUDALucas each time it stops and allow you see how many resets have occurred, if you care.

I have not been able to thoroughly test speeds yet; I know that CUDA 5.5 is usually faster, but at the cost of having the driver lockup. Combined with the batch file, there really is no issue other than if the restarts bother you as I've run many good DCs with the batch file.

With <=306.97, you don't need the batch file and there are no restarts, but it could potentially be &slightly* slower. I would love to see actual test data from everyone. Also, if anyone does experience the *stop* while on <=306.97, please let me know ASAP so I can update this info and nVidia.

As for reliability, I have completed many successful tests with 2.05 Beta, CUDA 4.0 up to 5.5, 32 and 64 bit. Many with a lot of stop and restarts and forced FFT size changes for testing the code.

:smile:

flashjh 2014-01-14 22:11

[URL="https://sourceforge.net/projects/cudalucas/files/2.05%20Beta/"]r62 posted[/URL] to fix the -threadbench problem

Usage for testing:

[B]CUDALucas -cufftbench lb ub p (e.g. CUDALucas -cufftbench 1 8192 6)[/B]

It gives a warning if either lb or ub is not a power of two. It works when they are not, but non optimal lengths near the edges of the range are likely to be included in <gpu> fft.txt.

[B]CUDALucas -threadbench lb ub p m (e.g. CUDALucas -threadbench 1 8192 6 1)
[/B]
The new parameter m (usually 0 or 1) controls a little bit of the behavior of the test. m = 0 causes all reasonable fft lengths ( n a multiple of 1K, largest prime factor of n is 7) between lb * 1k and ub * 1k to be tested, m = 1 tests only the lengths in <gpu> fft.txt and the table in init_ffts.

When testing the new versions run:

CUDALucas -r
CUDALucas -cufftbench 1 8192 6
CUDALucas -threadbench 1 8192 6 1

You can also run a [URL="http://www.mersenneforum.org/showthread.php?p=359754#post359754"]memtest[/URL]:

CUDALucas -memtest k n
where k * 25 MB of memory are tested, n * 10000 iterations are done for each of 5 data types at each of the k positions

blip 2014-02-17 21:48

I got the following output:

[CODE]
-- polite interval increased to 2
-- error_reset increased to 95
[/CODE]

What does that mean?

flashjh 2014-02-17 22:39

Can you post your Cudalucas.ini and the command line you're using to run the program?

blip 2014-02-17 23:39

1 Attachment(s)
I am running it as

[CODE]
CUDALucas -d 0
[/CODE]


All times are UTC. The time now is 13:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.