mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

LaurV 2012-07-23 08:52

There is a 2.03 Stable version and a (better, but still under work) 2.04 Beta version, both on the [URL="https://sourceforge.net/projects/cudalucas/files"]sourceforge[/URL] page. I personally use the Beta right now. There is no difference in math, just in "cosmetic" things, the Beta has some "improvements" which are partially working, partially are still worked on..:smile:

TObject 2012-07-24 00:04

CUDALucas 2.03

I accidently hit Ctrl-C twice on the CudaLUCAS window, as the result, when I continued the test I got the following message: “The checkpoint doesn’t match current test. Current test will be restarted.”

This is bad. It shouldn’t be that easy to lose all the work; especially when some people may be accustomed hitting Ctrl-C twice on an mfaktc window to exit immediately.

Dubslow 2012-07-24 00:39

[QUOTE=TObject;305680]CUDALucas 2.03

I accidently hit Ctrl-C twice on the CudaLUCAS window, as the result, when I continued the test I got the following message: “The checkpoint doesn’t match current test. Current test will be restarted.”

This is bad. It shouldn’t be that easy to lose all the work; especially when some people may be accustomed hitting Ctrl-C twice on an mfaktc window to exit immediately.[/QUOTE]

One of the problems (and changes in 2.04) is that the message could mean a variety of things. It could be the meta-data was corrupted, that the exponents didn't match, or (most likely, I think) that the main data was corrupted.

The ^C doesn't do anything itself except set a global quitting variable, which is in turn checked every iteration. A double ^C thus should not have had any effect, except perhaps printing the quitting message twice.

The only possible thing I could think of is that perhaps the second ^C was called while one of the various fwrite() calls was being executed, and that somehow that caused a corruption somewhere. I'll defer to more experienced programmers in that matter.

FWIW, I couldn't replicate in 2.04 Beta.
[code]Iteration 7680000 M( 26661529 )C, 0x6a13e9d50b44c72e, n = 1440K, CUDALucas v2.04 Beta err = 0.1523 (1:48 real, 5.4042 ms/iter, ETA 28:29:30)
^C SIGINT caught, writing checkpoint. Estimated time spent so far: 11:44:11

bill@Gravemind:~/CUDALucas∰∂ ^C
bill@Gravemind:~/CUDALucas∰∂ CUDALucas

Continuing work from a partial result of M26661529 fft length = 1440K iteration = 7689302
^C^C SIGINT caught, writing checkpoint. SIGINT caught, writing checkpoint. Estimated time spent so far: 11:44:11

bill@Gravemind:~/CUDALucas∰∂ CUDALucas

Continuing work from a partial result of M26661529 fft length = 1440K iteration = 7689345
Iteration 7700000 M( 26661529 )C, 0x5e6f65ddfa011c0a, n = 1440K, CUDALucas v2.04 Beta err = 0.1406 (0:59 real, 2.9549 ms/iter, ETA 15:33:44)
Iteration 7720000 M( 26661529 )C, 0x572d5b0fd4b87e69, n = 1440K, CUDALucas v2.04 Beta err = 0.1523 (1:52 real, 5.5877 ms/iter, ETA 29:23:50)
Iteration 7740000 M( 26661529 )C, 0xa9f5f7180a3fd8c2, n = 1440K, CUDALucas v2.04 Beta err = 0.1543 (1:50 real, 5.4870 ms/iter, ETA 28:50:14)
Iteration 7760000 M( 26661529 )C, 0x65353774d697b137, n = 1440K, CUDALucas v2.04 Beta err = 0.1453 (1:49 real, 5.4559 ms/iter, ETA 28:38:36)
Iteration 7780000 M( 26661529 )C, 0x474870feb62f6ea0, n = 1440K, CUDALucas v2.04 Beta err = 0.1504 (1:49 real, 5.4690 ms/iter, ETA 28:40:55)
Iteration 7800000 M( 26661529 )C, 0x00e7204a64ae247d, n = 1440K, CUDALucas v2.04 Beta err = 0.1484 (1:50 real, 5.4655 ms/iter, ETA 28:37:59)
^C^C SIGINT caught, writing checkpoint. SIGINT caught, writing checkpoint. Estimated time spent so far: 11:55:57

bill@Gravemind:~/CUDALucas∰∂ CUDALucas

Continuing work from a partial result of M26661529 fft length = 1440K iteration = 7817985[/code]

TObject 2012-07-24 01:00

Thank you for the explanation. Hopefully 2.04 already fixed it. With 2.03 I can reliably duplicate the issue: every time I press Ctrl-C twice in quick succession, this error pops up on the next start.

TObject 2012-07-24 01:10

1 Attachment(s)
I upgraded to the beta version CUDALucas-2.04 Beta-4.1-sm_21-x64.exe and I confirm that the error I reported in the post [url=http://www.mersenneforum.org/showpost.php?p=305680&postcount=1498]1498[/url] has been fixed.

Thank you.

Edit: I spoke too soon. The error is still there, although it took a few tries to replicate it with 2.04.

TObject 2012-07-24 01:42

The error message in 2.04 is “The checkpoint appears to be corrupt. Current test will be restarted.”

A few thoughts:
a) Obviously, it would be nice if the corruption did not happen in the first place.
b) Do not overwrite the backup save file until it is determined that the main file is in good shape.
c) Instead of restarting the test from the beginning attempt to restart it from the backup save file.
d) Consider asking “Do you want to restart the test?” rather than restarting automatically. Some people may have ability to restore save files from backup or restore points; so they would answer no, and go looking for a good version of the save file before newer versions are piled on top.

These are just friendly suggestions. Not complaints. Thank you for your hard work on the application.

Dubslow 2012-07-24 01:51

Hmm... I've no idea what might be causing it. Any experts want to weigh in?

a) Well yeah :smile:
b) You mean when writing checkpoints?
c) Already on the todo list for 2.05
d) Anyone with that ability can restore the checkpoint regardless of whether or not the test is restarted, just delete/overwrite the new/short/restarted checkpoint.

TObject 2012-07-24 02:08

[QUOTE=Dubslow;305696]
b) You mean when writing checkpoints?
[/QUOTE]

Every time the backup save file is overwritten, if it does not cause too much of the performance hit. The idea is to have something to prevent a good backup save file from being overwritten by a corrupted save file.

Dubslow 2012-07-24 02:23

[QUOTE=TObject;305699]Every time the backup save file is overwritten, if it does not cause too much of the performance hit. The idea is to have something to prevent a good backup save file from being overwritten by a corrupted save file.[/QUOTE]

Here's the current pseudo-[URL="http://sourceforge.net/p/cudalucas/code/37/tree/trunk/CUDALucas.cu?force=True"]code[/URL] (Ctrl+F "write_checkpoint" if you're curious):
[code]delete t-checkpoint
move c-ckp to t-ckp
write current data to c-ckp[/code]
What about that would you change?

kladner 2012-07-24 02:46

I just got CL up and running by itself on the GTX 460. With "SaveAllCheckpoints=1" set, I already have 143 saved checkpoiints in 3h27m. Obviously, I need to slow the write timing way down.

I say this to ask, do you have this set? Did the error wreck all checkpoints? Does it happen before more than one is created?

Sorry if this has already been addressed. I don't see it in the previous posts (or didn't understand.) I agree that the error should not be happening. But there is a multi-backup function available which would greatly mitigate its effects until the coding gods figure out what's going on.:smile:

Dubslow 2012-07-24 02:48

[QUOTE=kladner;305707]until the coding gods figure out what's going on.:smile:[/QUOTE]
I for one welcome any advice they might have. :razz:

Yes though, as kladner points out, SaveAllCheckpoints is a decent workaround.


All times are UTC. The time now is 23:15.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.