mersenneforum.org gpuOwL: an OpenCL program for Mersenne primality testing
 Register FAQ Search Today's Posts Mark Forums Read

 2017-04-27, 10:36 #45 preda     "Mihai Preda" Apr 2015 2·23·29 Posts Observed reliability Some information from my testing so far: 2 double-checks in 77M range were matches. (the following are all known primes) 24036583 failed once, successful once. 42643801 failed twice, successful twice. All these were successful on first run: 20996011, 25964951, 30402457, 32582657, 37156667, 74207281, 57885161. So empiric error rate about 20% (quite high IMO). The error rate may be affected by the temperature of the GPU. Up to now I was unable to find a cause for these errors in software. Especially that the software is supposed to be deterministic (produce the same result every time, without variation), and yet the results for the same exponent vary (meaning, either my assumption of determinism is wrong, or hardware errors are involved). Anyway, such a high error rate in bad news. An erroneous results becomes more likely as the computation length increases (higher exponents). It would be great to have a way to validate the result on every LL iteration ("error detection"). If a wrong iteration can be detected, it can simply be re-tried until correct, and the ghost of unreliable hardware goes away. (of course, there'd be some cost involved in error detection)
2017-04-27, 10:39   #46
preda

"Mihai Preda"
Apr 2015

53616 Posts

Quote:
 Originally Posted by airsquirrels I should have some time this weekend to do ISA dumps and try upgrading drivers / APP SDK to new versions on one of my FuryX Systems. Also, most consumer cards do occasionally have errors. I have seen them less often on AMD cards than NVIDIA, but they do happen. If it is helpful, I have a system with 3x W9100s in it which have ECC memory and (ideally) do not exhibit hardware errors (100s of double checks agree). If you setup to select the GPU I can run a few double check exponents on those cards to check stability.
Yep I'll try to add GPU selection. I don't have a multi-gpu system to test, but maybe it'll "just work" :)

2017-04-27, 12:55   #47
LaurV
Romulan Interpreter

Jun 2011
Thailand

34·113 Posts

Quote:
 Originally Posted by preda But, to keep *all* the old checkpoints around? -- each file is 16MB. If you get 4000 of those, that'd be 64GB, probably too much.
Yes please . With a checkpoint set to 500k or 1M, you have 160 files for some 80M exponent. Even with 100K, you have 10 per million, or 800 per test, which is no more than 12 gig (but this never happens as they will be clear by matching with the parallel run, this is not the job of gpuOwl, but the additional tools). BTW, cudaLucas has two different counters, one for displaying on screen and one for saving checkpoints files, and both are interactive (you can decrement/increment them pressing t/T, etc).

But yes, ALL residues have to be kept, if the user choose so, until he (the user) decides what to do with them.

After a LL/DC match, one can (manually or automatically, from the batch file, tool, etc) delete them, or whatever.

But there are many situations when one may need them.

One may have the surprise that doing a DC, a non-match residue popos up, then he/she runs it again and gets the same non-match, will all residues matching all the way, and comparing them with cudaLucas or P95 -- that is how you find a hidden bug, in either gpuOwl, or cuddaLucas, or (why not?) P95.

For example.

Last fiddled with by LaurV on 2017-04-27 at 12:56

 2017-04-27, 14:36 #48 preda     "Mihai Preda" Apr 2015 24668 Posts @LaurV, I understand that you want a track of the residues (the 64-bit values), that makes sense. Right now the residue sequence is saved to gpuowl.log (but, arguably, the tools may not know how to parse that?). But do you also need to store the full 16MB checkpoints? the full checkpoint is only needed to allow a re-start from that point. Do you want to re-start from arbitrary points in time, or just need a full track of the residues? I would like the default behavior to be friendly for a non-expert user (i.e. not fill his storage by default). I'll think about it.
 2017-04-27, 15:01 #49 wombatman I moo ablest echo power!     May 2013 1,741 Posts Why not keep only the two most recent checkpoints (so there's a backup in case the most recent is corrupted or something)? Since the residues are being kept elsewhere, it doesn't seem necessary to retain all the checkpoint files.
2017-04-27, 15:27   #50
preda

"Mihai Preda"
Apr 2015

2·23·29 Posts

Quote:
 Originally Posted by wombatman Why not keep only the two most recent checkpoints (so there's a backup in case the most recent is corrupted or something)? Since the residues are being kept elsewhere, it doesn't seem necessary to retain all the checkpoint files.
Yes, right now only two most recent are kept, in save-N.bin and save-N.old. But I'm open to changing that, just need a bit more thinking to what.

2017-04-27, 15:36   #51
preda

"Mihai Preda"
Apr 2015

133410 Posts

Quote:
 Originally Posted by preda Yep I'll try to add GPU selection. I don't have a multi-gpu system to test, but maybe it'll "just work" :)
I've added a command line option to select a specific device (see --help for list of devices).

I've tested by playing with running on the CPU. Surprisingly, it worked, but it was sooo slow.. something like 100times slower than mprime . Which shows both what a good implementation mprime has, and what a poor compiler OpenCL/cpu has.

2017-04-27, 21:19   #52
ewmayer
2ω=0

Sep 2002
República de California

5×7×331 Posts

Quote:
 Originally Posted by LaurV Additional "complaints", beside of the fact that the little thief is stealing my hex digits of the residue: 1. When something like this happens: (yes, we can force it, by giving irrealistic work to the card in the same time it is doing gpuOwl): Code: Error 4 is too large!00030000 / 76453229 [0.04%], ms/iter: 5.561, ETA: 4d 22:03; 00000000eb2493b3 error 4 (max 4) , then the error should be saved in the file too, and carried on with it at the next restart (use a byte in the header, or so), or the program should exit, and resume from an aterior saved file. Otherwise, (first) it is wasting precious time (yes the residue is wrong the correct one ends in 2BBB1710, and it continues with the wrong calculus, wasting cycles in vain)
Preda, does your code check fractional errors for every convolution output word on every iteration, or not?

2017-04-27, 23:50   #53
preda

"Mihai Preda"
Apr 2015

53616 Posts

Quote:
 Originally Posted by ewmayer Preda, does your code check fractional errors for every convolution output word on every iteration, or not?
Yes, it takes into account the rounding error from *every* double-to-long that is done.

This works like this:
On every iteration, a rounding error for every word is computed, and a global maximum of that is updated.

On every logstep (20000), this global maximum is read, printed, and reset to 0.
Thus, while the rounding error is computed on every iteration, it is only visible in aggregated (max) form on every logstep.

But any error in rounding at any point should be caught.

2017-04-28, 00:29   #54
LaurV
Romulan Interpreter

Jun 2011
Thailand

217018 Posts

Quote:
 Originally Posted by preda But do you also need to store the full 16MB checkpoints? the full checkpoint is only needed to allow a re-start from that point. Do you want to re-start from arbitrary points in time, or just need a full track of the residues?
yes, and yes.

We had all this "argument" for cudaLucas in the past, and I won't be very disapointed to win it again...

Quote:
 Originally Posted by wombatman Why not keep only the two most recent checkpoints
Unuseful. The program is doing that right now. But the two files, they usually are both wrong, with 50% chance, unless the error happes in the very early stage of the test. When you test with two cards in parallel, one is usualy faster and it takes some advance (more than two checkpoints) when the other spits out a residue which is not matching. In that point, you have 50% chances that the error is from the slower card, and 50% that... you have to completely restart the other test from scratch?

@Mihai: you can keep the interface as it is, simple for the normal user, keep the log and everything (very useful, but ok, for other programs you can - and I have to do it sometimes - redirect the output to a file) but please provide a "-save" switch (same as cudaLucas, youcan call it whatever) which will save all the checkpoints to a "backup" folder. See cudaLucas' .ini file.

Like this:

(mind that this files are even double in size - for nowadays disks I would't mind. Or, I can increase the number of iterations for checkpoints, to save space)

Last fiddled with by LaurV on 2017-04-28 at 00:32

 2017-04-28, 12:52 #55 LaurV Romulan Interpreter     Jun 2011 Thailand 34·113 Posts Grr.. replying to myself... We have a bigger problem... Scratch the part with "the program is doing this right now". He is doing it only when you start it, then it resumes from .bin, instead of .new, therefore the progress of the last run is always lost, and you resume from the point where former run (the one before last) left. Code: 07610000 / 76453229 [9.95%], ms/iter: 4.653, ETA: 3d 16:58; 00000000c8e14255 error 0.207659 (max 0.256673) 07620000 / 76453229 [9.97%], ms/iter: 4.668, ETA: 3d 17:15; 00000000eca854d1 error 0.211343 (max 0.256673) 07630000 / 76453229 [9.98%], ms/iter: 4.674, ETA: 3d 17:21; 000000003f13c300 error 0.23212 (max 0.256673) ^C e:\99 - Prime\gpuOwl>gpuowl -logstep 10000 gpuOwL v0.1 GPU Lucas-Lehmer primality checker Tahiti - OpenCL 1.2 AMD-APP (2348.3) LL FFT 4096K (1024*2048*2) of 76453229 (18.23 bits/word) at iteration 710000 OpenCL setup: 1092 ms 00720000 / 76453229 [0.94%], ms/iter: 4.668, ETA: 4d 02:12; 000000007c15e2ce error 0.205605 (max 0.205605) ^C Edit: the workaround would be to manualy copy the .new file to .bin, after interruption (or before resuming). Scrap the .bin, or rename it. Here actually the file header (where you can see the iteration number in clear) helps a lot! That was a very-very good idea!!! Last fiddled with by LaurV on 2017-04-28 at 13:12

 Similar Threads Thread Thread Starter Forum Replies Last Post Bdot GPU Computing 1668 2020-12-22 15:38 xx005fs GpuOwl 0 2019-07-26 21:37 1260 Software 17 2015-08-28 01:35 CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12 Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 21:27.

Sat Jan 23 21:27:45 UTC 2021 up 51 days, 17:39, 0 users, load averages: 3.19, 2.35, 2.11