mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2017-04-27, 10:36   #45
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

24608 Posts
Default Observed reliability

Some information from my testing so far:

2 double-checks in 77M range were matches.

(the following are all known primes)

24036583 failed once, successful once.
42643801 failed twice, successful twice.
All these were successful on first run: 20996011, 25964951, 30402457, 32582657, 37156667, 74207281, 57885161.

So empiric error rate about 20% (quite high IMO).
The error rate may be affected by the temperature of the GPU.

Up to now I was unable to find a cause for these errors in software. Especially that the software is supposed to be deterministic (produce the same result every time, without variation), and yet the results for the same exponent vary (meaning, either my assumption of determinism is wrong, or hardware errors are involved).

Anyway, such a high error rate in bad news. An erroneous results becomes more likely as the computation length increases (higher exponents).

It would be great to have a way to validate the result on every LL iteration ("error detection"). If a wrong iteration can be detected, it can simply be re-tried until correct, and the ghost of unreliable hardware goes away. (of course, there'd be some cost involved in error detection)
preda is offline   Reply With Quote
Old 2017-04-27, 10:39   #46
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

24×83 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
I should have some time this weekend to do ISA dumps and try upgrading drivers / APP SDK to new versions on one of my FuryX Systems.

Also, most consumer cards do occasionally have errors. I have seen them less often on AMD cards than NVIDIA, but they do happen. If it is helpful, I have a system with 3x W9100s in it which have ECC memory and (ideally) do not exhibit hardware errors (100s of double checks agree). If you setup to select the GPU I can run a few double check exponents on those cards to check stability.
Yep I'll try to add GPU selection. I don't have a multi-gpu system to test, but maybe it'll "just work" :)
preda is offline   Reply With Quote
Old 2017-04-27, 12:55   #47
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

213448 Posts
Default

Quote:
Originally Posted by preda View Post
But, to keep *all* the old checkpoints around? -- each file is 16MB. If you get 4000 of those, that'd be 64GB, probably too much.
Yes please . With a checkpoint set to 500k or 1M, you have 160 files for some 80M exponent. Even with 100K, you have 10 per million, or 800 per test, which is no more than 12 gig (but this never happens as they will be clear by matching with the parallel run, this is not the job of gpuOwl, but the additional tools). BTW, cudaLucas has two different counters, one for displaying on screen and one for saving checkpoints files, and both are interactive (you can decrement/increment them pressing t/T, etc).

But yes, ALL residues have to be kept, if the user choose so, until he (the user) decides what to do with them.

After a LL/DC match, one can (manually or automatically, from the batch file, tool, etc) delete them, or whatever.

But there are many situations when one may need them.

One may have the surprise that doing a DC, a non-match residue popos up, then he/she runs it again and gets the same non-match, will all residues matching all the way, and comparing them with cudaLucas or P95 -- that is how you find a hidden bug, in either gpuOwl, or cuddaLucas, or (why not?) P95.

For example.

Last fiddled with by LaurV on 2017-04-27 at 12:56
LaurV is offline   Reply With Quote
Old 2017-04-27, 14:36   #48
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

24×83 Posts
Default

@LaurV, I understand that you want a track of the residues (the 64-bit values), that makes sense. Right now the residue sequence is saved to gpuowl.log (but, arguably, the tools may not know how to parse that?).

But do you also need to store the full 16MB checkpoints? the full checkpoint is only needed to allow a re-start from that point. Do you want to re-start from arbitrary points in time, or just need a full track of the residues?

I would like the default behavior to be friendly for a non-expert user (i.e. not fill his storage by default). I'll think about it.
preda is offline   Reply With Quote
Old 2017-04-27, 15:01   #49
wombatman
I moo ablest echo power!
 
wombatman's Avatar
 
May 2013

5·347 Posts
Default

Why not keep only the two most recent checkpoints (so there's a backup in case the most recent is corrupted or something)? Since the residues are being kept elsewhere, it doesn't seem necessary to retain all the checkpoint files.
wombatman is offline   Reply With Quote
Old 2017-04-27, 15:27   #50
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

24608 Posts
Default

Quote:
Originally Posted by wombatman View Post
Why not keep only the two most recent checkpoints (so there's a backup in case the most recent is corrupted or something)? Since the residues are being kept elsewhere, it doesn't seem necessary to retain all the checkpoint files.
Yes, right now only two most recent are kept, in save-N.bin and save-N.old. But I'm open to changing that, just need a bit more thinking to what.
preda is offline   Reply With Quote
Old 2017-04-27, 15:36   #51
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

24×83 Posts
Default

Quote:
Originally Posted by preda View Post
Yep I'll try to add GPU selection. I don't have a multi-gpu system to test, but maybe it'll "just work" :)
I've added a command line option to select a specific device (see --help for list of devices).

I've tested by playing with running on the CPU. Surprisingly, it worked, but it was sooo slow.. something like 100times slower than mprime . Which shows both what a good implementation mprime has, and what a poor compiler OpenCL/cpu has.
preda is offline   Reply With Quote
Old 2017-04-27, 21:19   #52
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

266816 Posts
Default

Quote:
Originally Posted by LaurV View Post
Additional "complaints", beside of the fact that the little thief is stealing my hex digits of the residue:

1. When something like this happens: (yes, we can force it, by giving irrealistic work to the card in the same time it is doing gpuOwl):
Code:
Error 4 is too large!00030000 / 76453229 [0.04%], ms/iter: 5.561, ETA: 4d 22:03; 00000000eb2493b3 error 4 (max 4)
, then the error should be saved in the file too, and carried on with it at the next restart (use a byte in the header, or so), or the program should exit, and resume from an aterior saved file. Otherwise, (first) it is wasting precious time (yes the residue is wrong the correct one ends in 2BBB1710, and it continues with the wrong calculus, wasting cycles in vain)
Preda, does your code check fractional errors for every convolution output word on every iteration, or not?
ewmayer is offline   Reply With Quote
Old 2017-04-27, 23:50   #53
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

24×83 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Preda, does your code check fractional errors for every convolution output word on every iteration, or not?
Yes, it takes into account the rounding error from *every* double-to-long that is done.

This works like this:
On every iteration, a rounding error for every word is computed, and a global maximum of that is updated.

On every logstep (20000), this global maximum is read, printed, and reset to 0.
Thus, while the rounding error is computed on every iteration, it is only visible in aggregated (max) form on every logstep.

But any error in rounding at any point should be caught.
preda is offline   Reply With Quote
Old 2017-04-28, 00:29   #54
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

213448 Posts
Default

Quote:
Originally Posted by preda View Post
But do you also need to store the full 16MB checkpoints? the full checkpoint is only needed to allow a re-start from that point. Do you want to re-start from arbitrary points in time, or just need a full track of the residues?
yes, and yes.

We had all this "argument" for cudaLucas in the past, and I won't be very disapointed to win it again...

Quote:
Originally Posted by wombatman View Post
Why not keep only the two most recent checkpoints
Unuseful. The program is doing that right now. But the two files, they usually are both wrong, with 50% chance, unless the error happes in the very early stage of the test. When you test with two cards in parallel, one is usualy faster and it takes some advance (more than two checkpoints) when the other spits out a residue which is not matching. In that point, you have 50% chances that the error is from the slower card, and 50% that... you have to completely restart the other test from scratch?

@Mihai: you can keep the interface as it is, simple for the normal user, keep the log and everything (very useful, but ok, for other programs you can - and I have to do it sometimes - redirect the output to a file) but please provide a "-save" switch (same as cudaLucas, youcan call it whatever) which will save all the checkpoints to a "backup" folder. See cudaLucas' .ini file.

Like this:
Click image for larger version

Name:	resi.PNG
Views:	139
Size:	57.4 KB
ID:	15998

(mind that this files are even double in size - for nowadays disks I would't mind. Or, I can increase the number of iterations for checkpoints, to save space)

Last fiddled with by LaurV on 2017-04-28 at 00:32
LaurV is offline   Reply With Quote
Old 2017-04-28, 12:52   #55
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

22·7·11·29 Posts
Default

Grr.. replying to myself...

We have a bigger problem... Scratch the part with "the program is doing this right now". He is doing it only when you start it, then it resumes from .bin, instead of .new, therefore the progress of the last run is always lost, and you resume from the point where former run (the one before last) left.

Code:
07610000 / 76453229 [9.95%], ms/iter: 4.653, ETA: 3d 16:58; 00000000c8e14255 error 0.207659 (max 0.256673)
07620000 / 76453229 [9.97%], ms/iter: 4.668, ETA: 3d 17:15; 00000000eca854d1 error 0.211343 (max 0.256673)
07630000 / 76453229 [9.98%], ms/iter: 4.674, ETA: 3d 17:21; 000000003f13c300 error 0.23212 (max 0.256673)
^C
e:\99 - Prime\gpuOwl>gpuowl -logstep 10000
gpuOwL v0.1 GPU Lucas-Lehmer primality checker
Tahiti - OpenCL 1.2 AMD-APP (2348.3)
LL FFT 4096K (1024*2048*2) of 76453229 (18.23 bits/word) at iteration 710000
OpenCL setup: 1092 ms
00720000 / 76453229 [0.94%], ms/iter: 4.668, ETA: 4d 02:12; 000000007c15e2ce error 0.205605 (max 0.205605)
^C
Edit: the workaround would be to manualy copy the .new file to .bin, after interruption (or before resuming). Scrap the .bin, or rename it. Here actually the file header (where you can see the iteration number in clear) helps a lot! That was a very-very good idea!!!

Last fiddled with by LaurV on 2017-04-28 at 13:12
LaurV is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1657 2020-10-27 01:23
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 12:10.

Fri Nov 27 12:10:10 UTC 2020 up 78 days, 9:21, 4 users, load averages: 0.99, 1.23, 1.22

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.