mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
Thread Tools
Old 2019-11-28, 04:43   #1475
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

124758 Posts
Default DobleCheck etc.

I know that gpuowl v0.6 is ancient history at this point, not being maintained, but I find it useful to be able to run LL DC with Jacobi check on exponents ~50M-77M on an AMD gpu.
A Radeon VII can knock one of those out in 15-20 hours.
I've noticed repeated errors in the worktodo.txt. At first I thought it was typos I made.
But it appears that when a result is produced and the worktodo is rewritten by the program,
DoubleCheck= is getting altered for following assignments. Then they fail the validity test, and the program terminates since it has nothing to do, causing considerable loss of throughput. I've seen both DobleCheck and DubleCheck generated, and oubleCheck.


It also failed to remove the worktodo item Test=57885161 after finding it prime, and then terminated instead of continuing with following work.
It also in that case produced a result record with the AID = the exponent.
kriesel is offline   Reply With Quote
Old 2019-11-28, 06:21   #1476
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·457 Posts
Default

Funny. Probably something to do with windows newlines (\r\n). Or a memory corruption.
At some point I'd like to add LL back to master. Sorry, I don't think I'll look into fixing 0.6.

Quote:
Originally Posted by kriesel View Post
I know that gpuowl v0.6 is ancient history at this point, not being maintained, but I find it useful to be able to run LL DC with Jacobi check on exponents ~50M-77M on an AMD gpu.
A Radeon VII can knock one of those out in 15-20 hours.
I've noticed repeated errors in the worktodo.txt. At first I thought it was typos I made.
But it appears that when a result is produced and the worktodo is rewritten by the program,
DoubleCheck= is getting altered for following assignments. Then they fail the validity test, and the program terminates since it has nothing to do, causing considerable loss of throughput. I've seen both DobleCheck and DubleCheck generated, and oubleCheck.


It also failed to remove the worktodo item Test=57885161 after finding it prime, and then terminated instead of continuing with following work.
It also in that case produced a result record with the AID = the exponent.
preda is offline   Reply With Quote
Old 2019-11-28, 07:09   #1477
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

101010110112 Posts
Default

Quote:
Originally Posted by kriesel View Post
Had an error in a gpuowl v6.11-9 P-1 run:
Code:
2019-11-27 13:03:48 Exception NSt10filesystem7__cxx1116filesystem_errorE: filesystem error: cannot rename: File exists [C:\Users\ken\Document
414000187\414000187-new.p2.owl] [C:\Users\ken\Documents\v6.11-9-g9ae3189\414000187\414000187.p2.owl]
2019-11-27 13:03:48 waiting for background GCDs..
2019-11-27 13:03:48 Bye
This is strange, I don't understand why it happened. Can you reproduce it? does it happen every time? anything special, like: disk full, read-only folder, read-only file, etc?

There are 3 files:
foo-old.owl ("old")
foo.owl ("savefile")
foo-new.owl ("new")

The sequence is:
1. write "new"
2. remove "old" (ignoring errors)
3. rename "savefile" to "old" (ignoring errors)
4. rename "new" to "savefile"

It seems in your case step 4 failed. It failed because "savefile" was there. That suggests that step 3 silently failed.
preda is offline   Reply With Quote
Old 2019-11-28, 07:11   #1478
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23×271 Posts
Default

Quote:
Originally Posted by preda View Post
At some point I'd like to add LL back to master. Sorry, I don't think I'll look into fixing 0.6.


Also, thank you!!!
kracker is offline   Reply With Quote
Old 2019-11-28, 13:59   #1479
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

3×3,221 Posts
Default

Quote:
Originally Posted by preda View Post
At some point I'd like to add LL back to master.
We salute that idea and waiting for it to be sculpted in that stone called gpuOwl...

Last fiddled with by LaurV on 2019-11-28 at 14:00
LaurV is offline   Reply With Quote
Old 2019-11-28, 16:12   #1480
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

124758 Posts
Default

Quote:
Originally Posted by preda View Post
This is strange, I don't understand why it happened. Can you reproduce it? does it happen every time? anything special, like: disk full, read-only folder, read-only file, etc?
I've only seen it once on this system, in cranking through four 41xM P-1 on a GTX1080, disk has 1.34TB free, no read-only on files, exponent folder has same properties as for others that did not show the issue, user has full control permissions, I haven't modified any permissions. Maybe some sort of race condition with Windows file indexing which is enabled?
kriesel is offline   Reply With Quote
Old 2019-11-28, 16:15   #1481
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

543710 Posts
Default

Quote:
Originally Posted by preda View Post
At some point I'd like to add LL back to master.
That would be very welcome, as it is likely to incorporate the accumulated effort of various Windows-linux differences accommodations, and provide a supported version for LL DC. Ideally it would include Jacobi check, and pseudorandom offset.
kriesel is offline   Reply With Quote
Old 2019-11-28, 20:43   #1482
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·457 Posts
Default -pool <dir>

The usual way for me running multiple instances of GpuOwl was:
- each in its own folder, each with its own workdoto.txt and results.txt
- the script primenet.py watching all these folders, keeping the right amount of work queued in worktodo and sending the results out.

I started to think how to do "common worktodo.txt", i.e. multiple instances feeding from one worktodo. This is the solution I come up with:

- specify one "shared" directory (using "-pool <dir>")
- this shared dir contains only worktodo.txt and results.txt
- every instance of GpuOwl works as before, inside its own local folder, with these two changes:
a) when the local worktodo.txt is empty, extract the first assignment from the shared worktodo and move it to the local worktodo
b) write any result to the shared results.txt instead of the local one

This allows primenet.py to now watch only the shared folder, and not the local ones.


Number of assignments:
Before: for N instances, I was queing 3*N PRP assignments
After: for N instances, I queue N in the shared worktodo, plus 1 in each of the N local folders, for a total of 2*N.
preda is offline   Reply With Quote
Old 2019-11-29, 03:22   #1483
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

966310 Posts
Default

Quote:
Originally Posted by preda View Post
The usual way for me running multiple instances of GpuOwl was:
- each in its own folder, each with its own workdoto.txt and results.txt
There is nothing wrong with that, and we were doing it for cudaLucas for ages. The advantage was that our rusted OCD-etched soul fell happy managing the stuff 'face-to-face', in person.

Of course, we salute the new idea of a common pool (like the misfit is doing for mfaktX programs), and generally, we salute any improvements, in spite of the fact that we are thinking a bit that your efforts and commendable skills are wasted, being channeled in the wrong direction. Make theOwl faster, better, add back the LL, improve the P-1, add few additional FFTs, optimize the old one, fix old bugs, etc... and let us, 'the stupid masses', handle multiple instances by our/themselves. It is not like we are doing thousands of assignments per day like in TF. We just have one or two worktodo files, which change(s) once or twice weekly, when some LL finishes, and looking to our TWO folders once per week is not such a bothersome activity.... or, is it?

Let's be serious, how many of you have 50 GPUs in your rigs? Most of us have 1, few have 2, rarely 3 or 4. Those with more than one, anyhow are "hooked", they spend all the day looking at the folder where LL is running, with nothing else on the screen, and doing nothing else than counting iterations all day... "Yeaahh, 1% done, still 99% to finish! Good... WTF? it was the same 20 seconds ago? No progress?"

Moreover, adding a common pool would be detrimental when you have two instances and run the same exponent in both (LL+DC) - well, some of us still doing that, better waste some resources than lose a prime, so the result will still be two different folders, with two different pools, each pool running a single instance, each instance sucking from its own pool, or so..

Last fiddled with by LaurV on 2019-11-29 at 03:35
LaurV is offline   Reply With Quote
Old 2019-11-29, 11:04   #1484
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

148610 Posts
Default

Quote:
Originally Posted by LaurV View Post
Moreover, adding a common pool would be detrimental when you have two instances and run the same exponent in both (LL+DC) - well, some of us still doing that, better waste some resources than lose a prime
And have you ever found a mismatch in the residues?
R. Gerbicz is offline   Reply With Quote
Old 2019-11-30, 00:22   #1485
storm5510
Random Account
 
storm5510's Avatar
 
Aug 2009

7AB16 Posts
Default

Quote:
Originally Posted by LaurV View Post
...WTF? it was the same 20 seconds ago? No progress?"
I have two caveats:

#1: The screen writes could to be more frequent. It appears to be 10,000 iterations, or 10,000 something. With my vision being what it is, I walk by the screen and wonder if it is still running, or if it has frozen. Allow the user to decide by making this a config.txt option. Being an antique programmer, I understand there may be some effort involved.

#2: For every exponent ran, a folder containing checkpoint information is created, but not deleted after completion of the test. The housekeeping could be better.

Other than these, I feel gpuOwl does a really good job. I have only ran P-1's with it. Stage 2 is far faster than any of the other programs I have used.
storm5510 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 07:16.


Fri Aug 6 07:16:05 UTC 2021 up 14 days, 1:45, 1 user, load averages: 3.36, 2.97, 2.78

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.