mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-10-07, 08:55   #56
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

101010110112 Posts
Default getting there

It's sort of starting to work. Feel free to experiment. Probably plently of rough corners & bugs remaining, I'm still working on it.
preda is offline   Reply With Quote
Old 2020-10-07, 09:01   #57
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

31×173 Posts
Default

Quote:
Originally Posted by preda View Post
We know that for PRP, it's beneficial to run two GpuOwl processes per GPU, and this fits nicely with the very low memory requirement of the PRP.

But now in the merged PRP + P-1, there are two steps (P1, P2) that require a lot of memory. What to do? The simple solutions are: run only one process and give it the full memory of the GPU (-maxAlloc), or run two processes and give each 50% of RAM. Both solutions are somehow suboptimal.

I attempted a different solution, let's call it "Memlock". Each process knows on which device it runs (that small number in -device 0). We can use -maxAlloc to allow each process to use (almost) the full RAM of one GPU. (E.g. for a 16GB GPU that is *not* runninng the monitor, I would use -maxAlloc 15G . If running the monitor, -maxAlloc 14G). In conjunction with -pool <dir> which indicates a directory shared by all GpuOwl process. Each process, when starting the "big memory" regions (P1, P2) will attempt to acquire a memory lock on the device by creating a file in the pool directory, e.g. /pool/memlock-1 ) and wait if another process is already in a big memory region.

On normal exit the process will properly release the lock, but on crash the lock may need to be removed manually -- just delete that memlock-N directory (it's an empty directory BTW).
If I understand your proposal correctly, I see issues with that.
1) It requires using -pool. Not everyone uses it. I don't.

2) calling stage 1 large memory usage means only one P-1 run per gpu at a time. A pair of P-1-required-only worktodo files on the same gpu would run one process and stall the other. Even if there's enough gpu memory to run 4 stage ones. Traditionally stage 1 does not require much more ram than a primality test, which takes only 573 MB even at 181M exponent in v6.11-380; 371 MB for 100M exponent.

3) Performance advantage from multiple instances was reduced by recent raw performance improvements in gpuowl. As I recall cases were found where running multiple instances reduced performance.

Are gpu ram requirements much larger for V7 for the same exponent? With V6.11 I've successfully run P-1 both stages up to 500M exponent on 8GB gpus, and up to 1G on 16GB (single instance/gpu).

Last fiddled with by kriesel on 2020-10-07 at 09:04
kriesel is offline   Reply With Quote
Old 2020-10-07, 09:37   #58
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

25338 Posts
Default

Quote:
Originally Posted by kriesel View Post
If I understand your proposal correctly, I see issues with that.
1) It requires using -pool. Not everyone uses it. I don't.
Using -pool with "memlock" allows to run two processes each with -maxAlloc of 100% GPU RAM. It is always possible to not use -pool, and use any of the other options mentioned: either a single process with 100% RAM, or two processes with 50% RAM each.

Quote:
2) calling stage 1 large memory usage means only one P-1 run per gpu at a time. A pair of P-1-required-only worktodo files on the same gpu would run one process and stall the other. Even if there's enough gpu memory to run 4 stage ones. Traditionally stage 1 does not require much more ram than a primality test, which takes only 573 MB even at 181M exponent in v6.11-380; 371 MB for 100M exponent.
In the combined PRP/P-1, P-1 first-stage uses a lot of RAM. It simply becomes faster with more RAM. Also in the combined setup, P1 (i.e. first stage) is only running for less that 10% of the PRP length typically, so there is not really a need to run multiple P1s simultaneously.

But it's possible -- just give each 50% of RAM, they go a bit slower that's all.

Quote:
3) Performance advantage from multiple instances was reduced by recent raw performance improvements in gpuowl. As I recall cases were found where running multiple instances reduced performance.
I'd like single-instance to be the fastest too, but that's not I see in my setup.

Quote:
Are gpu ram requirements much larger for V7 for the same exponent? With V6.11 I've successfully run P-1 both stages up to 500M exponent on 8GB gpus, and up to 1G on 16GB (single instance/gpu).
Yes, memory requirements of first-stage increased. Second-stage decreased a bit.
preda is offline   Reply With Quote
Old 2020-10-07, 10:32   #59
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×457 Posts
Default P-1 bounds

The standalone P-1 worktype was replaced/integrated in PRP. So a worktodo line is a normal PRP line:

PRP=xxxxxAIDxxxx,1,2,100238077,-1,76,2

Note the last digit above, "2", indicating that a P-1 test is desired for this exponent before the PRP (or, in other words, the "2" indicates that the exponent didn't have P-1 done before).

The line can optionally be preceded by explicit bounds, e.g.:

B1=6000000;PRP=xxxxxAIDxxxx,1,2,100238077,-1,76,2
B2=50000000;PRP=xxxxxAIDxxxx,1,2,100238077,-1,76,2
B1=6000000,B2=50000000;PRP=xxxxxAIDxxxx,1,2,100238077,-1,76,2

(this allows setting bounds per-exponent).

The bounds can also be specified "for all exponents" in config.txt or the command line with -b1 and -b2 .

GpuOwl will run P-1 during the PRP if either of these is met:
- "1" or "2" at the end of the PRP line (instead of "0")
- B1 or B2 specified on the PRP line
- b1 or b2 on command line or config

How the bounds are established:
Explicit bounds override defaults. per-exponent bounds override config bounds.

The default B1 (used when no explicit B1 is specified) is roughly equal to exponent/20, which comes to 5M or 5.5M at the wavefront. The default B2 is 20xB1.

(So for a 100M exponent without explicit bounds but with "2" at the end, the default bounds would be B1=5M,B2=100M).
preda is offline   Reply With Quote
Old 2020-10-07, 10:35   #60
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

32E16 Posts
Default

Assuming the large memory requirement is within a single time interval memlock is a simple way to get the processes out of phase and should do the job fine. Assuming the normal case of the queued exponents being close and slowly increasing there should be barely any stalls after the initial one, barring the occasional small exponent from a previously expired allocation knocking the processes back in phase. It might be wise to let the processes get a bit more out of phase than immediately needed to account for most of the variability but that's only if micro-stalls are considered a problem.
M344587487 is offline   Reply With Quote
Old 2020-10-07, 10:38   #61
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·457 Posts
Default Changing bounds

The B1 bound can't be changed during a PRP test. It must be specified with the same value over the length of the PRP (and from the beginning).

To change/update B1 after the PRP test started, you need to move-or-delete the exponent folder (with savefiles) and start anew the PRP with the new B1.

The B2 can be changed during the PRP. Simply specify the new value, and the P2 will either be extended (if the new value is larger), or end early if smaller etc.

Unintentionally changing B1 during an ongoing PRP test can be a bit annoying, as the PRP test will refuse to start with changed B1. If so, one can always override the B1 "only for this exponent" in the worktodo line (to keep it constant for the ongoing test).
preda is offline   Reply With Quote
Old 2020-10-07, 12:14   #62
Aramis Wyler
 
Aramis Wyler's Avatar
 
"Bill Staffen"
Jan 2013
Pittsburgh, PA, USA

1A816 Posts
Default

I find myself again on the same page as M344587487 and again confused by the response.

The P-1 portion of the PRP run is in the begining, right? It does FS up to the bound, stops PRP'ing, runs the SS, and then picks back up again with the PRP. Wouldn't it release the memory at the end of the SS and spend the rest of the time with small memory footprint, at which point we could start the second PRP job with large memory?
Aramis Wyler is offline   Reply With Quote
Old 2020-10-07, 13:46   #63
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2×11×37 Posts
Default

I was responding to the description of memlock and theorising how to minimise the number of stalls for the typical use case, which at best is a micro-optimisation as the stalls might be numerous in that case but short. preda is giving us an info dump of how to customise bounds, not responding to me.



Your description is as I understand it, big memory required until it isn't. Memlock is just a simple way to only let one process use big memory at a time.
M344587487 is offline   Reply With Quote
Old 2020-10-07, 14:34   #64
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

165468 Posts
Default

Quote:
Originally Posted by preda View Post
On normal exit the process will properly release the lock, but on crash the lock may need to be removed manually -- just delete that memlock-N directory (it's an empty directory BTW).

Should manual lock removal become an irritation, the memlock-N directory could contain the process-ID of gpuowl. The memory would be considered locked only if memlock-N exists and the process-ID of gpuowl also matches.
Prime95 is offline   Reply With Quote
Old 2020-10-07, 14:37   #65
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

22·1,549 Posts
Default

Quote:
Originally Posted by preda View Post
On normal exit the process will properly release the lock, but on crash the lock may need to be removed manually -- just delete that memlock-N directory (it's an empty directory BTW).
How come a mutex or semaphore can't work here? That is the primary use case those primitives were made for.
retina is offline   Reply With Quote
Old 2020-10-07, 15:55   #66
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2·11·37 Posts
Default

Creating a file lock is implementing a mutex at the process level, to run two gpuowl workers you run two instances of the program so there are two processes. A normal mutex that you'd use to control threads within a process do not apply AFAIK.
M344587487 is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
GpuOwl PRP-Proof changes preda GpuOwl 20 2020-10-17 06:51
gpuowl: runtime error SELROC GpuOwl 59 2020-10-02 03:56
gpuOWL for Wagstaff GP2 GpuOwl 22 2020-06-13 16:57
gpuowl tuning M344587487 GpuOwl 14 2018-12-29 08:11
How to interface gpuOwl with PrimeNet preda PrimeNet 2 2017-10-07 21:32

All times are UTC. The time now is 02:40.


Sat Jul 17 02:40:54 UTC 2021 up 50 days, 28 mins, 1 user, load averages: 1.69, 1.59, 1.50

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.