mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-10-08, 21:19   #78
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

31×173 Posts
Default Gpuowl-win V7.0-18 observations

Quote:
Originally Posted by preda View Post
Hi Ken, could you please check, is there a genuine permission issue preventing the file rename?
No permission issues found. Renamed easily manually. Settings for the file and folder match what works in v6.11-380 in another folder.

Note, one of the awkward things about/during P2 is there is no ETA (for the more likely NF case, or the less likely F case).

The % complete is inconsistent between a stop and a resume.
The P2 iterations apparently don't all get saved at a stop and resume.

It's natural there would be some "rough edges" after a substantial rewrite.
Code:
2020-10-08 15:55:17 condorella/rx480 100352873 P2 GCD : no factor
2020-10-08 15:56:41 condorella/rx480 100352873 P2 12392205/100000000 (  8%); 24578 muls, 5434 us/mul
2020-10-08 15:58:54 condorella/rx480 100352873 P2 12854205/100000000 (  8%); 24522 muls, 5422 us/mul
2020-10-08 16:00:17 condorella/rx480 100352873 P2 Starting GCD
2020-10-08 16:01:07 condorella/rx480 100352873 P2 13316205/100000000 (  9%); 24402 muls, 5468 us/mul
2020-10-08 16:01:56 condorella/rx480 100352873 P2 GCD : no factor
2020-10-08 16:03:20 condorella/rx480 100352873 P2 13778205/100000000 (  9%); 24318 muls, 5473 us/mul
2020-10-08 16:03:47 condorella/rx480 100352873 P2 13822095/100000000 (  9%); 2352 muls, 11091 us/mul
2020-10-08 16:03:47 condorella/rx480 100352873 P2 Released memory lock 'memlock-0'
2020-10-08 16:03:47 condorella/rx480 Exiting because "stop requested"
2020-10-08 16:03:47 condorella/rx480 Bye
Terminate batch job (Y/N)? n

>gpuowl-win
2020-10-08 16:04:29 gpuowl v7.0-18-g69c2b85
2020-10-08 16:04:29 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -proof 9 -use NO_ASM
2020-10-08 16:04:29 device 0, unique id ''
2020-10-08 16:04:29 condorella/rx480 100352873 FFT: 5.50M 1K:11:256 (17.40 bpw)
2020-10-08 16:04:33 condorella/rx480 100352873 OpenCL args "-DEXP=100352873u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DAMDGPU=1 -DCARRY64=1 -DC
ARRYM64=1 -DWEIGHT_STEP_MINUS_1=0x8.3d478387094c8p-4 -DIWEIGHT_STEP_MINUS_1=-0xa.e0996ab7cf31p-5 -DNO_ASM=1  -cl-unsafe-math-optimizations -cl-std=CL2
.0 -cl-finite-math-only "
2020-10-08 16:04:37 condorella/rx480 100352873 OpenCL compilation in 4.68 s
2020-10-08 16:04:37 condorella/rx480 100352873 maxAlloc: 7.3 GB
2020-10-08 16:04:37 condorella/rx480 100352873 Space for 327 B1 buffers (available mem 7184.4 MB, buf size 22.0 MB)
2020-10-08 16:04:38 condorella/rx480 100352873 B1=5000000 (7214911 bits)
2020-10-08 16:04:41 condorella/rx480 100352873 OK  7215000 loaded: blockSize 500, 80feeffcdb81f502
2020-10-08 16:04:41 condorella/rx480 100352873 validating proof residues for power 9
2020-10-08 16:04:44 condorella/rx480 100352873 Proof using power 9
2020-10-08 16:04:50 condorella/rx480 100352873 OK  7216000   7.19%; 5657 us/it; ETA 6d 02:22; 902affbabbb0173f
2020-10-08 16:04:50 condorella/rx480 100352873 P2 (5000000,100000000) will continue from B2=13142955
2020-10-08 16:04:51 condorella/rx480 100352873 P2 B1=5000000, B2=100000000, D=210: 4903534 primes in [13142955, 100000005], selected 4301057 (87.7%) (
602477 doubles + 3698580 singles)
2020-10-08 16:04:51 condorella/rx480 100352873 P2 B1=5000000, B2=100000000, D=210 from B2=13142955 : 413605 blocks starting at 62586
2020-10-08 16:04:51 condorella/rx480 100352873 P2 Aquired memory lock 'memlock-0'
2020-10-08 16:04:51 condorella/rx480 100352873 P2 Allocated 24 P2 buffers
2020-10-08 16:04:52 condorella/rx480 100352873 P2 Setup 24 P2 buffers in 919.4 ms
2020-10-08 16:04:52 condorella/rx480 100352873 P2 13143165/100000000 (  0%); 8 muls, 6456 us/mul
2020-10-08 16:07:05 condorella/rx480 100352873 P2 13605165/100000000 (  1%); 24363 muls, 5435 us/mul
kriesel is offline   Reply With Quote
Old 2020-10-08, 21:43   #79
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

101010110112 Posts
Default

And a factor:
Code:
? factor(183358639670121039844487732119-1)
%14 = 
[        2 1]
[        3 1]
[        7 1]
[       11 1]
[      739 1]
[     1301 1]
[    48271 1]
[ 85261691 1]
[100299191 1]
preda is offline   Reply With Quote
Old 2020-10-09, 02:40   #80
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10100111100112 Posts
Default update

Sweet, 92+bits.

Gpuowl-win v7.0-26-g8e6a1d1 seems to address the worktodo rename issue. It handled P-1 & PRP on several low Mersenne primes.
However the cpu usage was unusually high and gpu usage unusually low, during P-1 activity. Time estimates show corresponding changes.
Possibly it relates to a Radeon VII on a Celeron G1840 system, fast gpu/slow cpu combo. I've not seen it with V6.11-380 running there though, in LLDC, PRP, or P-1. This is on Win10. The effect would not be as visible on the RX480, because there the gpu is slower, the cpus are faster, and GPU-Z monitoring for AMD via remote desktop does not work on Win7.
Attached Thumbnails
Click image for larger version

Name:	gpu p-1 load anomaly.png
Views:	77
Size:	98.4 KB
ID:	23503  

Last fiddled with by kriesel on 2020-10-09 at 03:20
kriesel is offline   Reply With Quote
Old 2020-10-10, 09:13   #81
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·457 Posts
Default

Quote:
Originally Posted by kriesel View Post
However the cpu usage was unusually high and gpu usage unusually low, during P-1 activity. Time estimates show corresponding changes.
Possibly it relates to a Radeon VII on a Celeron G1840 system, fast gpu/slow cpu combo. I've not seen it with V6.11-380 running there though, in LLDC, PRP, or P-1. This is on Win10. The effect would not be as visible on the RX480, because there the gpu is slower, the cpus are faster, and GPU-Z monitoring for AMD via remote desktop does not work on Win7.
The most intensive CPU-side operation is the GCD. After that, building the P2 plan (once at the start of P2) can also use more CPU. After that, maybe the P1 CPU use increased a (tiny) bit. Let's keep an eye on it, as we work out the kinks we'll see if this CPU problem you see persists for you. I do not reproduce it -- I see an increase of ~ 1% (from 1.7% to 2.7%) CPU use when gpuowl is running P1.
preda is offline   Reply With Quote
Old 2020-10-11, 08:20   #82
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

55B16 Posts
Default I've been blessed with a flaky GPU

My broken GPU generates about 20 errors per PRP test. Those are handled just fine by the error check for PRP, but the situation is different for P-1.

I tried to harden the P1 (first stage). I reimplemented the "fold" step (which is at the core of the new P1 implementation, the "fold" operation aggregates the many P1 buffers into a single residue, used for saving or at the end of P1) to make it less exposed to GPU errors. And I added the Jacobi check for P1, which happens (it seems so) to work nicely with the new P1 algorithm.

While there, I also added residue==0 detection for PRP, which was a long-requested feature. This simply flags any res64==0 as suspicious and does an early check when possible.

I added back a more-frequent res64 display in log (now that I read it more fequently from GPU-side); I do not expect this to affect performance.


P1: Jacobi check

A Jacobi check is done at start (on-load) of P1, and on every 1M iterations while P1 is ongoing. The Jacobi check is a CPU operation, about as slow as a GCD (let's say 35s CPU at the wavefront).
*If* the Jacobi fails (which is rare), a rather tricky rollback over savefiles is attempted, to start anew from an earlier point hopefully not affected. (we'll see what bugs hide in there).


P2 : What about the error-hardening of P2 (i.e. second stage of P-1)?
I think the Jacobi check is not applicable to P2; OTOH I also think that computation errors are not critical in P2, I'll explain why.

For P1 we need the precisely-correct final P1 result; even the slightest error during P1 would make both the P1 GCD and all of the full of P2 useless.

The situation is different though (IMO) for P2. An error during P2 would only affect the prime factors that happened to be P2-accumulated between the last P2 GCD and the location of the error. The follow-up multiplications (that take place after the error) are not affected by the P2 error. Thus, in P2, an error "erases" just a few stage-two primes (depending on how often the P2 GCD is done), without having "catastrophic" consequences of nullifying the whole P2. There is one exception -- if the P2 accumulator becomes zero (due to an error) it would remain stuck there -- but this special value can be detected and handled (e.g. by resetting the accumulator to 1).
preda is offline   Reply With Quote
Old 2020-10-11, 21:04   #83
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

123638 Posts
Default

Quote:
Originally Posted by preda View Post
The most intensive CPU-side operation is the GCD. After that, building the P2 plan (once at the start of P2) can also use more CPU. After that, maybe the P1 CPU use increased a (tiny) bit. Let's keep an eye on it, as we work out the kinks we'll see if this CPU problem you see persists for you. I do not reproduce it -- I see an increase of ~ 1% (from 1.7% to 2.7%) CPU use when gpuowl is running P1.
Have you tried lower exponents, such as below 30M? I saw reproducible low gpu usage and anomalously long iteration times at low exponent. (2400 us/it at p=13466917 during P1 on a Radeon VII, in V7.0-35-gf06bc5b)
The effect was much reduced at 20996011, and seems to be also at 24M. So, hopefully it is slight at the wavefront.

The addition of some P-1 error checking is great news BTW.

Last fiddled with by kriesel on 2020-10-11 at 21:05
kriesel is offline   Reply With Quote
Old 2020-10-11, 22:30   #84
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

55B16 Posts
Default

Quote:
Originally Posted by kriesel View Post
Have you tried lower exponents, such as below 30M? I saw reproducible low gpu usage and anomalously long iteration times at low exponent. (2400 us/it at p=13466917 during P1 on a Radeon VII, in V7.0-35-gf06bc5b)
The effect was much reduced at 20996011, and seems to be also at 24M. So, hopefully it is slight at the wavefront.

The addition of some P-1 error checking is great news BTW.
One thing to check is whether the GPU memory is exausted (this results in a terrible slowdown of the GPU).

On smaller exponents, a "residue" is smaller. It appears that for each buffer allocated on the GPU there is some overhead. Thus if you pass let's say -maxAlloc 15G on R7 (which has 16G of physical mem), GpuOwl will think it stays under 15G with the allocations, but actually on the GPU more than 16G is allocated. Maybe start with -maxAlloc 10G and see how it goes. If you have some way to see the amount of free memory on the GPU you can see the ratio of "overallocation" and you can verify you stay under "full mem".

PS:
I tested with this:
./gpuowl -prp 6972649 -b1 1500000 -b2 10000000 -maxAlloc 10G
and I see about 100us/it on R7.

Last fiddled with by preda on 2020-10-11 at 22:35
preda is offline   Reply With Quote
Old 2020-10-12, 07:04   #85
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

137110 Posts
Default

And a 116bits factor:

? factor(88196068168080719419045580163465679 - 1)
%28 =
[ 2 1]
[ 3 1]
[ 3119 1]
[ 3257 1]
[ 106781 1]
[ 2006659 1]
[ 67281289 1]
[100369781 1]

? log(88196068168080719419045580163465679)/log(2)
%29 = 116.08626956720425079258812889488565495

PS: I'm reworking P2 ATM
preda is offline   Reply With Quote
Old 2020-10-12, 16:45   #86
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

123638 Posts
Default

Thanks. I overlooked the format change from gpuowl V6 to V7 for -maxAlloc
Code:
-maxAlloc <size>   : limit GPU memory usage to size, which is a value with suffix M for MB and G for GB.
                     e.g. -maxAlloc 2048M or -maxAlloc 3.5G
-maxAlloc 14000 was fine in V6.11-380. In V7 it causes gpuowl to use ~15GB in stage 1 P-1 on a Radeon VII, and drop down to a few GB in stage 2.

Stopping and resuming M25964951 PRP after P-1 stage 2 concluded, to correct the config.txt, gpu ram usage dropped to 161MB from 2+GB, as if P2 is not freeing memory when it is done. All gpu ram figures are from the GPU-Z utility.

Last fiddled with by kriesel on 2020-10-12 at 16:47
kriesel is offline   Reply With Quote
Old 2020-10-14, 01:33   #87
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·457 Posts
Default

Quote:
Originally Posted by kriesel View Post
Thanks. I overlooked the format change from gpuowl V6 to V7 for -maxAlloc
Code:
-maxAlloc <size>   : limit GPU memory usage to size, which is a value with suffix M for MB and G for GB.
                     e.g. -maxAlloc 2048M or -maxAlloc 3.5G
-maxAlloc 14000 was fine in V6.11-380. In V7 it causes gpuowl to use ~15GB in stage 1 P-1 on a Radeon VII, and drop down to a few GB in stage 2.

Stopping and resuming M25964951 PRP after P-1 stage 2 concluded, to correct the config.txt, gpu ram usage dropped to 161MB from 2+GB, as if P2 is not freeing memory when it is done. All gpu ram figures are from the GPU-Z utility.
I don't think this concerns the format change to -maxAlloc, which didn't change much and works in the same way as in v6.x.

The problem I see is that, even when the software (GpuOwl) stays nicely under the requested maxAlloc, if allocating many small buffers the GPU actually uses a lot more memory than the allocated size (like 25% more). This behavior can't be predicted by the software, so the fallback is for the user to artifically lower the -maxAlloc in order to counter the "over allocation" of the GPU driver. This problem is significant only for small exponents up to tens-of-millions, and is not present at the wavefront exponents.

Last fiddled with by preda on 2020-10-14 at 01:34
preda is offline   Reply With Quote
Old 2020-10-14, 01:38   #88
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·457 Posts
Default New P2

Quote:
Originally Posted by preda View Post
PS: I'm reworking P2 ATM
A new implementation of P2 has been merged, which has two main changes in behavior:

1. it uses all the memory you can throw at it (up to -maxAlloc, of course) to speed up,
2. is significantly faster (15%), by using more buffers and better pairing
3. also works fine (i.e. fast enough) in low-memory situations.

And as usual, this is also an opportunity to introduce new bugs -- please make sure P2 does detect some factors before starting serious P-1.
preda is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
GpuOwl PRP-Proof changes preda GpuOwl 20 2020-10-17 06:51
gpuowl: runtime error SELROC GpuOwl 59 2020-10-02 03:56
gpuOWL for Wagstaff GP2 GpuOwl 22 2020-06-13 16:57
gpuowl tuning M344587487 GpuOwl 14 2018-12-29 08:11
How to interface gpuOwl with PrimeNet preda PrimeNet 2 2017-10-07 21:32

All times are UTC. The time now is 02:41.


Sat Jul 17 02:41:20 UTC 2021 up 50 days, 28 mins, 1 user, load averages: 1.43, 1.54, 1.48

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.