![]() |
|
|
#78 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
124528 Posts |
Quote:
Note, one of the awkward things about/during P2 is there is no ETA (for the more likely NF case, or the less likely F case). The % complete is inconsistent between a stop and a resume. The P2 iterations apparently don't all get saved at a stop and resume. It's natural there would be some "rough edges" after a substantial rewrite. Code:
2020-10-08 15:55:17 condorella/rx480 100352873 P2 GCD : no factor 2020-10-08 15:56:41 condorella/rx480 100352873 P2 12392205/100000000 ( 8%); 24578 muls, 5434 us/mul 2020-10-08 15:58:54 condorella/rx480 100352873 P2 12854205/100000000 ( 8%); 24522 muls, 5422 us/mul 2020-10-08 16:00:17 condorella/rx480 100352873 P2 Starting GCD 2020-10-08 16:01:07 condorella/rx480 100352873 P2 13316205/100000000 ( 9%); 24402 muls, 5468 us/mul 2020-10-08 16:01:56 condorella/rx480 100352873 P2 GCD : no factor 2020-10-08 16:03:20 condorella/rx480 100352873 P2 13778205/100000000 ( 9%); 24318 muls, 5473 us/mul 2020-10-08 16:03:47 condorella/rx480 100352873 P2 13822095/100000000 ( 9%); 2352 muls, 11091 us/mul 2020-10-08 16:03:47 condorella/rx480 100352873 P2 Released memory lock 'memlock-0' 2020-10-08 16:03:47 condorella/rx480 Exiting because "stop requested" 2020-10-08 16:03:47 condorella/rx480 Bye Terminate batch job (Y/N)? n >gpuowl-win 2020-10-08 16:04:29 gpuowl v7.0-18-g69c2b85 2020-10-08 16:04:29 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -proof 9 -use NO_ASM 2020-10-08 16:04:29 device 0, unique id '' 2020-10-08 16:04:29 condorella/rx480 100352873 FFT: 5.50M 1K:11:256 (17.40 bpw) 2020-10-08 16:04:33 condorella/rx480 100352873 OpenCL args "-DEXP=100352873u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DAMDGPU=1 -DCARRY64=1 -DC ARRYM64=1 -DWEIGHT_STEP_MINUS_1=0x8.3d478387094c8p-4 -DIWEIGHT_STEP_MINUS_1=-0xa.e0996ab7cf31p-5 -DNO_ASM=1 -cl-unsafe-math-optimizations -cl-std=CL2 .0 -cl-finite-math-only " 2020-10-08 16:04:37 condorella/rx480 100352873 OpenCL compilation in 4.68 s 2020-10-08 16:04:37 condorella/rx480 100352873 maxAlloc: 7.3 GB 2020-10-08 16:04:37 condorella/rx480 100352873 Space for 327 B1 buffers (available mem 7184.4 MB, buf size 22.0 MB) 2020-10-08 16:04:38 condorella/rx480 100352873 B1=5000000 (7214911 bits) 2020-10-08 16:04:41 condorella/rx480 100352873 OK 7215000 loaded: blockSize 500, 80feeffcdb81f502 2020-10-08 16:04:41 condorella/rx480 100352873 validating proof residues for power 9 2020-10-08 16:04:44 condorella/rx480 100352873 Proof using power 9 2020-10-08 16:04:50 condorella/rx480 100352873 OK 7216000 7.19%; 5657 us/it; ETA 6d 02:22; 902affbabbb0173f 2020-10-08 16:04:50 condorella/rx480 100352873 P2 (5000000,100000000) will continue from B2=13142955 2020-10-08 16:04:51 condorella/rx480 100352873 P2 B1=5000000, B2=100000000, D=210: 4903534 primes in [13142955, 100000005], selected 4301057 (87.7%) ( 602477 doubles + 3698580 singles) 2020-10-08 16:04:51 condorella/rx480 100352873 P2 B1=5000000, B2=100000000, D=210 from B2=13142955 : 413605 blocks starting at 62586 2020-10-08 16:04:51 condorella/rx480 100352873 P2 Aquired memory lock 'memlock-0' 2020-10-08 16:04:51 condorella/rx480 100352873 P2 Allocated 24 P2 buffers 2020-10-08 16:04:52 condorella/rx480 100352873 P2 Setup 24 P2 buffers in 919.4 ms 2020-10-08 16:04:52 condorella/rx480 100352873 P2 13143165/100000000 ( 0%); 8 muls, 6456 us/mul 2020-10-08 16:07:05 condorella/rx480 100352873 P2 13605165/100000000 ( 1%); 24363 muls, 5435 us/mul |
|
|
|
|
|
|
#79 |
|
"Mihai Preda"
Apr 2015
55B16 Posts |
And a factor:
Code:
? factor(183358639670121039844487732119-1) %14 = [ 2 1] [ 3 1] [ 7 1] [ 11 1] [ 739 1] [ 1301 1] [ 48271 1] [ 85261691 1] [100299191 1] |
|
|
|
|
|
#80 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
541810 Posts |
Sweet, 92+bits.
Gpuowl-win v7.0-26-g8e6a1d1 seems to address the worktodo rename issue. It handled P-1 & PRP on several low Mersenne primes. However the cpu usage was unusually high and gpu usage unusually low, during P-1 activity. Time estimates show corresponding changes. Possibly it relates to a Radeon VII on a Celeron G1840 system, fast gpu/slow cpu combo. I've not seen it with V6.11-380 running there though, in LLDC, PRP, or P-1. This is on Win10. The effect would not be as visible on the RX480, because there the gpu is slower, the cpus are faster, and GPU-Z monitoring for AMD via remote desktop does not work on Win7. Last fiddled with by kriesel on 2020-10-09 at 03:20 |
|
|
|
|
|
#81 | |
|
"Mihai Preda"
Apr 2015
137110 Posts |
Quote:
|
|
|
|
|
|
|
#82 |
|
"Mihai Preda"
Apr 2015
3×457 Posts |
My broken GPU generates about 20 errors per PRP test. Those are handled just fine by the error check for PRP, but the situation is different for P-1.
I tried to harden the P1 (first stage). I reimplemented the "fold" step (which is at the core of the new P1 implementation, the "fold" operation aggregates the many P1 buffers into a single residue, used for saving or at the end of P1) to make it less exposed to GPU errors. And I added the Jacobi check for P1, which happens (it seems so) to work nicely with the new P1 algorithm. While there, I also added residue==0 detection for PRP, which was a long-requested feature. This simply flags any res64==0 as suspicious and does an early check when possible. I added back a more-frequent res64 display in log (now that I read it more fequently from GPU-side); I do not expect this to affect performance. P1: Jacobi check A Jacobi check is done at start (on-load) of P1, and on every 1M iterations while P1 is ongoing. The Jacobi check is a CPU operation, about as slow as a GCD (let's say 35s CPU at the wavefront). *If* the Jacobi fails (which is rare), a rather tricky rollback over savefiles is attempted, to start anew from an earlier point hopefully not affected. (we'll see what bugs hide in there). P2 : What about the error-hardening of P2 (i.e. second stage of P-1)? I think the Jacobi check is not applicable to P2; OTOH I also think that computation errors are not critical in P2, I'll explain why. For P1 we need the precisely-correct final P1 result; even the slightest error during P1 would make both the P1 GCD and all of the full of P2 useless. The situation is different though (IMO) for P2. An error during P2 would only affect the prime factors that happened to be P2-accumulated between the last P2 GCD and the location of the error. The follow-up multiplications (that take place after the error) are not affected by the P2 error. Thus, in P2, an error "erases" just a few stage-two primes (depending on how often the P2 GCD is done), without having "catastrophic" consequences of nullifying the whole P2. There is one exception -- if the P2 accumulator becomes zero (due to an error) it would remain stuck there -- but this special value can be detected and handled (e.g. by resetting the accumulator to 1). |
|
|
|
|
|
#83 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
124528 Posts |
Quote:
The effect was much reduced at 20996011, and seems to be also at 24M. So, hopefully it is slight at the wavefront. The addition of some P-1 error checking is great news BTW. Last fiddled with by kriesel on 2020-10-11 at 21:05 |
|
|
|
|
|
|
#84 | |
|
"Mihai Preda"
Apr 2015
3·457 Posts |
Quote:
On smaller exponents, a "residue" is smaller. It appears that for each buffer allocated on the GPU there is some overhead. Thus if you pass let's say -maxAlloc 15G on R7 (which has 16G of physical mem), GpuOwl will think it stays under 15G with the allocations, but actually on the GPU more than 16G is allocated. Maybe start with -maxAlloc 10G and see how it goes. If you have some way to see the amount of free memory on the GPU you can see the ratio of "overallocation" and you can verify you stay under "full mem". PS: I tested with this: ./gpuowl -prp 6972649 -b1 1500000 -b2 10000000 -maxAlloc 10G and I see about 100us/it on R7. Last fiddled with by preda on 2020-10-11 at 22:35 |
|
|
|
|
|
|
#85 |
|
"Mihai Preda"
Apr 2015
55B16 Posts |
And a 116bits factor:
? factor(88196068168080719419045580163465679 - 1) %28 = [ 2 1] [ 3 1] [ 3119 1] [ 3257 1] [ 106781 1] [ 2006659 1] [ 67281289 1] [100369781 1] ? log(88196068168080719419045580163465679)/log(2) %29 = 116.08626956720425079258812889488565495 PS: I'm reworking P2 ATM |
|
|
|
|
|
#86 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2×32×7×43 Posts |
Thanks. I overlooked the format change from gpuowl V6 to V7 for -maxAlloc
Code:
-maxAlloc <size> : limit GPU memory usage to size, which is a value with suffix M for MB and G for GB.
e.g. -maxAlloc 2048M or -maxAlloc 3.5G
Stopping and resuming M25964951 PRP after P-1 stage 2 concluded, to correct the config.txt, gpu ram usage dropped to 161MB from 2+GB, as if P2 is not freeing memory when it is done. All gpu ram figures are from the GPU-Z utility. Last fiddled with by kriesel on 2020-10-12 at 16:47 |
|
|
|
|
|
#87 | |
|
"Mihai Preda"
Apr 2015
3×457 Posts |
Quote:
The problem I see is that, even when the software (GpuOwl) stays nicely under the requested maxAlloc, if allocating many small buffers the GPU actually uses a lot more memory than the allocated size (like 25% more). This behavior can't be predicted by the software, so the fallback is for the user to artifically lower the -maxAlloc in order to counter the "over allocation" of the GPU driver. This problem is significant only for small exponents up to tens-of-millions, and is not present at the wavefront exponents. Last fiddled with by preda on 2020-10-14 at 01:34 |
|
|
|
|
|
|
#88 |
|
"Mihai Preda"
Apr 2015
25338 Posts |
A new implementation of P2 has been merged, which has two main changes in behavior:
1. it uses all the memory you can throw at it (up to -maxAlloc, of course) to speed up, 2. is significantly faster (15%), by using more buffers and better pairing 3. also works fine (i.e. fast enough) in low-memory situations. And as usual, this is also an opportunity to introduce new bugs -- please make sure P2 does detect some factors before starting serious P-1. |
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| GpuOwl PRP-Proof changes | preda | GpuOwl | 20 | 2020-10-17 06:51 |
| gpuowl: runtime error | SELROC | GpuOwl | 59 | 2020-10-02 03:56 |
| gpuOWL for Wagstaff | GP2 | GpuOwl | 22 | 2020-06-13 16:57 |
| gpuowl tuning | M344587487 | GpuOwl | 14 | 2018-12-29 08:11 |
| How to interface gpuOwl with PrimeNet | preda | PrimeNet | 2 | 2017-10-07 21:32 |