mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
Thread Tools
Old 2019-09-22, 14:53   #1398
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,437 Posts
Default

Quote:
Originally Posted by preda View Post
Yes that seems pretty broken. I'm not sure why yet; I did push a new commit -- could you try it and tell me how it works? (pls check both with/without -time)

There's no need to wait 10minutes -- if it doesn't do the usual progress, or does not react to Ctrl-C, it's broken.
No problem, was going through paper mail while it ran. Retried previous commit with -yield but without -time; similar behavior. Eight minutes zero iterations. Responded to CTRL-C though. Will try make and run the latest commit after breakfast.
Attached Thumbnails
Click image for larger version

Name:	yield cpu stalls gpu.png
Views:	82
Size:	83.3 KB
ID:	21043  
kriesel is offline   Reply With Quote
Old 2019-09-22, 16:35   #1399
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,437 Posts
Default

Win7 X64 Pro, NVIDIA GTX1080Ti, gpuowl-win v6.11-6-g02fd645, M226m P-1 stage 2 continuation,


No -time:

without -yield operates normally on the gpu but fully occupies a cpu core (in this case a hyperthread on one of the Xeon E5520 packages); a round took 9 minutes 24 seconds.

with -yield, zero cpu after 12 core-seconds initialization, but also zero gpu load per GPU-Z so probably zero progress.


With -time:
without -yield operates normally on the gpu but fully occupies a cpu core (in this case a hyperthread on one of the Xeon E5520 packages); a round took 9 minutes 34.5 seconds, so -time overhead appears to be ~10 seconds / 564 =~ 1.8%
Code:
2019-09-22 11:27:32 226000127 P2 1628/2880: setup 4280 ms; 11400 us/prime, 51335 primes
2019-09-22 11:27:32 36.80% tailFusedMulDelta :   4118 us/call x 51335 calls
2019-09-22 11:27:32 33.56% carryFused     :   3547 us/call x 54355 calls
2019-09-22 11:27:32  7.10% fftMiddleIn    :    750 us/call x 54355 calls
2019-09-22 11:27:32  7.05% fftMiddleOut   :    745 us/call x 54355 calls
2019-09-22 11:27:32  6.63% transposeW     :    701 us/call x 54355 calls
2019-09-22 11:27:32  6.56% transposeH     :    693 us/call x 54355 calls
2019-09-22 11:27:32  1.58% fftH           :   1507 us/call x  6040 calls
2019-09-22 11:27:32  0.72% multiply       :   1371 us/call x  3020 calls
2019-09-22 11:27:32 Total time 574.506 s
with -yield again the gpu quickly goes idle.
kriesel is offline   Reply With Quote
Old 2019-09-22, 16:42   #1400
xx005fs
 
"Eric"
Jan 2018
USA

22·53 Posts
Default

Quote:
Originally Posted by preda View Post
I increased the sleep time on yield to attempt to reduce CPU usage more. Could you try again please? (with the newest revision)
Looks like there's no change in throughput or CPU load when running PRP. Still around 87% used on 1 core.
xx005fs is offline   Reply With Quote
Old 2019-09-22, 18:57   #1401
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,437 Posts
Default

Separate system, dual xeon e5-2670, Win7 X64 Pro, NVIDIA GTX1080, gpuowl-win v6.11-6-g02fd645, M228m P-1, similar behavior.
kriesel is offline   Reply With Quote
Old 2019-09-22, 20:42   #1402
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×457 Posts
Default

Quote:
Originally Posted by kriesel View Post
Separate system, dual xeon e5-2670, Win7 X64 Pro, NVIDIA GTX1080, gpuowl-win v6.11-6-g02fd645, M228m P-1, similar behavior.
I made one more change (added a queue flush before waiting in yield) please let me know whether this fixes it.
preda is offline   Reply With Quote
Old 2019-09-23, 00:07   #1403
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,437 Posts
Default gpuowl-win v6.11-9-g9ae3189

Quote:
Originally Posted by preda View Post
I made one more change (added a queue flush before waiting in yield) please let me know whether this fixes it.
Much better. Runs the gpu hard, and after the initial startup takes several cpu core seconds, there's about one more cpu core second used per gpu minute, on the dual Xeon E5-2670 system.
Code:
C:\Users\ken\Documents\v6.11-9-g9ae3189>gpuowl-win -device 0 -use ORIG_X2 -user kriesel -cpu emu/gtx1080 -maxAlloc 8000 -yield
2019-09-22 17:42:39 gpuowl v6.11-9-g9ae3189
2019-09-22 17:42:39 Note: no config.txt file found
2019-09-22 17:42:39 config: -device 0 -use ORIG_X2 -user kriesel -cpu emu/gtx1080 -maxAlloc 8000 -yield
2019-09-22 17:42:39 228000037 FFT 14336K: Width 256x4, Height 256x4, Middle 7; 15.53 bits/word
2019-09-22 17:42:40 OpenCL args "-DEXP=228000037u -DWIDTH=1024u -DSMALL_HEIGHT=1024u -DMIDDLE=7u -DWEIGHT_STEP=0xb.12354e6de8db8p-3 -DIWEIGHT_STEP=0xb.8fc56ff3f
adcp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DORIG_X2=1  -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-09-22 17:42:40

2019-09-22 17:42:40 OpenCL compilation in 22 ms
2019-09-22 17:42:44 228000037 P1 B1=1840000, B2=42320000; 2654010 bits; starting at 1083301
2019-09-22 17:44:16 228000037 P1  1090000  41.07%; 13745 us/sq; ETA 0d 05:58; 646ebd24b9141139
2019-09-22 17:46:33 228000037 P1  1100000  41.45%; 13754 us/sq; ETA 0d 05:56; 5b076380f84fa1f8
2019-09-22 17:48:52 228000037 P1  1110000  41.82%; 13821 us/sq; ETA 0d 05:56; 49cac9f30cafb667
2019-09-22 17:51:09 228000037 P1  1120000  42.20%; 13768 us/sq; ETA 0d 05:52; 49039a105d434d61
2019-09-22 17:53:28 228000037 P1  1130000  42.58%; 13831 us/sq; ETA 0d 05:51; aed916597692a26e
2019-09-22 17:55:45 228000037 P1  1140000  42.95%; 13763 us/sq; ETA 0d 05:47; 0a39a801f50514e8
2019-09-22 17:58:04 228000037 P1  1150000  43.33%; 13877 us/sq; ETA 0d 05:48; a69b4685a5d5e8ed
2019-09-22 18:00:22 228000037 P1  1160000  43.71%; 13764 us/sq; ETA 0d 05:43; 8ba2709ae1589129
2019-09-22 18:02:39 228000037 P1  1170000  44.08%; 13760 us/sq; ETA 0d 05:40; f69bffc29181eec2
2019-09-22 18:04:58 228000037 P1  1180000  44.46%; 13826 us/sq; ETA 0d 05:40; e55aa4dce17619d2
2019-09-22 18:07:15 228000037 P1  1190000  44.84%; 13767 us/sq; ETA 0d 05:36; bd8a0062f3e8109b
2019-09-22 18:09:33 228000037 P1  1200000  45.21%; 13823 us/sq; ETA 0d 05:35; 15f4486494abaf74
2019-09-22 18:11:51 228000037 P1  1210000  45.59%; 13767 us/sq; ETA 0d 05:31; a652297a1008f956
2019-09-22 18:14:10 228000037 P1  1220000  45.97%; 13842 us/sq; ETA 0d 05:31; 78094c385b32ceac
2019-09-22 18:14:16 Stopping, please wait..
2019-09-22 18:14:17 Exiting because "stop requested"
2019-09-22 18:14:17 Bye
Terminate batch job (Y/N)? n

C:\Users\ken\Documents\v6.11-9-g9ae3189>gpuowl-win -device 0 -use ORIG_X2 -user kriesel -cpu emu/gtx1080 -maxAlloc 8000 -yield -time
2019-09-22 18:14:40 gpuowl v6.11-9-g9ae3189
2019-09-22 18:14:40 Note: no config.txt file found
2019-09-22 18:14:40 config: -device 0 -use ORIG_X2 -user kriesel -cpu emu/gtx1080 -maxAlloc 8000 -yield -time
2019-09-22 18:14:40 228000037 FFT 14336K: Width 256x4, Height 256x4, Middle 7; 15.53 bits/word
2019-09-22 18:14:40 OpenCL args "-DEXP=228000037u -DWIDTH=1024u -DSMALL_HEIGHT=1024u -DMIDDLE=7u -DWEIGHT_STEP=0xb.12354e6de8db8p-3 -DIWEIGHT_STEP=0xb.8fc56ff3f
adcp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DORIG_X2=1  -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-09-22 18:14:40

2019-09-22 18:14:40 OpenCL compilation in 25 ms
2019-09-22 18:14:45 228000037 P1 B1=1840000, B2=42320000; 2654010 bits; starting at 1220501
2019-09-22 18:16:57 228000037 P1  1230000  46.34%; 13941 us/sq; ETA 0d 05:31; d10c1a457f57634c
2019-09-22 18:16:57 36.96% tailFused      :   5058 us/call x  9499 calls
2019-09-22 18:16:57 17.03% carryFused     :   4762 us/call x  4650 calls
2019-09-22 18:16:57 16.21% carryFusedMul  :   4347 us/call x  4848 calls
2019-09-22 18:16:57  7.52% transposeW     :   1029 us/call x  9499 calls
2019-09-22 18:16:57  7.47% transposeH     :   1022 us/call x  9499 calls
2019-09-22 18:16:57  7.41% fftMiddleIn    :   1014 us/call x  9499 calls
2019-09-22 18:16:57  7.39% fftMiddleOut   :   1011 us/call x  9499 calls
2019-09-22 18:16:57 Total time 129.985 s
Similar results on the 226M P-1 run on a GTX1080Ti on another system.
Attached Files
File Type: 7z v6.11-9-g9ae3189.7z (418.2 KB, 122 views)

Last fiddled with by kriesel on 2019-09-23 at 00:12
kriesel is offline   Reply With Quote
Old 2019-09-23, 01:20   #1404
xx005fs
 
"Eric"
Jan 2018
USA

22·53 Posts
Default

On Windows, the yield option works perfectly for PRP, dropping my CPU usage from about 5.5% of 16 threads down to almost nothing. Though the speed is reduced from around 860us/it down to 880us/it, which is insignificant enough and that my CPU would work more efficiently to compensate for that. Thanks Preda for addressing this bug (blame lays on Nvidia for sure).

Last fiddled with by xx005fs on 2019-09-23 at 01:21
xx005fs is offline   Reply With Quote
Old 2019-09-23, 14:34   #1405
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10101001111012 Posts
Default

PRP on GTX1080Ti on gpuowl V6.11-9 with -yield seems to be within 2% of gpu throughput of v6.7-4 (which saturates a cpu core). Observed prime95 throughput penalty with v6.7's cpu use was about 0.5% (2% of one of the 4 workers), thanks to hyperthreading mitigating the impact somewhat. These figures are very approximate. A more accurate check would use about an hour in each condition after ignoring the initial startup of 10 minutes or so for thermal stabilization.
Code:
2019-09-23 09:10:52 gpuowl v6.7-4-g278407a
2019-09-23 09:10:53 Note: no config.txt file found
2019-09-23 09:10:53 config: -device 0 -use ORIG_X2 -maxAlloc 10240 -user kriesel -cpu dodo-gtx1080ti
2019-09-23 09:10:53 87005279 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.59 bits/word
2019-09-23 09:10:53 using short carry kernels
2019-09-23 09:10:53 OpenCL args "-DEXP=87005279u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xa.97d8cd06772f8p-3 -DIWEIGHT
_STEP=0xc.1551b6b1158dp-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1  -I. -cl-fast-relaxed-mat
h -cl-std=CL2.0"
2019-09-23 09:10:53

2019-09-23 09:10:53 OpenCL compilation in 97 ms
2019-09-23 09:10:55 87005279.owl loaded: k 25172000, block 500, res64 2736c9728212e62e
2019-09-23 09:11:03 87005279 OK 25173000 28.93%; 3406 us/sq; ETA 2d 10:30; 8f25ad724e654078 (check 2.09s)
2019-09-23 09:12:36 87005279    25200000 28.96%; 3448 us/sq; ETA 2d 11:12; 7670ca7fa4cba9de
2019-09-23 09:15:32 87005279 OK 25250000 29.02%; 3472 us/sq; ETA 2d 11:34; 1d799dd231b858fc (check 2.11s)
2019-09-23 09:18:27 87005279    25300000 29.08%; 3513 us/sq; ETA 2d 12:12; 2ec8f55bc1a420aa
2019-09-23 09:21:07 Stopping, please wait..
2019-09-23 09:21:09 87005279 OK 25345500 29.13%; 3515 us/sq; ETA 2d 12:12; b879e7272e09c388 (check 2.12s)
2019-09-23 09:21:10 Exiting because "stop requested"
2019-09-23 09:21:10 Bye
Code:
2019-09-23 09:23:09 gpuowl v6.11-9-g9ae3189
2019-09-23 09:23:09 Note: no config.txt file found
2019-09-23 09:23:09 config: -device 0 -use ORIG_X2 -user kriesel -cpu dodo/gtx1080ti -maxAlloc 10240 -yield
2019-09-23 09:23:09 87005279 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.59 bits/word
2019-09-23 09:23:10 OpenCL args "-DEXP=87005279u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xa.97d8cd06772f8p-3 -DIWEIGHT
_STEP=0xc.1551b6b1158dp-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1  -I. -cl-fast-relaxed-mat
h -cl-std=CL2.0"
2019-09-23 09:23:10

2019-09-23 09:23:10 OpenCL compilation in 25 ms
2019-09-23 09:23:19 87005279 OK 25346500  29.13%; 3487 us/sq; ETA 2d 11:43; 2cdfabbcb0e97413 (check 2.15s)
2019-09-23 09:23:32 87005279    25350000  29.14%; 3501 us/sq; ETA 2d 11:58; 5921518eec88bf66
2019-09-23 09:26:28 87005279    25400000  29.19%; 3532 us/sq; ETA 2d 12:26; d6307af21b7c7f77
2019-09-23 09:29:26 87005279    25450000  29.25%; 3555 us/sq; ETA 2d 12:47; f9570edb50396289
2019-09-23 09:32:26 87005279 OK 25500000  29.31%; 3559 us/sq; ETA 2d 12:48; 076dfe1049b7bc9e (check 2.12s)

Last fiddled with by kriesel on 2019-09-23 at 14:43
kriesel is offline   Reply With Quote
Old 2019-09-30, 10:38   #1406
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

EB216 Posts
Default

Has there been any developments on getting gpuOwl to crunch Wagstaff numbers?
paulunderwood is offline   Reply With Quote
Old 2019-10-01, 10:55   #1407
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×457 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
Has there been any developments on getting gpuOwl to crunch Wagstaff numbers?
No progress, at least not from me, sorry... (limited time available, and I would like to do the "PRP proof" (VDF) to a proof of concept first)
preda is offline   Reply With Quote
Old 2019-10-01, 16:40   #1408
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

124758 Posts
Default Feature wish list update attempt

Quote:
Originally Posted by preda View Post
No progress, at least not from me, sorry... (limited time available, and I would like to do the "PRP proof" (VDF) to a proof of concept first)
Items 2 and 4 from https://www.mersenneforum.org/showpo...postcount=1331 also remain unimplemented wish list items.
I think those would be straightforward to implement. (Following numbering arbitrary.)
  1. I think SELROC would appreciate the automation of gputo72 fitted bounds for P-1, as would I. Manually looking up or computing bounds and entering them for each P-1 entry is a bit cumbersome. See https://www.mersenneforum.org/showpo...7&postcount=23
  2. Converting a problem worktodo entry from active to a comment that's skipped and continuing computation with any following active entries would enable continuing full throughput in many cases. Terminating when there's an issue with the current worktodo entry reduces throughput, whether it's due to an entry for a PRP run, P-1 run, or future Wagstaff capability run.
  3. Proof of computing the PRP via VDF is intriguing. It has a separate thread at https://www.mersenneforum.org/showthread.php?t=24654
  4. A method of verification of TF work performance was described by Robert Gerbicz. Links to that and to discussion of possible adaptation of the method to P-1 are included in a post on P-1 error rate https://www.mersenneforum.org/showpo...37&postcount=3.
  5. Wagstaff computation seems to me a significant development effort, based on reading the comments of Woltman and Mayer on how to proceed.
  6. P-1 has little in the way of error checking. Part of that is by the nature of the computation; the Gerbicz check does not apply. There are parts of it to which the Jacobi check could be applied, and large parts in which it is quite unproductive. See https://www.mersenneforum.org/showthread.php?p=490415 and https://www.mersenneforum.org/showthread.php?t=23470
  7. There appear to be some small opportunities for increased efficiency in P-1. See https://www.mersenneforum.org/showpo...3&postcount=11
Thanks for all your efforts. I'm happy to test nearly whatever you add next, within my available OS and gpu limits.
kriesel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 07:16.


Fri Aug 6 07:16:04 UTC 2021 up 14 days, 1:45, 1 user, load averages: 3.36, 2.97, 2.78

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.