![]() |
1 Attachment(s)
[QUOTE=preda;526270]Yes that seems pretty broken. I'm not sure why yet; I did push a new commit -- could you try it and tell me how it works? (pls check both with/without -time)
There's no need to wait 10minutes -- if it doesn't do the usual progress, or does not react to Ctrl-C, it's broken.[/QUOTE] No problem, was going through paper mail while it ran. Retried previous commit with -yield but without -time; similar behavior. Eight minutes zero iterations. Responded to CTRL-C though. Will try make and run the latest commit after breakfast. |
Win7 X64 Pro, NVIDIA GTX1080Ti, gpuowl-win v6.11-6-g02fd645, M226m P-1 stage 2 continuation,
No -time: without -yield operates normally on the gpu but fully occupies a cpu core (in this case a hyperthread on one of the Xeon E5520 packages); a round took 9 minutes 24 seconds. with -yield, zero cpu after 12 core-seconds initialization, but also zero gpu load per GPU-Z so probably zero progress. With -time: without -yield operates normally on the gpu but fully occupies a cpu core (in this case a hyperthread on one of the Xeon E5520 packages); a round took 9 minutes 34.5 seconds, so -time overhead appears to be ~10 seconds / 564 =~ 1.8% [CODE]2019-09-22 11:27:32 226000127 P2 1628/2880: setup 4280 ms; 11400 us/prime, 51335 primes 2019-09-22 11:27:32 36.80% tailFusedMulDelta : 4118 us/call x 51335 calls 2019-09-22 11:27:32 33.56% carryFused : 3547 us/call x 54355 calls 2019-09-22 11:27:32 7.10% fftMiddleIn : 750 us/call x 54355 calls 2019-09-22 11:27:32 7.05% fftMiddleOut : 745 us/call x 54355 calls 2019-09-22 11:27:32 6.63% transposeW : 701 us/call x 54355 calls 2019-09-22 11:27:32 6.56% transposeH : 693 us/call x 54355 calls 2019-09-22 11:27:32 1.58% fftH : 1507 us/call x 6040 calls 2019-09-22 11:27:32 0.72% multiply : 1371 us/call x 3020 calls 2019-09-22 11:27:32 Total time 574.506 s[/CODE]with -yield again the gpu quickly goes idle. |
[QUOTE=preda;526271]I increased the sleep time on yield to attempt to reduce CPU usage more. Could you try again please? (with the newest revision)[/QUOTE]
Looks like there's no change in throughput or CPU load when running PRP. Still around 87% used on 1 core. |
Separate system, dual xeon e5-2670, Win7 X64 Pro, NVIDIA GTX1080, gpuowl-win v6.11-6-g02fd645, M228m P-1, similar behavior.
|
[QUOTE=kriesel;526305]Separate system, dual xeon e5-2670, Win7 X64 Pro, NVIDIA GTX1080, gpuowl-win v6.11-6-g02fd645, M228m P-1, similar behavior.[/QUOTE]
I made one more change (added a queue flush before waiting in yield) please let me know whether this fixes it. |
gpuowl-win v6.11-9-g9ae3189
1 Attachment(s)
[QUOTE=preda;526314]I made one more change (added a queue flush before waiting in yield) please let me know whether this fixes it.[/QUOTE]Much better. Runs the gpu hard, and after the initial startup takes several cpu core seconds, there's about one more cpu core second used per gpu minute, on the dual Xeon E5-2670 system.[CODE]C:\Users\ken\Documents\v6.11-9-g9ae3189>gpuowl-win -device 0 -use ORIG_X2 -user kriesel -cpu emu/gtx1080 -maxAlloc 8000 -yield
2019-09-22 17:42:39 gpuowl v6.11-9-g9ae3189 2019-09-22 17:42:39 Note: no config.txt file found 2019-09-22 17:42:39 config: -device 0 -use ORIG_X2 -user kriesel -cpu emu/gtx1080 -maxAlloc 8000 -yield 2019-09-22 17:42:39 228000037 FFT 14336K: Width 256x4, Height 256x4, Middle 7; 15.53 bits/word 2019-09-22 17:42:40 OpenCL args "-DEXP=228000037u -DWIDTH=1024u -DSMALL_HEIGHT=1024u -DMIDDLE=7u -DWEIGHT_STEP=0xb.12354e6de8db8p-3 -DIWEIGHT_STEP=0xb.8fc56ff3f adcp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-22 17:42:40 2019-09-22 17:42:40 OpenCL compilation in 22 ms 2019-09-22 17:42:44 228000037 P1 B1=1840000, B2=42320000; 2654010 bits; starting at 1083301 2019-09-22 17:44:16 228000037 P1 1090000 41.07%; 13745 us/sq; ETA 0d 05:58; 646ebd24b9141139 2019-09-22 17:46:33 228000037 P1 1100000 41.45%; 13754 us/sq; ETA 0d 05:56; 5b076380f84fa1f8 2019-09-22 17:48:52 228000037 P1 1110000 41.82%; 13821 us/sq; ETA 0d 05:56; 49cac9f30cafb667 2019-09-22 17:51:09 228000037 P1 1120000 42.20%; 13768 us/sq; ETA 0d 05:52; 49039a105d434d61 2019-09-22 17:53:28 228000037 P1 1130000 42.58%; 13831 us/sq; ETA 0d 05:51; aed916597692a26e 2019-09-22 17:55:45 228000037 P1 1140000 42.95%; 13763 us/sq; ETA 0d 05:47; 0a39a801f50514e8 2019-09-22 17:58:04 228000037 P1 1150000 43.33%; 13877 us/sq; ETA 0d 05:48; a69b4685a5d5e8ed 2019-09-22 18:00:22 228000037 P1 1160000 43.71%; 13764 us/sq; ETA 0d 05:43; 8ba2709ae1589129 2019-09-22 18:02:39 228000037 P1 1170000 44.08%; 13760 us/sq; ETA 0d 05:40; f69bffc29181eec2 2019-09-22 18:04:58 228000037 P1 1180000 44.46%; 13826 us/sq; ETA 0d 05:40; e55aa4dce17619d2 2019-09-22 18:07:15 228000037 P1 1190000 44.84%; 13767 us/sq; ETA 0d 05:36; bd8a0062f3e8109b 2019-09-22 18:09:33 228000037 P1 1200000 45.21%; 13823 us/sq; ETA 0d 05:35; 15f4486494abaf74 2019-09-22 18:11:51 228000037 P1 1210000 45.59%; 13767 us/sq; ETA 0d 05:31; a652297a1008f956 2019-09-22 18:14:10 228000037 P1 1220000 45.97%; 13842 us/sq; ETA 0d 05:31; 78094c385b32ceac 2019-09-22 18:14:16 Stopping, please wait.. 2019-09-22 18:14:17 Exiting because "stop requested" 2019-09-22 18:14:17 Bye Terminate batch job (Y/N)? n C:\Users\ken\Documents\v6.11-9-g9ae3189>gpuowl-win -device 0 -use ORIG_X2 -user kriesel -cpu emu/gtx1080 -maxAlloc 8000 -yield -time 2019-09-22 18:14:40 gpuowl v6.11-9-g9ae3189 2019-09-22 18:14:40 Note: no config.txt file found 2019-09-22 18:14:40 config: -device 0 -use ORIG_X2 -user kriesel -cpu emu/gtx1080 -maxAlloc 8000 -yield -time 2019-09-22 18:14:40 228000037 FFT 14336K: Width 256x4, Height 256x4, Middle 7; 15.53 bits/word 2019-09-22 18:14:40 OpenCL args "-DEXP=228000037u -DWIDTH=1024u -DSMALL_HEIGHT=1024u -DMIDDLE=7u -DWEIGHT_STEP=0xb.12354e6de8db8p-3 -DIWEIGHT_STEP=0xb.8fc56ff3f adcp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-22 18:14:40 2019-09-22 18:14:40 OpenCL compilation in 25 ms 2019-09-22 18:14:45 228000037 P1 B1=1840000, B2=42320000; 2654010 bits; starting at 1220501 2019-09-22 18:16:57 228000037 P1 1230000 46.34%; 13941 us/sq; ETA 0d 05:31; d10c1a457f57634c 2019-09-22 18:16:57 36.96% tailFused : 5058 us/call x 9499 calls 2019-09-22 18:16:57 17.03% carryFused : 4762 us/call x 4650 calls 2019-09-22 18:16:57 16.21% carryFusedMul : 4347 us/call x 4848 calls 2019-09-22 18:16:57 7.52% transposeW : 1029 us/call x 9499 calls 2019-09-22 18:16:57 7.47% transposeH : 1022 us/call x 9499 calls 2019-09-22 18:16:57 7.41% fftMiddleIn : 1014 us/call x 9499 calls 2019-09-22 18:16:57 7.39% fftMiddleOut : 1011 us/call x 9499 calls 2019-09-22 18:16:57 Total time 129.985 s [/CODE]Similar results on the 226M P-1 run on a GTX1080Ti on another system. |
On Windows, the yield option works perfectly for PRP, dropping my CPU usage from about 5.5% of 16 threads down to almost nothing. Though the speed is reduced from around 860us/it down to 880us/it, which is insignificant enough and that my CPU would work more efficiently to compensate for that. Thanks Preda for addressing this bug (blame lays on Nvidia for sure).
|
PRP on GTX1080Ti on gpuowl V6.11-9 with -yield seems to be within 2% of gpu throughput of v6.7-4 (which saturates a cpu core). Observed prime95 throughput penalty with v6.7's cpu use was about 0.5% (2% of one of the 4 workers), thanks to hyperthreading mitigating the impact somewhat. These figures are very approximate. A more accurate check would use about an hour in each condition after ignoring the initial startup of 10 minutes or so for thermal stabilization.
[CODE]2019-09-23 09:10:52 gpuowl v6.7-4-g278407a 2019-09-23 09:10:53 Note: no config.txt file found 2019-09-23 09:10:53 config: -device 0 -use ORIG_X2 -maxAlloc 10240 -user kriesel -cpu dodo-gtx1080ti 2019-09-23 09:10:53 87005279 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.59 bits/word 2019-09-23 09:10:53 using short carry kernels 2019-09-23 09:10:53 OpenCL args "-DEXP=87005279u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xa.97d8cd06772f8p-3 -DIWEIGHT _STEP=0xc.1551b6b1158dp-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-mat h -cl-std=CL2.0" 2019-09-23 09:10:53 2019-09-23 09:10:53 OpenCL compilation in 97 ms 2019-09-23 09:10:55 87005279.owl loaded: k 25172000, block 500, res64 2736c9728212e62e 2019-09-23 09:11:03 87005279 OK 25173000 28.93%; 3406 us/sq; ETA 2d 10:30; 8f25ad724e654078 (check 2.09s) 2019-09-23 09:12:36 87005279 25200000 28.96%; 3448 us/sq; ETA 2d 11:12; 7670ca7fa4cba9de 2019-09-23 09:15:32 87005279 OK 25250000 29.02%; 3472 us/sq; ETA 2d 11:34; 1d799dd231b858fc (check 2.11s) 2019-09-23 09:18:27 87005279 25300000 29.08%; 3513 us/sq; ETA 2d 12:12; 2ec8f55bc1a420aa 2019-09-23 09:21:07 Stopping, please wait.. 2019-09-23 09:21:09 87005279 OK 25345500 29.13%; 3515 us/sq; ETA 2d 12:12; b879e7272e09c388 (check 2.12s) 2019-09-23 09:21:10 Exiting because "stop requested" 2019-09-23 09:21:10 Bye[/CODE][CODE]2019-09-23 09:23:09 gpuowl v6.11-9-g9ae3189 2019-09-23 09:23:09 Note: no config.txt file found 2019-09-23 09:23:09 config: -device 0 -use ORIG_X2 -user kriesel -cpu dodo/gtx1080ti -maxAlloc 10240 -yield 2019-09-23 09:23:09 87005279 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.59 bits/word 2019-09-23 09:23:10 OpenCL args "-DEXP=87005279u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xa.97d8cd06772f8p-3 -DIWEIGHT _STEP=0xc.1551b6b1158dp-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-mat h -cl-std=CL2.0" 2019-09-23 09:23:10 2019-09-23 09:23:10 OpenCL compilation in 25 ms 2019-09-23 09:23:19 87005279 OK 25346500 29.13%; 3487 us/sq; ETA 2d 11:43; 2cdfabbcb0e97413 (check 2.15s) 2019-09-23 09:23:32 87005279 25350000 29.14%; 3501 us/sq; ETA 2d 11:58; 5921518eec88bf66 2019-09-23 09:26:28 87005279 25400000 29.19%; 3532 us/sq; ETA 2d 12:26; d6307af21b7c7f77 2019-09-23 09:29:26 87005279 25450000 29.25%; 3555 us/sq; ETA 2d 12:47; f9570edb50396289 2019-09-23 09:32:26 87005279 OK 25500000 29.31%; 3559 us/sq; ETA 2d 12:48; 076dfe1049b7bc9e (check 2.12s) [/CODE] |
Has there been any developments on getting gpuOwl to crunch Wagstaff numbers?
|
[QUOTE=paulunderwood;526970]Has there been any developments on getting gpuOwl to crunch Wagstaff numbers?[/QUOTE]
No progress, at least not from me, sorry... (limited time available, and I would like to do the "PRP proof" (VDF) to a proof of concept first) |
Feature wish list update attempt
[QUOTE=preda;527060]No progress, at least not from me, sorry... (limited time available, and I would like to do the "PRP proof" (VDF) to a proof of concept first)[/QUOTE]Items 2 and 4 from [URL]https://www.mersenneforum.org/showpost.php?p=525330&postcount=1331[/URL] also remain unimplemented wish list items.
I think those would be straightforward to implement. (Following numbering arbitrary.) [LIST=1][*] I think SELROC would appreciate the automation of gputo72 fitted bounds for P-1, as would I. Manually looking up or computing bounds and entering them for each P-1 entry is a bit cumbersome. See [URL]https://www.mersenneforum.org/showpost.php?p=522257&postcount=23[/URL][*] Converting a problem worktodo entry from active to a comment that's skipped and continuing computation with any following active entries would enable continuing full throughput in many cases. Terminating when there's an issue with the current worktodo entry reduces throughput, whether it's due to an entry for a PRP run, P-1 run, or future Wagstaff capability run.[*]Proof of computing the PRP via VDF is intriguing. It has a separate thread at [URL]https://www.mersenneforum.org/showthread.php?t=24654[/URL][*]A method of verification of TF work performance was described by Robert Gerbicz. Links to that and to discussion of possible adaptation of the method to P-1 are included in a post on P-1 error rate [URL]https://www.mersenneforum.org/showpost.php?p=509937&postcount=3[/URL].[*]Wagstaff computation seems to me a significant development effort, based on reading the comments of Woltman and Mayer on how to proceed.[*]P-1 has little in the way of error checking. Part of that is by the nature of the computation; the Gerbicz check does not apply. There are parts of it to which the Jacobi check could be applied, and large parts in which it is quite unproductive. See [URL]https://www.mersenneforum.org/showthread.php?p=490415[/URL] and [URL]https://www.mersenneforum.org/showthread.php?t=23470[/URL][*]There appear to be some small opportunities for increased efficiency in P-1. See [URL]https://www.mersenneforum.org/showpost.php?p=515863&postcount=11[/URL][/LIST] Thanks for all your efforts. I'm happy to test nearly whatever you add next, within my available OS and gpu limits. |
| All times are UTC. The time now is 23:15. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.