![]() |
Gpuowl v6.7-4-278407a Windows build
1 Attachment(s)
There's a copy of the help output text and the license along with the stripped executable in the 7z file. The main feature addition is P-1 on NVIDIA. Note there's no P-1 save file capability yet, so any P-1 restart is from the beginning.
|
1 Attachment(s)
[QUOTE=preda;525342]most recent commit now, tagged v6.8: v6.8-0-gab732a0
[/QUOTE] msys2 / MINGW64 make did not like that one at all, producing quite a shower of messages about it. |
On a Win7 Pro x64 system with 12GB of ram, 36GB page file, and gpu with 4GB (GTX1050Ti) gpuowl 6.7-4 produced mad paging for an hour before terminating its attempt to start stage 2 on a 50M P-1 run :[CODE]2019-09-06 23:33:12 50001781 610000 96.08%; 12509 us/sq; ETA 0d 00:05; 73d8a20f091dd815
2019-09-06 23:35:17 50001781 620000 97.66%; 12520 us/sq; ETA 0d 00:03; c2adf07b3ea6c52c 2019-09-06 23:37:22 50001781 630000 99.23%; 12502 us/sq; ETA 0d 00:01; 7fc58198458ecd8a 2019-09-07 00:54:35 Exception gpu_error: OUT_OF_HOST_MEMORY clCreateBuffer at clwrap.cpp:273 makeBuf _ 2019-09-07 00:54:36 Bye[/CODE]cmd line was gpuowl-win -user kriesel -cpu condorette -use ORIG_X2 -device 0, worktodo entry B1=440000,B2=8360000;PFactor=0,1,2,50001781,-1,73,2. The same system and gpu have run CUDAPm1 to completion both stages on up to 384M exponents. |
[QUOTE=kriesel;525362]msys2 / MINGW64 make did not like that one at all, producing quite a shower of messages about it.[/QUOTE]
Please retry (I attempted a fix) |
[QUOTE=kriesel;525370]On a Win7 Pro x64 system with 12GB of ram, 36GB page file, and gpu with 4GB (GTX1050Ti) gpuowl 6.7-4 produced mad paging for an hour before terminating its attempt to start stage 2 on a 50M P-1 run :[CODE]2019-09-06 23:33:12 50001781 610000 96.08%; 12509 us/sq; ETA 0d 00:05; 73d8a20f091dd815
2019-09-06 23:35:17 50001781 620000 97.66%; 12520 us/sq; ETA 0d 00:03; c2adf07b3ea6c52c 2019-09-06 23:37:22 50001781 630000 99.23%; 12502 us/sq; ETA 0d 00:01; 7fc58198458ecd8a 2019-09-07 00:54:35 Exception gpu_error: OUT_OF_HOST_MEMORY clCreateBuffer at clwrap.cpp:273 makeBuf _ 2019-09-07 00:54:36 Bye[/CODE]cmd line was gpuowl-win -user kriesel -cpu condorette -use ORIG_X2 -device 0, worktodo entry B1=440000,B2=8360000;PFactor=0,1,2,50001781,-1,73,2. The same system and gpu have run CUDAPm1 to completion both stages on up to 384M exponents.[/QUOTE] Did you not specify -maxAlloc? |
[QUOTE=preda;525378]Did you not specify -maxAlloc?[/QUOTE]Correct, -maxAlloc not used on that run, which went to over 11GB peak working set and hit as high as 8000 page faults/sec. A retry with -maxAlloc 3072 started shortly after the earlier post has made it to stage 2 and so far has peak working set 182MB.
|
[QUOTE=preda;525377]Please retry (I attempted a fix)[/QUOTE]Thanks, much better; clean compile on gpuowl v6.8-2-g0f3059b in Win7Pro x64 with msys2/mingw64 (indicated as 20180531 in Control Panel/Programs/Programs and Features; not sure if that indicates any subsequent updates).
|
Sorry if I choose wrong forum thread but can you explain such case: on 1050TI I got near 315 Ghz/day in 24 hours on TF task. The almost same amount Ghz/day I got for one LL or PRP task which takes 15-16 days of computing on the same 1050 TI GPU. So 315 Ghz/days per 1 day or 16 days. Why so bid difference?
|
There are some other factors (ahem...) at play too, but the simplest way to put it is that LL/PRP and TF work uses different compute capabilities of the GPU core. TF is mostly 32-bit integer (INT32) while the FFT multiplication in LL/PRP uses mostly 64-bit double floating point (FP64) arithmetic. And in Nvidia consumer cards, the FP64 capability has been deliberately limited to protect their datacenter range of cards. So much so, that for every FP64 operation, the card can do 32 FP32 operations in the same time in parallel. On the Pascal cards (GTX10xx) INT32 is somewhat slower than FP32, so that's why the ratio in work done per day isn't quite 1:32. I don't know the exact figure though. But in the Turing generation of cards (GTX16xx and RTX20xx) INT32 is now as fast as FP32, so the difference between TF and LL work GHz-d/d is several times bigger.
For example, my Ryzen 5 3600 can do more LL work per day in mprime than my RTX 2080 (non-Super) can in CUDALucas. So that's why I only do trial factoring on the card. This is also the reason why the Radeon VII is such a beast at PRP work, because there, the FP64 to FP32 ratio was only limited to 1:4. On AMD consumer cards in general I think the ratio has mostly been 1:16, please correct me because I'm sure I'm wrong. |
[QUOTE=nomead;525442]On AMD consumer cards in general I think the ratio has mostly been 1:16, please correct me because I'm sure I'm wrong.[/QUOTE]This is why Preda was discouraged back around v3.8 gpuowl, on implementing an NVIDIA compatible gpuowl; GTX10xx correctly seemed slow to him, at 1:32. There is wide variation in the NVIDIA line. Some older NVIDIA gpus have much better DP performance ratios. See [url]https://www.mersenneforum.org/showpost.php?p=490612&postcount=3[/url]
|
gpuowl P-1 stage 2 terminology questions
What is a block?
What is a round? What determines how many rounds are required for stage 2? What Brent Suyama extension exponent is used? Does it vary? What determines the maximum exponent that gpuowl can complete in P-1 stage 1 or stage 2, other than run time or available fft lengths? (No doubt available gpu ram is a constraint, but without more info, that does not enable computing or predicting max exponent per gpu model based on gpu specifications. "Just try it" is an unsatisfying answer when run times may be weeks or longer, depending on gpu model and exponent) |
| All times are UTC. The time now is 23:15. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.