![]() |
What variables determine max P-1 exponent for two stages
Several scenarios, different observed limits on exponent in gpuowl P-1, several differences, some matches
A) Windows 7 Pro, dual old Xeon 4-core processors, 24GB system ram, NVIDIA GTX1080Ti with 11gb gpu ram, gpuowl v6.7-4-g278407a, -maxAlloc [B]10240[/B] Observed maximum exponent for completing two stages with GPU72 bounds, at least 514M but less than 520M, testing 517M next From the gpuowl.log, 2019-11-10 01:45:55 520000009 5580000 99.94%; 26842 us/sq; ETA 0d 00:01; 1baeff7353d9f091 2019-11-10 01:47:23 Not enough GPU memory, will skip stage2. Please wait for stage1 GCD B) Colab, and so Ubuntu linux on VM, 2 cores Xeon, KVM hypervisor, ~12.7GB system ram, NVIDIA Tesla K80 with 12gb gpu ram, gpuowl: Fan Ming's build for Colab, -maxAlloc [B]10240[/B] [URL]https://www.mersenneforum.org/showpost.php?p=528390&postcount=379[/URL] Observed maximum exponent for completing two stages with GPU72 bounds, at least 564M, upper bound TBD, testing 665M now C) different Windows 7 Pro system, dual old Xeon 6-core processors, 12GB system ram, AMD RX480 with 8GB gpu ram, Gpuowl V6.6-5-667954b, -maxAlloc not an available option in that version Observed maximum exponent for completing two stages with GPU72 bounds, at least 500M (gpu-z indicated 3.7GB used in stage 2), upper bound TBD, testing 530M now D) AMD RX550 with 4GB gpu ram, Gpuowl v6.7-4-g278407a, Observed maximum exponent for completing two stages with GPU72 bounds, at least 150M, upper bound 224M, testing 180M now E) AMD RX550 with 2GB gpu ram, gpuowl-v6.10-9-g54cba1d -maxAlloc 1900 Observed maximum exponent for completing two stages with GPU72 bounds, at least 24M, upper bound TBD F) NVIDIA GTX1060 3GB, gpuowl v6.9, -maxAlloc 3000 Observed maximum exponent for completing two stages with GPU72 bounds, <24M, not useful for GIMPS wavefront G) NVIDIA GTX1080 with 8GB gpu ram, Observed maximum exponent for completing two stages with GPU72 bounds, at least 499M, upper bound TBD Any ideas why the difference in observed limit, especially for the same -maxAlloc? I'd really appreciate Preda's thoughts on what determines that. There's clearly more to it than only gpu ram. If maxAlloc controlled max exponent, the limits on scenarios A and B would be very similar or identical, and they're not. Re unknown linux system ram amount, how do I ask the Colab VM's linux that? After a web search, tried !lshw -short !sudo lshw -short !sudo dmidecode -t memory and got bash: lshw: command not found sudo: lshw: command not found sudo: dmidecode: command not found finally got it with !cat /proc/meminfo since top is not part of the gpuowl scripts |
Dept of corrections
[QUOTE=kriesel;532301]Several scenarios, different observed limits on exponent in gpuowl P-1, several differences, some matches
A) Windows 7 Pro, dual old Xeon 4-core processors, 24GB system ram, NVIDIA GTX1080Ti with 11gb gpu ram, gpuowl v6.7-4-g278407a, -maxAlloc 10240 Observed maximum exponent for completing two stages with GPU72 bounds, at least 510M but less than [B]514[/B]M, testing [B]511[/B]M next. [/QUOTE]Fft length transition 28672 vs 36864K is sizable and 28672's max exponent is indicated in the help output as at 510.47M. |
New gpuOwl version
A new version of gpuOwl is ready with some fairly decent speed and power improvements. The key change was merging the transpose and middle steps into one kernel. This reduces memory bandwidth requirements by 33%.
On my test Radeon VII system running a 5M FFT, one instance timing was 950us, two instance timings was 1720us. Timings with the new code are 839us along with a reduction of 17 watts. For Linux, download the source from preda's github gpuowl page. Here is a link to the Windows executable: [url]https://www.dropbox.com/s/w9qnxd02qlt3sof/gpuowl-win.exe?dl=0[/url] To activate the new code, add "-use MERGED_MIDDLE" to the command line. Do not run 2 instances, it will reduce throughput. The new code is still undergoing some minor tuning and cleanup especially with regards to architectures other than Radeon VII. It needs QA on P-1 tasks. Bonus: I added the -cleanup command line argument to delete PRP save files at the end of the run for composite results. |
[QUOTE=Prime95;532379]A new version of gpuOwl is ready with some fairly decent speed and power improvements. The key change was merging the transpose and middle steps into one kernel. This reduces memory bandwidth requirements by 33%.
On my test Radeon VII system running a 5M FFT, one instance timing was 950us, two instance timings was 1720us. Timings with the new code are 839us along with a reduction of 17 watts. For Linux, download the source from preda's github gpuowl page. Here is a link to the Windows executable: [url]https://www.dropbox.com/s/w9qnxd02qlt3sof/gpuowl-win.exe?dl=0[/url] To activate the new code, add "-use MERGED_MIDDLE" to the command line. Do not run 2 instances, it will reduce throughput. The new code is still undergoing some minor tuning and cleanup especially with regards to architectures other than Radeon VII. It needs QA on P-1 tasks. Bonus: I added the -cleanup command line argument to delete PRP save files at the end of the run for composite results.[/QUOTE] A very nice speed up from 990us to 860us :tu: |
[QUOTE=paulunderwood;532381]A very nice speed up from 990us to 860us :tu:[/QUOTE]Or 1720/2=860, to 839, 2.5% higher throughput. There's also around 6% single-instance speedup for p=89796247, fft 5M, gtx1080, Win7 Pro, gpuowl v6.11-71-g7e02b07, and perhaps a little more to come. (That commit took different -use input)
|
[QUOTE=Prime95;532379]A new version of gpuOwl is ready with some fairly decent speed and power improvements. The key change was merging the transpose and middle steps into one kernel. This reduces memory bandwidth requirements by 33%.
On my test Radeon VII system running a 5M FFT, one instance timing was 950us, two instance timings was 1720us. Timings with the new code are 839us along with a reduction of 17 watts. For Linux, download the source from preda's github gpuowl page. Here is a link to the Windows executable: [url]https://www.dropbox.com/s/w9qnxd02qlt3sof/gpuowl-win.exe?dl=0[/url] To activate the new code, add "-use MERGED_MIDDLE" to the command line. Do not run 2 instances, it will reduce throughput. The new code is still undergoing some minor tuning and cleanup especially with regards to architectures other than Radeon VII. It needs QA on P-1 tasks. Bonus: I added the -cleanup command line argument to delete PRP save files at the end of the run for composite results.[/QUOTE] Awesome! I will test on my Titan V which is severely bounded by memory bandwidth and I'll also experiment with the P100s on Colab with this version when I have time. UPDATE 1: Significant speed up with my Vega 64, going from 2100us/it to 1870us/it. Not bad at all. UPDATE 2: Getting the following error on my Nvidia GPUs. The following error is happening on my Windows 10 1909, Nvidia Titan V running driver version 440.97 using George's executable: [CODE]OpenCL compilation error -11 (args -DEXP=94205039u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.2daa303b0ff18p-3 -DIWEIGHT_STEP=0xf.a6a9923a8d87p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DMERGED_MIDDLE=1 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0) 2019-12-08 14:43:55 <kernel>:1117:11: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation atom_add(&localSum, sum); ^ <kernel>:1120:39: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation if (get_local_id(0) == 0) { atom_add(&out[0], localSum); } ^ 2019-12-08 14:43:55 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:226 build 2019-12-08 14:43:55 Bye[/CODE] |
[QUOTE=Prime95;532379]A new version of gpuOwl is ready with some fairly decent speed and power improvements. ...
For Linux, download the source from preda's github gpuowl page. Here is a link to the Windows executable: [URL]https://www.dropbox.com/s/w9qnxd02qlt3sof/gpuowl-win.exe?dl=0[/URL] [/QUOTE] [CODE]C:\Users\ken\Documents\gwtest>gpuowl-win -time -iters 10000 -use NO_ASM 2019-12-08 17:07:54 gpuowl v6.11-64-g01a9778-dirty 2019-12-08 17:07:54 Note: no config.txt file found 2019-12-08 17:07:54 config: -time -iters 10000 -use NO_ASM 2019-12-08 17:07:54 89796247 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word 2019-12-08 17:07:56 OpenCL args "-DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-3 -DIWEIGHT_STEP=0x8.bce25ec56bc 2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-12-08 17:07:56 OpenCL compilation error -11 (args -DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-3 -DIWEIGH T_STEP=0x8.bce25ec56bc2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0) 2019-12-08 17:07:56 <kernel>:1117:11: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation atom_add(&localSum, sum); ^ <kernel>:1120:39: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation if (get_local_id(0) == 0) { atom_add(&out[0], localSum); } ^ 2019-12-08 17:07:56 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:226 build 2019-12-08 17:07:56 Bye C:\Users\ken\Documents\gwtest>gpuowl-win -time -iters 10000 -use NO_ASM,MERGED_MIDDLE 2019-12-08 17:07:56 gpuowl v6.11-64-g01a9778-dirty 2019-12-08 17:07:57 Note: no config.txt file found 2019-12-08 17:07:57 config: -time -iters 10000 -use NO_ASM,MERGED_MIDDLE 2019-12-08 17:07:57 89796247 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word 2019-12-08 17:07:58 OpenCL args "-DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-3 -DIWEIGHT_STEP=0x8.bce25ec56bc 2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -DMERGED_MIDDLE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-12-08 17:07:58 OpenCL compilation error -11 (args -DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-3 -DIWEIGH T_STEP=0x8.bce25ec56bc2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -DMERGED_MIDDLE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0) 2019-12-08 17:07:58 <kernel>:1117:11: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation atom_add(&localSum, sum); ^ <kernel>:1120:39: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation if (get_local_id(0) == 0) { atom_add(&out[0], localSum); } ^ 2019-12-08 17:07:59 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:226 build 2019-12-08 17:07:59 Bye[/CODE]Preda had made a pragma fix for his latest commit, 6.11-71-g7e02b07, which apparently didn't make it into prime95's Windows build yet. |
[QUOTE=kriesel;532389][CODE]C:\Users\ken\Documents\gwtest>gpuowl-win -time -iters 10000 -use NO_ASM
2019-12-08 17:07:54 gpuowl v6.11-64-g01a9778-dirty 2019-12-08 17:07:54 Note: no config.txt file found 2019-12-08 17:07:54 config: -time -iters 10000 -use NO_ASM 2019-12-08 17:07:54 89796247 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word 2019-12-08 17:07:56 OpenCL args "-DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-3 -DIWEIGHT_STEP=0x8.bce25ec56bc 2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-12-08 17:07:56 OpenCL compilation error -11 (args -DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-3 -DIWEIGH T_STEP=0x8.bce25ec56bc2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0) 2019-12-08 17:07:56 <kernel>:1117:11: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation atom_add(&localSum, sum); ^ <kernel>:1120:39: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation if (get_local_id(0) == 0) { atom_add(&out[0], localSum); } ^ 2019-12-08 17:07:56 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:226 build 2019-12-08 17:07:56 Bye C:\Users\ken\Documents\gwtest>gpuowl-win -time -iters 10000 -use NO_ASM,MERGED_MIDDLE 2019-12-08 17:07:56 gpuowl v6.11-64-g01a9778-dirty 2019-12-08 17:07:57 Note: no config.txt file found 2019-12-08 17:07:57 config: -time -iters 10000 -use NO_ASM,MERGED_MIDDLE 2019-12-08 17:07:57 89796247 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word 2019-12-08 17:07:58 OpenCL args "-DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-3 -DIWEIGHT_STEP=0x8.bce25ec56bc 2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -DMERGED_MIDDLE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-12-08 17:07:58 OpenCL compilation error -11 (args -DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-3 -DIWEIGH T_STEP=0x8.bce25ec56bc2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -DMERGED_MIDDLE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0) 2019-12-08 17:07:58 <kernel>:1117:11: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation atom_add(&localSum, sum); ^ <kernel>:1120:39: error: must specify '#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable' before using this atomic operation if (get_local_id(0) == 0) { atom_add(&out[0], localSum); } ^ 2019-12-08 17:07:59 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:226 build 2019-12-08 17:07:59 Bye[/CODE]Preda had made a pragma fix for his latest commit, 6.11-71-g7e02b07, which apparently didn't make it into prime95's Windows build yet.[/QUOTE] I see. Any build instruction for windows then? It would be great if I can figure out how to build on windows myself :D |
[QUOTE=kriesel;532389]Preda had made a pragma fix for his latest commit, 6.11-71-g7e02b07, which apparently didn't make it into prime95's Windows build yet.[/QUOTE]
I'm waiting for one last merge before making another Windows executable. |
[QUOTE=xx005fs;532390]I see. Any build instruction for windows then? It would be great if I can figure out how to build on windows myself :D[/QUOTE]Probably best to let Preda and Prime95 get back into sync first.
But in general, for relatively recent gpuowl versions, on Windows, do steps 1 through 4 of kracker's instructions at [URL]https://www.mersenneforum.org/showpost.php?p=483209&postcount=356[/URL] (The AMD APP SDK 3.0 link has gone dead. See for example [URL]https://github.com/fireice-uk/xmr-stak/issues/1511[/URL] or [URL]https://en.wikipedia.org/wiki/AMD_APP_SDK[/URL]) Install git on msys2 This may not be the whole story for setting up for compiles. In an msys2 cmd prompt box from here on: # to refresh a git working folder: git pull [URL]https://github.com/preda/gpuowl[/URL] #or to new folder that has not been a git folder before: git clone [URL]https://github.com/preda/gpuowl[/URL] cd gpuowl make gpuowl-win.exe To use the executable, switch to an NT command prompt box. It won't run in the msys2 context. Msys2 is a linux like environment. The executable is a Windows executable. It's a sort of cross-compile. I usually run gpuowl-win.exe -h immediately, both to save it, and to verify the newly compiled program is working well enough to identify gpus on the build box. Since it's OpenCL based, it's the same build whether used on AMD or NVIDIA gpus. |
[QUOTE=kriesel;532394]
To use the executable, switch to an NT command prompt box. It won't run in the msys2 context.[/QUOTE] It'll run, you just have to tell it to look for it in the "current" folder like "./gpuowl-win.exe" for example. |
| All times are UTC. The time now is 23:14. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.