![]() |
[QUOTE=nomead;532534]The advantage of benchmarking on Linux is that the results are more predictable, it's less likely that the OS starts indexing or going through updates or scanning for viruses in the background.[/QUOTE]What's the anticipated mechanism for a [B]cpu[/B]-intensive other process like indexing or virus scanning to have impact on a [B]gpu[/B]-intensive process with minimal cpu use such as gpuowl? Secondary effects from disk access contention among multiple processes despite caching strategies? I routinely benchmark gpus with prime95 occupying cpu cores fully, and writing save files periodically, and other GIMPS apps running on other gpus and doing their writes, and repeatability is pretty good.
|
[QUOTE=Prime95;532692]On the off chance it is an OpenCL compile issue, go to tailFused and change the declaration of lds to size SMALL_HEIGHT*2 rather than SMALL_HEIGHT*complicated_expression.[/QUOTE]
It comes after the OpenCL compilation is already done. And no, unfortunately that didn't fix it. Some FFT sizes with WIDTH=4096u also fail, with different error messages of course. There it clearly occurs during OpenCL compilation: [CODE]2019-12-12 09:29:21 ptxas error : Entry function 'carryFusedMul' uses too much shared data (0x10008 bytes, 0xc000 max) ptxas error : Entry function 'carryFused' uses too much shared data (0x10008 bytes, 0xc000 max) ptxas error : Entry function 'fftP' uses too much shared data (0x10008 bytes, 0xc000 max) ptxas error : Entry function 'fftW' uses too much shared data (0x10008 bytes, 0xc000 max) 2019-12-12 09:29:21 Exception gpu_error: clBuildProgram at clwrap.cpp:234 build [/CODE] (Note that this is also with just -use NO_ASM) I don't use git yet, so I just downloaded the whole zip from github, gwoltman2/gpuowl and the latest commit labeled b9c39f9. Maybe some day... |
[QUOTE=kriesel;532696]What's the anticipated mechanism for a [B]cpu[/B]-intensive other process like indexing or virus scanning to have impact on a [B]gpu[/B]-intensive process with minimal cpu use such as gpuowl? Secondary effects from disk access contention among multiple processes despite caching strategies? I routinely benchmark gpus with prime95 occupying cpu cores fully, and writing save files periodically, and other GIMPS apps running on other gpus and doing their writes, and repeatability is pretty good.[/QUOTE]
Yes, mostly disk I/O and related interrupts. Especially antivirus programs and Windows indexing really thrashes the disk, even with NVMe drives the impact is noticeable. I've found that with the very short work I'm doing on mfaktc at the moment (double checking 2 to 64 bits in the 3G-4G range, takes about 0.07 seconds per exponent!) mprime affects mfaktc, and vice versa. Not so much with exponents that take at least a second per factoring attempt. And earlier I was talking about Win10 affecting Prime95, no GPU involved there... And I'd hardly call gpuowl "minimal cpu use", even with -yield it takes about 80% of one core on my Linux machine, but luckily it's happy with a hyperthreaded core, so it doesn't affect mprime. |
[QUOTE=nomead;532699]I'd hardly call gpuowl "minimal cpu use", even with -yield it takes about 80% of one core on my Linux machine, but luckily it's happy with a hyperthreaded core, so it doesn't affect mprime.[/QUOTE]How slow a core is that? Here:
case 1: gpuowl 6.11-9 PRP on 5M fft, Win10 Pro x64, dual-xeon e5-2697-v2, 24 real cores, 48 counting hyperthreading, task manager reports about 0.25% cpu use for it, ~12% of 1 hyperthread. Case 2:gpuowl 6.6, P-1 stage 2 on 530M, Win7 Pro x64, dual xeon E5645, 12 real cores total, no hyperthreading, task manager reports 0% cpu use for it in its 1% resolution. Accumulated cpu time indicates 2.45 cpu core hours in 98 elapsed hours,1176 core-hours, ~0.21% of available cpu, ~30% of 1 core usage, and v6.6 does not have the -yield option. When it barely shows up in Task Manager, or registers 0%, I call that minimal. |
[QUOTE=kriesel;532702]How slow a core is that? Here:
case 1: gpuowl 6.11-9 PRP on 5M fft, Win10 Pro x64, dual-xeon e5-2697-v2, 24 real cores, 48 counting hyperthreading, task manager reports about 0.25% cpu use for it, ~12% of 1 hyperthread. Case 2:gpuowl 6.6, P-1 stage 2 on 530M, Win7 Pro x64, dual xeon E5645, 12 real cores total, no hyperthreading, task manager reports 0% cpu use for it in its 1% resolution. Accumulated cpu time indicates 2.45 cpu core hours in 98 elapsed hours,1176 core-hours, ~0.21% of available cpu, ~30% of 1 core usage, and v6.6 does not have the -yield option. When it barely shows up in Task Manager, or registers 0%, I call that minimal.[/QUOTE] On a Ryzen 5 3600, running at about 3.9 GHz. So 6 cores, 12 threads. Example: [CODE] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 28683 sam 30 10 725228 72764 7472 S 587.0 0.4 7528:08 mprime 1561 sam 20 0 5090652 142732 104232 R 79.4 0.9 0:09.32 gpuowl [/CODE] |
My case 1 a couple posts earlier was for the process using a Radeon VII (other gpu in the box is an RX550 2GB);
case 2 was for an RX480 (other gpu there is an RX550 4GB). Minimal cpu usage for either case of gpus saturation. |
[QUOTE=Prime95;532692]On the off chance it is an OpenCL compile issue, go to tailFused and change the declaration of lds to size SMALL_HEIGHT*2 rather than SMALL_HEIGHT*complicated_expression.[/QUOTE]
I think I know what the Nvidia issue is, we hit it in the past and Cheng Sun found the solution: [url]https://github.com/preda/gpuowl/commit/c48d46fdbcba6c490c439aa9b07eb4c40bcacae0[/url] It concerns unaligned access to LDS, which seems to be an issue on Nvidia (only). A different problem is the herratic behavior of the ROCm OpenCL compiler/optimizer, which has been in a dire state for years with a tendency of getting worse. It's extremely frustrating to debug and work-around "black box" bugs in the ROCm optimizer. I now have installed a recent ROCm, but I'm compiling using the libamdocl64.so from an older ROCm version which was simply generating code with better performance than all the following ROCm versions. The recent new T2_SHUFFLE changes fix the situation on the newest ROCm, bringing it to parity with the old lib I was using, but introduce a performance regression on the old-ROCm. That's fine, still an improvement although a bit non-intuitive -- if I get the same performance, probably better to get with the recent ROCm than with the old. OTOH I tried to apply, after the new T2_SHUFFLE variants, some trivial changes to the LDS in line with Sun's change mentioned above (to fix the Nvidia error), and suddenly carryFused's compilation became much worse for no apparent reason -- ROCm strikes again. In this confusing situation I'm going to wait a bit for more clarity before merging the new T2_SHUFFLE variants. |
[QUOTE=nomead;532698]
I don't use git yet, so I just downloaded the whole zip from github, gwoltman2/gpuowl and the latest commit labeled b9c39f9. Maybe some day...[/QUOTE] What? Spelling, perhaps? From github: We couldn’t find any repositories matching 'gwoltman2/gpuowl' |
gpuowl 6.11-83-ge270393 middle and shuffle tune on xfx Radeon VII
Improvement over baseline > 17%. Good job![CODE]Gpuowl version and commit
GPU model v6.11-83-ge270393 GPU clock capped at ~1400 Host OS Win 10 X64 Pro Notes the gpu now seems to have stabilized to a low error rate, none seen in more than a day. Exponent timed 89796247 Computation type (PRP, P-1 stage 1, P-1 stage 2): PRP FFT length FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word config file entries -time -iters 20000 -device 1 -user kriesel -cpu roa/radeonvii varying tuning -use options, in chronological order 1103 NO_ASM us/sq warmup, end user interaction, stabilize 1110 NO_ASM baseline In benchmarking (highlight fastest time in bold) 1240 NO_ASM,MERGED_MIDDLE,WORKINGIN 1233 NO_ASM,MERGED_MIDDLE,WORKINGIN (repeatability) 982 NO_ASM,MERGED_MIDDLE,WORKINGIN1 978 NO_ASM,MERGED_MIDDLE,WORKINGIN1A 973 NO_ASM,MERGED_MIDDLE,WORKINGIN2 976 NO_ASM,MERGED_MIDDLE,WORKINGIN3 1014 NO_ASM,MERGED_MIDDLE,WORKINGIN4 [B]946[/B] NO_ASM,MERGED_MIDDLE,WORKINGIN5 Out benchmarking (highlight fastest time in bold) 1105 NO_ASM,MERGED_MIDDLE,WORKINGOUT 1087 NO_ASM,MERGED_MIDDLE,WORKINGOUT0 994 NO_ASM,MERGED_MIDDLE,WORKINGOUT1 994 NO_ASM,MERGED_MIDDLE,WORKINGOUT1A 1056 NO_ASM,MERGED_MIDDLE,WORKINGOUT2 [B]960[/B] NO_ASM,MERGED_MIDDLE,WORKINGOUT3 991 NO_ASM,MERGED_MIDDLE,WORKINGOUT4 1003 NO_ASM,MERGED_MIDDLE,WORKINGOUT5 repeatability +-3.5/1106.5 = +-0.32% best 946 base 1106.5 ratio 1.170 1113 NO_ASM Fastest WORKINGIN, Fastest WORKINGOUT combination: 953 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3 953 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_WIDTH [B]946[/B] NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_MIDDLE 954 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_HEIGHT 947 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_REVERSELINE [B]946[/B] NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_WIDTH,T2_SHUFFLE_MIDDLE,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE 955 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE 947 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_MIDDLE best 946 base 1113 ratio 1.177[/CODE] |
[QUOTE=tServo;532735]What? Spelling, perhaps?
From github: We couldn’t find any repositories matching 'gwoltman2/gpuowl'[/QUOTE] Dunno: [URL="https://github.com/gwoltman2/gpuowl"]https://github.com/gwoltman2/gpuowl[/URL] |
[QUOTE=preda;532712]A different problem is the herratic behavior of the ROCm OpenCL compiler/optimizer, which has been in a dire state for years with a tendency of getting worse. It's extremely frustrating to debug and work-around "black box" bugs in the ROCm optimizer.[/QUOTE]
I also worked in the fix for unaligned access to local data and thanks to the compiler's optimizer lost all benefits I was seeing in the T2_SHUFFLE options. These saving are significant 808us vs 836us. At this point I don't know if the gain is coming from shuffling T2 values instead of T values or if is coming from the optimizer making different decisions. Preda is correct about the how frustrating this is. The difference in the source code is very minor, yet the optimizer produces wildly different results. What triggers the optimizer making good decisions? I spent yesterday looking at assembly output and making minor source tweaks and still haven't figured it out. |
| All times are UTC. The time now is 23:14. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.