![]() |
New FFT size! finally
GpuOwl has acquired a new FFT size: 5000K.
i.e. 625 * 4096 * 2 which is about 22% more bit-space than the old 4M FFT, and should be enough at the wavefront for... at least 1year I say :) The performance at the new FFT size is about 50% slower compared to 4M FFT. As a data-point, on my Vega64 I get 2.55 ms/it. This new GpuOwl bumped the version number to 2.x (from 1.x previously). The development of the new FFT, which is the first NPOT (non-power-of-2) in GpuOwl, required some preliminary cleanups which had the side effect that all the other transforms have been removed: M61, FFT-8M, etc. But: the previous version (v 1.x) with everything is still available in the V1 branch on github: [URL]https://github.com/preda/gpuowl/tree/V1[/URL] Also, in time, v 2.x may [re-]acquire additional transforms if there is demand. [QUOTE] gpuOwL v2.0-6ab41b2 GPU Mersenne primality checker Command line options: -user <name> : specify the user name. -cpu <name> : specify the hardware name. -time : display kernel profiling information. -tail fused|split : selects tail kernels variant (default 'fused'). -carry short|medium|long : selects carry propagation (default 'short'). Longer carry accomodates lower bits-per-word but may be slower. * 'short' is good for bits-per-word >= 13, * 'medium' is good for bits-per-word >= 6, * 'long' is good for bits-per-word >= 2. Also note that 'short' and 'medium' use fused carry kernels, while 'long' uses split carry kernels. -device <N> : select specific device among: 0 : gfx900-64x1630-@83:0.0 Vega [Radeon RX Vega] [/QUOTE] |
[QUOTE=preda;479583]A wild guess: maybe the PCI extender has something to do with the GPU-freeze? (maybe you could try to move the freezing-GPU to a direct PCI slot and see if the behavior changes?)[/QUOTE]
Couldn't any added cable run, combined with more contact-to-contact junctions (plug<>jack), cause problems like driver time-outs and sync problems in general? |
[QUOTE=kladner;479590]Couldn't any added cable run, combined with more contact-to-contact junctions (plug<>jack), cause problems like driver time-outs and sync problems in general?[/QUOTE]
I'm vaguely aware that there are two main types of riser, powered and unpowered. Unpowered are more likely to have power distribution issues, powered takes a 6 pin from the PSU. |
Prime95 has 5120k rather than the 5000k you use. How would these compare? As far as I was aware it is generally best to have as much as possible to still be powers of 2.
|
[QUOTE=henryzz;479606]Prime95 has 5120k rather than the 5000k you use. How would these compare? As far as I was aware it is generally best to have as much as possible to still be powers of 2.[/QUOTE]
I don't know the details of how Prime95 implements the FFT, and what works best there. A priori I would also expect powers of 2 to be a good thing, on the CPU. But on the GPU, there are different restrictions to how the code is structured: many "threads" (hundrends), each with a small nb. of registers for storing the data, avoiding memory access and basically just shuffling the values between the threads. For this situation, what I found works well is the "matrix FFT algorithm", where each thread dose a small FFT, shuffles and repeats. Thus, for me it is good when I have a high power of a base as the FFT size. In particular, I compute FFT 625 (which is 5^4) and 4096 (8^4). For 625, there are 125 threads each doing FFT-5 repeatedly (and shuffling interspersed). OTOH there may be a better way to do it on the GPU, that I don't know of. |
Ok that makes sense. It is just a case of working out what range of bases is best for the gpu then.
|
[QUOTE=preda;479585]GpuOwl has acquired a new FFT size: 5000K.
i.e. 625 * 4096 * 2 which is about 22% more bit-space than the old 4M FFT, and should be enough at the wavefront for... at least 1year I say :) The performance at the new FFT size is about 50% slower compared to 4M FFT. As a data-point, on my Vega64 I get 2.55 ms/it. This new GpuOwl bumped the version number to 2.x (from 1.x previously). The development of the new FFT, which is the first NPOT (non-power-of-2) in GpuOwl, required some preliminary cleanups which had the side effect that all the other transforms have been removed: M61, FFT-8M, etc. But: the previous version (v 1.x) with everything is still available in the V1 branch on github: [URL]https://github.com/preda/gpuowl/tree/V1[/URL] Also, in time, v 2.x may [re-]acquire additional transforms if there is demand.[/QUOTE] Outstanding to have ~5M available. (I thought you'd been working hard on something like that, since you've been quiet here lately.) Good progress. I estimated ~ 6 months ago, for primality testing, a need for support [CODE](Mid 2018) to 90,000,000 (Mid 2019) to 98,000,000 (Mid 2022) to 122,000,000 (Mid 2027) to 162,000,000[/CODE]Carry control having 3 choices seems interesting. Will start testing V2 after a Windows executable becomes available. Reading your summary of options, I see that there is not -size listed. Are 2M, 4M, and 8M all absent from V2? What's next, a 6M or a 10M? Integration of 2M, 4M, 5M, 8M in one executable? My vote is for the integration into one executable, before other new fft lengths or other features. In my testing on RX550 (low profile 2GB), V1.9-74f1a38 -fft M61 provided 7.5% exponent coverage above 4M DP, at a cost of 74% longer run time (faster than 8M DP legacy but only by 11.5%, and not for far), and is eclipsed by the new 5M fft in both speed and range. There seemed to be a hard limit on -fft M61 at or very near 20 bits/word (~83886080). [CODE]gpuOwL v1.9- GPU Mersenne primality checker Radeon 500 Series 8 @f:0.0, gfx804 1203MHz OpenCL compilation in 8517 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=83947901u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFGT_61=1 -DLOG_ROOT2=49u " Warning: high word size of 20.01 bits may result in errors PRP-3: FFT 4M (1024 * 2048 * 2) of 83947901 (20.01 bits/word) [2018-02-05 16:47:34 Central Standard Time] Starting at iteration 0 OK 0 / 83947901 [ 0.00%], 0.00 ms/it; ETA 0d 00:00; 0000000000000003 [16:47:45] EE 1000 / 83947901 [ 0.00%], 18.94 ms/it [18.91, 18.97] CV 0.2%, check 10.89s; ETA 18d 09:35; 69aed4a865e74756 [16:48:15] Stopping, please wait.. OK 500 / 83947901 [ 0.00%], 18.97 ms/it; ETA 18d 10:19; a2cbe8ea11556bbc [16:48:35] (1 errors) Bye [/CODE][CODE]gpuOwL v1.9- GPU Mersenne primality checker Radeon 500 Series 8 @f:0.0, gfx804 1203MHz OpenCL compilation in 8517 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=83864023u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFGT_61=1 -DLOG_ROOT2=49u " Warning: high word size of 19.99 bits may result in errors PRP-3: FFT 4M (1024 * 2048 * 2) of 83864023 (19.99 bits/word) [2018-02-05 16:50:52 Central Standard Time] Starting at iteration 0 OK 0 / 83864023 [ 0.00%], 0.00 ms/it; ETA 0d 00:00; 0000000000000003 [16:51:03] OK 1000 / 83864023 [ 0.00%], 18.94 ms/it [18.91, 18.97] CV 0.2%, check 10.87s; ETA 18d 09:09; ef9a5a11abf7143f [16:51:32] OK 5000 / 83864023 [ 0.01%], 18.92 ms/it [18.91, 18.94] CV 0.1%, check 10.94s; ETA 18d 08:41; ee62e9f17bd600ef [16:52:59] OK 10000 / 83864023 [ 0.01%], 19.06 ms/it [18.91, 20.34] CV 2.4%, check 10.94s; ETA 18d 12:01; 4f11f3877f5769ff [16:54:45] OK 20000 / 83864023 [ 0.02%], 18.97 ms/it [18.91, 19.94] CV 1.2%, check 10.95s; ETA 18d 09:45; 01fa40ea8a0cbf97 [16:58:06] OK 40000 / 83864023 [ 0.05%], 18.99 ms/it [18.91, 19.94] CV 1.2%, check 10.92s; ETA 18d 10:06; 4c8ee33e140d542b [17:04:36] OK 60000 / 83864023 [ 0.07%], 19.01 ms/it [18.88, 19.94] CV 1.5%, check 10.94s; ETA 18d 10:33; 319cca2509d59467 [17:11:08] OK 80000 / 83864023 [ 0.10%], 18.99 ms/it [18.91, 20.37] CV 1.7%, check 10.95s; ETA 18d 09:54; 8b7984d3f3ea335a [17:17:38] OK 100000 / 83864023 [ 0.12%], 18.99 ms/it [18.91, 20.37] CV 1.5%, check 10.94s; ETA 18d 09:46; aaa50bb7f76ceff0 [17:24:09] Stopping, please wait.. OK 144000 / 83864023 [ 0.17%], 18.99 ms/it [18.91, 20.15] CV 1.4%, check 10.94s; ETA 18d 09:42; 6b4ecb1038572b7d [17:38:16][/CODE] |
2 Attachment(s)
[QUOTE=M344587487;479605]I'm vaguely aware that there are two main types of riser, powered and unpowered. Unpowered are more likely to have power distribution issues, powered takes a 6 pin from the PSU.[/QUOTE]
Thanks for your input (and kladner and preda too). It's a powered riser, like the two that are running NVIDIA gpus on other systems without any analogous issue. (Tiny pcie 1x pc goes in the motherboard connector, USB cable carries data,16x connector PC attaches to GPU, separate SATA power cable adapter brings (most?) power to the gpu). And, recent use in 3 modes after some recent system restarts and driver removal / reinstall and addition of a second RX550 indicate there's something more subtle going on. An RX550 is affected, even if there's no riser involved on it, plugged straight into a x16 slot on the motherboard. Actually, the sensor data loss predates the introduction of a riser onto the system. Two RX550s in the same system (one of them the original that I saw the sensor data loss issue with, in x16 motherboard slot), the other a newer card, on the riser) both displayed full sensor data on the local console. Great, I thought, maybe it's solved. Went back to using Windows remote desktop, and got the first image captured; sensors dropping out upon Win RD use. Several experiments later, I conclude, Windows 7 remote desktop, AMD RX550 cards, MSI drivers shipped with the cards, and GPU-Z 2.7.0 seem to be an incompatible combination, causing sensor data to drop out for the duration of the remote desktop session (captured shortly after, using VNC remote). It's definitely not a gpuOwL issue, since it affects sensors when mfakto and cllucas are what are running on the cards. And after getting some sleep, I discover VNC remote session instead of Windows remote desktop is even worse, handling sensor data ok but bogging down both gpuOwLV1.9, and Mfakto 0.15pre6 progress terribly (orders of magnitude slower, looks hung when viewing console session or log file). mfakto: [CODE] date time | class compl | time ETA | GhzD/day sieve wait Feb 08 01:34 | 1495 32.4% | 261.51 1d23h | 184.91 10045 0.00% Feb 08 01:38 | 1507 32.5% | 261.67 1d23h | 184.79 10045 0.00% windows remote desktop Feb 08 05:14 | 1512 32.6% | 12944 96d22h | 3.74 10045 0.00% <--vnc remote desktop Feb 08 05:18 | 1515 32.7% | 260.54 1d22h | 185.59 10045 0.00% windows remote desktop Feb 08 05:23 | 1516 32.8% | 260.87 1d22h | 185.36 10045 0.00% [/CODE]gpuowl: [CODE] OK 10500000 / 77231809 [13.60%], 10.86 ms/it [10.76, 11.76] CV 1.7%, check 7.07s; ETA 8d 09:15; f750ded934f0f571 [00:16:59] OK 10550000 / 77231809 [13.66%], 10.85 ms/it [10.76, 11.48] CV 1.5%, check 7.13s; ETA 8d 08:59; 885573b8be067842 [00:26:09] OK 10600000 / 77231809 [13.72%], 18.41 ms/it [10.76, 343.65] CV 186.7%, check 21.26s; ETA 14d 04:47; 21dd27b9d0bd1a89 [00:41:51] OK 10650000 / 77231809 [13.79%], 43.52 ms/it [37.84, 543.91] CV 116.1%, check 21.08s; ETA 33d 12:53; 847b2e63fe97793d [01:18:28] OK 10700000 / 77231809 [13.85%], 284.94 ms/it [10.98, 25364.74] CV 889.1%, check 7.65s; ETA 219d 09:57; d56fa42939327b03 [05:16:03] OK 10750000 / 77231809 [13.92%], 11.46 ms/it [10.67, 12.57] CV 4.9%, check 6.97s; ETA 8d 19:33; 63a75d0f0c981062 [05:25:42] OK 10800000 / 77231809 [13.98%], 10.75 ms/it [10.67, 11.45] CV 1.6%, check 7.52s; ETA 8d 06:27; 654cf68bce82d0dc [05:34:48] [/CODE]There may be lingering traces of NVIDIA drivers or some other system software problem. I may resort to OS reinstall or some serious driver removal efforts as a test. I'm skeptical about MSI live update and downloadable latest driver also at the moment. I'm in the habit of leaving multiple remote desktop sessions running a long time from a single laptop. Maybe, at least for this one AMD-gpu-worker system, that should change, at least for now. |
[QUOTE=kladner;479590]Couldn't any added cable run, combined with more contact-to-contact junctions (plug<>jack), cause problems like driver time-outs and sync problems in general?[/QUOTE]
I don't know. I do know that with NVIDIA cards, there are TDRs on systems with no risers present. I think that was also the case with the system with one RX550, no riser. TDRs are a known problem in the gpu developer community, relating to kernel runtimes, judging by NVIDIA's and AMD's forums, as I recall. Maybe multiple causes increasing total frequency, though. |
[QUOTE=preda;479583]A wild guess: maybe the PCI extender has something to do with the GPU-freeze? (maybe you could try to move the freezing-GPU to a direct PCI slot and see if the behavior changes?)[/QUOTE]
I may try that. I've been avoiding the lowest x16 slot, near the 6-pin GPU power lead from the power supply, which I intend to fill with an RX480 when I find one at a decent price. (Looks like best price/performance within the 150W power limit in the AMD benchmarks at mersenne.ca). The RX550 is only ~50W so runs off the PCIe connector (rated up to 75 on a motherboard, most 1x/16x extenders are rated to 60W); the RX550 has no other connector for power than the card edge. I have an inexpensive unpowered x16 extender ordered, which will permit multiple tests: 1) does unpowered vs powered matter for the mere 50W draw of an RX550? 2) does a wider connection affect the few% of performance difference seen between my two RX550s of different design? 3) does it make BSOD, TDR, or hang/slowdown worse, or seem not to matter? |
[QUOTE=kriesel;479625]I don't know. I do know that with NVIDIA cards, there are TDRs on systems with no risers present. I think that was also the case with the system with one RX550, no riser. TDRs are a known problem in the gpu developer community, relating to kernel runtimes, judging by NVIDIA's and AMD's forums, as I recall. Maybe multiple causes increasing total frequency, though.[/QUOTE]
I am relating via a fairly crude analogy.I have seen monitor extension cables produce ghosting on screen. However, those memories of mine may date back to totally analog (VGA) days.I am not sure of the results in the DVI/HDMI/DispalyPort universe. |
| All times are UTC. The time now is 22:48. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.