![]() |
[QUOTE=SELROC;491399]I am computing 85M-86M so 4M would fail (?), while 16M should slow down.
Looking forward for 5M (before automatic work fetch).[/QUOTE] Or run v2.0 which has only 5000K, while you wait. It should work up to around 93M. It's around 5.25ms/iter on an RX-480. [CODE]gpuOwL v2.0- GPU Mersenne primality checker Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics Note: using long carry and fused tail kernels OpenCL compilation in 1544 ms, with " -DEXP=83871259u -I. -cl-fast-relaxed-math -cl-kernel-arg-info " PRP-3: FFT 5000K (625 * 4096 * 2) of 83871259 (16.38 bits/word) [2018-07-05 16:49:02 Central Daylight Time][/CODE] |
[QUOTE=kriesel;491414]Or run v2.0 which has only 5000K, while you wait. It should work up to around 93M. It's around 5.25ms/iter on an RX-480.
[CODE]gpuOwL v2.0- GPU Mersenne primality checker Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics Note: using long carry and fused tail kernels OpenCL compilation in 1544 ms, with " -DEXP=83871259u -I. -cl-fast-relaxed-math -cl-kernel-arg-info " PRP-3: FFT 5000K (625 * 4096 * 2) of 83871259 (16.38 bits/word) [2018-07-05 16:49:02 Central Daylight Time][/CODE][/QUOTE] This is an attempt that I was tempted to make, but I think the old checkpoint format is incompatible with the new one... |
[QUOTE=SELROC;491419]This is an attempt that I was tempted to make, but I think the old checkpoint format is incompatible with the new one...[/QUOTE]
Just thinking about some possibilities: a) save a safety duplicate copy of the checkpoint file and then try the current exponent on v2.0 b) run the current exponent on 8M to completion in 3.x c) switch to a different exponent that can run on v2.0 from the start in 5000K fft length d) wait on the current exponent until Preda provides a 5M length in 3.x |
[QUOTE=kriesel;491433]Just thinking about some possibilities:
a) save a safety duplicate copy of the checkpoint file and then try the current exponent on v2.0 b) run the current exponent on 8M to completion in 3.x c) switch to a different exponent that can run on v2.0 from the start in 5000K fft length d) wait on the current exponent until Preda provides a 5M length in 3.x[/QUOTE] Currently doing option b :-) |
[QUOTE=SELROC;491435]Currently doing option b :-)[/QUOTE]
gpuOwl Memory Consumption (smemstat -m): PID Swap USS PSS RSS D User Command 1230 0.000 57.246 69.139 116.152 ↑ sel ./openowl 1301 0.000 55.887 67.527 113.723 ↑ sel ./openowl 1278 0.000 52.070 64.017 111.398 ↑ sel ./openowl 1253 0.000 51.504 63.448 110.934 ↑ sel ./openowl 1326 0.000 43.871 55.805 103.395 ↑ sel ./openowl 4078 0.000 0.246 0.445 2.848 ↑ root smemstat -m Note: Memory reported in units of megabytes. |
openOwl news v3.3
Hi, I'm happy to bring some exciting news: I have upgraded openOwl's FFT framework to make incorporating some NPOT sizes easier; and added a new factor-5 "middle step". Thus, openOwl should now support these FFT sizes: 4M, 5M, 8M, 10M, 16M, 20M.
What is good is that two of these sizes are particularly useful: 5M for wavefront PRP (80M -- 96M), and 20M for "100 million digits" PRP. The speeds (on my Vega64, stock, air, 1400MHz, 150W) are roughly 2.5ms/it for 5M FFT, and 9.77ms/it for 20M FFT. The FFT size is by default chosen automatically based on the exponent, but can be also be "forced" on the command line with -fft, e.g.: "-fft 8M" "-fft +1" or "-fft -1" (use the next higher/lower size, relative to the auto-selected size). Another piece of "news" is that openOwl uses "rolling offset", which means that it dynamically changes the offset when an error is encountered. This trick allows to continue an exponent at the very upper edge of a given FFT size, where numerical errors are present. In my observations, the benefit is small, allowing an exponent-size extension of (less than) 0.5%. The "-block" command line argument sets the "block size" of the GEC ("Gerbicz Error Checking"). The values accepted now are 100, 200, 400. An error check is done at every block^2 iterations (thus, 10K iterations for -block 100, and 160K iterations for block 400). So, a smaller block detects errors earlier because it checks more often. The drawback is the cost, block 100 having an overhead of roughly 3%, while block 400 an overhead of roughly 0.75%). Default is block 200, overhead 1.5%, check every 40K iterations. The block size can only be set when starting a new exponent, it being fixed afterwards for the exponent. Bugs are expected. |
[QUOTE=preda;491680]I have upgraded openOwl's FFT framework to make incorporating some NPOT sizes easier; and added a new factor-5 "middle step". Thus, openOwl should now support these FFT sizes: 4M, 5M, 8M, 10M, 16M, 20M.[/QUOTE]
Is it difficult to do a factor 3 FFT? That would give you 3M & 6M (although 3M wouldn't be useful for PRP tests as there are no candidates available for it). |
[QUOTE=preda;491680]Hi, I'm happy to bring some exciting news: I have upgraded openOwl's FFT framework to make incorporating some NPOT sizes easier; and added a new factor-5 "middle step". Thus, openOwl should now support these FFT sizes: 4M, 5M, 8M, 10M, 16M, 20M.
What is good is that two of these sizes are particularly useful: 5M for wavefront PRP (80M -- 96M), and 20M for "100 million digits" PRP. The speeds (on my Vega64, stock, air, 1400MHz, 150W) are roughly 2.5ms/it for 5M FFT, and 9.77ms/it for 20M FFT. The FFT size is by default chosen automatically based on the exponent, but can be also be "forced" on the command line with -fft, e.g.: "-fft 8M" "-fft +1" or "-fft -1" (use the next higher/lower size, relative to the auto-selected size). Another piece of "news" is that openOwl uses "rolling offset", which means that it dynamically changes the offset when an error is encountered. This trick allows to continue an exponent at the very upper edge of a given FFT size, where numerical errors are present. In my observations, the benefit is small, allowing an exponent-size extension of (less than) 0.5%. The "-block" command line argument sets the "block size" of the GEC ("Gerbicz Error Checking"). The values accepted now are 100, 200, 400. An error check is done at every block^2 iterations (thus, 10K iterations for -block 100, and 160K iterations for block 400). So, a smaller block detects errors earlier because it checks more often. The drawback is the cost, block 100 having an overhead of roughly 3%, while block 400 an overhead of roughly 0.75%). Default is block 200, overhead 1.5%, check every 40K iterations. The block size can only be set when starting a new exponent, it being fixed afterwards for the exponent. Bugs are expected.[/QUOTE] How do those timings compare to the power of 2 ffts? |
[QUOTE=axn;491687]Is it difficult to do a factor 3 FFT? That would give you 3M & 6M (although 3M wouldn't be useful for PRP tests as there are no candidates available for it).[/QUOTE]
It *should* be easy to replace the "5" with either 3 or 3*3. But a 6M FFT is not really "hot" yet. (may become useful later, when the wavefront moves past 5M FFT, but that may take years). OTOH 9 would allow a 18M FFT, which might be fastest for 100M-digits. |
[QUOTE=preda;491680]Hi, I'm happy to bring some exciting news: I have upgraded openOwl's FFT framework to make incorporating some NPOT sizes easier; and added a new factor-5 "middle step". Thus, openOwl should now support these FFT sizes: 4M, 5M, 8M, 10M, 16M, 20M.
What is good is that two of these sizes are particularly useful: 5M for wavefront PRP (80M -- 96M), and 20M for "100 million digits" PRP. The speeds (on my Vega64, stock, air, 1400MHz, 150W) are roughly 2.5ms/it for 5M FFT, and 9.77ms/it for 20M FFT. The FFT size is by default chosen automatically based on the exponent, but can be also be "forced" on the command line with -fft, e.g.: "-fft 8M" "-fft +1" or "-fft -1" (use the next higher/lower size, relative to the auto-selected size). Another piece of "news" is that openOwl uses "rolling offset", which means that it dynamically changes the offset when an error is encountered. This trick allows to continue an exponent at the very upper edge of a given FFT size, where numerical errors are present. In my observations, the benefit is small, allowing an exponent-size extension of (less than) 0.5%. The "-block" command line argument sets the "block size" of the GEC ("Gerbicz Error Checking"). The values accepted now are 100, 200, 400. An error check is done at every block^2 iterations (thus, 10K iterations for -block 100, and 160K iterations for block 400). So, a smaller block detects errors earlier because it checks more often. The drawback is the cost, block 100 having an overhead of roughly 3%, while block 400 an overhead of roughly 0.75%). Default is block 200, overhead 1.5%, check every 40K iterations. The block size can only be set when starting a new exponent, it being fixed afterwards for the exponent. Bugs are expected.[/QUOTE] Currently testing latest. It selected fft 5M on the current exponents 85M. The timing is 4-5 ms/it. Waiting for completion. |
[QUOTE=henryzz;491689]How do those timings compare to the power of 2 ffts?[/QUOTE]
(roughly) 16M FFT: 7.8 ms/it. 20M FFT: 9.8 ms/it So it's mostly linearly with the FFT size, which is about the best I could hope for. In fact under 10ms/it for 100M-digit PRP is not a bad baseline. |
| All times are UTC. The time now is 23:00. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.