![]() |
[QUOTE=kriesel;491728]f) benchmark in the various V3.x lengths, and start over in, or switch midstream to, the fastest suitable V3.3 fft length.
[/QUOTE] Because there are very big jumps between the FFT sizes supported (4-5-8-10-16-20), and because these particular NPOTs are quite good performance-wise, it is not expected to encounter any "performance inversion" at all -- a situation where a larger FFT would be faster then a smaller one. So the fastest FFT is the smallest FFT that can handle the exponent. Of course anybody can experiment, e.g. using -fft +/-1 to verify this. (the situation is different in CUDA, where the FFT sizes are "compact" (small jumps) and some sizes are particularly inefficient, so sometimes it makes sense to go to a bit larger FFT than strictly needed). |
[QUOTE=kriesel;491719]Preda's air cooled Vega 64, as indicated in his prior posts.
Asking about V1.9/2 compatibility to V3.3 was a question to Preda, to address SELROC's previously posted concern about compatibility. A clear statement on compatibility from Preda, who wrote and tested the code, would settle it, in my opinion. Or are you saying, SELROC, that you've tested with v1.9 or 2.x checkpoint files and the exponent restarts from iteration 0 in V3.3? I've thought your statements about it up to now were questions or doubts, not test results.[/QUOTE] [QUOTE=kriesel;491728]That's sufficiently different from what I read and recalled, that I went back through this thread looking for what I apparently missed. I did not find such test results stated as such, in pages 22-44 of this thread. Perhaps it's in another thread? Or something. Anyway, thanks for the clear summary just now. And in English, since my grasp of Italian is approximately zero. So, now, with V3.3's additional fft lengths, in addition to a-d in [URL]http://www.mersenneforum.org/showpost.php?p=491433&postcount=465[/URL] (well, b&c seem still applicable) there's e) benchmark and compute whether it's quicker overall to finish an existing exponent in V1.9/2.0 or start over in V3.3 and (depending on whether the various V3.x fft lengths are compatible and can be changed on the fly) f) benchmark in the various V3.x lengths, and start over in, or switch midstream to, the fastest suitable V3.3 fft length. SELROC, would you assemble and post a similar timings table vs. version and fft length for your RX580?[/QUOTE] This is what I think: as long as you select the same FFT size, v3.3 is faster than previous versions. The 4M size is too small for 85M exponent, I tested it and it failed, with bits/word > 20. The 5M size is the faster for 85M-86M exponent. |
In V1.9, the -fft M61 -size 4M can do up to ~83.87M. (some limit <83871433)
On RX550, it is ~7.4-9.2% slower than V2.0 5000K DP, and on RX480, 4M M61 is faster than V2.0 5000K DP, by about 6.3%; Windows 7 x64, same gpu, driver, etc. [CODE]gpuOwL v2.0- GPU Mersenne primality checker Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics Note: using long carry and fused tail kernels OpenCL compilation in 1598 ms, with " -DEXP=83871259u -I. -cl-fast-relaxed-math -cl-kernel-arg-info " PRP-3: FFT 5000K (625 * 4096 * 2) of 83871259 (16.38 bits/word) [2018-07-11 09:20:32 Central Daylight Time] OK 75500000 / 83871259 [90.02%], 5.25 ms/it [5.18, 5.81], check 4.76s; ETA 0d 12:12; 68406df4ababed55 [00:45:22] OK 76000000 / 83871259 [90.61%], 5.25 ms/it [5.18, 5.72], check 4.80s; ETA 0d 11:28; 26f19477440580e1 [01:29:10] OK 76500000 / 83871259 [91.21%], 5.25 ms/it [5.19, 5.82], check 4.79s; ETA 0d 10:45; e9bf0d6048a79a9b [02:13:00] [/CODE][CODE]gpuOwL v1.9- GPU Mersenne primality checker Radeon (TM) RX 480 Graphics 36 @28:0.0, Ellesmere 1266MHz OpenCL compilation in 9860 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=83870041u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFGT_61=1 -DLOG_ROOT2=49u " Warning: high word size of 20.00 bits may result in errors Note: using long carry kernels PRP-3: FFT 4M (1024 * 2048 * 2) of 83870041 (20.00 bits/word) [2018-07-13 13:41:07 Central Daylight Time] OK 3710000 / 83870041 [ 4.42%], 4.92 ms/it [4.87, 5.34] CV 2.1%, check 4.09s; ETA 4d 13:33; 1683dd3b253ee628 [13:44:44] OK 3720000 / 83870041 [ 4.44%], 4.93 ms/it [4.88, 5.35] CV 2.2%, check 3.89s; ETA 4d 13:41; d286de5a16debe4d [13:45:38] OK 3740000 / 83870041 [ 4.46%], 4.92 ms/it [4.87, 5.36] CV 2.1%, check 4.08s; ETA 4d 13:35; b84105343fdda8a3 [13:47:20] OK 3760000 / 83870041 [ 4.48%], 4.93 ms/it [4.87, 5.36] CV 2.3%, check 3.88s; ETA 4d 13:42; 9c0b1d4ddcf1bd2c [13:49:03] OK 3780000 / 83870041 [ 4.51%], 4.93 ms/it [4.88, 5.29] CV 2.1%, check 4.03s; ETA 4d 13:46; 44b788f4b55511a2 [13:50:45] OK 3800000 / 83870041 [ 4.53%], 4.93 ms/it [4.88, 5.28] CV 2.0%, check 3.90s; ETA 4d 13:37; 6b319d3fa95c5f7a [13:52:28] OK 3850000 / 83870041 [ 4.59%], 4.95 ms/it [4.89, 5.81] CV 2.8%, check 3.91s; ETA 4d 13:56; 9fbf9129fb4516a8 [13:56:39] OK 3900000 / 83870041 [ 4.65%], 4.99 ms/it [4.88, 6.78] CV 5.4%, check 3.96s; ETA 4d 14:55; 08250aa4bd6bf9c1 [14:00:53] OK 3950000 / 83870041 [ 4.71%], 4.97 ms/it [4.88, 5.98] CV 3.3%, check 3.98s; ETA 4d 14:14; b6958fdc5d218140 [14:05:05] OK 4000000 / 83870041 [ 4.77%], 4.93 ms/it [4.88, 5.39] CV 2.2%, check 3.88s; ETA 4d 13:27; 079471495f27e70b [14:09:15] OK 4050000 / 83870041 [ 4.83%], 4.96 ms/it [4.87, 6.11] CV 3.8%, check 3.90s; ETA 4d 14:03; 8b6870aa4b408e7f [14:13:28] OK 4100000 / 83870041 [ 4.89%], 4.98 ms/it [4.88, 5.53] CV 3.2%, check 3.90s; ETA 4d 14:16; 4cb590ea14c87340 [14:17:40] OK 4150000 / 83870041 [ 4.95%], 4.93 ms/it [4.88, 5.38] CV 2.2%, check 4.09s; ETA 4d 13:12; ba6740911ef7be07 [14:21:51] [/CODE](see also [URL]http://www.mersenneforum.org/showpost.php?p=484694&postcount=370[/URL]) |
[QUOTE=kriesel;488170]SELROC, what gpu model(s) are you wanting to display stats for?
gpuOwL runs on Intel and AMD but not currently NVIDIA. Here are some things to try on for gpu monitoring related to gpuOwL on linux: [URL]http://www.rkblog.rk.edu.pl/w/p/monitoring-amd-intel-and-nvidia-graphics-card-usage-under-linux/[/URL] [URL]https://support.amd.com/en-us/kb-articles/Pages/AMDSystemMonitor.aspx[/URL] is Windows 7 specific[/QUOTE] Never mind about the AMD system monitor utility. V1.0.0.9 from 2012 and reliably produced only an error message box on my AMD gpu system in Win 7 x64. |
[QUOTE=SELROC;491738]This is what I think: as long as you select the same FFT size, v3.3 is faster than previous versions.
The 4M size is too small for 85M exponent, I tested it and it failed, with bits/word > 20. The 5M size is the faster for 85M-86M exponent.[/QUOTE] Started a 150M exponent, gpuowl selected 8M FFT, timing is 6-7 ms/it, ETA 11d. |
GPU memory usage
I made a little theoretical estimation of the amount of GPU memory used by openowl.
When working with an FFT of size N (i.e. N words), the total amount of GPU memory is [a bit more than] 8*8*N bytes. For example when doing a 20M FFT, this comes to about 1.25GB, and for a 5M FFT about 320MB. |
[QUOTE=preda;491781]I made a little theoretical estimation of the amount of GPU memory used by openowl.
When working with an FFT of size N (i.e. N words), the total amount of GPU memory is [a bit more than] 8*8*N bytes. For example when doing a 20M FFT, this comes to about 1.25GB, and for a 5M FFT about 320MB.[/QUOTE] Hi Mihai, I still have to find a GPU monitoring tool for Debian. Right now I am only monitoring temperatures of the GPU. Lm-sensors is so slow that it is almost unusable to monitor adequately and any way it does not show GPU RAM. |
Hardware: Asus Radeon RX580 8G (Ellesmere)
[QUOTE=SELROC;491778]Started a 150M exponent, gpuowl selected 8M FFT, timing is 6-7 ms/it, ETA 11d.[/QUOTE] Yet started another exponent, at 300M this time, FFT 20M, timing 20-21 ms/it, ETA 69d. |
[QUOTE=SELROC;491785]Hardware: Asus Radeon RX580 8G (Ellesmere)
Yet started another exponent, at 300M this time, FFT 20M, timing 20-21 ms/it, ETA 69d.[/QUOTE] On 20M it may be worth doing a bit higher exponents, 332M, which reach into "100M digits" domain. You can get such exponents from the "manual assignments" page, "first time 100M digits PRP". |
[QUOTE=SELROC;491783]Hi Mihai, I still have to find a GPU monitoring tool for Debian. Right now I am only monitoring temperatures of the GPU.
Lm-sensors is so slow that it is almost unusable to monitor adequately and any way it does not show GPU RAM.[/QUOTE] I don't have a good solution myself. If you use ROCm, it may be an idea to submit a feature request to rocm-smi. I think some information about allocated GPU RAM can be gleaned from clinfo. That's why my memory info is "theoretical", not reported from the GPU. |
V3.3 openowl build fail on Windows; help?
1 Attachment(s)
In msys2/mingw64 (there does not seem to be a make available there), see the attachment for warnings/errors.
|
| All times are UTC. The time now is 23:00. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.