mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

ewmayer 2020-06-17 20:23

[QUOTE=PhilF;548229]I'll stress the fans. They aren't that hard to change. :tire:[/QUOTE]

There's also noise to consider - in my case the pictured build is maybe 6 feet away from the dining room table. Before I bought the large desk fan I was having to run the fans of the 2 rightmost GPUs high enough that their resulting high-pitched whine rose well-above white-noise level, which was an annoying intrusion on mealtime conversation. And in case of failure, the $15 desk fan is a whole lot easier to replace than one of the GPU ones. Does take up just a bit more room, admittedly. :)

paulunderwood 2020-06-20 07:23

I just upgraded to ROCm-3.5.1 and clinfo thereafter said 0 devices found. Now trying to go back to 3,3. :rant:

After more shenanigans i.e. [C]apt-get autoremove rocm*[/C], changing the apt sources file and [C]apt-get install rocm-dev3.3.0[/C], I got my clinfo to return to reason. I changed gpuOwl Makefile and recompiled.

2 instances at 5.5M FFT with sclk 4
3.5.0: 1312 µs/it
3.3.0: 1248 µs/it

A huge difference!

preda 2020-06-20 08:22

[QUOTE=paulunderwood;548571]I just upgraded to ROCm-3.5.1 and clinfo thereafter said 0 devices found. Now trying to go back to 3,3. :rant:

After more shenanigans i.e. [C]apt-get autoremove rocm*[/C], changing the apt sources file and [C]apt-get install rocm-dev3.3.0[/C], I got my clinfo to return to reason. I changed gpuOwl Makefile and recompiled.

2 instances at 5.5M FFT with sclk 4
3.5.0: 1312 µs/it
3.3.0: 1248 µs/it

A huge difference![/QUOTE]

I took the liberty to post your datapoint on the ROCm issue I opened about this
[url]https://github.com/RadeonOpenCompute/ROCm/issues/1124[/url]

in the hope that'll help the ROCm team focus a bit better on performance regressions.

ewmayer 2020-06-20 21:12

I ran into some very interesting 1-job-or-2 timing results a few days ago when testing George's MIDDLE_MUL = 13,14,15 optimizations targeting FFT lengths 6.5,7,7.5M. Kriesel had asked me to do an LL-TC of [url=https://www.mersenne.org/report_exponent/?exp_lo=145500007&full=1]an expo needing 8M FFT length[/url], so I played around with the latest build to look for optimal run params. I always wanted to see how bad 8M was due to lack of an optimized MIDDLE_MUL = 16 relative to the above smaller ones.

I suspended one my 2 side-by-side PRPs@5.5M on my Haswell-system's R7, leaving just 1 running, so the first set of data reflect 2-job run timings. I used the Mlucas max_p, which are ~1% below the gpuowl ones, i.e. trigger some higher-acc settings but not the max: all runs ran at -DMM_CHAIN=1u -DMM2_CHAIN=(1 @5,5.5,6.5,8M and 2@6,7,7.5M) -DMAX_ACCURACY=1.

This was using gpuowl v6.11-321-gf485359-dirty, I checked that 5.5-7.5M use MIDDLE=11-15, 8M used MIDDLE=8, as expected:

5.0M: 96517061 LL 10000 0.01%; 1146 us/it; ETA 1d 06:43; aa8d1c560f892242
5.5M: 105943729 LL 10000 0.01%; 1283 us/it; ETA 1d 13:45; f084ea07a2af855a\
6.0M: 115351079 LL 10000 0.01%; 1355 us/it; ETA 1d 19:25; 49d490b9dd9af6db
6.5M: 124740743 LL 10000 0.01%; 2081 us/it; ETA 3d 00:06; 29bc4fc7dee72e54
7.0M: 134113981 LL 10000 0.01%; 2157 us/it; ETA 3d 08:22; 85a3bb21b2955ca8
7.5M: 143472103 LL 10000 0.01%; 2163 us/it; ETA 3d 14:12; 132a96c40d1d8a61
8.0M 152816089 LL 10000 0.01%; 1849 us/it; ETA 3d 06:28; 079c9d38bd21940e

So 5-6M look good, 8M looks good at just a smidge over 4/3 times the 6M timing, but 6.5-7.5M are ugly ... in fact had I been able to run that 145.5M LL-DC at 7.5M as I briefly hoped (it's less than 1% above the gpuowl 8M limit) it would have been slower than at 8M! Sent the above to George, he said his timings for the same range were nicely monotonic, but [b]with just 1 job running[/b]. That was apparently how he did his tunings when testing his code optimizations.

So redid my timings with no other tasks using the GPU, and here things get very interesting:

Per-iter times at 5-8M FFT were 630,686,756,860,903,934,1253 us/iter ... these are more along the lines I expected, assuming the build I'm using, gpuowl v6.11-321-gf485359-dirty, postdates George's MIDDLE-MUL=13-15 checkins - up through 7.5M nice trend, 8M very bad.

I double-checked the 8M timing, it is indeed 1250 us/iter for 1-job and only jumps to 1850 us/iter with my 5.5M-FFT PRP sharing cycles. IOW, running the aforementioned 8M-needing LL test as the only job on the R7 would be somewhat wall-clock faster - 2 days versus 3 - but would have been a disaster in terms of total system throughput, relative to 2-job mode.

Based on the wild contrast between the 1-job and 2-job numbers, we may need to give FFT-length-specific #instances-to-use guidance starting at or just beyond 6M.

Prime95 2020-06-20 21:46

[QUOTE=ewmayer;548648]5.0M: 96517061 LL 10000 0.01%; 1146 us/it; ETA 1d 06:43; aa8d1c560f892242
5.5M: 105943729 LL 10000 0.01%; 1283 us/it; ETA 1d 13:45; f084ea07a2af855a\
6.0M: 115351079 LL 10000 0.01%; 1355 us/it; ETA 1d 19:25; 49d490b9dd9af6db
6.5M: 124740743 LL 10000 0.01%; 2081 us/it; ETA 3d 00:06; 29bc4fc7dee72e54
7.0M: 134113981 LL 10000 0.01%; 2157 us/it; ETA 3d 08:22; 85a3bb21b2955ca8
7.5M: 143472103 LL 10000 0.01%; 2163 us/it; ETA 3d 14:12; 132a96c40d1d8a61
8.0M 152816089 LL 10000 0.01%; 1849 us/it; ETA 3d 06:28; 079c9d38bd21940e[/QUOTE]

You need to rerun this. This time make sure that the still-running 5.5M job's timings do not change during each of the above tests. I've often seen nutty timings. Just yesterday, one of my 5M FFTs was running at 1060us while the other 5M FFT was running at 1380us. Later one test ended and the next began - both jobs then changed to identical timings. Same total throughput.

kriesel 2020-06-20 22:46

[QUOTE=Prime95;548654]I've often seen nutty timings.[/QUOTE]I've seen unbalanced, unpredictable "gpu timesharing" on other gpu models including NVIDIA, on other applications too. Especially when things get more disparate between tasks running.
App A, exponent A, worktype A alone on the gpu is one throughput measure, compute A iters/sec
App B, exponent B, worktype B alone on the gpu is another, compute B iters/sec.
If the fft lengths are different, maybe we should compare iters x length / sec.
App A etc throughput per unit time in tandem is C iters/sec or whatever.
App B etc throughput per unit time in tandem is D iters/sec or whatever.
Measured during the same time period as C. That's important since the blend can fluctuate back and forth over time in some cases.
EXPECT one task's throughput to change when its competition for resources on the gpu changes.
Combined relative throughput of a tandem session is C/A + D/B of a solo run which should be expressed in consistent units and so is dimensionless overall; compare to unity.
Continue tandem if above unity, switch to solo if not.

ewmayer 2020-06-21 01:27

[QUOTE=Prime95;548654]You need to rerun this. This time make sure that the still-running 5.5M job's timings do not change during each of the above tests. I've often seen nutty timings. Just yesterday, one of my 5M FFTs was running at 1060us while the other 5M FFT was running at 1380us. Later one test ended and the next began - both jobs then changed to identical timings. Same total throughput.[/QUOTE]

Good point. I suspended my ongoing 8M LL-TC, leaving just the 5.5M PRP backround job running, then re-ran each of the LL-test tiining runs with '-iters 10000' deleted, let each do 1Miters in order to get a decent timing sample. Here the results, all timings in us/iter. 'Timing' col is that of each 5-8M LL-run; '5.5M' is that of the background 5.5M-FFT PRP run during the LL-timing run. In order to nondimensionalize things in terms of total-throughput for the combo of both runs, I inverted all the per-iter timings to convert to iter/sec, multiplied each resulting datum by (FFT-length/5M), added the 2 resulting data for each row to get an unnormalized total-throughput number, which I normalized by dividing all these col-3 numbers by the one in row 1. Thus in row 1 we invert the 2 per-iter timings to get 883 and 783 iter/sec, multiply each of those by (FFT-length/5M) to get 883 and 861 5M-equivalent iter/sec, add those 2 to get 1744 5M-equivalent iter/sec, which is then normalized to 1.00 and used to normalize the other rows as well. The rightmost '1-job' column is my timings for the left-column-described runs with nothing else competing for R7 cycles (630,686,756,860,903,934,1253 us/iter @5-8M), inverted to get iter/sec and multiplied by (FFT-length/5M)/1744, in order to provide an apples-to-apples comparison between 1-job and 2-job throughput. In the rightmost 2 cols, larger numbers mean better total throughput:
[code] Timing 5.5M: 5M-equiv: 1-job:
../gpuowl -fft 5M -ll 96517061 1133 1277 1.000 0.910
../gpuowl -fft 5.5M -ll 105943729 1272 1232 1.008 0.919
../gpuowl -fft 6M -ll 115351079 1354 1303 0.992 0.910
../gpuowl -fft 6.5M -ll 124740743 2066 1018 0.980 0.867
../gpuowl -fft 7M -ll 134113981 2145 1032 0.985 0.889
../gpuowl -fft 7.5M -ll 143472103 2150 1080 0.984 0.921
../gpuowl -fft 8M -ll 152816089 1833 1500 0.921 0.732[/code]
Thus total-throughput numbers now give a consistent picture of 8M being worse than the smaller FFT lengths, but 2-job run mode remains consistently better across all these lengths, and is dramatically better than 1-job at 8M.

kriesel 2020-06-21 02:47

[QUOTE=ewmayer;548668]Thus total-throughput numbers now give a consistent picture of 8M being worse than the smaller FFT lengths, but 2-job run mode remains consistently better across all these lengths, and is dramatically better than 1-job at 8M.[/QUOTE]Try running 8 & 8 tandem. Also per the attachment at [URL="https://mersenneforum.org/showpost.php?p=547248&postcount=2272"]post 2272 [/URL]the GhzD/day roll-off continues to very high fft length (on Windows 10).

ewmayer 2020-06-21 04:13

[QUOTE=kriesel;548675]Try running 8 & 8 tandem. Also per the attachment at [URL="https://mersenneforum.org/showpost.php?p=547248&postcount=2272"]post 2272 [/URL]the GhzD/day roll-off continues to very high fft length (on Windows 10).[/QUOTE]

You want more data, there's another way for you to get them, Mr. fellow-R7-user, hint, hint.

Jan S 2020-06-22 18:14

Small question - P-1
 
When i use GPUowl for P-1 factoring, this process uses for stage 2 RAM or videoRAM?

Uncwilly 2020-06-22 18:37

gpuOwl is all video card based, IIRC.


All times are UTC. The time now is 22:58.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.