mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-06-17, 20:23   #2344
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

9,791 Posts
Default

Quote:
Originally Posted by PhilF View Post
I'll stress the fans. They aren't that hard to change.
There's also noise to consider - in my case the pictured build is maybe 6 feet away from the dining room table. Before I bought the large desk fan I was having to run the fans of the 2 rightmost GPUs high enough that their resulting high-pitched whine rose well-above white-noise level, which was an annoying intrusion on mealtime conversation. And in case of failure, the $15 desk fan is a whole lot easier to replace than one of the GPU ones. Does take up just a bit more room, admittedly. :)

Last fiddled with by ewmayer on 2020-06-17 at 20:24
ewmayer is offline   Reply With Quote
Old 2020-06-20, 07:23   #2345
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

65558 Posts
Default

I just upgraded to ROCm-3.5.1 and clinfo thereafter said 0 devices found. Now trying to go back to 3,3.

After more shenanigans i.e. apt-get autoremove rocm*, changing the apt sources file and apt-get install rocm-dev3.3.0, I got my clinfo to return to reason. I changed gpuOwl Makefile and recompiled.

2 instances at 5.5M FFT with sclk 4
3.5.0: 1312 µs/it
3.3.0: 1248 µs/it

A huge difference!

Last fiddled with by paulunderwood on 2020-06-20 at 07:57
paulunderwood is offline   Reply With Quote
Old 2020-06-20, 08:22   #2346
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

126010 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
I just upgraded to ROCm-3.5.1 and clinfo thereafter said 0 devices found. Now trying to go back to 3,3.

After more shenanigans i.e. apt-get autoremove rocm*, changing the apt sources file and apt-get install rocm-dev3.3.0, I got my clinfo to return to reason. I changed gpuOwl Makefile and recompiled.

2 instances at 5.5M FFT with sclk 4
3.5.0: 1312 µs/it
3.3.0: 1248 µs/it

A huge difference!
I took the liberty to post your datapoint on the ROCm issue I opened about this
https://github.com/RadeonOpenCompute/ROCm/issues/1124

in the hope that'll help the ROCm team focus a bit better on performance regressions.
preda is online now   Reply With Quote
Old 2020-06-20, 21:12   #2347
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

9,791 Posts
Default

I ran into some very interesting 1-job-or-2 timing results a few days ago when testing George's MIDDLE_MUL = 13,14,15 optimizations targeting FFT lengths 6.5,7,7.5M. Kriesel had asked me to do an LL-TC of an expo needing 8M FFT length, so I played around with the latest build to look for optimal run params. I always wanted to see how bad 8M was due to lack of an optimized MIDDLE_MUL = 16 relative to the above smaller ones.

I suspended one my 2 side-by-side PRPs@5.5M on my Haswell-system's R7, leaving just 1 running, so the first set of data reflect 2-job run timings. I used the Mlucas max_p, which are ~1% below the gpuowl ones, i.e. trigger some higher-acc settings but not the max: all runs ran at -DMM_CHAIN=1u -DMM2_CHAIN=(1 @5,5.5,6.5,8M and 2@6,7,7.5M) -DMAX_ACCURACY=1.

This was using gpuowl v6.11-321-gf485359-dirty, I checked that 5.5-7.5M use MIDDLE=11-15, 8M used MIDDLE=8, as expected:

5.0M: 96517061 LL 10000 0.01%; 1146 us/it; ETA 1d 06:43; aa8d1c560f892242
5.5M: 105943729 LL 10000 0.01%; 1283 us/it; ETA 1d 13:45; f084ea07a2af855a\
6.0M: 115351079 LL 10000 0.01%; 1355 us/it; ETA 1d 19:25; 49d490b9dd9af6db
6.5M: 124740743 LL 10000 0.01%; 2081 us/it; ETA 3d 00:06; 29bc4fc7dee72e54
7.0M: 134113981 LL 10000 0.01%; 2157 us/it; ETA 3d 08:22; 85a3bb21b2955ca8
7.5M: 143472103 LL 10000 0.01%; 2163 us/it; ETA 3d 14:12; 132a96c40d1d8a61
8.0M 152816089 LL 10000 0.01%; 1849 us/it; ETA 3d 06:28; 079c9d38bd21940e

So 5-6M look good, 8M looks good at just a smidge over 4/3 times the 6M timing, but 6.5-7.5M are ugly ... in fact had I been able to run that 145.5M LL-DC at 7.5M as I briefly hoped (it's less than 1% above the gpuowl 8M limit) it would have been slower than at 8M! Sent the above to George, he said his timings for the same range were nicely monotonic, but with just 1 job running. That was apparently how he did his tunings when testing his code optimizations.

So redid my timings with no other tasks using the GPU, and here things get very interesting:

Per-iter times at 5-8M FFT were 630,686,756,860,903,934,1253 us/iter ... these are more along the lines I expected, assuming the build I'm using, gpuowl v6.11-321-gf485359-dirty, postdates George's MIDDLE-MUL=13-15 checkins - up through 7.5M nice trend, 8M very bad.

I double-checked the 8M timing, it is indeed 1250 us/iter for 1-job and only jumps to 1850 us/iter with my 5.5M-FFT PRP sharing cycles. IOW, running the aforementioned 8M-needing LL test as the only job on the R7 would be somewhat wall-clock faster - 2 days versus 3 - but would have been a disaster in terms of total system throughput, relative to 2-job mode.

Based on the wild contrast between the 1-job and 2-job numbers, we may need to give FFT-length-specific #instances-to-use guidance starting at or just beyond 6M.

Last fiddled with by ewmayer on 2020-06-20 at 21:16
ewmayer is offline   Reply With Quote
Old 2020-06-20, 21:46   #2348
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2×5×23×31 Posts
Default

Quote:
Originally Posted by ewmayer View Post
5.0M: 96517061 LL 10000 0.01%; 1146 us/it; ETA 1d 06:43; aa8d1c560f892242
5.5M: 105943729 LL 10000 0.01%; 1283 us/it; ETA 1d 13:45; f084ea07a2af855a\
6.0M: 115351079 LL 10000 0.01%; 1355 us/it; ETA 1d 19:25; 49d490b9dd9af6db
6.5M: 124740743 LL 10000 0.01%; 2081 us/it; ETA 3d 00:06; 29bc4fc7dee72e54
7.0M: 134113981 LL 10000 0.01%; 2157 us/it; ETA 3d 08:22; 85a3bb21b2955ca8
7.5M: 143472103 LL 10000 0.01%; 2163 us/it; ETA 3d 14:12; 132a96c40d1d8a61
8.0M 152816089 LL 10000 0.01%; 1849 us/it; ETA 3d 06:28; 079c9d38bd21940e
You need to rerun this. This time make sure that the still-running 5.5M job's timings do not change during each of the above tests. I've often seen nutty timings. Just yesterday, one of my 5M FFTs was running at 1060us while the other 5M FFT was running at 1380us. Later one test ended and the next began - both jobs then changed to identical timings. Same total throughput.
Prime95 is offline   Reply With Quote
Old 2020-06-20, 22:46   #2349
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

107348 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I've often seen nutty timings.
I've seen unbalanced, unpredictable "gpu timesharing" on other gpu models including NVIDIA, on other applications too. Especially when things get more disparate between tasks running.
App A, exponent A, worktype A alone on the gpu is one throughput measure, compute A iters/sec
App B, exponent B, worktype B alone on the gpu is another, compute B iters/sec.
If the fft lengths are different, maybe we should compare iters x length / sec.
App A etc throughput per unit time in tandem is C iters/sec or whatever.
App B etc throughput per unit time in tandem is D iters/sec or whatever.
Measured during the same time period as C. That's important since the blend can fluctuate back and forth over time in some cases.
EXPECT one task's throughput to change when its competition for resources on the gpu changes.
Combined relative throughput of a tandem session is C/A + D/B of a solo run which should be expressed in consistent units and so is dimensionless overall; compare to unity.
Continue tandem if above unity, switch to solo if not.

Last fiddled with by kriesel on 2020-06-20 at 23:02
kriesel is offline   Reply With Quote
Old 2020-06-21, 01:27   #2350
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

9,791 Posts
Default

Quote:
Originally Posted by Prime95 View Post
You need to rerun this. This time make sure that the still-running 5.5M job's timings do not change during each of the above tests. I've often seen nutty timings. Just yesterday, one of my 5M FFTs was running at 1060us while the other 5M FFT was running at 1380us. Later one test ended and the next began - both jobs then changed to identical timings. Same total throughput.
Good point. I suspended my ongoing 8M LL-TC, leaving just the 5.5M PRP backround job running, then re-ran each of the LL-test tiining runs with '-iters 10000' deleted, let each do 1Miters in order to get a decent timing sample. Here the results, all timings in us/iter. 'Timing' col is that of each 5-8M LL-run; '5.5M' is that of the background 5.5M-FFT PRP run during the LL-timing run. In order to nondimensionalize things in terms of total-throughput for the combo of both runs, I inverted all the per-iter timings to convert to iter/sec, multiplied each resulting datum by (FFT-length/5M), added the 2 resulting data for each row to get an unnormalized total-throughput number, which I normalized by dividing all these col-3 numbers by the one in row 1. Thus in row 1 we invert the 2 per-iter timings to get 883 and 783 iter/sec, multiply each of those by (FFT-length/5M) to get 883 and 861 5M-equivalent iter/sec, add those 2 to get 1744 5M-equivalent iter/sec, which is then normalized to 1.00 and used to normalize the other rows as well. The rightmost '1-job' column is my timings for the left-column-described runs with nothing else competing for R7 cycles (630,686,756,860,903,934,1253 us/iter @5-8M), inverted to get iter/sec and multiplied by (FFT-length/5M)/1744, in order to provide an apples-to-apples comparison between 1-job and 2-job throughput. In the rightmost 2 cols, larger numbers mean better total throughput:
Code:
					Timing	5.5M:	5M-equiv:  1-job:
../gpuowl -fft 5M -ll 96517061		1133	1277	1.000	0.910
../gpuowl -fft 5.5M -ll 105943729	1272	1232	1.008	0.919
../gpuowl -fft 6M -ll 115351079		1354	1303	0.992	0.910
../gpuowl -fft 6.5M -ll 124740743	2066	1018	0.980	0.867
../gpuowl -fft 7M -ll 134113981		2145	1032	0.985	0.889
../gpuowl -fft 7.5M -ll 143472103	2150	1080	0.984	0.921
../gpuowl -fft 8M -ll 152816089		1833	1500	0.921	0.732
Thus total-throughput numbers now give a consistent picture of 8M being worse than the smaller FFT lengths, but 2-job run mode remains consistently better across all these lengths, and is dramatically better than 1-job at 8M.
ewmayer is offline   Reply With Quote
Old 2020-06-21, 02:47   #2351
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22×32×127 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Thus total-throughput numbers now give a consistent picture of 8M being worse than the smaller FFT lengths, but 2-job run mode remains consistently better across all these lengths, and is dramatically better than 1-job at 8M.
Try running 8 & 8 tandem. Also per the attachment at post 2272 the GhzD/day roll-off continues to very high fft length (on Windows 10).

Last fiddled with by kriesel on 2020-06-21 at 02:48
kriesel is offline   Reply With Quote
Old 2020-06-21, 04:13   #2352
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

9,791 Posts
Default

Quote:
Originally Posted by kriesel View Post
Try running 8 & 8 tandem. Also per the attachment at post 2272 the GhzD/day roll-off continues to very high fft length (on Windows 10).
You want more data, there's another way for you to get them, Mr. fellow-R7-user, hint, hint.
ewmayer is offline   Reply With Quote
Old 2020-06-22, 18:14   #2353
Jan S
 
Oct 2018
Slovakia

24×5 Posts
Default Small question - P-1

When i use GPUowl for P-1 factoring, this process uses for stage 2 RAM or videoRAM?
Jan S is offline   Reply With Quote
Old 2020-06-22, 18:37   #2354
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

21FB16 Posts
Default

gpuOwl is all video card based, IIRC.
Uncwilly is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1656 2020-10-13 14:21
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 21:00.

Tue Oct 20 21:00:06 UTC 2020 up 40 days, 18:11, 1 user, load averages: 2.11, 2.01, 1.92

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.