20200617, 20:23  #2344 
∂^{2}ω=0
Sep 2002
República de California
2^{2}·7·11·37 Posts 
There's also noise to consider  in my case the pictured build is maybe 6 feet away from the dining room table. Before I bought the large desk fan I was having to run the fans of the 2 rightmost GPUs high enough that their resulting highpitched whine rose wellabove whitenoise level, which was an annoying intrusion on mealtime conversation. And in case of failure, the $15 desk fan is a whole lot easier to replace than one of the GPU ones. Does take up just a bit more room, admittedly. :)
Last fiddled with by ewmayer on 20200617 at 20:24 
20200620, 07:23  #2345 
Sep 2002
Database er0rr
2^{2}×7^{2}×17 Posts 
I just upgraded to ROCm3.5.1 and clinfo thereafter said 0 devices found. Now trying to go back to 3,3.
After more shenanigans i.e. aptget autoremove rocm*, changing the apt sources file and aptget install rocmdev3.3.0, I got my clinfo to return to reason. I changed gpuOwl Makefile and recompiled. 2 instances at 5.5M FFT with sclk 4 3.5.0: 1312 µs/it 3.3.0: 1248 µs/it A huge difference! Last fiddled with by paulunderwood on 20200620 at 07:57 
20200620, 08:22  #2346  
"Mihai Preda"
Apr 2015
480_{16} Posts 
Quote:
https://github.com/RadeonOpenCompute/ROCm/issues/1124 in the hope that'll help the ROCm team focus a bit better on performance regressions. 

20200620, 21:12  #2347 
∂^{2}ω=0
Sep 2002
República de California
2^{2}·7·11·37 Posts 
I ran into some very interesting 1jobor2 timing results a few days ago when testing George's MIDDLE_MUL = 13,14,15 optimizations targeting FFT lengths 6.5,7,7.5M. Kriesel had asked me to do an LLTC of an expo needing 8M FFT length, so I played around with the latest build to look for optimal run params. I always wanted to see how bad 8M was due to lack of an optimized MIDDLE_MUL = 16 relative to the above smaller ones.
I suspended one my 2 sidebyside PRPs@5.5M on my Haswellsystem's R7, leaving just 1 running, so the first set of data reflect 2job run timings. I used the Mlucas max_p, which are ~1% below the gpuowl ones, i.e. trigger some higheracc settings but not the max: all runs ran at DMM_CHAIN=1u DMM2_CHAIN=(1 @5,5.5,6.5,8M and 2@6,7,7.5M) DMAX_ACCURACY=1. This was using gpuowl v6.11321gf485359dirty, I checked that 5.57.5M use MIDDLE=1115, 8M used MIDDLE=8, as expected: 5.0M: 96517061 LL 10000 0.01%; 1146 us/it; ETA 1d 06:43; aa8d1c560f892242 5.5M: 105943729 LL 10000 0.01%; 1283 us/it; ETA 1d 13:45; f084ea07a2af855a\ 6.0M: 115351079 LL 10000 0.01%; 1355 us/it; ETA 1d 19:25; 49d490b9dd9af6db 6.5M: 124740743 LL 10000 0.01%; 2081 us/it; ETA 3d 00:06; 29bc4fc7dee72e54 7.0M: 134113981 LL 10000 0.01%; 2157 us/it; ETA 3d 08:22; 85a3bb21b2955ca8 7.5M: 143472103 LL 10000 0.01%; 2163 us/it; ETA 3d 14:12; 132a96c40d1d8a61 8.0M 152816089 LL 10000 0.01%; 1849 us/it; ETA 3d 06:28; 079c9d38bd21940e So 56M look good, 8M looks good at just a smidge over 4/3 times the 6M timing, but 6.57.5M are ugly ... in fact had I been able to run that 145.5M LLDC at 7.5M as I briefly hoped (it's less than 1% above the gpuowl 8M limit) it would have been slower than at 8M! Sent the above to George, he said his timings for the same range were nicely monotonic, but with just 1 job running. That was apparently how he did his tunings when testing his code optimizations. So redid my timings with no other tasks using the GPU, and here things get very interesting: Periter times at 58M FFT were 630,686,756,860,903,934,1253 us/iter ... these are more along the lines I expected, assuming the build I'm using, gpuowl v6.11321gf485359dirty, postdates George's MIDDLEMUL=1315 checkins  up through 7.5M nice trend, 8M very bad. I doublechecked the 8M timing, it is indeed 1250 us/iter for 1job and only jumps to 1850 us/iter with my 5.5MFFT PRP sharing cycles. IOW, running the aforementioned 8Mneeding LL test as the only job on the R7 would be somewhat wallclock faster  2 days versus 3  but would have been a disaster in terms of total system throughput, relative to 2job mode. Based on the wild contrast between the 1job and 2job numbers, we may need to give FFTlengthspecific #instancestouse guidance starting at or just beyond 6M. Last fiddled with by ewmayer on 20200620 at 21:16 
20200620, 21:46  #2348  
P90 years forever!
Aug 2002
Yeehaw, FL
7,013 Posts 
Quote:


20200620, 22:46  #2349 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
1071_{16} Posts 
I've seen unbalanced, unpredictable "gpu timesharing" on other gpu models including NVIDIA, on other applications too. Especially when things get more disparate between tasks running.
App A, exponent A, worktype A alone on the gpu is one throughput measure, compute A iters/sec App B, exponent B, worktype B alone on the gpu is another, compute B iters/sec. If the fft lengths are different, maybe we should compare iters x length / sec. App A etc throughput per unit time in tandem is C iters/sec or whatever. App B etc throughput per unit time in tandem is D iters/sec or whatever. Measured during the same time period as C. That's important since the blend can fluctuate back and forth over time in some cases. EXPECT one task's throughput to change when its competition for resources on the gpu changes. Combined relative throughput of a tandem session is C/A + D/B of a solo run which should be expressed in consistent units and so is dimensionless overall; compare to unity. Continue tandem if above unity, switch to solo if not. Last fiddled with by kriesel on 20200620 at 23:02 
20200621, 01:27  #2350  
∂^{2}ω=0
Sep 2002
República de California
2^{2}×7×11×37 Posts 
Quote:
Code:
Timing 5.5M: 5Mequiv: 1job: ../gpuowl fft 5M ll 96517061 1133 1277 1.000 0.910 ../gpuowl fft 5.5M ll 105943729 1272 1232 1.008 0.919 ../gpuowl fft 6M ll 115351079 1354 1303 0.992 0.910 ../gpuowl fft 6.5M ll 124740743 2066 1018 0.980 0.867 ../gpuowl fft 7M ll 134113981 2145 1032 0.985 0.889 ../gpuowl fft 7.5M ll 143472103 2150 1080 0.984 0.921 ../gpuowl fft 8M ll 152816089 1833 1500 0.921 0.732 

20200621, 02:47  #2351  
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
3·23·61 Posts 
Quote:
Last fiddled with by kriesel on 20200621 at 02:48 

20200621, 04:13  #2352  
∂^{2}ω=0
Sep 2002
República de California
11396_{10} Posts 
Quote:


20200622, 18:14  #2353 
Oct 2018
Slovakia
3·23 Posts 
Small question  P1
When i use GPUowl for P1 factoring, this process uses for stage 2 RAM or videoRAM?

20200622, 18:37  #2354 
6809 > 6502
"""""""""""""""""""
Aug 2003
101×103 Posts
2^{6}·131 Posts 
gpuOwl is all video card based, IIRC.

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
mfakto: an OpenCL program for Mersenne prefactoring  Bdot  GPU Computing  1618  20200624 00:11 
GPUOWL AMD Windows OpenCL issues  xx005fs  GpuOwl  0  20190726 21:37 
Testing an expression for primality  1260  Software  17  20150828 01:35 
Testing Mersenne cofactors for primality?  CRGreathouse  Computer Science & Computational Number Theory  18  20130608 19:12 
Primalitytesting program with multiple types of moduli (PFGWrelated)  Unregistered  Information & Answers  4  20061004 22:38 