mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2014-11-06, 22:09   #1266
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by kracker View Post
Also, I ran "regular" TF with same ini for a hour, no crashes. Running --perftest crashes, "2" works...
Can you try regular TF with time per class over 10 seconds (DC range, or higher bitlevels)?

If "2" works, would "5" or "8" work as well? And can you pls post the results of such a run with the critical tests reduced to a working number?

Would increasing TdrDelay help? The perftest dumps the whole test onto the GPU at once - something that the "normal" TF does not do.

Quote:
Originally Posted by kracker View Post
That's the max width I have.
Sometimes, size matters.
Attached Thumbnails
Click image for larger version

Name:	size.png
Views:	129
Size:	130.0 KB
ID:	11947  

Last fiddled with by Bdot on 2014-11-06 at 22:22
Bdot is offline   Reply With Quote
Old 2014-11-06, 22:38   #1267
NickOfTime
 
Apr 2014

528 Posts
Default

Quote:
Originally Posted by Bdot View Post
Before I saw the test results of your x16 vs. x8 PCIex cards I would have said that the bus etc. does not have any influence on mfakto. But seeing the x16 card consistently a few percent ahead of the x8 counterpart suggests it does make a difference.

Do I understand it right that each instance, when running alone would give ~725GHz, but when starting the other instance the speed drops to ~450GHz per card?

In this case I'd say this is AMD's Powertune technology in action. Maybe you can try to use Catalyst Control Center and set the power target to some percent higher and watch if the GHz-d/d output increases accordingly? But careful, if you do that over a longer period of time: the additional heat generation can be significant. I don't have a good explanation why the speed would drop below the 0.14 level, though. Maybe the 15-bit kernels have a bigger share of simple instructions that do not generate so much heat?
Edit: which settings did you use for this parallel TF test? Your files suggest you should be using something similar to m-gs-128-32.ini or m-gs-fulltest.ini for maximum performance.
Ok, I removed the gtx690 which was in slot2 and windows had disabled due to a driver error... bringing both card back to 16x and the Same temps. And it fixed the issues overall performance issues...

72bit - 128/32 GCN - 600 each concurrently, 625 individual.
72bit - 128/32 GCN3 - 730 each concurrently, 740 individual. some black square artifacts on screen ...

74bit - 128/32 GCN3 - full tf run concurrently
no factor for M70384891 from 2^73 to 2^74 [mfakto 0.15pre5-Win cl_barrett32_76_gs_2]
tf(): total time spent: 53m 11.196s (735.87 GHz-days / day)
no factor for M70384891 from 2^73 to 2^74 [mfakto 0.15pre5-Win cl_barrett32_76_gs_2]
tf(): total time spent: 53m 15.578s (734.86 GHz-days / day)

Thou with GCN3 I have some black square artifacts flickering on screen, I'll Try updating Catalyst from 14.3 to 14.9 and see if they still appear...

Edit: artifacts go away with 14.9, and Concurrently processing stablizes and windows/screen response is if the CPU is busy (or it could be closing gpuz) and then set to idle..

Nov 06 16:51 | 464 9.8% | 3.775 54m29s | 647.99 80181 0.00%
Nov 06 16:51 | 465 9.9% | 3.859 55m38s | 633.88 80181 0.00%
IDLE
Nov 06 16:52 | 468 10.0% | 3.415 49m11s | 716.30 80181 0.00%
Nov 06 16:52 | 473 10.1% | 3.384 48m40s | 722.86 80181 0.00%
Nov 06 16:52 | 476 10.2% | 3.392 48m44s | 721.15 80181 0.00%

Nov 06 16:51 | 564 12.2% | 3.805 53m28s | 642.88 80181 0.00%
Nov 06 16:52 | 569 12.3% | 3.592 50m24s | 681.00 80181 0.00%
IDLE
Nov 06 16:52 | 576 12.4% | 3.384 47m26s | 722.86 80181 0.00%
Nov 06 16:52 | 581 12.5% | 3.393 47m30s | 720.94 80181 0.00%
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Nov 06 16:52 | 585 12.6% | 3.379 47m15s | 723.93 80181 0.00%

Edit2: Yep, definely faster with CPU idle (or at least not running P1)..., 10% overclock brings it up to 810 ghz/d each...

Edit3: yep it was Prime95 4 cores running P1 that reduced performance, switching to Boinc - Primeboinca with 8 core @ 100% the GPU still have max performance...

Last fiddled with by NickOfTime on 2014-11-06 at 23:19
NickOfTime is offline   Reply With Quote
Old 2014-11-06, 22:45   #1268
Stef42
 
Feb 2012
the Netherlands

2·29 Posts
Default

Quote:
Originally Posted by Bdot View Post
Thank you. These results also show signs of reaching the power limit. In the GCN2 test, cl_barrett15_82 is even faster than cl_barrett15_73/74 - this can only happen because cl_barrett15_82 was run before the others and was executed at higher clock speed.

Please also check out the Catalyst Control Center's Overclocking department (Graphics Overdrive). The safe method is to downclock the core, say from 1070MHz (it that your clock speed?) to 1000MHz, and see what single tests (like "mfakto -i m-cpu-GCN2.ini --perftest" return. If the results are almost unchanged, then you had reached the power limit. If they drop by ~6%, then you had not. The other way to test it would by to increase the power target to +10% and see if the results are better.

In case it is not the power target but the GPU temperature, you can try to manually set the fan speed to 80% and check if that yields an improvement. If so, then probably the cooler is not correctly seated (or poor thermal paste does not transfer the heat quickly enough).

Another indication of throttling are the fulltest's GPU sieve results. The very first test with the smallest sieve comes out fastest. Normally, way bigger sieve sizes are needed for best performance. It's just that the first test ran at full speed for a longer time.

On the other hand, by simple scaling of my HD7950's clock and number of compute elements, you should come out 11% ahead of me. In the GCN and GCN2 tests, this can be observed for the more difficult tasks, like 4000M. For the 2M TF, my card is 20% ahead. It almost looks like your card refuses to go faster than 500GHz-days/day - no matter which kernel. For 2M, they all have the same speed!
Thanks for analysing so closely. I'm kinda confused right know. First of all: the cards run in a 15C ambient room.
One GPU is 79C, the other 66C. The hotter one runs at 1070mhz (Top), the other at 1000mhz (buttom).
MSI Afterburner shows no sign of automatic downclocking. I'm not sure if it will detect it.
With TF 71-72 om 70M I can get aĆ  high as 430ghz a day. CPU bottleneck? I'm using the latest official drivers.
Stef42 is offline   Reply With Quote
Old 2014-11-06, 23:56   #1269
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23·271 Posts
Default

Quote:
Originally Posted by Bdot View Post
Would increasing TdrDelay help? The perftest dumps the whole test onto the GPU at once - something that the "normal" TF does not do.
Thank you, that seems to work!

Quote:
Originally Posted by Bdot View Post
Sometimes, size matters.
Thank you, that seems to work!

Running tests now.... FYI they look quite similar to the 260X....
kracker is offline   Reply With Quote
Old 2014-11-07, 03:54   #1270
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

2·3·1,609 Posts
Default

Quote:
Originally Posted by Stef42 View Post
One GPU is 79C, the other 66C. The hotter one runs at 1070mhz (Top), the other at 1000mhz (buttom).
The top/bottom difference may be more important than the clock difference. Try changing clocks or exchanging cards (swap them) if you can and see what is going on. OTOH, the 80C is not "hot" at all for a HD card (my HD7970, air cooled, is always at this temperature during the day, when the ambient raises to around 28C, and it was like that for years, still working).
LaurV is online now   Reply With Quote
Old 2014-11-07, 18:44   #1271
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23·271 Posts
Default

Attached Files
File Type: zip APU_6550D.zip (47.2 KB, 71 views)
File Type: zip R9_285_Tonga.zip (53.3 KB, 81 views)
kracker is offline   Reply With Quote
Old 2014-11-07, 19:45   #1272
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

11258 Posts
Default

Quote:
Originally Posted by Stef42 View Post
MSI Afterburner shows no sign of automatic downclocking. I'm not sure if it will detect it.
That is an important question: Who would detect it. For my 7950, when I set the power target too low, all tools still show the nominal clock speed. However, mfakto slows down. Therefore I wrote the hocus-pocus about checking out different power limits in CCC.
Bdot is offline   Reply With Quote
Old 2014-11-07, 21:24   #1273
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

59710 Posts
Default

Quote:
Originally Posted by kracker View Post
These results, again, change a few things - thank you so much. I'll start with the APU results as they are a bit more consistent.

GPU sieving on VLIW architecture is inefficient - CPU sieving (SievePrimes=25000) is 25% faster than GPU sieving (80181). The GPU sieve test suggests the optimal GPUSievePrimes value should be around 45000, but see below.

The most important learning here is that I can dump the stand-alone GPU-sieve-speed test. While it may be interesting in some ways, the "sweet spot" of GPUSieveSize vs. SievePrimes that it tries to find is of no value. I assume it's because of a relatively small cache (or rather the cache being shared with the CPU), that running only the sieve kernel is more efficient than the interlocking of sieving and TF in real tests. Therefore "real" tests prefer a smaller GPUSieveSize - 64 MBit seems to be the optimum.

After all, who cares how fast the GPU sieving alone is - the combined speed is what counts. I'll create a "default" perftest that only measures the combined speed and skips all the other stuff.


The R285 results are a bit confusing, especially when comparing them to your R260x results. First of all, R285 has the slow int32 and slow DP multiplication, so I have to move Tonga into the GCN category - same as Bonaire. However, compared to Bonaire, it seems to have bigger/faster/more efficient caches as the maximum speed is achieved when maxing out GPUSieveSize(128) and GPUSieveProcessSize(32). Bonaire definitely needs GPUSieveProcessSize=24 along with GPUSieveSize=126. The optimal kernel selection also differs sometimes ...

Just comparing compute units and clock speed, R285 to R260x should be a factor of 1.82. For small exponents we see 1.9 - 1.95, for larger exponents even 2.4 - 2.55. This may be related to the improved memory interface of the 285. I have not seen real 15GB/s transfer speed from CPU to GPU before ...
Bdot is offline   Reply With Quote
Old 2014-11-07, 21:52   #1274
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23×271 Posts
Default

Quote:
Originally Posted by Bdot View Post
... The optimal kernel selection also differs sometimes ...
Maybe "personal" kernel ordering might be best at this time? etc, do a quick test of all the kernels on first launch, write them to file and read from it at launch? Just a idea that I have...

Last fiddled with by kracker on 2014-11-07 at 21:52
kracker is offline   Reply With Quote
Old 2014-11-16, 17:13   #1275
AK76
 
Sep 2014

1316 Posts
Default

I change my computer and my results are much better.
My new platform is Asus Z97-AR with i7 4790k and 2x8GB 2400MHz DDR3.
My old mb+cpu limited my 290 card so much...
Attached Files
File Type: zip R9 290.zip (52.5 KB, 83 views)
AK76 is offline   Reply With Quote
Old 2014-11-16, 20:30   #1276
AK76
 
Sep 2014

19 Posts
Default

One question: i noticed that kernel barret32 is faster than barret15. How change the mfakto settings to use "32"?
AK76 is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL: an OpenCL program for Mersenne primality testing preda GpuOwl 2718 2021-07-06 18:30
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3497 2021-06-05 12:27
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 08:18.


Mon Aug 2 08:18:12 UTC 2021 up 10 days, 2:47, 0 users, load averages: 2.81, 2.13, 1.75

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.