mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2014-09-09, 08:42   #1200
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Quote:
Originally Posted by Bdot View Post
I need to see if the same kernel is selected as before ... Maybe I did something wrong with the kernel precedence for APUs ... I'll check your logs and come back to that separately.
The correct kernel was selected, and everything else also seems to be OK. More detailed testing has shown that a performance improvement for GCN has adverse effects on VLIW5

Thanks again, Jayder, for pointing out this issue to me. I'm not yet sure how to address this, though ...

It goes in line with another observation regarding the use of double precision: On my HD7950, it improves performance by 5%. On HD7850, performance drops by 7%. It looks like a lot of device-dependent #ifdefs need to go into the kernel files, which I tried to avoid so far (the IntelHD bugs were the first to require that). I may also need to create a separate device class for the high-end GCN's because of their faster DP performance.

Thank you all for your offers to test ... with these additional changes coming, I think it makes no sense to send out a test version right now.

I'll come back to you ...
Bdot is offline   Reply With Quote
Old 2014-09-09, 13:23   #1201
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

23×3×72 Posts
Default

Quote:
Originally Posted by Bdot View Post
Finally I managed to create a 74-bit kernel that helps straightening out the performance of mfakto when the factor sizes increase (it moves out the big drop one more bit). My HD7950@1100MHz now runs 100M candidates:

bits : GHz-days/day
67-68: 448
68-69: 476
69-70: 459
70-71: 416
71-72: 417
72-73: 418
73-74: 408 <== the new one, was 361 before
74-82: 361

Attempts of achieving this using a new 5x16-bit kernel or an improved montgomery kernel yielded slow results. The solution is a "4x15-bit + 1x16-bit" kernel ...
Great news, well done!
VictordeHolland is offline   Reply With Quote
Old 2014-10-07, 13:52   #1202
AK76
 
Sep 2014

1910 Posts
Default

Quote:
Originally Posted by Bdot View Post
My HD7950@1100MHz now runs 100M candidate.
100M mean 2^100M or 100M-digits ?
AK76 is offline   Reply With Quote
Old 2014-10-08, 06:14   #1203
snme2pm1
 
"Graham uses ISO 8601"
Mar 2014
AU, Sydney

35 Posts
Default

Quote:
Originally Posted by AK76 View Post
100M mean 2^100M or 100M-digits ?
The production rate figures quoted for that HD7950 1100MHz are not too far removed from rates for my HD7950 1000MHz working on exponents in the 118 million space, (i.e. 2 power p minus 1).
I've been watching this space anticipating a new release for 74 bit exploration, but can probably only use that out of hours, since I've never been able to adequately duck lack of responsiveness issues.
snme2pm1 is offline   Reply With Quote
Old 2014-10-08, 07:42   #1204
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by AK76 View Post
100M mean 2^100M or 100M-digits ?
I used a 100M exponent for that test.
Quote:
Originally Posted by snme2pm1 View Post
The production rate figures quoted for that HD7950 1100MHz are not too far removed from rates for my HD7950 1000MHz working on exponents in the 118 million space, (i.e. 2 power p minus 1).
I've been watching this space anticipating a new release for 74 bit exploration, but can probably only use that out of hours, since I've never been able to adequately duck lack of responsiveness issues.
There are a few parameters you can try to tweak for better responsiveness: low but non-zero FlushInterval (3, 2, 1), lower GPUSieveSize and lower GPUSieveProcessSize should each help. I'd try tweaking them in this order for best responsiveness-gain per performance-loss ratio.

The new release will have to wait a bit longer as I currently have very little time to work on it (and there are still a few things to do).

Last fiddled with by Bdot on 2014-10-08 at 07:46 Reason: FlushInterval: start with 3 downwards
Bdot is offline   Reply With Quote
Old 2014-10-09, 17:42   #1205
AK76
 
Sep 2014

19 Posts
Default

On my ATI R9 290 i use FlushInterval=0. 3,2,1 works much worse than "0".

GPUSieveSize=5 or 6

GPUSieveProcessSize=16

GPUSievePrimes= between 30000 and 80000

For example: today i run exp 70M bit 71-72 Rate 2500M/s, 430GHz-d/day.

Soon i will test 100M candidates on different bits, to comapre my GPU performance with Bdot's 7950.
AK76 is offline   Reply With Quote
Old 2014-10-09, 19:01   #1206
VictordeHolland
 
VictordeHolland's Avatar
 
"Victor de Hollander"
Aug 2011
the Netherlands

23×3×72 Posts
Default

Quote:
Originally Posted by AK76 View Post
On my ATI R9 290 i use FlushInterval=0. 3,2,1 works much worse than "0".
Less responsive or less throughput (or both)???
VictordeHolland is offline   Reply With Quote
Old 2014-10-09, 20:27   #1207
AK76
 
Sep 2014

19 Posts
Default

Less throughput.
AK76 is offline   Reply With Quote
Old 2014-10-10, 22:27   #1208
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3×199 Posts
Default

Ah, thanks for that clarification!

Yes, of course, the best throughput is achieved when the GPU is not shared with anything, especially not 3D-Games or screen-updates in general .

My suggestion was meant towards a more responsive system at the cost of as little as possible throughput.

Regarding your performance measurements: throughput should scale linearly with the GPU clock speed (or shader clock). Memory clock has very little influence.
Bdot is offline   Reply With Quote
Old 2014-10-11, 08:10   #1209
snme2pm1
 
"Graham uses ISO 8601"
Mar 2014
AU, Sydney

F316 Posts
Smile

Quote:
Originally Posted by Bdot View Post
My suggestion was meant towards a more responsive system at the cost of as little as possible throughput.
I've read that over several times, but can't convince myself that it can not be entirely misunderstood.
I recognise that the writer is not necessarily english first.
I suspect that various minor inflections might have conveyed a more intended meaning.
One such moderation would be to suggest an intention of a more responsive system at the cost of as little as possible throughput reduction.
snme2pm1 is offline   Reply With Quote
Old 2014-10-11, 12:31   #1210
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

11258 Posts
Default

Oh .. I see . Thank you for trying to extract what I really meant (I wish all forum members did that consistently). I probably wanted to say something like "the responsiveness improvement should cost you as little performance as possible" ...

And you're totally right about English not being my first language. It was actually the third I started to learn.

Thinking about this all again, it would probably be much easier for me to aim for as little throughput as possible than what I was trying over the past few years
Bdot is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL: an OpenCL program for Mersenne primality testing preda GpuOwl 2719 2021-08-05 22:43
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3497 2021-06-05 12:27
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 01:06.


Fri Aug 6 01:06:08 UTC 2021 up 13 days, 19:35, 1 user, load averages: 2.37, 2.40, 2.33

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.