mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2012-03-27, 17:56   #1706
axn
 
axn's Avatar
 
Jun 2003

2×3×7×112 Posts
Default

Quote:
Originally Posted by BigBrother View Post
Well, The Card is now inserted into a PCI-E 2.0 x16 slot, and my brain surgery skills allowed me to fix a bent pin on the CPU socket so my memory is back at dual channel again.

One instance of mfaktc is now taking +-70% GPU instead of the 74% I reported yesterday, and nVidia's Visual Profiler shows transfer rates of 6Gb/s instead of 3 Gb/s, but since the amount of data to transfer is relatively small, there's no earth-shattering improvement. I could run the same benchmark I did yesterday again if James would like me to do that.
Improved memory should have a more pronounced impact on CUDALucas.
axn is online now   Reply With Quote
Old 2012-03-27, 20:20   #1707
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

11100001101012 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
You can if you click the zoom in/out links I just added.
Would it be possible to somehow overlay the current TF bounds (http://www.mersenne.org/various/math.php, plus 3 bits) on top of the chart? It would be so pretty

Also, is it possible to make an "overall" chart that averages the breakeven points for all the GPUs? You'd have to figure out a way to weight the throughput of each GPU relative to the others; the 5xx would have highest weighting, 4xx next highest, and then everything else a lower weighting.

Edit: Perhaps a mod should move all the posts relating to James' new page to a separate "TF vs. LL" thread in this forum?

Last fiddled with by Dubslow on 2012-03-27 at 20:22
Dubslow is offline   Reply With Quote
Old 2012-03-27, 20:48   #1708
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

11·311 Posts
Default

Quote:
Originally Posted by Dubslow View Post
Would it be possible to somehow overlay the current TF bounds (http://www.mersenne.org/various/math.php, plus 3 bits) on top of the chart? It would be so pretty
I've put the current PrimeNet CPU-TF limits on the chart as orange.

Quote:
Originally Posted by Dubslow View Post
Also, is it possible to make an "overall" chart that averages the breakeven points for all the GPUs? You'd have to figure out a way to weight the throughput of each GPU relative to the others; the 5xx would have highest weighting, 4xx next highest, and then everything else a lower weighting.
Breakeven points are not dependent on GPU absolute performance, only relative performance for that compute version between mfaktc and CUDALucas. Relativel CUDALucas vs maktc performance may mean they need around 1 extra TF bitlevel for the breakeven point. But there's only 4 patterns there, one for each compute version: 3.0, 2.1, 2.0, 1.3.
James Heinrich is offline   Reply With Quote
Old 2012-03-27, 21:09   #1709
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

160658 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
I've put the current PrimeNet CPU-TF limits on the chart as orange.
...pretty
Quote:
Originally Posted by James Heinrich View Post
Breakeven points are not dependent on GPU absolute performance, only relative performance for that compute version between mfaktc and CUDALucas. Relativel CUDALucas vs maktc performance may mean they need around 1 extra TF bitlevel for the breakeven point. But there's only 4 patterns there, one for each compute version: 3.0, 2.1, 2.0, 1.3.
Heh, I didn't realize it was the same numbers, but now I do. 2.1 has slightly more conservative TF bounds, but otherwise matches up fairly well with 2.0; now, perhaps this should be put in the GPU272 forum, but I think that project should retain PrimeNet's TF bounds, unless you James can modify assignment rules (somehow I doubt that). If we do that, than +3 bits is the conservative goal, and +4 bits would be aggressive TF bounds. I vote for the former, because many of the cyan cells are in fact well above 200, and it is GIMPS, not GIMFS as petrw1 has pointed out elsewhere.
Dubslow is offline   Reply With Quote
Old 2012-03-27, 21:42   #1710
rcv
 
Dec 2011

100011112 Posts
Default

Quote:
Originally Posted by msft View Post
I swear I searched NVIDIA's Web site on Thursday and Friday after the announcement, but found no new technical details. [Just pointers to new drivers.]

The new toolkit and docs explain a lot. In addition to the things we knew were slower on the 680, the docs reveal that shift instructions are way slow. And mfaktc does use C-style shifts in the inner loop. I still suspect there may be an occupancy issue that is halving the performance.

@BigBrother: Are you up to running the nvvp profiler?

@msft: Thanks for the link!!!
rcv is offline   Reply With Quote
Old 2012-03-28, 00:51   #1711
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

Quote:
Originally Posted by rcv View Post
The new toolkit and docs explain a lot. In addition to the things we knew were slower on the 680, the docs reveal that shift instructions are way slow.
I'd say they are about 20 times slower than they should be!! 32-bit muls are much faster than shift lefts! Repeated adds are much faster than small shift lefts. Algorithms may have to change to avoid shift rights.
Prime95 is offline   Reply With Quote
Old 2012-03-28, 03:09   #1712
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

165618 Posts
Default

I also noticed that type conversion is dreadfully slow. Try to minimize these.

Does anyone know if add.cc runs runs on 168 cores or does it get restricted to 32 or even worse 8 cores??

Could bfe (bit field extract) be used as a replacement for the slow shift right?

In general, how does one know which PTX instructions map to actual hardware instructions? If it's emulated, how does one see which instructions are used to emulate the PTX instruction?

Last fiddled with by Prime95 on 2012-03-28 at 03:11
Prime95 is offline   Reply With Quote
Old 2012-03-28, 04:13   #1713
rcv
 
Dec 2011

2178 Posts
Default

After years of coaching developers to change their multiplies to shifts. Now, NVIDIA may be coaching developers to change their shifts back to multiplies. Even a shift right (by a constant) might be performed by a mul.hi instruction. How ironic.

Quote:
Originally Posted by Prime95 View Post
In general, how does one know which PTX instructions map to actual hardware instructions? If it's emulated, how does one see which instructions are used to emulate the PTX instruction?
I've not done it, but I've read that NVIDIA provides a disassembler that shows the honest-and-true (post PTX) machine code that is executed. But, as far as I know, NVIDIA doesn't document the low-level machine code.

I suppose that would answer questions such as "does the bit-field-extract PTX macro generate a single instruction or a series of instructions?"
rcv is offline   Reply With Quote
Old 2012-03-28, 04:44   #1714
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

32×29×37 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
I've put the current PrimeNet CPU-TF limits on the chart as orange.

Breakeven points are not dependent on GPU absolute performance, only relative performance for that compute version between mfaktc and CUDALucas. Relativel CUDALucas vs maktc performance may mean they need around 1 extra TF bitlevel for the breakeven point. But there's only 4 patterns there, one for each compute version: 3.0, 2.1, 2.0, 1.3.
Man, those graphics are wonderful! And they perfectly match my cards and my calculus (despite the fact that I never submitted results to your site ). Kotgw!
LaurV is online now   Reply With Quote
Old 2012-03-28, 19:52   #1715
apsen
 
Jun 2011

131 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
But there's only 4 patterns there, one for each compute version: 3.0, 2.1, 2.0, 1.3.
What are the numbers in the cells?
apsen is offline   Reply With Quote
Old 2012-03-28, 20:16   #1716
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

11·311 Posts
Default

Quote:
Originally Posted by apsen View Post
What are the numbers in the cells?
As I (probably poorly) tried to explain in the text above the graph, 100 = time spent on TF combined with the probability of finding a factor means equal chance to clear an exponent with either TF to that bit level or by L-L'ing it. 200 = double 100, which factors in the fact that 2x L-L tests are needed. It does not factor in the lesser amounts of triple-checks and P-1 testing that might be saved with a factor. My interpretation is that TF should be done to the 200 mark, or a little bit higher. Since "200" will rarely fall exactly on an integer bitlevel (actual breakeven point for "100" is in the rightmost column), TF to the rounded-up-from-that is appropriate when 2 L-Ls would be saved. If only 1 L-L would be saved, then TF to 1 bitlevel less (half the TF effort).
James Heinrich is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 11:57.


Mon Aug 2 11:57:22 UTC 2021 up 10 days, 6:26, 0 users, load averages: 1.88, 1.74, 1.44

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.