![]() |
|
|
#1706 | |
|
Jun 2003
2×3×7×112 Posts |
Quote:
|
|
|
|
|
|
|
#1707 |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
11100001101012 Posts |
Would it be possible to somehow overlay the current TF bounds (http://www.mersenne.org/various/math.php, plus 3 bits) on top of the chart? It would be so pretty
![]() Also, is it possible to make an "overall" chart that averages the breakeven points for all the GPUs? You'd have to figure out a way to weight the throughput of each GPU relative to the others; the 5xx would have highest weighting, 4xx next highest, and then everything else a lower weighting. Edit: Perhaps a mod should move all the posts relating to James' new page to a separate "TF vs. LL" thread in this forum? Last fiddled with by Dubslow on 2012-03-27 at 20:22 |
|
|
|
|
|
#1708 | ||
|
"James Heinrich"
May 2004
ex-Northern Ontario
11·311 Posts |
Quote:
Quote:
|
||
|
|
|
|
|
#1709 | ||
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
160658 Posts |
Quote:
![]() Quote:
|
||
|
|
|
|
|
#1710 | |
|
Dec 2011
100011112 Posts |
Quote:
The new toolkit and docs explain a lot. In addition to the things we knew were slower on the 680, the docs reveal that shift instructions are way slow. And mfaktc does use C-style shifts in the inner loop. I still suspect there may be an occupancy issue that is halving the performance. @BigBrother: Are you up to running the nvvp profiler? @msft: Thanks for the link!!! |
|
|
|
|
|
|
#1711 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
7,537 Posts |
I'd say they are about 20 times slower than they should be!! 32-bit muls are much faster than shift lefts! Repeated adds are much faster than small shift lefts. Algorithms may have to change to avoid shift rights.
|
|
|
|
|
|
#1712 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
165618 Posts |
I also noticed that type conversion is dreadfully slow. Try to minimize these.
Does anyone know if add.cc runs runs on 168 cores or does it get restricted to 32 or even worse 8 cores?? Could bfe (bit field extract) be used as a replacement for the slow shift right? In general, how does one know which PTX instructions map to actual hardware instructions? If it's emulated, how does one see which instructions are used to emulate the PTX instruction? Last fiddled with by Prime95 on 2012-03-28 at 03:11 |
|
|
|
|
|
#1713 | |
|
Dec 2011
2178 Posts |
After years of coaching developers to change their multiplies to shifts. Now, NVIDIA may be coaching developers to change their shifts back to multiplies. Even a shift right (by a constant) might be performed by a mul.hi instruction. How ironic.
Quote:
I suppose that would answer questions such as "does the bit-field-extract PTX macro generate a single instruction or a series of instructions?" |
|
|
|
|
|
|
#1714 | |
|
Romulan Interpreter
Jun 2011
Thailand
32×29×37 Posts |
Quote:
). Kotgw!
|
|
|
|
|
|
|
#1716 |
|
"James Heinrich"
May 2004
ex-Northern Ontario
11·311 Posts |
As I (probably poorly) tried to explain in the text above the graph, 100 = time spent on TF combined with the probability of finding a factor means equal chance to clear an exponent with either TF to that bit level or by L-L'ing it. 200 = double 100, which factors in the fact that 2x L-L tests are needed. It does not factor in the lesser amounts of triple-checks and P-1 testing that might be saved with a factor. My interpretation is that TF should be done to the 200 mark, or a little bit higher. Since "200" will rarely fall exactly on an integer bitlevel (actual breakeven point for "100" is in the rightmost column), TF to the rounded-up-from-that is appropriate when 2 L-Ls would be saved. If only 1 L-L would be saved, then TF to 1 bitlevel less (half the TF effort).
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1676 | 2021-06-30 21:23 |
| The P-1 factoring CUDA program | firejuggler | GPU Computing | 753 | 2020-12-12 18:07 |
| gr-mfaktc: a CUDA program for generalized repunits prefactoring | MrRepunit | GPU Computing | 32 | 2020-11-11 19:56 |
| mfaktc 0.21 - CUDA runtime wrong | keisentraut | Software | 2 | 2020-08-18 07:03 |
| World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |