![]() |
[QUOTE=BigBrother;294415]Well, The Card is now inserted into a PCI-E 2.0 x16 slot, and my brain surgery skills allowed me to fix a bent pin on the CPU socket so my memory is back at dual channel again. :cool:
One instance of mfaktc is now taking +-70% GPU instead of the 74% I reported yesterday, and nVidia's Visual Profiler shows transfer rates of 6Gb/s instead of 3 Gb/s, but since the amount of data to transfer is relatively small, there's no earth-shattering improvement. I could run the same benchmark I did yesterday again if James would like me to do that.[/QUOTE] Improved memory should have a more pronounced impact on CUDALucas. |
[QUOTE=James Heinrich;294396]You can if you click the zoom in/out links I just added. :smile:[/QUOTE]
Would it be possible to somehow overlay the current TF bounds ([url]http://www.mersenne.org/various/math.php[/url], plus 3 bits) on top of the chart? It would be so pretty :smile: Also, is it possible to make an "overall" chart that averages the breakeven points for all the GPUs? You'd have to figure out a way to weight the throughput of each GPU relative to the others; the 5xx would have highest weighting, 4xx next highest, and then everything else a lower weighting. Edit: Perhaps a mod should move all the posts relating to James' new page to a separate "TF vs. LL" thread in this forum? |
[QUOTE=Dubslow;294429]Would it be possible to somehow overlay the current TF bounds ([url]http://www.mersenne.org/various/math.php[/url], plus 3 bits) on top of the chart? It would be so pretty :smile:[/quote]I've put the current PrimeNet CPU-TF limits on the chart as orange.
[QUOTE=Dubslow;294429]Also, is it possible to make an "overall" chart that averages the breakeven points for all the GPUs? You'd have to figure out a way to weight the throughput of each GPU relative to the others; the 5xx would have highest weighting, 4xx next highest, and then everything else a lower weighting.[/QUOTE]Breakeven points are not dependent on GPU absolute performance, only relative performance for that compute version between mfaktc and CUDALucas. Relativel CUDALucas vs maktc performance may mean they need around 1 extra TF bitlevel for the breakeven point. But there's only 4 patterns there, one for each compute version: [url=http://mersenne-aries.sili.net/cudalucas.php?model=467]3.0[/url], [url=http://mersenne-aries.sili.net/cudalucas.php?model=7]2.1[/url], [url=http://mersenne-aries.sili.net/cudalucas.php?model=13]2.0[/url], [url=http://mersenne-aries.sili.net/cudalucas.php?model=15]1.3[/url]. |
[QUOTE=James Heinrich;294435]I've put the current PrimeNet CPU-TF limits on the chart as orange.[/quote]...pretty :smile:
[QUOTE=James Heinrich;294435] Breakeven points are not dependent on GPU absolute performance, only relative performance for that compute version between mfaktc and CUDALucas. Relativel CUDALucas vs maktc performance may mean they need around 1 extra TF bitlevel for the breakeven point. But there's only 4 patterns there, one for each compute version: [url=http://mersenne-aries.sili.net/cudalucas.php?model=467]3.0[/url], [url=http://mersenne-aries.sili.net/cudalucas.php?model=7]2.1[/url], [url=http://mersenne-aries.sili.net/cudalucas.php?model=13]2.0[/url], [url=http://mersenne-aries.sili.net/cudalucas.php?model=15]1.3[/url].[/QUOTE] Heh, I didn't realize it was the same numbers, but now I do. 2.1 has slightly more conservative TF bounds, but otherwise matches up fairly well with 2.0; now, perhaps this should be put in the GPU272 forum, but I think that project should retain PrimeNet's TF bounds, unless you James can modify assignment rules (somehow I doubt that). If we do that, than +3 bits is the conservative goal, and +4 bits would be aggressive TF bounds. I vote for the former, because many of the cyan cells are in fact well above 200, and it is GIMPS, not GIMFS as petrw1 has pointed out elsewhere. |
[QUOTE=msft;294375][URL]http://forums.nvidia.com/index.php?showtopic=225312&st=20&p=1387312&#entry1387312[/URL]
It is answer?[/QUOTE] I swear I searched NVIDIA's Web site on Thursday and Friday after the announcement, but found no new technical details. [Just pointers to new drivers.] The new toolkit and docs explain a lot. In addition to the things we knew were slower on the 680, the docs reveal that shift instructions are way slow. And mfaktc does use C-style shifts in the inner loop. I still suspect there may be an occupancy issue that is halving the performance. @BigBrother: Are you up to running the [FONT=Courier New]nvvp[/FONT] profiler? @msft: Thanks for the link!!! |
[QUOTE=rcv;294438]The new toolkit and docs explain a lot. In addition to the things we knew were slower on the 680, the docs reveal that shift instructions are way slow.[/QUOTE]
I'd say they are about 20 times slower than they should be!! 32-bit muls are much faster than shift lefts! Repeated adds are much faster than small shift lefts. Algorithms may have to change to avoid shift rights. |
I also noticed that type conversion is dreadfully slow. Try to minimize these.
Does anyone know if add.cc runs runs on 168 cores or does it get restricted to 32 or even worse 8 cores?? Could bfe (bit field extract) be used as a replacement for the slow shift right? In general, how does one know which PTX instructions map to actual hardware instructions? If it's emulated, how does one see which instructions are used to emulate the PTX instruction? |
After years of coaching developers to change their multiplies to shifts. Now, NVIDIA may be coaching developers to change their shifts back to multiplies. Even a shift right (by a constant) might be performed by a mul.hi instruction. How ironic.
[QUOTE=Prime95;294466]In general, how does one know which PTX instructions map to actual hardware instructions? If it's emulated, how does one see which instructions are used to emulate the PTX instruction?[/QUOTE] I've not done it, but I've read that NVIDIA provides a disassembler that shows the honest-and-true (post PTX) machine code that is executed. But, as far as I know, NVIDIA doesn't document the low-level machine code. I suppose that would answer questions such as "does the bit-field-extract PTX macro generate a single instruction or a series of instructions?" |
[QUOTE=James Heinrich;294435]I've put the current PrimeNet CPU-TF limits on the chart as orange.
Breakeven points are not dependent on GPU absolute performance, only relative performance for that compute version between mfaktc and CUDALucas. Relativel CUDALucas vs maktc performance may mean they need around 1 extra TF bitlevel for the breakeven point. But there's only 4 patterns there, one for each compute version: [URL="http://mersenne-aries.sili.net/cudalucas.php?model=467"]3.0[/URL], [URL="http://mersenne-aries.sili.net/cudalucas.php?model=7"]2.1[/URL], [URL="http://mersenne-aries.sili.net/cudalucas.php?model=13"]2.0[/URL], [URL="http://mersenne-aries.sili.net/cudalucas.php?model=15"]1.3[/URL].[/QUOTE] Man, those graphics are wonderful! And they perfectly match my cards and my calculus (despite the fact that I never submitted results to your site :smile:). Kotgw! |
[QUOTE=James Heinrich;294435]But there's only 4 patterns there, one for each compute version: [url=http://mersenne-aries.sili.net/cudalucas.php?model=467]3.0[/url], [url=http://mersenne-aries.sili.net/cudalucas.php?model=7]2.1[/url], [url=http://mersenne-aries.sili.net/cudalucas.php?model=13]2.0[/url], [url=http://mersenne-aries.sili.net/cudalucas.php?model=15]1.3[/url].[/QUOTE]
What are the numbers in the cells? |
[QUOTE=apsen;294534]What are the numbers in the cells?[/QUOTE]As I (probably poorly) tried to explain in the text above the graph, 100 = time spent on TF combined with the probability of finding a factor means equal chance to clear an exponent with either TF to that bit level or by L-L'ing it. 200 = double 100, which factors in the fact that 2x L-L tests are needed. It does not factor in the lesser amounts of triple-checks and P-1 testing that might be saved with a factor. My interpretation is that TF should be done to the 200 mark, or a little bit higher. Since "200" will rarely fall exactly on an integer bitlevel (actual breakeven point for "100" is in the rightmost column), TF to the rounded-up-from-that is appropriate when 2 L-Ls would be saved. If only 1 L-L would be saved, then TF to 1 bitlevel less (half the TF effort).
|
| All times are UTC. The time now is 23:16. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.