![]() |
|
|
#12 |
|
Banned
"Luigi"
Aug 2002
Team Italia
10011000000012 Posts |
|
|
|
|
|
|
#13 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
201278 Posts |
Maybe. Should make no difference to mfaktc/mmff which relies on integer multiplies. It might make a big difference to CUDALucas. Or it might make little difference if CUDALucas is mostly bottlenecked on memory bandwidth. Benchmarks would be nice.
|
|
|
|
|
|
#14 |
|
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996
1428 Posts |
Changes from the 680 generation of some relevance to various gpu efforts here:
"Space and bandwidth for both the register file and the L2 cache have been greatly increased for GK110. At the SMX level GK110 has 256KB of register file space, composed of 65K 32bit registers, as compared to 128KB of such space (32K registers) on GF100. Bandwidth to those register files has in turn been doubled, allowing GK110 to read from those register files faster than ever before. As for the L2 cache, it has received a very similar treatment. GK110 uses an L2 cache up to 1.5MB, twice as big as GF110; and that L2 cache bandwidth has also been doubled." And memory access patterns outside of those handled well by the other caches and memory now have a cache home: "...it’s also worth noting that NVIDIA has reworked their texture cache to be more useful for compute. On GF100 the 12KB texture cache was just that, a texture cache, only available to the texture units. As it turns out, clever programmers were using the texture cache as another data cache by mapping normal data at texture data, so NVIDIA has promoted the texture cache to a larger, more capable cache on GK110. Now measuring 48KB in size, in compute mode the texture cache becomes a read-only cache, specializing in unaligned memory access patterns." New low level instructions are introduced: "NVIDIA has added a number of new instructions and operations to GK110 to further improve performance. New shuffle instructions allow for threads within a warp to share (i.e. shuffle) data without going to shared memory, making the process much faster than the old load/share/store method. Meanwhile atomic operations have also been overhauled, with NVIDIA both speeding up the execution speed of atomic operations and adding some FP64 operations that were previously only available for FP32 data." And, very significantly, Compute Model 3.5 adds the ability for GPU kernels to launch other GPU kernels! For many compute tasks the latency of host-gpu-host-gpu transfers is a big issue, further complicated by interactions between compute scheduling and host communication. "Dynamic Parallelism" as introduced on the GK110 means that work units can be generated and dispatched by processes running entirely on the GPU with correspondingly much lower overhead. For certain tasks this will make a big difference. And, yeah there's the significant detail of FP64 speed back to a more reasonable 1/3 of SP speed if you downclock to stay within thermal limits. Practically speaking this means > 1Tflop/s DP on a single card, with 6GB of 288GB/s memory and a huge register file. So this can indeed be very fast for some tasks. -Ethan |
|
|
|
|
|
#15 |
|
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996
2·72 Posts |
It should also be remembered that the GK110 design was informed by contract requirements for a certain customer order that ate the first ~19000 boards off the line:
http://www.olcf.ornl.gov/titan/ And there are several academic papers out which examine GK110 performance characteristics and optimization demands: http://scholar.google.com/scholar?q=...&as_sdt=0%2C48 |
|
|
|
|
|
#16 | |
|
"Oliver"
Mar 2005
Germany
5×223 Posts |
Quote:
Oliver |
|
|
|
|
|
|
#17 |
|
Jul 2009
Tokyo
2·5·61 Posts |
https://developer.nvidia.com/sites/d...erformance.pdf
FFT Performance |
|
|
|
|
|
#18 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
201278 Posts |
http://www.anandtech.com/show/6774/n...nce-unveiled/3
Looks like 2.5x faster than a GTX 580 (for CUDALucas) Last fiddled with by Prime95 on 2013-02-21 at 15:34 |
|
|
|
|
|
#19 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
2×7×461 Posts |
GK110 apparently has full-precision FMA in DP; is that enough to make it reasonable to do double-precision arithmetic with FP rather than integer operations, or does the smaller number of DP units still bite you?
|
|
|
|
|
|
#20 |
|
Apr 2012
Berlin Germany
1100112 Posts |
my 680 in 2 weeks for sale
(serious)
|
|
|
|
|
|
#21 |
|
Mar 2003
Melbourne
5×103 Posts |
Check out the FFT dp benchmark:
http://www.anandtech.com/show/6774/n...nce-unveiled/3 This card looks like to be one to use for LL testing. -- Craig |
|
|
|
|
|
#22 |
|
If I May
"Chris Halsall"
Sep 2002
Barbados
2·112·47 Posts |
Yeah... I read the whole article after George pointed us to it.
It's interesting... This SKU almost seems like it's targeted more to Compute than Gaming. Gamers are bemoaning the $1000 MSRP, but at approximately 1/7th the price of a Tesla K20X I suspect many will be more than happy to forgo ECC memory. I can't wait to see some benchmarks from those here who have the budget (and/or the connections) to see how our favorite programs preform on this.... |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Titan's Best Choice | Brain | GPU Computing | 30 | 2019-10-19 19:19 |
| Titan Black | ATH | Hardware | 15 | 2017-05-27 22:38 |
| Is any GTX 750 the GeForce GTX 750 Ti owner here? | pepi37 | Hardware | 12 | 2016-07-17 22:35 |
| Nvidia announces Titan X | ixfd64 | GPU Computing | 20 | 2015-04-28 00:27 |
| 2x AMD 7990 or 2x Nvidia Titan ?? | Manpowre | GPU Computing | 27 | 2013-05-12 10:00 |