mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-02-20, 19:34   #12
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

10011000000012 Posts
Default

Quote:
Originally Posted by sdbardwick View Post
And they de-Nerfed the FP64 abilities.
Titan: 1/3 FP32
680: 1/24 FP32
580: 1/8 FP32
Now, that's interesting...

Luigi
ET_ is offline   Reply With Quote
Old 2013-02-20, 21:42   #13
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

201278 Posts
Default

Quote:
Originally Posted by ET_ View Post
Now, that's interesting...
Maybe. Should make no difference to mfaktc/mmff which relies on integer multiplies. It might make a big difference to CUDALucas. Or it might make little difference if CUDALucas is mostly bottlenecked on memory bandwidth. Benchmarks would be nice.
Prime95 is offline   Reply With Quote
Old 2013-02-20, 22:47   #14
Ethan (EO)
 
Ethan (EO)'s Avatar
 
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996

1428 Posts
Default

Changes from the 680 generation of some relevance to various gpu efforts here:

"Space and bandwidth for both the register file and the L2 cache have been greatly increased for GK110. At the SMX level GK110 has 256KB of register file space, composed of 65K 32bit registers, as compared to 128KB of such space (32K registers) on GF100. Bandwidth to those register files has in turn been doubled, allowing GK110 to read from those register files faster than ever before. As for the L2 cache, it has received a very similar treatment. GK110 uses an L2 cache up to 1.5MB, twice as big as GF110; and that L2 cache bandwidth has also been doubled."

And memory access patterns outside of those handled well by the other caches and memory now have a cache home:

"...it’s also worth noting that NVIDIA has reworked their texture cache to be more useful for compute. On GF100 the 12KB texture cache was just that, a texture cache, only available to the texture units. As it turns out, clever programmers were using the texture cache as another data cache by mapping normal data at texture data, so NVIDIA has promoted the texture cache to a larger, more capable cache on GK110. Now measuring 48KB in size, in compute mode the texture cache becomes a read-only cache, specializing in unaligned memory access patterns."

New low level instructions are introduced:

"NVIDIA has added a number of new instructions and operations to GK110 to further improve performance. New shuffle instructions allow for threads within a warp to share (i.e. shuffle) data without going to shared memory, making the process much faster than the old load/share/store method. Meanwhile atomic operations have also been overhauled, with NVIDIA both speeding up the execution speed of atomic operations and adding some FP64 operations that were previously only available for FP32 data."

And, very significantly, Compute Model 3.5 adds the ability for GPU kernels to launch other GPU kernels! For many compute tasks the latency of host-gpu-host-gpu transfers is a big issue, further complicated by interactions between compute scheduling and host communication. "Dynamic Parallelism" as introduced on the GK110 means that work units can be generated and dispatched by processes running entirely on the GPU with correspondingly much lower overhead. For certain tasks this will make a big difference.

And, yeah there's the significant detail of FP64 speed back to a more reasonable 1/3 of SP speed if you downclock to stay within thermal limits. Practically speaking this means > 1Tflop/s DP on a single card, with 6GB of 288GB/s memory and a huge register file. So this can indeed be very fast for some tasks.


-Ethan
Ethan (EO) is offline   Reply With Quote
Old 2013-02-20, 22:58   #15
Ethan (EO)
 
Ethan (EO)'s Avatar
 
"Ethan O'Connor"
Oct 2002
GIMPS since Jan 1996

2·72 Posts
Default

It should also be remembered that the GK110 design was informed by contract requirements for a certain customer order that ate the first ~19000 boards off the line:

http://www.olcf.ornl.gov/titan/

And there are several academic papers out which examine GK110 performance characteristics and optimization demands:

http://scholar.google.com/scholar?q=...&as_sdt=0%2C48
Ethan (EO) is offline   Reply With Quote
Old 2013-02-21, 09:48   #16
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

5×223 Posts
Default

Quote:
Originally Posted by Ethan (EO) View Post
Changes from the 680 generation of some relevance to various gpu efforts here:

"Space and bandwidth for both the register file and the L2 cache have been greatly increased for GK110. At the SMX level GK110 has 256KB of register file space, composed of 65K 32bit registers, as compared to 128KB of such space (32K registers) on GF100. Bandwidth to those register files has in turn been doubled, allowing GK110 to read from those register files faster than ever before. As for the L2 cache, it has received a very similar treatment. GK110 uses an L2 cache up to 1.5MB, twice as big as GF110; and that L2 cache bandwidth has also been doubled."
BUT a SMX in Kepler consists of 192 cores while a SM on Fermi has only 32 cores. So twice the size and twice bandwidth the to the register file for SIX times compute cores...

Oliver
TheJudger is offline   Reply With Quote
Old 2013-02-21, 12:30   #17
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2·5·61 Posts
Default

https://developer.nvidia.com/sites/d...erformance.pdf
FFT Performance
msft is offline   Reply With Quote
Old 2013-02-21, 15:20   #18
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

201278 Posts
Default

http://www.anandtech.com/show/6774/n...nce-unveiled/3

Looks like 2.5x faster than a GTX 580 (for CUDALucas)

Last fiddled with by Prime95 on 2013-02-21 at 15:34
Prime95 is offline   Reply With Quote
Old 2013-02-21, 15:31   #19
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

2×7×461 Posts
Default

GK110 apparently has full-precision FMA in DP; is that enough to make it reasonable to do double-precision arithmetic with FP rather than integer operations, or does the smaller number of DP units still bite you?
fivemack is offline   Reply With Quote
Old 2013-02-21, 17:17   #20
Redarm
 
Redarm's Avatar
 
Apr 2012
Berlin Germany

1100112 Posts
Default

my 680 in 2 weeks for sale (serious)
Redarm is offline   Reply With Quote
Old 2013-02-21, 21:15   #21
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5×103 Posts
Default

Check out the FFT dp benchmark:

http://www.anandtech.com/show/6774/n...nce-unveiled/3

This card looks like to be one to use for LL testing.

-- Craig
nucleon is offline   Reply With Quote
Old 2013-02-21, 21:45   #22
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2·112·47 Posts
Default

Quote:
Originally Posted by nucleon View Post
This card looks like to be one to use for LL testing.
Yeah... I read the whole article after George pointed us to it.

It's interesting... This SKU almost seems like it's targeted more to Compute than Gaming. Gamers are bemoaning the $1000 MSRP, but at approximately 1/7th the price of a Tesla K20X I suspect many will be more than happy to forgo ECC memory.

I can't wait to see some benchmarks from those here who have the budget (and/or the connections) to see how our favorite programs preform on this....
chalsall is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Titan's Best Choice Brain GPU Computing 30 2019-10-19 19:19
Titan Black ATH Hardware 15 2017-05-27 22:38
Is any GTX 750 the GeForce GTX 750 Ti owner here? pepi37 Hardware 12 2016-07-17 22:35
Nvidia announces Titan X ixfd64 GPU Computing 20 2015-04-28 00:27
2x AMD 7990 or 2x Nvidia Titan ?? Manpowre GPU Computing 27 2013-05-12 10:00

All times are UTC. The time now is 14:51.


Fri Jul 7 14:51:57 UTC 2023 up 323 days, 12:20, 0 users, load averages: 1.06, 1.11, 1.11

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔