mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2019-03-20, 10:05   #23
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

41·251 Posts
Default

As usual, a lot of blah-blah and false claims....
LaurV is offline   Reply With Quote
Old 2019-03-20, 10:11   #24
axn
 
axn's Avatar
 
Jun 2003

125308 Posts
Default

Quote:
Originally Posted by diep View Post
Edit: at website of nvidia , there nvidia doesn't list its FP64 capaiblities yet at wikipedia i see it gets listed at not being lobotomized delivering 6.1 Tflops.
https://en.wikipedia.org/wiki/List_o...ocessing_units

That's confusing news then

Someone who happens to have one here ought to try to run some FP64 on it....

the 1.1 Tflops claim:

https://www.reddit.com/r/nvidia/comm...ision_monster/
See https://www.mersenne.ca/cudalucas.php for CuLu performance. Titan V is the fastest "consumer" grade card (only same generation Tesla/Quadro are faster).
axn is offline   Reply With Quote
Old 2019-03-20, 10:54   #25
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

22×3×112 Posts
Default

Quote:
Originally Posted by diep View Post
How do you want to do this?

Do you have C code to show this or pseudo-code of a working FFT that doesn't need this?
If you feel brave, feel free to peruse GpuOwl's implementation, which is extremely compact, the OpenCL part (the kernels) clocking in at only 1300 LOC in total: https://github.com/preda/gpuowl/blob/master/gpuowl.cl

It should be possible to do a 1:1 conversion to CUDA if desired. Or use as inspiration, or a starting point, only.
preda is offline   Reply With Quote
Old 2019-03-20, 12:02   #26
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

4758 Posts
Default

See https://docs.nvidia.com/cuda/turing-...ide/index.html

1.4.1.4. Integer Arithmetic
Similar to Volta, the Turing SM includes dedicated FP32 and INT32 cores. This enables simultaneous execution of FP32 and INT32 operations. Applications can interleave pointer arithmetic with floating-point computations. For example, each iteration of a pipelined loop could update addresses and load data for the next iteration while simultaneously processing the current iteration at full FP32 throughput.

Not talking about tensor cores now, that's just FP16 4x4 matrix multiply to an FP32 result matrix. Probably useless for our purposes, but who knows.
nomead is offline   Reply With Quote
Old 2019-03-20, 13:03   #27
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3·269 Posts
Default

Quote:
Originally Posted by nomead View Post
See https://docs.nvidia.com/cuda/turing-...ide/index.html

1.4.1.4. Integer Arithmetic
Similar to Volta, the Turing SM includes dedicated FP32 and INT32 cores. This enables simultaneous execution of FP32 and INT32 operations. Applications can interleave pointer arithmetic with floating-point computations. For example, each iteration of a pipelined loop could update addresses and load data for the next iteration while simultaneously processing the current iteration at full FP32 throughput.

Not talking about tensor cores now, that's just FP16 4x4 matrix multiply to an FP32 result matrix. Probably useless for our purposes, but who knows.
Those Tensor Cores deliver roughly against 100 Tflops performance - so for some clever sort of CRT based FP16 fft version there should be possibilities. 60-100 Tflops is really a lot.

The definition of 'executing at the same time' at a GPU is fuzzy logic because even generations of gpu's from 12 years ago and before basically take long time for an instruction to get out of the execution units - so there is a nonstop state of 'executing at the same time'. Furthermore if i run 8-20 warps of 32 cudacores at an 128 cudacore SIMD (9000 and 1000 series) or 192 cudacore SIMD (kepler) then obviously warps already get executed 'at the same time'. If the claim now is that if the same kernel with n warps has a mix of integer and floating point instructions and that you can get a higher IPC at a SIMD by mixing floating point and integer instructions through each other - that's implicitly more of the same.
diep is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Memory Bandwidth Fred Hardware 12 2016-02-01 18:29
High Bandwidth Memory tha GPU Computing 4 2015-07-31 00:21
configuration for max memory bandwidth smartypants Hardware 11 2015-07-26 09:16
P-1 memory bandwidth TheMawn Hardware 1 2013-06-15 23:15
Parallel memory bandwidth fivemack Factoring 14 2008-06-11 20:43

All times are UTC. The time now is 16:34.


Fri Jul 7 16:34:37 UTC 2023 up 323 days, 14:03, 1 user, load averages: 2.22, 2.26, 1.98

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔