mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2010-10-30, 17:46   #12
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

11001001101102 Posts
Default

Quote:
Originally Posted by jasonp View Post
Are you sure the current release of openCL will generate efficient enough code for any GPU? IIRC as of a few months ago the AMD OpenCL implementation had severe performance problems because it could not use the high-speed local memory on ATI GPUs.
I believe that the most current AMD OpenCL does use high-speed local memory on 5000-series GPUs - the local memory on 4000-series GPUs had such horrible restraints on writing to it (I think you could read from anywhere but only write to a very narrow slice 'owned' by the subprocessor that your thread is running on) that I'm surprised it was usable from assembly language, let alone as a compiler target.

Last fiddled with by fivemack on 2010-10-30 at 17:46
fivemack is offline   Reply With Quote
Old 2010-10-30, 21:36   #13
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

2×7×461 Posts
Default

I've downloaded the OpenCL FFT example that Apple offers at

http://developer.apple.com/library/m...ion/Intro.html

and tried to run it on my iMac (with a 4850 GPU).

This was quite far from glorious success: it failed to compile (giving a warning about an uninitialised variable), and when I initialised the variable the program ran with a peak speed of 11GFLOPs, stalled the user interface of the Mac entirely while it was running, and gave the wrong answers.
fivemack is offline   Reply With Quote
Old 2010-11-04, 14:22   #14
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

FWIW, AMD offers an OpenCL Webinar series: http://developer.amd.com/zones/OpenC...LWebinars.aspx

Another point for OpenCL will be the OpenCL capable IGPs coming with the CPU like in AMD's Llano (est. 400 shaders, 400-500 GFLOPS), Ontario/Zacate (lower power with est. 80 shaders) and Intel's Sandy Bridge with up to 12 EUs.

Last fiddled with by Dresdenboy on 2010-11-04 at 14:24
Dresdenboy is offline   Reply With Quote
Old 2010-11-21, 15:19   #15
otutusaus
 
Nov 2010
Ann Arbor, MI

9410 Posts
Default

You might find this link interesting about OpenCL implementations of FFT algorithms:
http://www.bealto.com/gpu-fft.html
otutusaus is offline   Reply With Quote
Old 2010-12-27, 06:30   #16
Commaster
 
Jun 2010
Kiev, Ukraine

3916 Posts
Default

Something usefull? http://developer.amd.com/gpu/appmath...s/default.aspx
Commaster is offline   Reply With Quote
Old 2010-12-27, 23:50   #17
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

22·3·641 Posts
Default

Quote:
Originally Posted by Commaster View Post
It mentions only single-precision, not double-precision, FFTs.

Floating-point arithmetic results in rounding errors. Since our GIMPS work requires exact integer answers, the implementation of FFTs in floating-point requires the allocation of several "guard bits" in each FP number so that at the end of the FFT, the rounding error does not add up to enough to make the final integer answer wrong.

In single-precision floating-point, the number of bits left for holding the actual data values, after allocating the guard bits, is so small that it kills any speed advantage FP has over all-integer. Only double-precision FP allows enough bits for data in each FP number to preserve a speed advantage.
cheesehead is offline   Reply With Quote
Old 2011-06-06, 16:17   #18
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Quote:
Originally Posted by cheesehead View Post
It mentions only single-precision, not double-precision, FFTs.

Floating-point arithmetic results in rounding errors. Since our GIMPS work requires exact integer answers, the implementation of FFTs in floating-point requires the allocation of several "guard bits" in each FP number so that at the end of the FFT, the rounding error does not add up to enough to make the final integer answer wrong.

In single-precision floating-point, the number of bits left for holding the actual data values, after allocating the guard bits, is so small that it kills any speed advantage FP has over all-integer. Only double-precision FP allows enough bits for data in each FP number to preserve a speed advantage.
Reacting onto old posting...

You are correct. However double precision versus single precision this is not the biggest problem as he proves. As he concludes correctly the problem are the RAM rewrites when transforms grow a tad bigger, say the size we need for LL now (or the VRB-Reix which is nearly similar to LL).

to give it in cycles, computing single limb at todays AMD Radeon HD6000 series should be around 16 cycles. That would involve reading a 2 doubles and writing 2 doubles. So that involves 4 * 8 = 32 bytes.

Each compute unit has 64 PE's and not too much cache per compute unit.
the GPR's would be able to carry not so much data, so the amount of bytes read and written to RAM on the gpu's (idemdito Nvidia) is of major concern.

For example at the HD6970 i have here it would mean next; the gpu on paper delivers something, when i measure (sure it also serves as a videocard but we cannot assume laboratory circumstances is it?) i can get out a 140-145GB/s bandwidth to the (device RAM).

Let's take 140GB/s.

140GB/s translates with 1536 PE's clocked at 880Mhz to that 32 bytes is:

32 bytes * 1536 PE's * 0.88G / 140G = 309 cycles

So the RAM is the problem. Cooley and tukey types are simply not gonna work as they require rewrites from and to the RAM, which already slows down too much.

Now one can combine several steps of the algorithm. However that's bound to the amount of cache you have available per compute unit. Be it shared cache, be it GPR's.

Each combination is a power of 2. So for example 32KB shared local cache mean you can store a max of 2^11.

Note this is all theoretic numbers.

Last fiddled with by diep on 2011-06-06 at 16:19
diep is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
Can't get OpenCL to work on HD7950 Ubuntu 14.04.5 LTS VictordeHolland Linux 4 2018-04-11 13:44
Lucas–Lehmer sequences and Heronian triangles with consecutive side lengths a nicol Math 1 2016-11-08 05:32
OpenCL accellerated lattice siever pstach Factoring 1 2014-05-23 01:03
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09

All times are UTC. The time now is 15:12.


Fri Jul 7 15:12:44 UTC 2023 up 323 days, 12:41, 0 users, load averages: 1.26, 1.10, 1.11

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔