![]() |
|
|
#12 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
193616 Posts |
I believe that the most current AMD OpenCL does use high-speed local memory on 5000-series GPUs - the local memory on 4000-series GPUs had such horrible restraints on writing to it (I think you could read from anywhere but only write to a very narrow slice 'owned' by the subprocessor that your thread is running on) that I'm surprised it was usable from assembly language, let alone as a compiler target.
Last fiddled with by fivemack on 2010-10-30 at 17:46 |
|
|
|
|
|
#13 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
2×7×461 Posts |
I've downloaded the OpenCL FFT example that Apple offers at
http://developer.apple.com/library/m...ion/Intro.html and tried to run it on my iMac (with a 4850 GPU). This was quite far from glorious success: it failed to compile (giving a warning about an uninitialised variable), and when I initialised the variable the program ran with a peak speed of 11GFLOPs, stalled the user interface of the Mac entirely while it was running, and gave the wrong answers. |
|
|
|
|
|
#14 |
|
Apr 2003
Berlin, Germany
192 Posts |
FWIW, AMD offers an OpenCL Webinar series: http://developer.amd.com/zones/OpenC...LWebinars.aspx
Another point for OpenCL will be the OpenCL capable IGPs coming with the CPU like in AMD's Llano (est. 400 shaders, 400-500 GFLOPS), Ontario/Zacate (lower power with est. 80 shaders) and Intel's Sandy Bridge with up to 12 EUs. Last fiddled with by Dresdenboy on 2010-11-04 at 14:24 |
|
|
|
|
|
#15 |
|
Nov 2010
Ann Arbor, MI
5E16 Posts |
You might find this link interesting about OpenCL implementations of FFT algorithms:
http://www.bealto.com/gpu-fft.html |
|
|
|
|
|
#16 |
|
Jun 2010
Kiev, Ukraine
3916 Posts |
Something usefull? http://developer.amd.com/gpu/appmath...s/default.aspx
|
|
|
|
|
|
#17 | |
|
"Richard B. Woods"
Aug 2002
Wisconsin USA
22×3×641 Posts |
Quote:
Floating-point arithmetic results in rounding errors. Since our GIMPS work requires exact integer answers, the implementation of FFTs in floating-point requires the allocation of several "guard bits" in each FP number so that at the end of the FFT, the rounding error does not add up to enough to make the final integer answer wrong. In single-precision floating-point, the number of bits left for holding the actual data values, after allocating the guard bits, is so small that it kills any speed advantage FP has over all-integer. Only double-precision FP allows enough bits for data in each FP number to preserve a speed advantage. |
|
|
|
|
|
|
#18 | |
|
Sep 2006
The Netherlands
11001001112 Posts |
Quote:
You are correct. However double precision versus single precision this is not the biggest problem as he proves. As he concludes correctly the problem are the RAM rewrites when transforms grow a tad bigger, say the size we need for LL now (or the VRB-Reix which is nearly similar to LL). to give it in cycles, computing single limb at todays AMD Radeon HD6000 series should be around 16 cycles. That would involve reading a 2 doubles and writing 2 doubles. So that involves 4 * 8 = 32 bytes. Each compute unit has 64 PE's and not too much cache per compute unit. the GPR's would be able to carry not so much data, so the amount of bytes read and written to RAM on the gpu's (idemdito Nvidia) is of major concern. For example at the HD6970 i have here it would mean next; the gpu on paper delivers something, when i measure (sure it also serves as a videocard but we cannot assume laboratory circumstances is it?) i can get out a 140-145GB/s bandwidth to the (device RAM). Let's take 140GB/s. 140GB/s translates with 1536 PE's clocked at 880Mhz to that 32 bytes is: 32 bytes * 1536 PE's * 0.88G / 140G = 309 cycles So the RAM is the problem. Cooley and tukey types are simply not gonna work as they require rewrites from and to the RAM, which already slows down too much. Now one can combine several steps of the algorithm. However that's bound to the amount of cache you have available per compute unit. Be it shared cache, be it GPR's. Each combination is a power of 2. So for example 32KB shared local cache mean you can store a max of 2^11. Note this is all theoretic numbers. Last fiddled with by diep on 2011-06-06 at 16:19 |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| LL with OpenCL | msft | GPU Computing | 433 | 2019-06-23 21:11 |
| Can't get OpenCL to work on HD7950 Ubuntu 14.04.5 LTS | VictordeHolland | Linux | 4 | 2018-04-11 13:44 |
| Lucas–Lehmer sequences and Heronian triangles with consecutive side lengths | a nicol | Math | 1 | 2016-11-08 05:32 |
| OpenCL accellerated lattice siever | pstach | Factoring | 1 | 2014-05-23 01:03 |
| OpenCL for FPGAs | TObject | GPU Computing | 2 | 2013-10-12 21:09 |