mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2009-12-19, 13:43   #1
Brain
 
Brain's Avatar
 
Dec 2009
Peine, Germany

331 Posts
Default The ATI GPU Computing thread

This thread shall collect all knowledge and current information about GPU computing on ATI graphics cards. I'd prefer info about current standards DirectX 11 and ATI Stream technology (Direct Compute, OpenCL).

Intention is to get some Prime95 work done on ATI HD5000 generation. Although this forum contains a lot of info for double precision maths for Nvidia cards (CUDA), there is hardly anything for ATI. All I have found is this:

AMD Core Math Library for Graphic Processors (ACML-GPU):
http://developer.amd.com/gpu/acmlgpu/pages/default.aspx

Share your knowledge. I'm looking forward to your posts.
Brain is offline   Reply With Quote
Old 2009-12-19, 16:27   #2
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

67510 Posts
Default

Quote:
Originally Posted by Brain View Post
This thread shall collect all knowledge and current information about GPU computing on ATI graphics cards. I'd prefer info about current standards DirectX 11 and ATI Stream technology (Direct Compute, OpenCL).

Intention is to get some Prime95 work done on ATI HD5000 generation. Although this forum contains a lot of info for double precision maths for Nvidia cards (CUDA), there is hardly anything for ATI. All I have found is this:

AMD Core Math Library for Graphic Processors (ACML-GPU):
http://developer.amd.com/gpu/acmlgpu/pages/default.aspx

Share your knowledge. I'm looking forward to your posts.
Hi very interesting. First download libraries to write code for the cards to see whether you can get them and try to find every file you need prior to start the project. Some years ago software support from AMD was a big problem. Ignore OpenCL - too slow and ignore DirectX. You want directly program for the chip.

I wrote on paper (not electronic on the net) a small idea to get quickly something going on these gpu's. For a full DWT 32 bits floating point implementation there should be something factors better though. Please note it is important to dig up additional information on the significants. The registers might be calculating in more bits precise than you guess.

The whole trick to get fast on these gpu's will be modifying the algorithm to do the FFT (prior to moving to DWT).

Actually a first implementation of FFT is more interesting, as that can then get used for other libraries as well. It should be possible to select an algorithm where with some trickery it is possible to avoid the memory controller as much as possible.

Straightforward FFT is of course 2n log 2n. However it goes from left to right through the array which squeezes out the maximum bandwidth out of the RAM. These gpu's are so fast nowadays that main concern is keeping outside of the device RAM and optimize for the maximum inside registerfile/tiny cache.

So it is worthwhile to look at some more 'inefficient' algorithms to get the FFT done, under the condition that more of each 'log n' is relative closeby in each calculation. In short is it possible to exchange in a sneaky manner the order in which the actual instructions get executed, later on doing 1 slow correction maybe?

Further the AMD's have 5 execution units ("stream cores") for each full blown core. How many of those can do a multiplication? If it's just 1, consider that some integer type transform might be a possibility to consider as otherwise 3 out of 5 execution units nonstop idle. It is maybe possible to get them all 5 to work sometimes when doing other type of transform.

The AMD gpu is world champion in executing as many instructions as possible per cycle. So give it code where it can do that.

For X2 versions you want to run 2 independant transforms. The memory is not shared between the GPU's.

For a single gpu: 1600 streamcores * 0.95Ghz = 1.5Tflop handsdown.

That's really a lot you know.

Oh note each vector is also 128 bits. So single precision you soon run to 6 Tflop there.
So it is all about contraints also how many streamcores can execute from each penta-core at the same time an instruction.

All this information is real important information, as claim isn't they are 6 Tflop. Why?

Last fiddled with by diep on 2009-12-19 at 16:36
diep is offline   Reply With Quote
Old 2009-12-19, 16:46   #3
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

2×52×127 Posts
Default

The five-wide cores aren't each processing 128-bit vectors; code written with 128-bit vectors gets translated to 32-bit ops on four lanes.
fivemack is offline   Reply With Quote
Old 2009-12-19, 16:55   #4
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

33×52 Posts
Default

Quote:
Originally Posted by fivemack View Post
The five-wide cores aren't each processing 128-bit vectors; code written with 128-bit vectors gets translated to 32-bit ops on four lanes.
Most useful.

How is the 5th core involved?

Last fiddled with by diep on 2009-12-19 at 16:55
diep is offline   Reply With Quote
Old 2009-12-19, 18:12   #5
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

143168 Posts
Default

I don't have an ATI card; I bought a GeForce GTX275 after reading the AMD developer forums for a few weeks and despairing at AMD's apparent inability to realise that offering 'double-precision support in compiler' as a driver update expected eight months after the cards are released is not sensible.

http://developer.amd.com/gpu/ATIStre...s/default.aspx has a "Documentation" section at the bottom which links to

http://developer.amd.com/gpu_assets/...chitecture.pdf

(the machine-code of the VLIW units) and

http://developer.amd.com/gpu_assets/..._Processor.pdf

(the assembly language used by AMD's assembler to construct VLIW machine-code)

The lanes are called x, y, z, w and Trans; the Trans lane is the only one that can do integer<->float conversions, integer multiply, or sin/cos/sqrt/rsqrt/exp/log. The register set is modelled as consisting of 128-bit registers with x, y, z and w components, and (for example) the y lane can only write to the y component of a register; this is a consequence of the x, y, z and w components of the register set being stored as four separate banks of memory.

XYZW (and not Trans) can do double-precision multiply, comparison, and convert-single-to-double; to do a single DP addition you have to use two lanes, and they have to be 0+1 or 2+3 (or both) of an instruction bundle. See page 9-45 of R700-Family for an example. To do a DP multiply or FMA on R700 hardware uses all four lanes.

I don't know if this has changed on R800 hardware; the R800 manual doesn't seem to be available yet. Maybe I'm just being naive in expecting to get the instruction-set manual before the hardware.

Last fiddled with by fivemack on 2009-12-19 at 18:19
fivemack is offline   Reply With Quote
Old 2009-12-19, 18:21   #6
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

33×52 Posts
Default

Quote:
Originally Posted by fivemack View Post
I don't have an ATI card; I bought a GeForce GTX275 after reading the AMD developer forums for a few weeks and despairing at AMD's apparent inability to realise that offering 'double-precision support' as a driver update eight months after the cards were released is not sensible.

http://developer.amd.com/gpu/ATIStre...s/default.aspx has a "Documentation" section at the bottom which links to

http://developer.amd.com/gpu_assets/...chitecture.pdf

(the machine-code of the VLIW units) and

http://developer.amd.com/gpu_assets/..._Processor.pdf

(the assembly language used by AMD's assembler to construct VLIW machine-code)

The lanes are called x, y, z, w and Trans; the Trans lane is the only one that can do integer<->float conversions, integer multiply, or sin/cos/sqrt/rsqrt/exp/log.

XYZW (and not Trans) can do double-precision multiply, comparison, and convert-single-to-double; to do a single DP addition you have to use two lanes, and they have to be 0+1 or 2+3 (or both) of an instruction bundle. See page 9-45 of R700-Family for an example. To do a DP multiply or FMA on R700 hardware uses all four lanes.

I don't know if this has changed on R800 hardware; the R800 manual doesn't seem to be available yet. Maybe I'm just being naive in expecting to get the instruction-set manual before the hardware.
look it is wishful thinking that you can do double precision on gpu's.
they are forever 32 bits floating point devices. So a transform must be in 32 bits.

At AMD cards you can get 50% out of card. at nvidia only the tesla card is interesting, the rest is crap.
Maybe you get 25% out of Tesla btw.

That might change with fermi.
In turn i assume both have fastest gpu.

Both have ugly support (understatement).

One of problems with nvidia is no good description of the streamcores WHICH hardware instructions they support
and WHAT latency which is and how ugly branches are and which CMOV instructions is there (if any, i assume 0 for now
in contradiction to AMD).

in AMD cards you know which hardware instructions the chip has, you don't know the latency however.

The software support is big problem there.

please note that if manufacturers finally produce those details we know a lot more and can produce finally something that can get out the maximum of the cards. probably there is a good reason to keep it quiet, namely they both overreacting the potential of their cards by some *factors*.

We are guessing AMD by factor 2, Nvidia by factor 4, because fastest codes that run on those cards that use some sort of instruction mix and a realistic program gets to that performance after lots of toying.

Hopefully Fermi is improving things a lot with better caches!

Last fiddled with by diep on 2009-12-19 at 18:26
diep is offline   Reply With Quote
Old 2009-12-19, 18:44   #7
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

635010 Posts
Default

Please explain your comments about tesla: as far as I'm aware, and as far as I can tell nvidia confirm this if asked, tesla 10-series are the same GPUs as on the geforce 285 cards, with a slightly higher clock rate, significantly more RAM on board, and available in different-shaped boxes.
fivemack is offline   Reply With Quote
Old 2009-12-19, 18:54   #8
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2A316 Posts
Default

Quote:
Originally Posted by fivemack View Post
Please explain your comments about tesla: as far as I'm aware, and as far as I can tell nvidia confirm this if asked, tesla 10-series are the same GPUs as on the geforce 285 cards, with a slightly higher clock rate, significantly more RAM on board, and available in different-shaped boxes.
That's the theory.

For practice google around what people *achieved*.

The big difference gpu's vs tesla is towards memory controller usually.

Isn't there a limitation on RAM allocation on gpu's for example, which Tesla doesn't have?
Some big bandwidth throttling was known a while ago for gpu's that tesla didn't have (so any claim on RAM bandwidth,
forget it on a gpu).

Look you can claim anything about your hardware if you don't release details.

I can also write fantastic stories about something i keep forever details secret about under all circumstances. In fact i approached nvidia one day for a simulator of some very big world wide software project that considered doing a pilot project at nvidia.

In fact i didn't even need to speak further with the persons involved as the response was total negative of Nvidia there.
Note that was about Tesla in fact not cheapo gpu's.

That should tell you something whether they have something to hide.

A chance to earn half a billion they don't even take serious you know.
All what was needed was some promises to support a pilot project.

You can't setup a pilot project for software without technical details. You compete with all HPC hardware there where software teams get the maximum out of what is technical possible on the cpu's. I'm not even gonna *try* to write software for hardware without information unfunded.

Vincent

Last fiddled with by diep on 2009-12-19 at 19:00
diep is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Error while Computing Antonio NFS@Home 5 2016-06-30 17:30
GPU Computing Cheat Sheet (a.k.a. GPU Computing Guide) Brain GPU Computing 20 2015-10-25 18:39
How to start GPU computing? colinhester GPU Computing 6 2011-07-25 13:54
Deutscher Thread (german thread) TauCeti NFSNET Discussion 0 2003-12-11 22:12
The difference between P2P and distributed computing and grid computing GP2 Lounge 2 2003-12-03 14:13

All times are UTC. The time now is 04:19.

Tue Jul 14 04:19:12 UTC 2020 up 111 days, 1:52, 0 users, load averages: 1.66, 2.20, 2.31

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.