![]() |
|
|
#826 |
|
"James Heinrich"
May 2004
ex-Northern Ontario
342110 Posts |
|
|
|
|
|
|
#827 | ||||
|
Sep 2006
The Netherlands
36 Posts |
Quote:
Quote:
So a phenom2 ddr3 high clocked will be way faster than an i3 in bandwidth of course and an i7, latest models, will be above that. In both cases the memory controller is on chip, so the higher the cpu has been clocked the bigger the bandwidth over the pci-e to the RAM. So the i7 from intel @ 3.x Ghz is great with a bunch of low latency DIMMs inside. After this comes a cheapo AMD with ddr3. Mainboard quality really matters here. All these cpu's have enough cores in principle to feed your GPU. Feeding the gpu needs a form of memory prefetching. Older AMD gpu's couldn't do that. 6000 series can, so can nvidia. Of course theJudgers code runs only at nvidia. Quote:
SLI usually is for gamers and might have the disadvantage at some cards that you need to attach TFT's to the videocards in order to be able to use them for gpgpu, which seems silly. I read a report there online and answer from a helpdesk, which is a silly thing. Hopefully they fixed this already. Quote:
Of course 2nd hand offers on ebay you can never beat. Just don't believe the advertized power it eats. gpgpu of thejudger is quite efficient. it'll eat a lot more power than nvidia advertizes. Think of 500 watt for the GTX590 and 405+ watt for the GTX580. Get a 608Mhz version though in case you buy new hardware. those 668Mhz versions might be a tad unreliable :) of any manufacturer ebay always wins it pricewise. yet if you intend to run 24/24 for a few years, then powercosts are immense of those videocards, at least in 1st world nations. I know from a few chinese researchers the opposite is the case. they only care for hardware costs, power is for free for them :) Vincent Last fiddled with by diep on 2011-05-03 at 17:12 |
||||
|
|
|
|
|
#828 |
|
Sep 2006
The Netherlands
36 Posts |
580 over a 570 :
570 has 480 CUDA cores and default clocked 1464 Mhz 580 has 512 cuda cores and default 1544 Mhz Now you can do the math yourself as it is the same generation gpu. TF is not dependant upon memory speed unlike FFT codes. So 580 is faster on paper 512*1.544 / 480*1.464 = 13% A 590GTX should have roughly same power consumption like the 580 (maybe slightly more but i doubt it). Yet it has 1024 cuda cores at between 1.2Ghz and 1.4Ghz Default it is 608Mhz for the graphics clock so that's 1216Mhz. You can do the math how much faster that is. Not yet factor 2 over a 580 but nearly. Last fiddled with by diep on 2011-05-03 at 17:24 |
|
|
|
|
|
#830 | |
|
Sep 2006
The Netherlands
2D916 Posts |
Quote:
In all cases it is the same chip of course. So the GTX590 and the GTX570 will be the same chip. GPU's where a bunch of cores didn't work they turned off 32 of them and call that a GTX570. Basically they turned off 1 compute unit i guess, as internal the thing splits up things in compute units. Note compute units is the OpenCL language, nvidia has it's own terminology for this and original called and calls it SM's (symmetric multiprocessors???). In manycores it is easy to turn off cores. In cpu's that is far more complex because of the cache coherency which a gpu simply doesn't have. So if you turn off a compute unit, which is 64 cores in case of AMD and 32 in case of Nvidia, that is very easy to do. that's why those gpu's can be such large die spaces whereas with cpu's it is very expensive to produce such huge cpu's. they stacked 2 gpu's therefore onto a single board and called that the GTX590. Now for gpgpu things are different as it depends upon capabilities of languages and available instructions and how fast those are, yet for gamers the GTX580 was basically hammered away by AMD's 6990. So i guess the GTX590 is a desperate attempt by nvidia there, and it kind of failed already as in most games benchmarks i saw, the 6990 hammers away the GTX590 completely. Yet for gpgpu having 1024 cuda cores is a blessing. What it means for FFT is not so clear to me. the fact that the 6990 hammers it away is probably because nvidia lobotomized bandwidth. they have to do this. Having 2 tracks of 384 bits wide at 1 card is just too much. AMD had optimized for 256 bits and already designed the thing to fit 2 gpu's on it, so must have optimized it far more. Not a design mistake by nvidia nor amd, realize AMd released 6 months later or so than nvidia. Newer chip, better performance of course. I wrote down some stuff here on paper now for a fast FFT and unlike thall i'm gonna release the code. Would run both fast on amd as well as on nvidia, though i program it for AMD in opencl. It's not clear to me however how to run faster on the 6990 from amd there, as every compute unit must execute the same instructions. nvidia doesn't have this problem; each compute unit ('SM') you can steer independantly so porting such code to the GTX590 might be easier than the 6990. AMD is supposed to release a new version of their SDK (not sure the upcoming version already supports it) where also each compute unit, just like nvidia, can get steered. In such case the AMD 6990 will be fastest for the FFT i'm gonna build. Right now it's not so clear for me whether a 580 with some additional carry tricks in CUDA might get to the same speed like the AMD6970 there or vice versa. The obviously lobotomized bandwidth to the RAM of the GTX590 will lose it from the 6990 there, as the games indicate already. Now in games the quality of drivers matters as well, so the last word hasn't been spoken there yet. With TF there is no discussion about what is fastest there. That's nvidia and bandwidth to the internal device ram is not so relevant there as it doesn't get used at all by TheJudger. what does matter is bandwidth from the cpu over the pci-e to the gpu. there are huge differences there. for example my setup here is low clocked and oldie DDR2 ram, so that's 2.2 GB/s to the gpu and 2.0GB/s back (roughly said and yes it is also serving as a videocard while benchmarking this). TheJudgers setup is a lot faster there. Factors more per second. Now i don't want to eat cpu time at all, i want every program i build to be stand alone inside the gpu, so no bandwidth from the cpu to the gpu needed in such case. Yet you'll have to deal with that problem for now. if you start some instances of the program and each one feed to another gpu it should be easy to get to work i guess. |
|
|
|
|
|
|
#831 |
|
Sep 2006
The Netherlands
36 Posts |
Oh maybe i wasn't clear for those interested in the diff between the 2 gpu cards and the single gpu cards.
The 1 gpu cards you can read from the device RAM to any compute unit. It's slow yet it is possible. the bandwidth to the RAM is a 140GB/s or so, let's ignore theoretic specs of 170GB/s to 190GB/s. I never obtained those at any card of course. You'll need a TheJudger setup for that probably to achieve that :) I have no testsetups here. Just old second hand stuff and a new videocard. the 2 gpu cards do not share any memory with each other. They have their own DDR5 ram serving each gpu. there is no shared memory at all between them. I guess even the GDS (global data store - a kilobyte or 64 - as defined in the opencl specs and implemented by AMD and Nvidia by now) is not shared. Sure there is a link possible between them. I forgot speed of it. i remember some link ran at 7 GB/s between the 2 gpu's (which is a factor 20+ too slow to consider using it). Now for TF all this is not needed. It just calculates and crunches inside the compute units and in fact doesn't use the shared RAM it has there. mfackt is seeing whether a factor candidate can divide an exponent, that's all. Simple calculation, yet eats a bunch of cycles. So the speed at which the cores can process that calculation is the speed of your GPU. RAM is not involved at all. Not even the SRAM. Just register files are involved and the execution units. For FFT and games is another story. Let's skip games. We don't play games, we just toy with numbers! FFT needs bandwidth to RAM but also shared RAM. In case of the 590GTX you can of course run 2 instances of your code at 2 different FFT's. So no need for shared memory between the 2 gpu's then. At the 6990 you cannot do that for a simple reason, it doesn't support multiple kernels at the same time. So there is no sharing of any information between the 2 gpu's in such case. We'll have to wait therefore for a new SDK from amd that supports this to use the 6990's full potential there. In the meantime i go cook a FFT now, porting some 2006 cpu code i had written to the GPU. That will run horrible at first, but i calculated after optimization it should be roughly a tad faster than the timing Thall posted here. Note Thall used existing CUDA FFT library from Nvidia, whereas i have my own code here using a different transform and no library at all. I'm using an integer transform. It's not clear to me how fast the GTX590 would be when it runs 2 of those transforms. The disappointing benchmarks of it indicate it has a real bandwidth problem. Yet for Trial Factoring that doesn't matter as it doesn't need that RAM at all. It seems that for each type of code there is a gpu that is fast for it. Last fiddled with by diep on 2011-05-03 at 18:00 |
|
|
|
|
|
#832 |
|
Sep 2006
The Netherlands
36 Posts |
Because of disappointing results in games, maybe price of GTX590 will soon drop and next year second hand dirt cheap. Who knows?
The kids are mercilious there :) I remember a few years ago some 8800 dual gpu version also was dirt cheap at ebay when next generation was there. In this case you must however take into account that the next generation gpu's at 22 nm, that those will be fantastic faster than todays generation. Moving from 40 nm to 22nm is such a huge difference. Factor 4 speed increase effectively for gpgpu is easy to predict. I hope AMd will also improve its integer execution units (the multiplication part - top bits and 32 x 32 bits integers). That would speed increase AMD gpu by factor 16 for integers effectively compared to todays cards. Nvidia can toy some in SSE2 type style; executing vectors at a single cuda core; that would speed 'em up factor 8 for next generation chip. Vincent |
|
|
|
|
|
#833 | |
|
"Mike"
Aug 2002
200528 Posts |
Quote:
|
|
|
|
|
|
|
#834 |
|
Sep 2006
The Netherlands
36 Posts |
I'm not an expert in all this, can ask someone at a forum (www.lostcircuits.com). From what i understand is that you need pci-e 8x at least for the videocards.
The videocards all claim to be 16x but in reality no mainboard that you can buy in a shop yet supports real full blown 16x, the 16x slots in reality all give 8x performance. Then further if we speak about the memory bandwidth being a bottleneck, the cpu speed matters. A 3.6Ghz cpu always is going to be much faster with ddr3 than a 2.3Ghz chip. Say 50% faster at least if not more, in bandwidth. Now all this is not a problem if you intend to run just 1 gpu or so, but if you want to run a bunch, it'll matter. In itself it's easy with a riser card to stack up a bunch of gpu's. CPU's are outdated now for anything with prime numbers. there is no way they can ever make up with gpu's anymore. So all these gpu issues are very relevant. From a support and easyness perspective seen i'd argue that the gpu manufacturers are not very seriously supporting their gpu's for gpgpu. It's just total amateurism and clumsiness and lots of technical hidden facts. Last fiddled with by diep on 2011-05-03 at 19:46 |
|
|
|
|
|
#835 |
|
Sep 2006
The Netherlands
36 Posts |
Please note if you are interested in running upcoming FFT's as well, that bandwidth to and from the gpu doesn't matter. Just internal bandwidth will matter.
It is not clear to me which gpu will be fastest there for upcoming FFT, and then i'm not even busy with the ebay price of older generation gpu's. I intend to use 32 bits calculations there. AMD is weak there so is nvidia. the modulo which i do runs simple 32 bits instructions so that's 4x faster at AMD than at Nvidia. Yet i'm guessing all that doesn't matter. The internal bandwidth of the device ram (so the ddr5 of the gpu) will determine most i'd guess and also the slightly faster speed of AMD at the calculations might be undone completely by worse load balancing. where the GTX580 has 16 SM's (compute units) which gives easy load balancing, the 24 of the Radeon HD 6970 i don't see yet how to load balance very well as the FFT size is a power of 2. Vincent |
|
|
|
|
|
#836 |
|
Dec 2010
Monticello
5×359 Posts |
I wouldn't say LL tests on the current generation of CPUs are dead yet...only TFs, unless someone forgot to tell me about a huge performance jump there..but I don't see it, unless the GPU can do many LLs in parallel. Remember that testing M(100M) for primality with LL involves O(100M) serial steps.
And if you don't think the CPU manufacturers aren't watching the GPUs, you can win the inaccurate delphic oracle award in five years. Those guys *have* to be thinking about the problem, and we will see the CPU performance increase substantially. Honestly, I don't feel like NVIDIA has been neglecting us; the CUDA toolkit and specs are out there in the open and any amateur can program to it. I do understand why they don't want to get locked into a particular instruction set with a bunch of binaries; supporting x86 instructions is a major thorn in the CPU manufacturer's sides. I can't speak to ATI/AMD, except to say that I had heard (15 years ago) that ATI didn't feel that community support of various kinds was all that important. My experience working there (on Rage5 test, at the former Tseng Labs site) was that open communications even within the company wasn't all that high a priority. Finally, it's worth noting that the CPU side has had significant speedups as the FFT sizes have been generalized to irrational numbers that more closely match the number sizes involved. |
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1676 | 2021-06-30 21:23 |
| The P-1 factoring CUDA program | firejuggler | GPU Computing | 753 | 2020-12-12 18:07 |
| gr-mfaktc: a CUDA program for generalized repunits prefactoring | MrRepunit | GPU Computing | 32 | 2020-11-11 19:56 |
| mfaktc 0.21 - CUDA runtime wrong | keisentraut | Software | 2 | 2020-08-18 07:03 |
| World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |