mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-05-28, 15:56   #23
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

59710 Posts
Default

Quote:
Originally Posted by diep View Post
Another huge difference is that PS3 has 6 PE's available clocked 3.17Ghz, executing single instruction per cycle versus the supercomputer cell2 chip has 8 PE's available. This is a huge performance difference.
Still spreading wrong information? How many times should people provide you with links that you don't bother reading? That's intellectual dishonesty and it makes all your other points doubtful.

@chris2be8: The EPFL team published an article about their ECM record factorization. I guess that even if you could get their code, you couldn't achieve such a feat since they used a cluster of PS3. Anyway, I guess you can try to get in touch with Arjen K. Lenstra asking for more information
ldesnogu is offline   Reply With Quote
Old 2011-05-28, 16:44   #24
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
Still spreading wrong information? How many times should people provide you with links that you don't bother reading? That's intellectual dishonesty and it makes all your other points doubtful.

@chris2be8: The EPFL team published an article about their ECM record factorization. I guess that even if you could get their code, you couldn't achieve such a feat since they used a cluster of PS3. Anyway, I guess you can try to get in touch with Arjen K. Lenstra asking for more information
Dude, there is 2 cell cpu's. You must not confuse both. The links provided are the 2009 PDF's. The PS3 saw the light in 2005.

Do you see the difference between 2005 and the powerXCell 8i from around 2009?

So if the guy is claiming he uses PS3's, keep in mind most are in reality using the rackmounts with newer CELL cpu's to 'emulate' that.

And it's 8 vs 6 cores. Huge difference.

The first produced CELL cpu's were a crap.

Vincent
diep is offline   Reply With Quote
Old 2011-05-28, 16:47   #25
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

p.s. please note that with the latencies at the later CELL chips, there is huge latencies. Like 7 cycles for double precision. Programming very well for that is of course possible, but it is exactly the same effort like getting software to work on the GPU's which are a lot cheaper per Tflop.

Last fiddled with by diep on 2011-05-28 at 16:47
diep is offline   Reply With Quote
Old 2011-05-28, 17:32   #26
xilman
Bamboozled!
 
xilman's Avatar
 
"๐’‰บ๐’ŒŒ๐’‡ท๐’†ท๐’€ญ"
May 2003
Down not across

2·17·347 Posts
Default

Quote:
Originally Posted by diep View Post
So if the guy is claiming he uses PS3's, keep in mind most are in reality using the rackmounts with newer CELL cpu's to 'emulate' that.
If "the guy" to whom you refer is Arjen Lenstra, I suggest that you should do a little investigation before you make such claims.

Arjen's cluster is indeed made out of honest-to-$DEITY PS3's. The game machines. They are technically rack-mounted in the sense that they sit on shelves fitted into racks. There's a nice picture of his cluster here: http://lacal.epfl.ch/

There's also a picture of the RSA-129 check...

Paul
xilman is offline   Reply With Quote
Old 2011-05-28, 18:13   #27
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

3×199 Posts
Default

Quote:
Originally Posted by diep View Post
Dude, there is 2 cell cpu's. You must not confuse both. The links provided are the 2009 PDF's. The PS3 saw the light in 2005.

Do you see the difference between 2005 and the powerXCell 8i from around 2009?
If you had taken 1 minute of your time to skim through the document you'd have found the pipeline dispatch and latency table that has two sets of columns: one for Cell/BE (the one in PS3) and PowerXCell 8i. Appendix B.

Quote:
So if the guy is claiming he uses PS3's, keep in mind most are in reality using the rackmounts with newer CELL cpu's to 'emulate' that.
And now you doubt an article co-signed by Arjen Lenstra and Peter Montgomery. Funny
ldesnogu is offline   Reply With Quote
Old 2011-05-29, 17:44   #28
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
If you had taken 1 minute of your time to skim through the document you'd have found the pipeline dispatch and latency table that has two sets of columns: one for Cell/BE (the one in PS3) and PowerXCell 8i. Appendix B.

And now you doubt an article co-signed by Arjen Lenstra and Peter Montgomery. Funny
For nowadays standards these papers are a joke.

Let me give you a few facts. At top500.org they give a single core of the CELL as being 12.8 Gflops. So that's in case of multiply-adds: 1 instruction per cycle double precision times 2 doubles per vector:

3.2Ghz * 2 * 2 = 12.8
http://www.top500.org/system/10377
That's in fact the powerX8i cell already.

If i open a paper of Lenstra:

http://eprint.iacr.org/2010/338.pdf
This publication refers to 2010 so i guess it was written in 2010 or 2011.

Page 3 alinea 2 shows they represent things using 16x16 == 32 bits representation. now i don't know whether integers run faster on it than double precision, let's assume that integers runs at 2 instructions per cycle per PE.

So they have 6 PE's available * 3.2Ghz * 2 IPC * 4 ints per SIMD =

6 * 3.2Ghz * 2 * 4 = 153.6 G per second.

The GPU, even the previous generation GPU's from AMD deliver 1.351 Tflops (that's my definition of a flop, they claim multiply-add themselves so in that case it's 2.7 Tflop for the HD6970, note on paper the old generation is a tad faster, yet harder to program for)

and can do a multiply-add at 16x16 bits multiplications in fact.

That's 10x faster nearly than CELL.

The 6990 in fact is 20x faster.

The HD5870 you can buy at ebay for $90 and you could already get those
easily previous year, long before this paper has been written.

Note also 24x24 bits is available (requires 2 instructions for top bits).

The idea was genius from IBM to produce them, yet it took them too long to produce them which renders them obsolete.

The CELL processor is something in between an old CPU and todays GPU's.
Therefore to program the CELL you can also program GPU's. In fact that's easier than programming CELLs as getting the full throughput from a CELL is tougher than from todays generations of GPU's.

So this PS3 is old junk and i don't know why such big guys are busy with them.

The article indicates however they never ran at the PS3 actually yet that they ran at one of their supercomputers blades.

See how the article was written. Very dubious word construction is for example the formulation at page 2 bottom left. "It is conceivable ...."

So he actually never toyed with the PS3 as in the next paragraph the blades are there. Yet if something officially doesn't exist i guess you can't run on 'em, so you use some vague PS3 photo. We saw some NCSA cover ups even in computerchess that way, so i guess for prime numbers it'll be even easier. Note that the previous blades were around $2k a piece if you buy a bunch of 'em. Sure they can put 'em online for $20k but no one ever paid that AFAIK. Of course $2k was attractive at the time.

He does mention that there is an even and odd pipeline mentionning that theoretical the chip would be able to execute 2 instructions per cycle. the way how he writes it down you must analyze also very well, as also that indicates he sees it as a theory.

Probably happens with simple integers and not with double precision floating point, which is probably why they emulate things using integers rather than floating point.

The thing has no branch prediction indeed (the old CELL, the newer ones are supposed to be a tad more clever there, but i never tested those).

So the programming model to get optimal performance out of them is a lot harder than from GPU's, meanwhile they're roughly factor 10 slower than a GPU.

Note that at Nvidia it is true the single precision integers are slower, yet you can multiply more bits at once, this is where nvidia beats the CELL bigtime, so Fermi will be effectively also nearly be factor 10 faster than CELL.

Again. When IBM launched the CELL idea it was a great idea, yet they took too long to produce them. Years and years of delays. When they arrived it was too late. A hybrid form in between a cpu and a gpu.

The reason why they sold some i guess is because the western world was asleep with respect to GPU's. They sat and waited. I guess they woke up now.

AFAIK the initial CELL PE's for the PS3 were clocked 3.17Ghz and in the article they mention 3.192Ghz so i'm not sure whether Lenstra refers to the same type of chip. maybe the blades are clocked 3.192Ghz?

3 years difference is a lot. Just google on i7 and tell me how many variations there are of it.

Last fiddled with by diep on 2011-05-29 at 18:02
diep is offline   Reply With Quote
Old 2011-05-29, 19:53   #29
xilman
Bamboozled!
 
xilman's Avatar
 
"๐’‰บ๐’ŒŒ๐’‡ท๐’†ท๐’€ญ"
May 2003
Down not across

2E1616 Posts
Default

Quote:
Originally Posted by diep View Post
For nowadays standards these papers are a joke.

Let me give you a few facts.
Let me give you a few facts.

Arjen built the PS3 cluster several years ago when the PS3 price-performance was actually quite good. The cluster is still producing good results not because it is state of the art but because it has been fully paid for, it has achieved all that it was intended to do and because it still works. For the moment, the only cost to use it is the electrical power and a relatively small amount of attention.


Quote:
Originally Posted by diep View Post
That's 10x faster nearly than CELL.
Some more facts for you.

Arjen and his team are well aware of this figure. When I spoke with him and Joppe Bos at Eurocrypt last year he said that they already had code which was at least ten times faster on a GPU than on a PS3. The problem they then had was that the latency was appalling. This is in the context of running ECM to a stage one limit of a billion or few. A PS3 could run 8 curves a day (if I remember the numbers correctly) whereas a GPU could run 30 thousand curves a year (again, if I remember correctly). However, it took the PS3 a day before any answers appeared and it would take the GPU a year before its output was ready. They weren't prepared to wait that long.

This information prompted me to pass on some of my investigations into low-latency arithmetic and stimulated me into working on the subject again. I really must restart on that project as I'd also like to run ECM on GPUs.

Paul
xilman is offline   Reply With Quote
Old 2011-05-29, 20:03   #30
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3·269 Posts
Default

Quote:
Originally Posted by xilman View Post
Let me give you a few facts.

Arjen built the PS3 cluster several years ago when the PS3 price-performance was actually quite good. The cluster is still producing good results not because it is state of the art but because it has been fully paid for, it has achieved all that it was intended to do and because it still works. For the moment, the only cost to use it is the electrical power and a relatively small amount of attention.


Some more facts for you.

Arjen and his team are well aware of this figure. When I spoke with him and Joppe Bos at Eurocrypt last year he said that they already had code which was at least ten times faster on a GPU than on a PS3. The problem they then had was that the latency was appalling. This is in the context of running ECM to a stage one limit of a billion or few. A PS3 could run 8 curves a day (if I remember the numbers correctly) whereas a GPU could run 30 thousand curves a year (again, if I remember correctly). However, it took the PS3 a day before any answers appeared and it would take the GPU a year before its output was ready. They weren't prepared to wait that long.

This information prompted me to pass on some of my investigations into low-latency arithmetic and stimulated me into working on the subject again. I really must restart on that project as I'd also like to run ECM on GPUs.

Paul
With those budgets it's not so complicated to make a kick butt GPU program.

What you want to do is design your own card (not cpu). Take the default card of AMD or Nvidia. Put some SRAM additional on each card and a hub from it directly to the network and use a better cooler and a lot more DDR5 ram.

So the latency to the device RAM will be a tad slower, yet the latency to the other nodes will be kick butt of course.

This is not rocket science. It'll be a tad more expensive than gpu's from the shop of course and you need to print a 500-1000 at least of each card to get the price down a lot.

So it'll be about 150 euro more expensive each modified gpu then than when you buy it in the shop.

The modification, even when you have a commercial guy do it, would not be more than a couple of hundreds of thousands.

Yet all sort of things that require a lot of clever programming now will be a lot easier then.

Please note that in the first place, just like with the CELL, forget the word latency. Focus upon throughput :)

Regards,
Vincent

p.s. after having that 'cluster' for a few years now it's still 'conceivable' that a modification will allow 'em to use that 7th core after all those years? How can we take this serious?

p.s.2 using 256 bits calculcations using 'schoolboy' is 256 multiplications. Effective output of 4 bits per instruction. Maybe the 80386 is outputting more bits per cycle there? How to take this paper serious if you need 256 multiplications to just multiply 1 limb? In the sovjet union they used those wooden abacus to multiply faster than that...

Last fiddled with by diep on 2011-05-29 at 20:08
diep is offline   Reply With Quote
Old 2011-05-29, 22:13   #31
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

179510 Posts
Default

I'm impressed with the latencies....a day to a week is huge....

As I see it, the problem is the GPU programming budget is about 20 guys working casually....jasonp, msft, the judger, George Woltman, whoever wrote msieve/NFS@home, some helpers, that's about it... doesn't compare even to a group of 10 programmers working on the problem full time. And honestly, the math is decently ferocious, so you can't just throw $$ at the problem...you need people with math training.

Is there a reasonably simple explanation of why GPU latencies (or even PS/3 ones) on ECM are on such a high order? (Glossing over the fact that i don't really understand an elliptic curve, am still working on MPQS)

****idly wonders if SONY could be convinced to let a few trusted math programmers work on PS/4 on a restricted (don't develop games and all code comes through you, so its enforceable) license.....thinking otherwise, they are gonna get taken to the cleaners by the GPU cards...*****
Christenson is offline   Reply With Quote
Old 2011-05-29, 23:34   #32
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

356510 Posts
Default

(Msieve is written primarily by me, along with help here and there from quite a few folks. NFS@Home uses the lattice sieve from GGNFS with modifications from frmky and debrouxl, and uses almost-unmodified msieve for the postprocessing)

I don't see where Sony would have any incentive to develop non-gaming applications on any model of Cell processor. The BlueGene series of processors has all the programming infrastructure it needs courtesy of IBM, and the console market needs the economy of scale that comes from running games. Honestly, if Prime95 was immediately ported to a PS3, how many consoles would that actually sell?

When the Cell processor was first introduced, there were grandiose visions that the Cell would be ubiquitous, your microwave would have one and it would network automatically with other Cell processors around it. That warranted opening up the platform, but now it looks like the chip will only be used in the consumer space for consoles.
jasonp is offline   Reply With Quote
Old 2011-05-30, 02:44   #33
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5×359 Posts
Default

Recognizing that gaming consoles require economies of scale, so lots of chips need to be sold, doesn't necessarily mean a tidy profit can't be made with some other serious work. At this point, even if all the primehunting programs in the world worked on a PS/3 right now, unless those things are dirt cheap, like around $100, I don't think it would sell a single unit...the performance isn't there. I predict that in 5 years all phases of prime and factor hunting will be almost exclusively on GPUs. I can name one NVIDIA GPU sold by CUDA (my GT440), and another I have my eye on...

Now, the question is, on the PS/4, or possibly with a GPU-style coprocessor card (now that the HDCP protected digital video mode is in place, letting the card take a ride in the back of my PC, and still having good copy protection), would it make sense for Sony to let entities other than gaming companies do some programming? But I'm betting with Jason (even without his post) that no such thing is in the works, especially not between the history of the PS/3 (and the sony rootkit) and the recent hacking troubles with PlayStation Network...the attitude and the karma just isn't there...even if Sony gets the worst of both worlds (only runs games, supports viruses, and brings loads of bad publicity).

In an admittedly grandiose (and possibly impossible) vision of mine, a secure OS resides on the GPU card (I don't regard even Linux as really secure, there's too much kernel mode code) and the PC frame supplies the peripherals.
Christenson is offline   Reply With Quote
Reply



All times are UTC. The time now is 15:14.


Fri Jul 7 15:14:56 UTC 2023 up 323 days, 12:43, 0 users, load averages: 1.70, 1.32, 1.19

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

โ‰  ยฑ โˆ“ รท ร— ยท โˆ’ โˆš โ€ฐ โŠ— โŠ• โŠ– โŠ˜ โŠ™ โ‰ค โ‰ฅ โ‰ฆ โ‰ง โ‰จ โ‰ฉ โ‰บ โ‰ป โ‰ผ โ‰ฝ โŠ โА โŠ‘ โŠ’ ยฒ ยณ ยฐ
โˆ  โˆŸ ยฐ โ‰… ~ โ€– โŸ‚ โซ›
โ‰ก โ‰œ โ‰ˆ โˆ โˆž โ‰ช โ‰ซ โŒŠโŒ‹ โŒˆโŒ‰ โˆ˜ โˆ โˆ โˆ‘ โˆง โˆจ โˆฉ โˆช โจ€ โŠ• โŠ— ๐–• ๐–– ๐–— โŠฒ โŠณ
โˆ… โˆ– โˆ โ†ฆ โ†ฃ โˆฉ โˆช โІ โŠ‚ โŠ„ โŠŠ โЇ โŠƒ โŠ… โŠ‹ โŠ– โˆˆ โˆ‰ โˆ‹ โˆŒ โ„• โ„ค โ„š โ„ โ„‚ โ„ต โ„ถ โ„ท โ„ธ ๐“Ÿ
ยฌ โˆจ โˆง โŠ• โ†’ โ† โ‡’ โ‡ โ‡” โˆ€ โˆƒ โˆ„ โˆด โˆต โŠค โŠฅ โŠข โŠจ โซค โŠฃ โ€ฆ โ‹ฏ โ‹ฎ โ‹ฐ โ‹ฑ
โˆซ โˆฌ โˆญ โˆฎ โˆฏ โˆฐ โˆ‡ โˆ† ฮด โˆ‚ โ„ฑ โ„’ โ„“
๐›ข๐›ผ ๐›ฃ๐›ฝ ๐›ค๐›พ ๐›ฅ๐›ฟ ๐›ฆ๐œ€๐œ– ๐›ง๐œ ๐›จ๐œ‚ ๐›ฉ๐œƒ๐œ— ๐›ช๐œ„ ๐›ซ๐œ… ๐›ฌ๐œ† ๐›ญ๐œ‡ ๐›ฎ๐œˆ ๐›ฏ๐œ‰ ๐›ฐ๐œŠ ๐›ฑ๐œ‹ ๐›ฒ๐œŒ ๐›ด๐œŽ๐œ ๐›ต๐œ ๐›ถ๐œ ๐›ท๐œ™๐œ‘ ๐›ธ๐œ’ ๐›น๐œ“ ๐›บ๐œ”