![]() |
|
|
#23 | |
|
Jan 2008
France
59710 Posts |
Quote:
@chris2be8: The EPFL team published an article about their ECM record factorization. I guess that even if you could get their code, you couldn't achieve such a feat since they used a cluster of PS3. Anyway, I guess you can try to get in touch with Arjen K. Lenstra asking for more information
|
|
|
|
|
|
|
#24 | |
|
Sep 2006
The Netherlands
3×269 Posts |
Quote:
Do you see the difference between 2005 and the powerXCell 8i from around 2009? So if the guy is claiming he uses PS3's, keep in mind most are in reality using the rackmounts with newer CELL cpu's to 'emulate' that. And it's 8 vs 6 cores. Huge difference. The first produced CELL cpu's were a crap. Vincent |
|
|
|
|
|
|
#25 |
|
Sep 2006
The Netherlands
3×269 Posts |
p.s. please note that with the latencies at the later CELL chips, there is huge latencies. Like 7 cycles for double precision. Programming very well for that is of course possible, but it is exactly the same effort like getting software to work on the GPU's which are a lot cheaper per Tflop.
Last fiddled with by diep on 2011-05-28 at 16:47 |
|
|
|
|
|
#26 | |
|
Bamboozled!
"๐บ๐๐ท๐ท๐ญ"
May 2003
Down not across
2·17·347 Posts |
Quote:
Arjen's cluster is indeed made out of honest-to-$DEITY PS3's. The game machines. They are technically rack-mounted in the sense that they sit on shelves fitted into racks. There's a nice picture of his cluster here: http://lacal.epfl.ch/ There's also a picture of the RSA-129 check... Paul |
|
|
|
|
|
|
#27 | ||
|
Jan 2008
France
3×199 Posts |
Quote:
Quote:
|
||
|
|
|
|
|
#28 | |
|
Sep 2006
The Netherlands
3×269 Posts |
Quote:
Let me give you a few facts. At top500.org they give a single core of the CELL as being 12.8 Gflops. So that's in case of multiply-adds: 1 instruction per cycle double precision times 2 doubles per vector: 3.2Ghz * 2 * 2 = 12.8 http://www.top500.org/system/10377 That's in fact the powerX8i cell already. If i open a paper of Lenstra: http://eprint.iacr.org/2010/338.pdf This publication refers to 2010 so i guess it was written in 2010 or 2011. Page 3 alinea 2 shows they represent things using 16x16 == 32 bits representation. now i don't know whether integers run faster on it than double precision, let's assume that integers runs at 2 instructions per cycle per PE. So they have 6 PE's available * 3.2Ghz * 2 IPC * 4 ints per SIMD = 6 * 3.2Ghz * 2 * 4 = 153.6 G per second. The GPU, even the previous generation GPU's from AMD deliver 1.351 Tflops (that's my definition of a flop, they claim multiply-add themselves so in that case it's 2.7 Tflop for the HD6970, note on paper the old generation is a tad faster, yet harder to program for) and can do a multiply-add at 16x16 bits multiplications in fact. That's 10x faster nearly than CELL. The 6990 in fact is 20x faster. The HD5870 you can buy at ebay for $90 and you could already get those easily previous year, long before this paper has been written. Note also 24x24 bits is available (requires 2 instructions for top bits). The idea was genius from IBM to produce them, yet it took them too long to produce them which renders them obsolete. The CELL processor is something in between an old CPU and todays GPU's. Therefore to program the CELL you can also program GPU's. In fact that's easier than programming CELLs as getting the full throughput from a CELL is tougher than from todays generations of GPU's. So this PS3 is old junk and i don't know why such big guys are busy with them. The article indicates however they never ran at the PS3 actually yet that they ran at one of their supercomputers blades. See how the article was written. Very dubious word construction is for example the formulation at page 2 bottom left. "It is conceivable ...." So he actually never toyed with the PS3 as in the next paragraph the blades are there. Yet if something officially doesn't exist i guess you can't run on 'em, so you use some vague PS3 photo. We saw some NCSA cover ups even in computerchess that way, so i guess for prime numbers it'll be even easier. Note that the previous blades were around $2k a piece if you buy a bunch of 'em. Sure they can put 'em online for $20k but no one ever paid that AFAIK. Of course $2k was attractive at the time. He does mention that there is an even and odd pipeline mentionning that theoretical the chip would be able to execute 2 instructions per cycle. the way how he writes it down you must analyze also very well, as also that indicates he sees it as a theory. Probably happens with simple integers and not with double precision floating point, which is probably why they emulate things using integers rather than floating point. The thing has no branch prediction indeed (the old CELL, the newer ones are supposed to be a tad more clever there, but i never tested those). So the programming model to get optimal performance out of them is a lot harder than from GPU's, meanwhile they're roughly factor 10 slower than a GPU. Note that at Nvidia it is true the single precision integers are slower, yet you can multiply more bits at once, this is where nvidia beats the CELL bigtime, so Fermi will be effectively also nearly be factor 10 faster than CELL. Again. When IBM launched the CELL idea it was a great idea, yet they took too long to produce them. Years and years of delays. When they arrived it was too late. A hybrid form in between a cpu and a gpu. The reason why they sold some i guess is because the western world was asleep with respect to GPU's. They sat and waited. I guess they woke up now. AFAIK the initial CELL PE's for the PS3 were clocked 3.17Ghz and in the article they mention 3.192Ghz so i'm not sure whether Lenstra refers to the same type of chip. maybe the blades are clocked 3.192Ghz? 3 years difference is a lot. Just google on i7 and tell me how many variations there are of it. Last fiddled with by diep on 2011-05-29 at 18:02 |
|
|
|
|
|
|
#29 | |
|
Bamboozled!
"๐บ๐๐ท๐ท๐ญ"
May 2003
Down not across
2E1616 Posts |
Quote:
Arjen built the PS3 cluster several years ago when the PS3 price-performance was actually quite good. The cluster is still producing good results not because it is state of the art but because it has been fully paid for, it has achieved all that it was intended to do and because it still works. For the moment, the only cost to use it is the electrical power and a relatively small amount of attention. Some more facts for you. Arjen and his team are well aware of this figure. When I spoke with him and Joppe Bos at Eurocrypt last year he said that they already had code which was at least ten times faster on a GPU than on a PS3. The problem they then had was that the latency was appalling. This is in the context of running ECM to a stage one limit of a billion or few. A PS3 could run 8 curves a day (if I remember the numbers correctly) whereas a GPU could run 30 thousand curves a year (again, if I remember correctly). However, it took the PS3 a day before any answers appeared and it would take the GPU a year before its output was ready. They weren't prepared to wait that long. This information prompted me to pass on some of my investigations into low-latency arithmetic and stimulated me into working on the subject again. I really must restart on that project as I'd also like to run ECM on GPUs. Paul |
|
|
|
|
|
|
#30 | |
|
Sep 2006
The Netherlands
3·269 Posts |
Quote:
What you want to do is design your own card (not cpu). Take the default card of AMD or Nvidia. Put some SRAM additional on each card and a hub from it directly to the network and use a better cooler and a lot more DDR5 ram. So the latency to the device RAM will be a tad slower, yet the latency to the other nodes will be kick butt of course. This is not rocket science. It'll be a tad more expensive than gpu's from the shop of course and you need to print a 500-1000 at least of each card to get the price down a lot. So it'll be about 150 euro more expensive each modified gpu then than when you buy it in the shop. The modification, even when you have a commercial guy do it, would not be more than a couple of hundreds of thousands. Yet all sort of things that require a lot of clever programming now will be a lot easier then. Please note that in the first place, just like with the CELL, forget the word latency. Focus upon throughput :) Regards, Vincent p.s. after having that 'cluster' for a few years now it's still 'conceivable' that a modification will allow 'em to use that 7th core after all those years? How can we take this serious? p.s.2 using 256 bits calculcations using 'schoolboy' is 256 multiplications. Effective output of 4 bits per instruction. Maybe the 80386 is outputting more bits per cycle there? How to take this paper serious if you need 256 multiplications to just multiply 1 limb? In the sovjet union they used those wooden abacus to multiply faster than that... Last fiddled with by diep on 2011-05-29 at 20:08 |
|
|
|
|
|
|
#31 |
|
Dec 2010
Monticello
179510 Posts |
I'm impressed with the latencies....a day to a week is huge....
As I see it, the problem is the GPU programming budget is about 20 guys working casually....jasonp, msft, the judger, George Woltman, whoever wrote msieve/NFS@home, some helpers, that's about it... doesn't compare even to a group of 10 programmers working on the problem full time. And honestly, the math is decently ferocious, so you can't just throw $$ at the problem...you need people with math training. Is there a reasonably simple explanation of why GPU latencies (or even PS/3 ones) on ECM are on such a high order? (Glossing over the fact that i don't really understand an elliptic curve, am still working on MPQS) ****idly wonders if SONY could be convinced to let a few trusted math programmers work on PS/4 on a restricted (don't develop games and all code comes through you, so its enforceable) license.....thinking otherwise, they are gonna get taken to the cleaners by the GPU cards...***** |
|
|
|
|
|
#32 |
|
Tribal Bullet
Oct 2004
356510 Posts |
(Msieve is written primarily by me, along with help here and there from quite a few folks. NFS@Home uses the lattice sieve from GGNFS with modifications from frmky and debrouxl, and uses almost-unmodified msieve for the postprocessing)
I don't see where Sony would have any incentive to develop non-gaming applications on any model of Cell processor. The BlueGene series of processors has all the programming infrastructure it needs courtesy of IBM, and the console market needs the economy of scale that comes from running games. Honestly, if Prime95 was immediately ported to a PS3, how many consoles would that actually sell? When the Cell processor was first introduced, there were grandiose visions that the Cell would be ubiquitous, your microwave would have one and it would network automatically with other Cell processors around it. That warranted opening up the platform, but now it looks like the chip will only be used in the consumer space for consoles. |
|
|
|
|
|
#33 |
|
Dec 2010
Monticello
5×359 Posts |
Recognizing that gaming consoles require economies of scale, so lots of chips need to be sold, doesn't necessarily mean a tidy profit can't be made with some other serious work. At this point, even if all the primehunting programs in the world worked on a PS/3 right now, unless those things are dirt cheap, like around $100, I don't think it would sell a single unit...the performance isn't there. I predict that in 5 years all phases of prime and factor hunting will be almost exclusively on GPUs. I can name one NVIDIA GPU sold by CUDA (my GT440), and another I have my eye on...
Now, the question is, on the PS/4, or possibly with a GPU-style coprocessor card (now that the HDCP protected digital video mode is in place, letting the card take a ride in the back of my PC, and still having good copy protection), would it make sense for Sony to let entities other than gaming companies do some programming? But I'm betting with Jason (even without his post) that no such thing is in the works, especially not between the history of the PS/3 (and the sony rootkit) and the recent hacking troubles with PlayStation Network...the attitude and the karma just isn't there...even if Sony gets the worst of both worlds (only runs games, supports viruses, and brings loads of bad publicity). In an admittedly grandiose (and possibly impossible) vision of mine, a secure OS resides on the GPU card (I don't regard even Linux as really secure, there's too much kernel mode code) and the PC frame supplies the peripherals. |
|
|
|