![]() |
Cell
As well known on the forum, LL testing is a sequential algorithm.
The FFT used in each iteration is, however parallelisable. Leaving aside the single/big precision reasons, the main PowerPC processor in Cell chip could pass the FFT across the 8 subsidiary cells to give a fast answer to the FFT part of the math. The cells are designed to talk to each other and cooperate. Main cpu has 32K L1, 512K L2. Each of 8 cells around has 256K cache. There is onboard memory/io controller, namely Rambus XDR@3.2Ghz and FlexIO@6.4Ghz. Anyway, even if it's no good for LL testing, maybe it could be used as a way to do trial factoring quickly? Or do the same limitations apply? |
A bigger problem is that each "cell" only has 256KB of local memory, and they [I]do not[/I] share memory addresses.
Information transfer between units is accomplished using DMA, and would be a major bottleneck. The cells are designed to work on tasks that don't require any interprocessor communication. |
[QUOTE=ColdFury]A bigger problem is that each "cell" only has 256KB of local memory[/QUOTE]
Would they be able to factor well? |
[QUOTE=Uncwilly]Would they be able to factor well?[/QUOTE]
Depends on how ameniable the factoring code is to vectorization. |
Fwiw, another article from today talks about single/double precision abilities ([url]http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318[/url]).
What sounds relevant to this discussion is on page 4 ([url]http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318&p=4[/url]). [quote]Given this estimate, the peak DP FP throughput of an 8 SPE CELL processor is approximately 25~30 GFlops when the DP FP capability of the PPE is also taken into consideration.[/quote] According to ([url]http://mersenneforum.org/showthread.php?t=2718[/url]) one P90 year is 1.04e15 Flops. This means one 8 SPE CELL could do one P90 year every 1.04e15 / 25e9 seconds or every 9.5 hours. This translates to 2.49 P90 years per day. In the last week PrimeNet did an average of 1483 P90 years per day. In order to equal PrimeNets output it'd only take 595 CELLs. It's highly likely I made some sort of error in that calculation but if these processors are going to be so widespread as to be in our Playstations doesn't it seem possible that we might get a few in PrimeNet? And if so they could make quite a contribution. |
It looks like within 2005, PC's with the cell processor could be completing 10 million digit numbers in 1-2 days instead of a month. If someone wants to make some real money they need to put these things complete with memory on a pci card so you can pop 5 or 6 in a PC.
Imagine being able to complete 25-120, 10 million digit numbers per PC per month...........SIGN ME UP :banana: :banana: :banana: |
[QUOTE=Paulie]Unfortunately Cell is geared to single precision SIMD. GIMPS needs double precision.[/QUOTE]
GIMPS uses double precision (floating point) but I don't think it is necessary, it could have used integer storage. One of the lucas (was that g or m?) clients uses single precision, doesn't it? |
Not to rain on your parade, but the CELL is a PowerPC CPU, not an x86 CPU. In other words, it will not run Prime95. It should run GLucas or MLucas. Also according to the article [b]'Moreover, these SP operations are not fully IEEE754 compliant in terms of rounding modes'[/b] and [b]'the SPE’s double precision unit is fully IEEE854 compliant'[/b]. Since IEEE854 is a generalization of IEEE754, DP FP might be IEEE754 compliant, but I don't know. I'm not an expert on FFT's, but I have to assume that the current versions of GLucas and MLucas assume IEEE754 compliance.
The current 2.5 GHz PowerpC 970 (aka G5) is around 19 GFLOPS for a single CPU, whereas the CELL (with 8 SPE) is around 25-30 GFLOPS. That might sound inpressive, but even on G5, GLucas/MLucas run at about half the speed of Prime95 on a similarly clocked P4. There are a number of reasons for this. One is that GLucas and MLucas are not coded in assembler, they have some assembler macros, but not much. Prime95 can take advantage of SSE and SSE2 on x86, but AltiVec on PPC is useless since it only supports single precision. |
If the Cell processor significantly benefits from a hand optimized FFT routine than all the better. Such a routine would have great benefits, not just for Mersenne, but all math programs that make use of it. As quite some TOP 1000 number crunchers are used to run FFT dependant algorithms such an optimized routine could win some fame.
|
The entire FFT algorithm's dataset would need to fit in the cell's 256K memory. These are very simple devices. There is no memory virtualization, and the cells do not share a common memory space, like normal co-processors. This means no swapping or any other tricks.
|
Cell Architecture Explained
Also have a look at:
[URL=http://www.blachford.info/computer/Cells/Cell0.html]http://www.blachford.info/computer/Cells/Cell0.html[/URL] Tony |
| All times are UTC. The time now is 05:30. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.