![]() |
Gimps awaiting a major transition
L.S.,
The more I am reading about it, the more I am thinking about it, I've come to realize that the GIMPS project has to prepare itself for the new IBM, Sony, Toshiba Cell processor. There are nice new functionalities, opportunities and higher speeds awaiting us, but also threads that need to be considered and dealt with. The Cell processor has 8 FPU's aboard, each of them capable of outrunning an Intel Pentium IV. Also, grid computing and networking is the basis for the design of the chip, very different from todays networks, where no processor is aware of other processors. The current prime95 program is handoptimized for the CISC Pentium processor, and to some extend the Athlon. With the Cell processor being RISC based there will be less need to work with a handoptimized program. Hence people can much easier write their own LL test program and run it outside the GIMPS project. This could jeopardize the GIMPS rule that you would have to share the EFF prize. Most members joined even before there was a monetary award, and to prevent double work a server coodinating the hand out will still be welcome to everyone willing to test, but one could work ahead of the server range more easily. I think the GIMPS project could benefit from cooperation with the chipdesigners if we would be able to write the fastest FFT implementation for the Cell processor. I would guess though that they are working on that themselves. Also more competition from other projects can be expected when network oriented processors become available. Projects offering money for making your processor available for all kinds of commercial and scientific can be expected to pop up everywhere competing for resources. The future for sure is interesting. |
I think you're referring to this:
[url]http://story.news.yahoo.com/news?tmpl=story&ncid=738&e=1&u=/ap/20050208/ap_on_hi_te/cell_processor[/url] "Cell's designers say their chip, built from the start with the burgeoning world of rich media and broadband networks in mind, can deliver 10 times the performance of today's PC processors." However, the article then goes on to cite seemingly contradictory figures: "Cell appears to have an advantage in the number of transistors — 234 million compared with 125 million for today's latest Pentium 4 chips. Traditional chip makers, however, have regularly doubled their number of transistors every 12 to 18 months." "Cell is said to run at clock speeds greater than 4 gigahertz, which would top the 3.8 GHz of Intel's current top-speed chip." If Cell was 10 times faster, wouldn't we expect 1.25 billion transistors and 38 GHz? |
[QUOTE=jinydu]"Cell is said to run at clock speeds greater than 4 gigahertz, which would top the 3.8 GHz of Intel's current top-speed chip."
If Cell was 10 times faster, wouldn't we expect 1.25 billion transistors and 38 GHz?[/QUOTE] There are other, and perhaps more efficient ways, of utilizing millions of transistors than turning them into ridiculously long pipelines. John "Hannibal" Stokes of Ars Technica has written an [URL=http://arstechnica.com/paedia/m/moore/moore-1.html]excellent article[/URL] about the essentials of CPU design on the background of Moore's Law. Highly recommended reading for hardware freaks. regards, Leif. |
Speaking of Hannibal, he just posted his own article about the Cell processor here: [url]http://arstechnica.com/articles/paedia/cpu/cell-1.ars[/url]
Part 2 coming soon. |
[QUOTE=tha]L.S.,
The Cell processor has 8 FPU's aboard, each of them capable of outrunning an Intel Pentium IV. Also, grid computing and networking is the basis for the design of the chip, very different from todays networks, where no processor is aware of other processors. [/QUOTE] Unfortunately Cell is geared to single precision SIMD. GIMPS needs double precision. |
[QUOTE=leifbk]There are other, and perhaps more efficient ways, of utilizing millions of transistors than turning them into ridiculously long pipelines. John "Hannibal" Stokes of Ars Technica has written an [URL=http://arstechnica.com/paedia/m/moore/moore-1.html]excellent article[/URL] about the essentials of CPU design on the background of Moore's Law. Highly recommended reading for hardware freaks.
regards, Leif.[/QUOTE] Indeed. There are a number of improvements that would greatly benefit the domain of computational number theory without increasing pipelines and without adding a lot more gates: (1) Much larger L_1 and L_2 caches and lower latency to main memory (2) More gates devoted to integer multiplication and division. The Pent IV takes 37 cycles to do a division! (3) A scatter/gather capability etc. |
The Intel P4 can do 8 single precision ops using SSE2, (using two SSE2 registers), so the cell processor isn't ahead of it, yet.
As paulie mentioned GIMPS needs double precision so it doesn't help. Are there ways of using single precision math for GIMPS ? If so are they practical/usable ? |
[CODE]the Cell processor being RISC based there will be less need to work with a handoptimized program.[/CODE]
Non-sequitor. Anyways, the cell processors are all single-precision, and if they weren't, you couldn't run just one LL test on them. The processors are designed to work all independently. You can't share a FFT across them. You'd have to run independent LL tests on each. |
the program could be written to go in order shooting one after neother so 1234567812345678 that would introduce quite a lot of performance actualy 2-8 and keep 1 running the program
|
[QUOTE=ColdFury]You'd have to run independent LL tests on each.[/QUOTE]
Exactly. If your PC happens to have 8 cells in it then you can run 8 LL tests in the time you'd usually run 1. It just so happens that LL tests aren't suited to it but I doubt Sony/IBM/etc. had us in mind when they started the design. If what the various articles are saying is true then our nice and cheap personal computers will have several of these in them and will also be able to call on any we have lying around in our PDAs, microwaves, playstation 3s and whatever other technical gizmos we have lying around. I'm sure a lot of people on these forums would love the ability to do 2 LL tests while the other 6 cells are crunching their Doom 4 or whatever other ways we can find to waste our CPU cycles. |
actually if it was able to do that i would sugest that you would run it in a line from like i susgested earlier but that would take a lot of code but since we cant due to the single point it doesnt really help us unless we find a way to use 2 of them togeather
|
Cell
As well known on the forum, LL testing is a sequential algorithm.
The FFT used in each iteration is, however parallelisable. Leaving aside the single/big precision reasons, the main PowerPC processor in Cell chip could pass the FFT across the 8 subsidiary cells to give a fast answer to the FFT part of the math. The cells are designed to talk to each other and cooperate. Main cpu has 32K L1, 512K L2. Each of 8 cells around has 256K cache. There is onboard memory/io controller, namely Rambus XDR@3.2Ghz and FlexIO@6.4Ghz. Anyway, even if it's no good for LL testing, maybe it could be used as a way to do trial factoring quickly? Or do the same limitations apply? |
A bigger problem is that each "cell" only has 256KB of local memory, and they [I]do not[/I] share memory addresses.
Information transfer between units is accomplished using DMA, and would be a major bottleneck. The cells are designed to work on tasks that don't require any interprocessor communication. |
[QUOTE=ColdFury]A bigger problem is that each "cell" only has 256KB of local memory[/QUOTE]
Would they be able to factor well? |
[QUOTE=Uncwilly]Would they be able to factor well?[/QUOTE]
Depends on how ameniable the factoring code is to vectorization. |
Fwiw, another article from today talks about single/double precision abilities ([url]http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318[/url]).
What sounds relevant to this discussion is on page 4 ([url]http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318&p=4[/url]). [quote]Given this estimate, the peak DP FP throughput of an 8 SPE CELL processor is approximately 25~30 GFlops when the DP FP capability of the PPE is also taken into consideration.[/quote] According to ([url]http://mersenneforum.org/showthread.php?t=2718[/url]) one P90 year is 1.04e15 Flops. This means one 8 SPE CELL could do one P90 year every 1.04e15 / 25e9 seconds or every 9.5 hours. This translates to 2.49 P90 years per day. In the last week PrimeNet did an average of 1483 P90 years per day. In order to equal PrimeNets output it'd only take 595 CELLs. It's highly likely I made some sort of error in that calculation but if these processors are going to be so widespread as to be in our Playstations doesn't it seem possible that we might get a few in PrimeNet? And if so they could make quite a contribution. |
It looks like within 2005, PC's with the cell processor could be completing 10 million digit numbers in 1-2 days instead of a month. If someone wants to make some real money they need to put these things complete with memory on a pci card so you can pop 5 or 6 in a PC.
Imagine being able to complete 25-120, 10 million digit numbers per PC per month...........SIGN ME UP :banana: :banana: :banana: |
[QUOTE=Paulie]Unfortunately Cell is geared to single precision SIMD. GIMPS needs double precision.[/QUOTE]
GIMPS uses double precision (floating point) but I don't think it is necessary, it could have used integer storage. One of the lucas (was that g or m?) clients uses single precision, doesn't it? |
Not to rain on your parade, but the CELL is a PowerPC CPU, not an x86 CPU. In other words, it will not run Prime95. It should run GLucas or MLucas. Also according to the article [b]'Moreover, these SP operations are not fully IEEE754 compliant in terms of rounding modes'[/b] and [b]'the SPE’s double precision unit is fully IEEE854 compliant'[/b]. Since IEEE854 is a generalization of IEEE754, DP FP might be IEEE754 compliant, but I don't know. I'm not an expert on FFT's, but I have to assume that the current versions of GLucas and MLucas assume IEEE754 compliance.
The current 2.5 GHz PowerpC 970 (aka G5) is around 19 GFLOPS for a single CPU, whereas the CELL (with 8 SPE) is around 25-30 GFLOPS. That might sound inpressive, but even on G5, GLucas/MLucas run at about half the speed of Prime95 on a similarly clocked P4. There are a number of reasons for this. One is that GLucas and MLucas are not coded in assembler, they have some assembler macros, but not much. Prime95 can take advantage of SSE and SSE2 on x86, but AltiVec on PPC is useless since it only supports single precision. |
If the Cell processor significantly benefits from a hand optimized FFT routine than all the better. Such a routine would have great benefits, not just for Mersenne, but all math programs that make use of it. As quite some TOP 1000 number crunchers are used to run FFT dependant algorithms such an optimized routine could win some fame.
|
The entire FFT algorithm's dataset would need to fit in the cell's 256K memory. These are very simple devices. There is no memory virtualization, and the cells do not share a common memory space, like normal co-processors. This means no swapping or any other tricks.
|
Cell Architecture Explained
Also have a look at:
[URL=http://www.blachford.info/computer/Cells/Cell0.html]http://www.blachford.info/computer/Cells/Cell0.html[/URL] Tony |
Hey guys, just to let you all know, i might be getting an early shipment of the Cell chips when they are finalized for PC use and i will put them up for auction on Ebay. i will keep everyone posted on the situation.
PS: Its nice to know people in North America and Asia :showoff: |
[QUOTE=dsouza123]The Intel P4 can do 8 single precision ops using SSE2, (using two SSE2 registers), so the cell processor isn't ahead of it, yet.
As paulie mentioned GIMPS needs double precision so it doesn't help.[/QUOTE]8 single precision ops per 2 cycles or 4 SP ops per cycle. Like with SSE2, the P4 FPU only handles one 64bit half per unit (FP Add/Mul) at once per cycle (but in parallel across different pipeline stages). [B]@all: [/B] Let's discuss, if the buyers of PS3s (not those, who'd buy them intentionally for Prime95) would see some use in letting the machine do some calculations while they aren't using it. |
about this single precision stuff; couldn't we just increase the FFT length to improve accuracy?
Even if we wanted to do this, we have to approach sony for the DRM keys to unlock the PS3 hardware. I don't think sony will do this, because they make all their money on the software. Sony would want over 10USD for each copy of P95 for PS3. |
[QUOTE=Dresdenboy]Let's discuss, if the buyers of PS3s (not those, who'd buy them intentionally for Prime95) would see some use in letting the machine do some calculations while they aren't using it.[/QUOTE]
People don't normally leave their consoles running unless they are playing, so that means P95 wouldn't get much effective running time. Taking into account the difficulties involved, it is probably not worth investing the time to port P95 to these machines. That is, unless for some reason they start substituting Intel/AMD processors in general purpose PCs, which doesn't look very likely. Anyway, just my 2 cents... |
not really in the early era of ps2s there was linux distros for the ps2 there still avaible but only jap versions of it.. wiat here
[url]http://www.us.playstation.com/peripherals.aspx?id=SCPH-97047[/url] [url]http://blackrhino.xrhino.com/main.php?page=home[/url] oo there is a free distro i think. intresting net bootng [url]http://playstation2-linux.com/projects/diskless[/url] |
[QUOTE=E_tron]about this single precision stuff; couldn't we just increase the FFT length to improve accuracy?
Even if we wanted to do this, we have to approach sony for the DRM keys to unlock the PS3 hardware. I don't think sony will do this, because they make all their money on the software. Sony would want over 10USD for each copy of P95 for PS3.[/QUOTE] Someone posted about this once and using single-precision would bloat the FFT to totally unreasonable sizes. The FFTs can't fit in the Cell's memory anyways. People don't realize they don't have MMUs and uses DMA to perform memory transfer. Imagine running Prime95 with 256kb of RAM and paging everything to and from the hard drive. |
[QUOTE=ColdFury]Someone posted about this once and using single-precision would bloat the FFT to totally unreasonable sizes.
The FFTs can't fit in the Cell's memory anyways. People don't realize they don't have MMUs and uses DMA to perform memory transfer. Imagine running Prime95 with 256kb of RAM and paging everything to and from the hard drive.[/QUOTE]A Cell SPE (as the PPE) can do double precision math. But at a slower rate than SP (something like factor 10 IIRC). It is much better for the FFT to work with such a slow double precision but having enough mantissa bits for calculation than to do a huge FFT, which can get use only a few bits per SP number. The next thing is: Why would the FFT have to fit into Cell's memory? It usually doesn't fit into the caches of a K7, K8, P4 or Pentium-M CPU. Instead Cell has a dual channel XDR memory controller, delivering 25GB/s. That's a hell more than what we get with the MCT of a K8 (although it already is at ~98% of the max bandwidth of 6103 MiB/s for 2xDDR400 RAM) or with DDR2 on a newer P4 board. FFTs can be calculated in parallel very well if the interconnection bandwidth is high enough. And the algorithms are very straightforward. While executing the first instruction you could actually say, what'd happen 1000 instructions later. A FFT algorithm for a certain size has a pattern how it is being executed and when and where it reads and stores data. The perfect job for a SPE on Cell. Even the fact, that the local memory is not a cache is not as bad as it may seem, since it has low latency (6 cycles, because it's SRAM like in a cache) and it's behaviour is predictable (not like a cache) since it does nothing on its own. It's like a cache without logic. And because of the mentioned access patterns you can easily load the data 6 cycles before it will be used. And even the times, where the memory's data has to be exchanged, will be small thanks to the EIB. The SPEs can also access the L2 and external XDR memory. The 256kB local SRAM should be good for possibly up to 14 levels of the Prime95 FFT (it also needs space for code and some tables). Some links (although already mentioned in some threads): [URL=http://anandtech.com/cpuchipsets/showdoc.aspx?i=2379&p=1]Understanding the Cell Microprocessor[/URL] [URL=http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318]ISSCC 2005: The Cell Microprocessor[/URL] [URL=http://arstechnica.com/articles/paedia/cpu/cell-1.ars]Introducing the IBM/Sony/Toshiba Cell Processor — Part I[/URL] [URL=http://arstechnica.com/articles/paedia/cpu/cell-2.ars]Introducing the IBM/Sony/Toshiba Cell Processor — Part II[/URL] |
In addition to Dresdenboys comments i'd like to point you to the excellent anandtech article [URL=http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2379]Understanding the Cell Microprocessor[/URL].
The article covers the implications of the cell cacheless In Order architecture. Tau |
An addition regarding DP capabilities:
David Wang wrote (2nd link in my earlier posting): "Given this estimate, the peak DP FP throughput of an 8 SPE CELL processor is approximately 25~30 GFlops when the DP FP capability of the PPE is also taken into consideration." Lets look at a Netburst CPU at 3.4 GHz as an example: 6.8 GFlops. What is left to say, is: Cell (and similar MPUs) should currently give the best bang for the buck regarding LLR testing or even TF. No FPGA, GPU or general purpose CPU could currently deliver more, because of high price, missing universality or FP throughput. |
i wounder what this will give...
Viral processor that builds it self. 50nm :geek: [url]http://www.spectrum.ieee.org/WEBONLY/publicfeature/nov03/1103bio.html[/url] |
There is a [URL=http://research.scea.com/research/html/CellGDC05/index.html]Cell Presentation from GDC 2005[/URL] online, which sheds further light on the capabilities of this class of MPUs.
IMO the [URL=http://research.scea.com/research/html/CellGDC05/16.html]Cell's SPE FP and other capabilities[/URL] look even more useful for algorithms like FFTs than before. |
Other processing possibilities.
AGEIA Technologies Inc PhysX chip, dedicated Physics Processing Unit (PPU) Expect to see PPU enabled systems and boards in time for the 2005 Christmas buying season. Native hardware support of NovedeX Physics engine. 2 Terabits/second of bandwidth is presently contemplated for internal memories facilitating data movement to/from the FPE. The internal memory structure has no "set associativity" limitations. PPU provides a library of common linear algebra and physics related algorithms implemented using the DME and FPE. However, application specific or custom algorithms may also be defined within PPU for execution by the DME and FPE. Xbox 2/Xbox Next/Xenon/Xbox 360 (? which name ?) "Xenon's CPU has three 3.0 GHz PowerPC cores. Each core is capable of two instructions per cycle and has an L1 cache with 32 KB for data and 32 KB for instructions. The three cores share 1 MB of L2 cache." ? Ship before end of 2005, two versions one with hard drive |
Would it be possible / feasible to run LL tests on graphics hardware? The latest graphics chips from nV and ATi have a big processing power, the question is: would it be possible to utilize it in such a way?
Any thoughts on that :question: |
[QUOTE=Cruelty]Would it be possible / feasible to run LL tests on graphics hardware? The latest graphics chips from nV and ATi have a big processing power, the question is: would it be possible to utilize it in such a way?
Any thoughts on that :question:[/QUOTE] This question about GPUs has been asked about a thousand times in these forums. Unfortunately, the answer remains no. These GPUs will do floating point math, but only single-precision. Prime95 requires double-precision. |
Cell : DP
Do you know this presentation of the Cell architecture ?
There is a slide talking about the FP : [URL=http://www.research.scea.com/research/html/CellGDC05/16.html]Slide 15[/URL] Is the 2 ways DP means Double Precision ? But slide 17 says: "SIMD FLoat only" . So, single or double float precision ? Tony |
Cell :Simple or Double Floating point ?
[QUOTE=Paulie]Unfortunately Cell is geared to single precision SIMD. GIMPS needs double precision.[/QUOTE] What about this paper talking of double-precision: [URL=http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/CD03DF9DB5C3FB9187256FC000745CF1/$file/ISSCC-20.3-Cell_Mult.pdf]ISSCC[/URL] ?
Also, look at: [URL=http://www-306.ibm.com/chips/techlib/techlib.nsf/products/Cell]IBM[/URL] . Tony |
[QUOTE=T.Rex]Is the 2 ways DP means Double Precision ?[/quote]I think so.
[quote]But slide 17 says: "SIMD FLoat only".[/QUOTE]Read that slide's table right-to-left. |
[QUOTE=T.Rex]Is the 2 ways DP means Double Precision ?[/QUOTE]Yes.
[QUOTE=T.Rex]But slide 17 says: "SIMD FLoat only" .[/QUOTE]But not in the "SPE" column (Cell), which states "SIMD int, float, double". "VU" seems to be just some DSP's vector unit, which is being compared to Cell's SPEs. The double precision capabilities of Cell (more the SPE ones than the PPE's) have already been discussed in this thread. Just read above. |
While reading through an article in a Linux magazine, where they show the
possibilities of letting Linux run on Cell, I found an interesting document mentioned: [b]Unleashing the power: A programming example of large FFTs on Cell[/b] The original source is here: [url]http://www.power.org/news/events/barcelona/[/url] And the document is here: [url]http://www.power.org/news/events/barcelona/11_chow.pdf[/url] It speaks about single precision FFTs, but that doesn't matter, since it covers nearly all important factors, which might be interesting for implementing a LL test on Cell. They say, their FFT implementation would be already close to being computationally bound, so this would even be more the case when using double precision. |
Will the SSE3 instructions bail out Intel or does the AMD implementation of SSE3 keep AMD in the lead for serious number crunching?
|
[QUOTE=JHagerson]Will the SSE3 instructions bail out Intel or does the AMD implementation of SSE3 keep AMD in the lead for serious number crunching?[/QUOTE]The set of new SSE3 instructions, which are useful for complex number math, still have about the same FADD/FMUL throughput for the instructions, which are important for Prime95, so this wouldn't help. Additionally Prime95 already avoids the need for horizontal operations by doing the complex math "by hand" in 2 separate sets of calculations going on in the lower and higher halves of the SSE2 registers.
However, here is the data I collected from the appropriate optimization manuals for the register-to-register variants of these instructions. Intel didn't give the numbers for the case, when memory operands are involved and delivered from shortest latency cache (as it is often the case in Prime95). For the K8 these instructions have a 1 (HADDPx/HSUBPx) or 2 cycles (ADDSUBPx, MOVxDUP) longer latency. [code]Prescott/Nocona: Instruction(s) Latency/ involved Units Throughput ADDSUBPD/ADDSUBPS 5 / 2 FP_ADD HADDPD/HADDPS 13 / 4 FP_ADD,FP_MISC HSUBPD/HSUBPS 13 / 4 FP_ADD,FP_MISC MOVDDUP xmm1, xmm2 4 / 2 FP_MOVE MOVSHDUP xmm1, xmm2 6 / 2 FP_MOVE MOVSLDUP xmm1, xmm2 6 / 2 FP_MOVE K8 Stepping E: Instruction(s) Latency/ involved Units Throughput ADDSUBPD/ADDSUBPS 5 / 2 FADD HADDPD/HADDPS 5 / 2 FADD HSUBPD/HSUBPS 5 / 2 FMUL (maybe for parallel execution) MOVDDUP xmm1, xmm2 2 / 2 FMUL MOVSHDUP xmm1, xmm2 3 / 2 FMUL MOVSLDUP xmm1, xmm2 3 / 2 FADD[/code] Throughput is given as "cycles between instruction issue". As you can see, SSE3 wouldn't change the situation. |
| All times are UTC. The time now is 05:30. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.