![]() |
Upcoming Prime95 monsters (processors)
This week definitely was Conroe's week, when we're speaking about processors. This CPU, which will later this year, will cause a boost in Prime95 performance per clock thanks to full width 128 bit SSE execution with throughput of 1/cycle and the bigger and better cache subsystem.
It looks like AMD's next core with improved FPU will arrive not earlier than in 2007. For a start a nice article on Realworldtech: [url]http://www.realworldtech.com/page.cfm?ArticleID=RWT030906143144&p=1[/url] |
Will they Merom/Conroe/Woodcrest (mobile/desktop/server)
have the extra SSE2 registers like the Athlon64/Opteron ? (Are the extra SSE2 (AMD) for 64 bit modes only or also in a 32 bit OS ?) Is the 128 bit SSE also for SSE2 ? Is it only multiple data, ie 2 quad words (64 bit data) or 4 dwords (32 bit data) etc, or is there a 128 data type ? What are the L2 cache sizes ? |
[QUOTE=dsouza123]Will they Merom/Conroe/Woodcrest (mobile/desktop/server)
have the extra SSE2 registers like the Athlon64/Opteron ?[/QUOTE] Yes, they include the x64 stuff. [QUOTE=dsouza123](Are the extra SSE2 (AMD) for 64 bit modes only or also in a 32 bit OS ?)[/QUOTE]I assume, there won't be any exception here. [QUOTE=dsouza123]Is the 128 bit SSE also for SSE2 ?[/QUOTE]It's for the whole lot of SSEn implementations. Else they would have wasted ressources. [QUOTE=dsouza123]Is it only multiple data, ie 2 quad words (64 bit data) or 4 dwords (32 bit data) etc, or is there a 128 data type ?[/QUOTE]Maybe in SSE4. But so far it will just be compatible to SSEn with n up to 3 like these extensions are implemented on existing architectures. [QUOTE=dsouza123]What are the L2 cache sizes ?[/QUOTE]Conroe has a shared 4 MB L2 cache. If a task on core 1 needs more cache than the task on core 2, then the first task will also be able to utilize more of the L2 cache. Also L1-L1 connections between the cores are better and the 64 bit implementations will surely be better than on Prescott. This could also mean faster running 64 bit TF code. |
[QUOTE=Dresdenboy]This week definitely was Conroe's week, when we're speaking about processors. This CPU, which will later this year, will cause a boost in Prime95 performance per clock thanks to full width 128 bit SSE execution with throughput of 1/cycle and the bigger and better cache subsystem.
It looks like AMD's next core with improved FPU will arrive not earlier than in 2007. For a start a nice article on Realworldtech: [url]http://www.realworldtech.com/page.cfm?ArticleID=RWT030906143144&p=1[/url][/QUOTE] Quoted from the article [QUOTE]...However, the bottom line is that we expect the Core microarchitecture to provide a 20-40% performance boost over the prior generation products, and more in certain cases. At the same time, [COLOR="Red"]power consumption will drop dramatically for the desktop and server devices, in the range of 30-40% and possibly more[/COLOR]. As a result, the performance/watt will improve substantially for Intel...[/QUOTE] very attractive to GIMPS farmers :w00t: |
[QUOTE=Dresdenboy]For a start a nice article on Realworldtech:
[url]http://www.realworldtech.com/page.cfm?ArticleID=RWT030906143144&p=1[/url][/QUOTE] A nice article. While the full 128-bit SSEn implementation with FADD and FMUL on separate ports looks very, very promising, this will likely shift the GIMPS bottleneck to another part of the CPU. For example, if the latency on add/mul is high, then the bottleneck will become "register pressure" (not enough SSE2 registers to schedule independent floating point operations). If the add/mul latency is reasonable, then the bottleneck will move to how fast data can be stored and loaded -- L1 and L2 cache latency & bandwidth may be the bottleneck. In any event, it will be interesting to read more and get some benchmarks in the coming months! |
Yes, I saw them mention that SSE instructions would now take 1 clock cycle instead of the 2 cycles on average before. I figured that should give GIMPS a nice speed boost assuming that another bottleneck didn't get hit really fast.
It will be interesting to see. |
[QUOTE]For example, if the latency on add/mul is high, then the bottleneck will become "register pressure" (not enough SSE2 registers to schedule independent floating point operations).[/QUOTE]
All the more reason to write an AMD64/EMT64T version! |
[QUOTE=Jeff Gilchrist]I saw them mention that SSE instructions would now take 1 clock cycle instead of the 2 cycles on average before. [/QUOTE]
Just to clarify, the "1 clock cycle" figure is for maximum throughput in a pipelined architecture. Latency refers to how fast a single add or mul operation takes. The doubling in maximum thoughput is definitely good news but won't result in a doubling of prime95 speed. BTW, AMD has typically been a clock or two faster in latency with the AMD64 and P4 equal in throughput. |
[QUOTE=ColdFury]All the more reason to write an AMD64/EMT64T version![/QUOTE]
Uh, raise your hand if you are running 64-bit Windows.... I don't see many hands raised :rant: |
[QUOTE=Prime95]Uh, raise your hand if you are running 64-bit Windows.... I don't see many hands raised :rant:[/QUOTE]
True, but lucky people could at least run mprime on Linux. :flex: |
[QUOTE=ColdFury]True, but lucky people could at least run mprime on Linux. :flex:[/QUOTE]
Not until binutils is upgraded to support 64-bit COFF object files |
| All times are UTC. The time now is 04:33. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.