mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2003-08-07, 21:06   #23
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

35×31 Posts
Default

Quote:
Originally Posted by ewmayer
Please correct me if I'm wrong, but I believe the SSE2 allows any combination of 2 fadd/fmul ops per cycle.
The P4's max throughput is one fadd and one fmul per cycle.
Prime95 is online now   Reply With Quote
Old 2003-08-07, 23:04   #24
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101101011111102 Posts
Default

Quote:
Originally Posted by Prime95
Quote:
Originally Posted by ewmayer
Please correct me if I'm wrong, but I believe the SSE2 allows any combination of 2 fadd/fmul ops per cycle.
The P4's max throughput is one fadd and one fmul per cycle.
So the P4's SSE2 operations that operate on pairs of 64-bit doubles - which are the available ones (involving add or mul), and what is their latency/pipelineability?

This theme highlights one of the other reasons an integer transform is attractive - most CPUs can do 2 integer adds per cycle, so an integer transform optimized to minimize the number of muls should be able to come close in speed to a floating one, assuming integer muls and modulo operations are decently fast.
ewmayer is offline   Reply With Quote
Old 2003-08-07, 23:43   #25
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

35×31 Posts
Default

An SSE2 add has a latency of 4 clocks and one can be started every other clock.

An SSE2 mul has a latency of 6 clocks and one can be started every other clock.
Prime95 is online now   Reply With Quote
Old 2003-08-10, 14:18   #26
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Quote:
Originally Posted by ewmayer
This theme highlights one of the other reasons an integer transform is attractive - most CPUs can do 2 integer adds per cycle, so an integer transform optimized to minimize the number of muls should be able to come close in speed to a floating one, assuming integer muls and modulo operations are decently fast.
I'm working on that and my first step is implementing a FGT. Opteron has a fast 64bit mul and can do max. 3 integer adds per cycle. Unfortunately there is a limit of max. 3 issued macroOps to all units. But with SSE2 code this would look so:

[code:1]code:
...
addpd xmm3, xmm2
addpd xmm4, xmm2
mulpd xmm1, xmm0
...
The issued Ops:
(I don't consider the internal mapped regs now, xmmX is shortened to rX)
n+0: [fadd r3_lo, r2_lo] [fadd r3_hi, r2_hi] [-unused-]
n+1: [fadd r4_lo, r2_lo] [fadd r4_hi, r2_hi] [-unused-]
n+2: [fmul r1_lo, r0_lo] [fmul r1_hi, r0_hi] [-unused-]
(Here I'm not sure if it's possible for the decoder to split an double issued instruction - in that case the CPU would issue the 3 Ops in 2 cycles.)

The executed Ops:
m+0: [fadd r3_lo, r2_lo] [fmul r1_lo, r0_lo] [-unused-]
m+1: [fadd r3_hi, r2_hi] [fmul r1_hi, r0_hi] [-unused-]
m+2: [fadd r4_lo, r2_lo] [-unused-] [-unused-]
m+3: [fadd r4_hi, r2_hi] [-unused-] [-unused-]
[/code:1]
The free slots in this example show, that they won't be enough to make a full integer transform in parallel to the FP code (modulo etc.) but we could trade FP slots for integer if that transform allows us to do less FP calculations (due to more bits per input word or something else).
Dresdenboy is offline   Reply With Quote
Old 2003-08-10, 17:41   #27
QuintLeo
 
QuintLeo's Avatar
 
Oct 2002
Lost in the hills of Iowa

26×7 Posts
Default

As a thought, would it be possible (on the Opteron/Athlon64, and possibly on older CPUs) to keep the FP pipeline full and then use any "left over" instructions to run the Integer pipeline(s) for trial factoring on another exponent at the same time?
QuintLeo is offline   Reply With Quote
Old 2003-08-10, 19:14   #28
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Quote:
Originally Posted by QuintLeo
As a thought, would it be possible (on the Opteron/Athlon64, and possibly on older CPUs) to keep the FP pipeline full and then use any "left over" instructions to run the Integer pipeline(s) for trial factoring on another exponent at the same time?
This would be possible in a limited way. The problem here is to fit program loops of another code into the loops of the main code. Either we do it ourself (because we know, how the loops will look like - and use some automation) or we use SMT (what Intel calls "Hyperthreading"). SMT is not available in Opteron 1, so we would do it "by hand".

A parallel transform is a bit easier to fit since it has to handle similar data sets and the structure of the whole algorithm is similar - but not equal..
Dresdenboy is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Let's buy GIMPS an Opteron! Xyzzy Lounge 264 2006-08-17 12:39
AMD Athlon 64 vs AMD Opteron for ecm thomasn Factoring 6 2004-11-08 13:25
Opteron web server... Xyzzy Lounge 14 2003-11-05 23:07
Opteron Bottleneck?? Prime95 Hardware 31 2003-09-17 06:54
What will an AMD Opteron be classified as ? dsouza123 Software 4 2003-08-02 14:29

All times are UTC. The time now is 16:40.


Sun Aug 1 16:40:22 UTC 2021 up 9 days, 11:09, 0 users, load averages: 1.46, 1.38, 1.50

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.