![]() |
|
|
#23 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
35×31 Posts |
Quote:
|
|
|
|
|
|
|
#24 | ||
|
∂2ω=0
Sep 2002
República de California
101101011111102 Posts |
Quote:
This theme highlights one of the other reasons an integer transform is attractive - most CPUs can do 2 integer adds per cycle, so an integer transform optimized to minimize the number of muls should be able to come close in speed to a floating one, assuming integer muls and modulo operations are decently fast. |
||
|
|
|
|
|
#25 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
35×31 Posts |
An SSE2 add has a latency of 4 clocks and one can be started every other clock.
An SSE2 mul has a latency of 6 clocks and one can be started every other clock. |
|
|
|
|
|
#26 | |
|
Apr 2003
Berlin, Germany
192 Posts |
Quote:
[code:1]code: ... addpd xmm3, xmm2 addpd xmm4, xmm2 mulpd xmm1, xmm0 ... The issued Ops: (I don't consider the internal mapped regs now, xmmX is shortened to rX) n+0: [fadd r3_lo, r2_lo] [fadd r3_hi, r2_hi] [-unused-] n+1: [fadd r4_lo, r2_lo] [fadd r4_hi, r2_hi] [-unused-] n+2: [fmul r1_lo, r0_lo] [fmul r1_hi, r0_hi] [-unused-] (Here I'm not sure if it's possible for the decoder to split an double issued instruction - in that case the CPU would issue the 3 Ops in 2 cycles.) The executed Ops: m+0: [fadd r3_lo, r2_lo] [fmul r1_lo, r0_lo] [-unused-] m+1: [fadd r3_hi, r2_hi] [fmul r1_hi, r0_hi] [-unused-] m+2: [fadd r4_lo, r2_lo] [-unused-] [-unused-] m+3: [fadd r4_hi, r2_hi] [-unused-] [-unused-] [/code:1] The free slots in this example show, that they won't be enough to make a full integer transform in parallel to the FP code (modulo etc.) but we could trade FP slots for integer if that transform allows us to do less FP calculations (due to more bits per input word or something else). |
|
|
|
|
|
|
#27 |
|
Oct 2002
Lost in the hills of Iowa
26×7 Posts |
As a thought, would it be possible (on the Opteron/Athlon64, and possibly on older CPUs) to keep the FP pipeline full and then use any "left over" instructions to run the Integer pipeline(s) for trial factoring on another exponent at the same time?
|
|
|
|
|
|
#28 | |
|
Apr 2003
Berlin, Germany
192 Posts |
Quote:
A parallel transform is a bit easier to fit since it has to handle similar data sets and the structure of the whole algorithm is similar - but not equal.. |
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Let's buy GIMPS an Opteron! | Xyzzy | Lounge | 264 | 2006-08-17 12:39 |
| AMD Athlon 64 vs AMD Opteron for ecm | thomasn | Factoring | 6 | 2004-11-08 13:25 |
| Opteron web server... | Xyzzy | Lounge | 14 | 2003-11-05 23:07 |
| Opteron Bottleneck?? | Prime95 | Hardware | 31 | 2003-09-17 06:54 |
| What will an AMD Opteron be classified as ? | dsouza123 | Software | 4 | 2003-08-02 14:29 |