![]() |
|
|
#1 |
|
Apr 2003
Berlin, Germany
1011010012 Posts |
My idea is to get a more equal distribution of additions and multiplications. That's surely not possible for every bit of code but should be for the most parts. Current code has much less muls than adds (1:4 or so). So if we'd trade adds for muls by using a different transform we could make better use of the available functional units in the FPU. That would help P4's (SSE2) and Athlons - although the first have a somewhat higher mul latency AFAIR. But the possible throughput is 1 add and 1 mul per cycle.
Matthias |
|
|
|
|
|
#2 | ||
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
Is it known that FHT can use fewer adds than FFT? I haven't seen any references to this effect. Remember, one can also code the FFT to use a better balance of adds and muls (and for certain odd radices this leads to a modest lowering of the number of adds, which are the rate-limiting operation as far as the FPU is concerned), but for the power-of-2 radices which form the bulk of the work, it appears that the algorithms that have the fewest muls also have the minimum number of adds (except w.r.to the complex "twiddle" muls, where using e.g. a Karatsuba-style mul would definitely increase the add count). I'm not sure why this is so, but perhaps it has to do with the symmetry properties of the FFT.
A recent e-mail exchange between George and myself: Quote:
Quote:
|
||
|
|
|
|
|
#3 |
|
Apr 2003
Berlin, Germany
192 Posts |
I found some interesting papers about FHT compared to other FFT algorithms and also one about better use of FMA in FFTs (which would be fine for Itanium I think). I think they help a bit comparing the FHT and others to current implementations of DWT. The mers.tar.gz package on mersenne.org's freeware page includes an implementation of a DWT using FHT instead of a FFT.
"FFT algorithms for multply-add architectures" http://www.km.sjf.stuba.sk/Aplimat/P...Ueberhuber.pdf The next paper has a table containing the number of FP muls, adds, integer muls, adds, shifts and mem usage. "Comparative analysis of FFT algorithms in sequential and parallel form" http://www.isip.msstate.edu/conferen...paper_pdsp.pdf "Comparative analysis of FFT algorithms" http://www.dcs.fmph.uniba.sk/~mnagy/rozpoz/paper_v7.pdf |
|
|
|
|
|
#4 |
|
∂2ω=0
Sep 2002
República de California
103·113 Posts |
Thanks for the links, Dresden - I'll get back to you after I've had some time to read and digest them.
|
|
|
|
|
|
#5 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
2×53×71 Posts |
I'm not an expert, but after looking at an FHT implmentation on-line it looks like my hopes for doing three levels of the FFT in the 8 Intel FPU registers won't work out. It seems the "butterfly" operation is sufficiently different than the current radix-4 FFT now in use that you are still limited to two levels of the FFT.
Unfortunately, the pain of implementing an FHT in assembly language is so high, I need to convince myself that there will be a significant payoff before proceeding. The biggest gain I know of for Athlon systems will come from changing the amount of work done in pass 1 and pass 2. Currently a 1M FFT does 12 levels in pass 1 and 8 levels in pass 2. This design decision came about so that pass 2 could be done within the Pentium's tiny 8K L1 cache. |
|
|
|
|
|
#6 |
|
Apr 2003
Berlin, Germany
192 Posts |
I'm currently working on pearl scripts which modify prime95 source to test some peephole optimizations like replacing fadd st,st with fmul st, XMM_TWO (should better be a locally placed constant for creating smaller op codes). I wrote more details in the thread about Prime95 v23.6 (http://mersenneforum.org/viewtopic.p...2&start=24).
It will be a bit more complex to align the code to 8byte decoder windows because I have to estimate the opcode sizes without disassembling - but should work for most ops. And a much more complex task (besides changing the algorithm or arrangement of data in memory and such) would be to simulate SMT by mixing code paths which work on different data sets but have different distribution of FP ops. For example mixing fadd-blocks with fmul blocks. But that would only easily be possible for AMD64 or other CPUs where the algorithms use only half of available registers. BTW there is an 4xOpteron box available on sourceforge compile farm for compiling and testing. There the upper 8 SSE2 regs could be used just for that - and the additional int regs too. But it would need some difficult scheduling by hand - especially in case of loops... DDB |
|
|
|
|
|
#7 | |
|
Apr 2003
Berlin, Germany
5518 Posts |
Quote:
|
|
|
|
|
|
|
#8 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
752610 Posts |
I looked at the FHT implementation at http://www.jjj.de/fxt/#fxt
I glanced at your article and from the diagram it looks like the three-input butterflies will not get in the way of doing 3 levels of the FFT in the 8 FPU registers. More study is required! |
|
|
|
|
|
#9 |
|
Apr 2003
Berlin, Germany
192 Posts |
Maybe it would help to store a value for some clocks in cache - modern architectures have store to load forwarding. It needs additional instruction slots but a fp op could use it as mem operand.
On Opteron you'd have 16 regs, but that's not of use for current user base. For 64bit CPUs in general I'm currently thinking about some calculation blocks using only integer units for doing fp ops. It's a hard task to do all the mantissa shifting, sign checking, renormalization and so in an efficient way. It needs at least 6 and more instructions just to prepare one operand. But for adding 4 doubles this can be done in parallel and final normalization only needs to be done for one result. I'll make a test of this and will also look how many opportunities can be found where some work can be substituted by "emulated" fp calculations. It won't help in blocks which are only used for 0.5% of calculations. ;) BTW would the change of the pass 1/pass 2 distribution of levels be possible by using some automation? |
|
|
|
|
|
#10 | |
|
"Richard B. Woods"
Aug 2002
Wisconsin USA
11110000011002 Posts |
Quote:
|
|
|
|
|
|
|
#11 | |
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
Quote:
|
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Do normal adults give themselves an allowance? (...to fast or not to fast - there is no question!) | jasong | jasong | 35 | 2016-12-11 00:57 |
| Intel GPU usable? | tha | Hardware | 4 | 2015-07-28 15:31 |
| earthquakes: a usable power source ? | science_man_88 | Lounge | 42 | 2011-03-27 00:52 |
| Graphic Card usable for Prime? | Riza | Hardware | 11 | 2006-11-09 11:46 |
| Fast Odd Discrete Fourier Transform (ODFT) | dsouza123 | Miscellaneous Math | 1 | 2005-11-13 21:37 |