mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Math

Reply
 
Thread Tools
Old 2003-07-29, 12:26   #1
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

1011010012 Posts
Default Is the Fast Hartley Transform usable in DWT?

My idea is to get a more equal distribution of additions and multiplications. That's surely not possible for every bit of code but should be for the most parts. Current code has much less muls than adds (1:4 or so). So if we'd trade adds for muls by using a different transform we could make better use of the available functional units in the FPU. That would help P4's (SSE2) and Athlons - although the first have a somewhat higher mul latency AFAIR. But the possible throughput is 1 add and 1 mul per cycle.

Matthias
Dresdenboy is offline   Reply With Quote
Old 2003-07-29, 16:10   #2
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103×113 Posts
Default

Is it known that FHT can use fewer adds than FFT? I haven't seen any references to this effect. Remember, one can also code the FFT to use a better balance of adds and muls (and for certain odd radices this leads to a modest lowering of the number of adds, which are the rate-limiting operation as far as the FPU is concerned), but for the power-of-2 radices which form the bulk of the work, it appears that the algorithms that have the fewest muls also have the minimum number of adds (except w.r.to the complex "twiddle" muls, where using e.g. a Karatsuba-style mul would definitely increase the add count). I'm not sure why this is so, but perhaps it has to do with the symmetry properties of the FFT.

A recent e-mail exchange between George and myself:

Quote:
Originally Posted by Prime95
Have you ever considered an FHT? If I understand things correctly, this is
an all real data transform. Since on the Intel chip you want to do as much
work as possible while the data is in registers, I think the FHT can do 3
levels of the FFT while a complex data FFT can only do two levels of the
FFT (using Intel's 8 registers to hold the data). One major concern is
that the cos+sin multipliers won't provide as many opportunities to use
tricks to eliminate some muls and adds. I haven't thought about what the
caching consequences would be - if any.
Quote:
Originally Posted by ewmayer
I've not looked at the FHT in any serious way, since I've been spoiled
by the register-rich nature of the architectures I've targeted with my code.
There might be a payoff for the x86-style chips, but since these all seem
to be moving toward supporting SSE2 and having more registers
(if only for the 64-bit SSE2 doubles), I wonder whether it would be worth
spending all that time investigating and coding a brand new type of
transform. Your call, obviously, but I don't expect much (if any) payoff
from using FHT in my code, on its target platforms.
ewmayer is offline   Reply With Quote
Old 2003-07-30, 10:05   #3
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

I found some interesting papers about FHT compared to other FFT algorithms and also one about better use of FMA in FFTs (which would be fine for Itanium I think). I think they help a bit comparing the FHT and others to current implementations of DWT. The mers.tar.gz package on mersenne.org's freeware page includes an implementation of a DWT using FHT instead of a FFT.

"FFT algorithms for multply-add architectures"
http://www.km.sjf.stuba.sk/Aplimat/P...Ueberhuber.pdf

The next paper has a table containing the number of FP muls, adds, integer muls, adds, shifts and mem usage.
"Comparative analysis of FFT algorithms in sequential and parallel form"
http://www.isip.msstate.edu/conferen...paper_pdsp.pdf

"Comparative analysis of FFT algorithms"
http://www.dcs.fmph.uniba.sk/~mnagy/rozpoz/paper_v7.pdf
Dresdenboy is offline   Reply With Quote
Old 2003-07-30, 17:03   #4
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103·113 Posts
Default

Thanks for the links, Dresden - I'll get back to you after I've had some time to read and digest them.
ewmayer is offline   Reply With Quote
Old 2003-07-30, 19:35   #5
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2×53×71 Posts
Default

I'm not an expert, but after looking at an FHT implmentation on-line it looks like my hopes for doing three levels of the FFT in the 8 Intel FPU registers won't work out. It seems the "butterfly" operation is sufficiently different than the current radix-4 FFT now in use that you are still limited to two levels of the FFT.

Unfortunately, the pain of implementing an FHT in assembly language is so high, I need to convince myself that there will be a significant payoff before proceeding.

The biggest gain I know of for Athlon systems will come from changing the amount of work done in pass 1 and pass 2. Currently a 1M FFT does 12 levels in pass 1 and 8 levels in pass 2. This design decision came about so that pass 2 could be done within the Pentium's tiny 8K L1 cache.
Prime95 is online now   Reply With Quote
Old 2003-07-31, 08:04   #6
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

I'm currently working on pearl scripts which modify prime95 source to test some peephole optimizations like replacing fadd st,st with fmul st, XMM_TWO (should better be a locally placed constant for creating smaller op codes). I wrote more details in the thread about Prime95 v23.6 (http://mersenneforum.org/viewtopic.p...2&start=24).

It will be a bit more complex to align the code to 8byte decoder windows because I have to estimate the opcode sizes without disassembling - but should work for most ops.

And a much more complex task (besides changing the algorithm or arrangement of data in memory and such) would be to simulate SMT by mixing code paths which work on different data sets but have different distribution of FP ops. For example mixing fadd-blocks with fmul blocks. But that would only easily be possible for AMD64 or other CPUs where the algorithms use only half of available registers. BTW there is an 4xOpteron box available on sourceforge compile farm for compiling and testing. There the upper 8 SSE2 regs could be used just for that - and the additional int regs too. But it would need some difficult scheduling by hand - especially in case of loops...

DDB
Dresdenboy is offline   Reply With Quote
Old 2003-07-31, 08:27   #7
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

5518 Posts
Default

Quote:
Originally Posted by Prime95
I'm not an expert, but after looking at an FHT implmentation on-line it looks like my hopes for doing three levels of the FFT in the 8 Intel FPU registers won't work out. It seems the "butterfly" operation is sufficiently different than the current radix-4 FFT now in use that you are still limited to two levels of the FFT.
Did you read the article "Calculating the FHT in Hardware" (on http://www.faginfamily.net/barry/) or something else?
Dresdenboy is offline   Reply With Quote
Old 2003-07-31, 17:46   #8
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

752610 Posts
Default

I looked at the FHT implementation at http://www.jjj.de/fxt/#fxt

I glanced at your article and from the diagram it looks like the three-input butterflies will not get in the way of doing 3 levels of the FFT in the 8 FPU registers. More study is required!
Prime95 is online now   Reply With Quote
Old 2003-08-01, 08:05   #9
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Maybe it would help to store a value for some clocks in cache - modern architectures have store to load forwarding. It needs additional instruction slots but a fp op could use it as mem operand.

On Opteron you'd have 16 regs, but that's not of use for current user base.

For 64bit CPUs in general I'm currently thinking about some calculation blocks using only integer units for doing fp ops. It's a hard task to do all the mantissa shifting, sign checking, renormalization and so in an efficient way. It needs at least 6 and more instructions just to prepare one operand. But for adding 4 doubles this can be done in parallel and final normalization only needs to be done for one result. I'll make a test of this and will also look how many opportunities can be found where some work can be substituted by "emulated" fp calculations. It won't help in blocks which are only used for 0.5% of calculations. ;)

BTW would the change of the pass 1/pass 2 distribution of levels be possible by using some automation?
Dresdenboy is offline   Reply With Quote
Old 2003-08-01, 13:50   #10
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

11110000011002 Posts
Default

Quote:
Originally Posted by Dresdenboy
For 64bit CPUs in general I'm currently thinking about some calculation blocks using only integer units for doing fp ops. It's a hard task to do all the mantissa shifting, sign checking, renormalization and so in an efficient way. It needs at least 6 and more instructions just to prepare one operand. But for adding 4 doubles this can be done in parallel and final normalization only needs to be done for one result. I'll make a test of this and will also look how many opportunities can be found where some work can be substituted by "emulated" fp calculations.
Integer ops to emulate floating-point ops which are emulating integer ops, with each emulation being a speed-up -- I love it ! I just love it !!! I envy you guys.
cheesehead is offline   Reply With Quote
Old 2003-08-01, 15:37   #11
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103×113 Posts
Default

Quote:
Originally Posted by Dresdenboy
For 64bit CPUs in general I'm currently thinking about some calculation blocks using only integer units for doing fp ops. It's a hard task to do all the mantissa shifting, sign checking, renormalization and so in an efficient way. It needs at least 6 and more instructions just to prepare one operand. But for adding 4 doubles this can be done in parallel and final normalization only needs to be done for one result. I'll make a test of this and will also look how many opportunities can be found where some work can be substituted by "emulated" fp calculations. It won't help in blocks which are only used for 0.5% of calculations.
If you're going to go to the trouble of heavy integer coding, you'd probably be better off by using those integer ops to do an all-integer transform in parallel with the floating-point transform. If your integer transform is done modulo a b-bit prime, you can increase the size of the your transform inputs by around b/2 bits over what is possible for an all-floating transform.
ewmayer is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Do normal adults give themselves an allowance? (...to fast or not to fast - there is no question!) jasong jasong 35 2016-12-11 00:57
Intel GPU usable? tha Hardware 4 2015-07-28 15:31
earthquakes: a usable power source ? science_man_88 Lounge 42 2011-03-27 00:52
Graphic Card usable for Prime? Riza Hardware 11 2006-11-09 11:46
Fast Odd Discrete Fourier Transform (ODFT) dsouza123 Miscellaneous Math 1 2005-11-13 21:37

All times are UTC. The time now is 17:46.


Fri Jul 16 17:46:24 UTC 2021 up 49 days, 15:33, 1 user, load averages: 1.29, 1.44, 1.47

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.