![]() |
Is the Fast Hartley Transform usable in DWT?
My idea is to get a more equal distribution of additions and multiplications. That's surely not possible for every bit of code but should be for the most parts. Current code has much less muls than adds (1:4 or so). So if we'd trade adds for muls by using a different transform we could make better use of the available functional units in the FPU. That would help P4's (SSE2) and Athlons - although the first have a somewhat higher mul latency AFAIR. But the possible throughput is 1 add and 1 mul per cycle.
Matthias |
Is it known that FHT can use fewer adds than FFT? I haven't seen any references to this effect. Remember, one can also code the FFT to use a better balance of adds and muls (and for certain odd radices this leads to a modest lowering of the number of adds, which are the rate-limiting operation as far as the FPU is concerned), but for the power-of-2 radices which form the bulk of the work, it appears that the algorithms that have the fewest muls also have the minimum number of adds (except w.r.to the complex "twiddle" muls, where using e.g. a Karatsuba-style mul would definitely increase the add count). I'm not sure why this is so, but perhaps it has to do with the symmetry properties of the FFT.
A recent e-mail exchange between George and myself: [quote="Prime95"]Have you ever considered an FHT? If I understand things correctly, this is an all real data transform. Since on the Intel chip you want to do as much work as possible while the data is in registers, I think the FHT can do 3 levels of the FFT while a complex data FFT can only do two levels of the FFT (using Intel's 8 registers to hold the data). One major concern is that the cos+sin multipliers won't provide as many opportunities to use tricks to eliminate some muls and adds. I haven't thought about what the caching consequences would be - if any.[/quote] [quote="ewmayer"]I've not looked at the FHT in any serious way, since I've been spoiled by the register-rich nature of the architectures I've targeted with my code. There might be a payoff for the x86-style chips, but since these all seem to be moving toward supporting SSE2 and having more registers (if only for the 64-bit SSE2 doubles), I wonder whether it would be worth spending all that time investigating and coding a brand new type of transform. Your call, obviously, but I don't expect much (if any) payoff from using FHT in my code, on its target platforms.[/quote] |
I found some interesting papers about FHT compared to other FFT algorithms and also one about better use of FMA in FFTs (which would be fine for Itanium I think). I think they help a bit comparing the FHT and others to current implementations of DWT. The mers.tar.gz package on mersenne.org's freeware page includes an implementation of a DWT using FHT instead of a FFT.
"FFT algorithms for multply-add architectures" [url]http://www.km.sjf.stuba.sk/Aplimat/Proceedings/Communications/Franchetti-Kaltenberger-Ueberhuber.pdf[/url] The next paper has a table containing the number of FP muls, adds, integer muls, adds, shifts and mem usage. "Comparative analysis of FFT algorithms in sequential and parallel form" [url]http://www.isip.msstate.edu/conferences/dsp96/paper_pdsp.pdf[/url] "Comparative analysis of FFT algorithms" [url]http://www.dcs.fmph.uniba.sk/~mnagy/rozpoz/paper_v7.pdf[/url] |
Thanks for the links, Dresden - I'll get back to you after I've had some time to read and digest them.
|
I'm not an expert, but after looking at an FHT implmentation on-line it looks like my hopes for doing three levels of the FFT in the 8 Intel FPU registers won't work out. It seems the "butterfly" operation is sufficiently different than the current radix-4 FFT now in use that you are still limited to two levels of the FFT.
Unfortunately, the pain of implementing an FHT in assembly language is so high, I need to convince myself that there will be a significant payoff before proceeding. The biggest gain I know of for Athlon systems will come from changing the amount of work done in pass 1 and pass 2. Currently a 1M FFT does 12 levels in pass 1 and 8 levels in pass 2. This design decision came about so that pass 2 could be done within the Pentium's tiny 8K L1 cache. |
I'm currently working on pearl scripts which modify prime95 source to test some peephole optimizations like replacing [i]fadd st,st[/i] with [i]fmul st, XMM_TWO[/i] (should better be a locally placed constant for creating smaller op codes). I wrote more details in the thread about Prime95 v23.6 ([url]http://mersenneforum.org/viewtopic.php?t=862&start=24[/url]).
It will be a bit more complex to align the code to 8byte decoder windows because I have to estimate the opcode sizes without disassembling - but should work for most ops. And a much more complex task (besides changing the algorithm or arrangement of data in memory and such) would be to simulate SMT by mixing code paths which work on different data sets but have different distribution of FP ops. For example mixing fadd-blocks with fmul blocks. But that would only easily be possible for AMD64 or other CPUs where the algorithms use only half of available registers. BTW there is an 4xOpteron box available on sourceforge compile farm for compiling and testing. There the upper 8 SSE2 regs could be used just for that - and the additional int regs too. But it would need some difficult scheduling by hand - especially in case of loops... DDB |
[quote="Prime95"]I'm not an expert, but after looking at an FHT implmentation on-line it looks like my hopes for doing three levels of the FFT in the 8 Intel FPU registers won't work out. It seems the "butterfly" operation is sufficiently different than the current radix-4 FFT now in use that you are still limited to two levels of the FFT.[/quote]
Did you read the article "Calculating the FHT in Hardware" (on [url]http://www.faginfamily.net/barry/[/url]) or something else? |
I looked at the FHT implementation at http://www.jjj.de/fxt/#fxt
I glanced at your article and from the diagram it looks like the three-input butterflies will not get in the way of doing 3 levels of the FFT in the 8 FPU registers. More study is required! |
Maybe it would help to store a value for some clocks in cache - modern architectures have store to load forwarding. It needs additional instruction slots but a fp op could use it as mem operand.
On Opteron you'd have 16 regs, but that's not of use for current user base. For 64bit CPUs in general I'm currently thinking about some calculation blocks using only integer units for doing fp ops. It's a hard task to do all the mantissa shifting, sign checking, renormalization and so in an efficient way. It needs at least 6 and more instructions just to prepare one operand. But for adding 4 doubles this can be done in parallel and final normalization only needs to be done for one result. I'll make a test of this and will also look how many opportunities can be found where some work can be substituted by "emulated" fp calculations. It won't help in blocks which are only used for 0.5% of calculations. ;) BTW would the change of the pass 1/pass 2 distribution of levels be possible by using some automation? |
[quote="Dresdenboy"]For 64bit CPUs in general I'm currently thinking about some calculation blocks using only integer units for doing fp ops. It's a hard task to do all the mantissa shifting, sign checking, renormalization and so in an efficient way. It needs at least 6 and more instructions just to prepare one operand. But for adding 4 doubles this can be done in parallel and final normalization only needs to be done for one result. I'll make a test of this and will also look how many opportunities can be found where some work can be substituted by "emulated" fp calculations.[/quote]
Integer ops to emulate floating-point ops which are emulating integer ops, with each emulation being a speed-up -- [i]I love it ! I just love it !!![/i] I envy you guys. |
[quote="Dresdenboy"]For 64bit CPUs in general I'm currently thinking about some calculation blocks using only integer units for doing fp ops. It's a hard task to do all the mantissa shifting, sign checking, renormalization and so in an efficient way. It needs at least 6 and more instructions just to prepare one operand. But for adding 4 doubles this can be done in parallel and final normalization only needs to be done for one result. I'll make a test of this and will also look how many opportunities can be found where some work can be substituted by "emulated" fp calculations. It won't help in blocks which are only used for 0.5% of calculations.[/quote]
If you're going to go to the trouble of heavy integer coding, you'd probably be better off by using those integer ops to do an all-integer transform in parallel with the floating-point transform. If your integer transform is done modulo a b-bit prime, you can increase the size of the your transform inputs by around b/2 bits over what is possible for an all-floating transform. |
[quote="ewmayer"]If you're going to go to the trouble of heavy integer coding, you'd probably be better off by using those integer ops to do an all-integer transform in parallel with the floating-point transform. If your integer transform is done modulo a b-bit prime, you can increase the size of the your transform inputs by around b/2 bits over what is possible for an all-floating transform.[/quote]
I read what you've wrote in the forum about this possibility (and it was mentioned in F24 paper). Actually this would be the best solution although it would be very difficult to code. That's why I thought I should try to do some smaller code improvements which already help without needing too much development time. I see the most difficulties in making the code run in parallel like on a SMT capable CPU. Loops, function calls and such things would just have to be combined carefully. SMT capable CPUs could be fine in this case but they could also just make it worse because the threads can't track the process of eachother to avoid cache thrashing and other side effects of such huge datasets. Did you already write or try to write an implementation of such an "hybrid" transform? Matthias |
[quote="Dresdenboy"]Did you already write or try to write an implementation of such an "hybrid" transform?[/quote]
Yes - about 5 years ago I coded a mod-M61^2 (i.e. complex modular arithmetic modulo M61) Fast Galois Transform in parallel with a floating FFT and did some benchmarks on an Alpha (I think it was a 21164). This was all in Fortran, though, and the compiler was doing a miserable job interleaving floating and integer ops, so it was really slow. But it did conform that I was able to use roughly 30-31 bits more per input than for an all-floating FFT. At that point I didn't have time to pursue it further in any serious way, and there were still technical issues (esp. related to data access patterns for the non-power-of-2 case and efficient carry propagation) that needed to be solved. I'm pretty sure I now know how to do the mixed float/int carry step (which involves error correction of the floating utputs using the modular ones) efficiently, and am working on the data access stuff. More on this as things develop. Also, now that my code is all in C it'll be much easier to write efficient macros and such. |
[quote="ewmayer"]I'm pretty sure I now know how to do the mixed float/int carry step (which involves error correction of the floating utputs using the modular ones) efficiently, and am working on the data access stuff. More on this as things develop. Also, now that my code is all in C it'll be much easier to write efficient macros and such.[/quote]
Do you know how many instructions can be held in 21264's reorder buffer? If the compiler did a bad job in interleaving the code only a huge reorder buffer would help creating the parallelism we wanted. But it's surely easier to change the compiler or use some coding tricks than waiting for a new CPU ;) I created an open source project for optimizing code of other open source projects for AMD64 platforms. Maybe by providing a specially optimized client we could win some of the enthusiasts who will buy such a platform. I don't know many facts about the distributed RSA cracking but I'm sure many people joined just because they knew that their machine is good in this task. And maybe it's possible to create a fast integer convolution which could help in other projects. I think the AMD core math library is optimized to the max but not for some special cases (if FPU code is perfectly optimized nobody would look if the integer unit could help). Matthias |
[quote="Dresdenboy"]Do you know how many instructions can be held in 21264's reorder buffer? If the compiler did a bad job in interleaving the code only a huge reorder buffer would help creating the parallelism we wanted. But it's surely easier to change the compiler or use some coding tricks than waiting for a new CPU ;)[/quote]
I don't know that figure - try a google search and see if anything turns up. There are two reasons my previous attempt is likely not at all indicative of what is achievable in practice: 1) I ran on a 21164, which is 8x slower at integer mul than the 21264; 2) I expect a C compiler to do a much better job with the integer instructions, since Fortran compiler technology (at least in the past 20-30 years) has been driven mainly by scientific computing, which is dominated by floating-point math. And C at least provides a reasonable way to include ASM macros, so as a last resort one can bypass the compiler that way if it's doing a poor job with crucial parts of the computation. [quote]And maybe it's possible to create a fast integer convolution which could help in other projects. I think the AMD core math library is optimized to the max but not for some special cases (if FPU code is perfectly optimized nobody would look if the integer unit could help).[/quote] The person affiliated with GIMPS who probably has the most experience with actual hardware implementation of integer transforms is Jason Papadopoulos - here are a couple of related links: http://www.mail-archive.com/mersenne%40base.com/msg07322.html http://www.mail-archive.com/mersenne%40base.com/msg07320.html |
[quote="ewmayer"]I don't know that figure - try a google search and see if anything turns up. There are two reasons my previous attempt is likely not at all indicative of what is achievable in practice:
1) I ran on a 21164, which is 8x slower at integer mul than the 21264;[/quote]This is surely important in this case.[quote="ewmayer"]2) I expect a C compiler to do a much better job with the integer instructions, since Fortran compiler technology (at least in the past 20-30 years) has been driven mainly by scientific computing, which is dominated by floating-point math. And C at least provides a reasonable way to include ASM macros, so as a last resort one can bypass the compiler that way if it's doing a poor job with crucial parts of the computation.[/quote]Compiler makers which offer combined C and Fortran compilers use the same low level optimizers for the compilers. But that won't help in language related things like assembler macros and such.[quote="ewmayer"]The person affiliated with GIMPS who probably has the most experience with actual hardware implementation of integer transforms is Jason Papadopoulos - here are a couple of related links: [/quote]Thanks for the links. I think I already stumbled across these messages earlier during my web research. I already got the impression that he is the right person while browsing digests, newsgroups or reading some papers ;) |
Now I have some starting point (C implementation) for FGT. I will try to add weighting to it according to Crandalls papers. Then some optimized mod operations will follow.
BTW one drawback of x86 architecture is that imul with double width results have fixed target registers (edx:eax or rdx:rax). I hope the OOO execution allows a dense imul block with some instructions to get the results "out of the way". |
[quote="Dresdenboy"]Now I have some starting point (C implementation) for FGT. I will try to add weighting to it according to Crandalls papers. Then some optimized mod operations will follow.[/quote]
For Mersenne-mod DWT modulo a Mersenne prime q = 2^p-1 (as in FGT), the DWT weights will be powers of 2, i.e. the DWT weighting/unweighting of each digit can be effected via a leftward shift (with a shift count between 0 and p-1 places), followed by a mod-q operation (which needs just integer and, shift and adds). |
| All times are UTC. The time now is 15:14. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.