mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Math (https://www.mersenneforum.org/forumdisplay.php?f=8)
-   -   Is the Fast Hartley Transform usable in DWT? (https://www.mersenneforum.org/showthread.php?t=896)

Dresdenboy 2003-08-02 20:43

[quote="ewmayer"]If you're going to go to the trouble of heavy integer coding, you'd probably be better off by using those integer ops to do an all-integer transform in parallel with the floating-point transform. If your integer transform is done modulo a b-bit prime, you can increase the size of the your transform inputs by around b/2 bits over what is possible for an all-floating transform.[/quote]

I read what you've wrote in the forum about this possibility (and it was mentioned in F24 paper). Actually this would be the best solution although it would be very difficult to code. That's why I thought I should try to do some smaller code improvements which already help without needing too much development time.

I see the most difficulties in making the code run in parallel like on a SMT capable CPU. Loops, function calls and such things would just have to be combined carefully. SMT capable CPUs could be fine in this case but they could also just make it worse because the threads can't track the process of eachother to avoid cache thrashing and other side effects of such huge datasets.

Did you already write or try to write an implementation of such an "hybrid" transform?

Matthias

ewmayer 2003-08-03 01:30

[quote="Dresdenboy"]Did you already write or try to write an implementation of such an "hybrid" transform?[/quote]

Yes - about 5 years ago I coded a mod-M61^2 (i.e. complex modular arithmetic modulo M61) Fast Galois Transform in parallel with a floating FFT and did some benchmarks on an Alpha (I think it was a 21164). This was all in Fortran, though, and the compiler was doing a miserable job interleaving floating and integer ops, so it was really slow. But it did conform that I was able to use roughly 30-31 bits more per input than for an all-floating FFT. At that point I didn't have time to pursue it further in any serious way, and there were still technical issues (esp. related to data access patterns for the non-power-of-2 case and efficient carry propagation) that needed to be solved. I'm pretty sure I now know how to do the mixed float/int carry step (which involves error correction of the floating utputs using the modular ones) efficiently, and am working on the data access stuff. More on this as things develop. Also, now that my code is all in C it'll be much easier to write efficient macros and such.

Dresdenboy 2003-08-03 15:13

[quote="ewmayer"]I'm pretty sure I now know how to do the mixed float/int carry step (which involves error correction of the floating utputs using the modular ones) efficiently, and am working on the data access stuff. More on this as things develop. Also, now that my code is all in C it'll be much easier to write efficient macros and such.[/quote]

Do you know how many instructions can be held in 21264's reorder buffer? If the compiler did a bad job in interleaving the code only a huge reorder buffer would help creating the parallelism we wanted. But it's surely easier to change the compiler or use some coding tricks than waiting for a new CPU ;)

I created an open source project for optimizing code of other open source projects for AMD64 platforms. Maybe by providing a specially optimized client we could win some of the enthusiasts who will buy such a platform.

I don't know many facts about the distributed RSA cracking but I'm sure many people joined just because they knew that their machine is good in this task.

And maybe it's possible to create a fast integer convolution which could help in other projects. I think the AMD core math library is optimized to the max but not for some special cases (if FPU code is perfectly optimized nobody would look if the integer unit could help).

Matthias

ewmayer 2003-08-03 23:42

[quote="Dresdenboy"]Do you know how many instructions can be held in 21264's reorder buffer? If the compiler did a bad job in interleaving the code only a huge reorder buffer would help creating the parallelism we wanted. But it's surely easier to change the compiler or use some coding tricks than waiting for a new CPU ;)[/quote]

I don't know that figure - try a google search and see if anything turns up. There are two reasons my previous attempt is likely not at all indicative of what is achievable in practice:

1) I ran on a 21164, which is 8x slower at integer mul than the 21264;

2) I expect a C compiler to do a much better job with the integer instructions, since Fortran compiler technology (at least in the past 20-30 years) has been driven mainly by scientific computing, which is dominated by floating-point math. And C at least provides a reasonable way to include ASM macros, so as a last resort one can bypass the compiler that way if it's doing a poor job with crucial parts of the computation.

[quote]And maybe it's possible to create a fast integer convolution which could help in other projects. I think the AMD core math library is optimized to the max but not for some special cases (if FPU code is perfectly optimized nobody would look if the integer unit could help).[/quote]

The person affiliated with GIMPS who probably has the most experience with actual hardware implementation of integer transforms is Jason Papadopoulos - here are a couple of related links:

http://www.mail-archive.com/mersenne%40base.com/msg07322.html

http://www.mail-archive.com/mersenne%40base.com/msg07320.html

Dresdenboy 2003-08-04 14:02

[quote="ewmayer"]I don't know that figure - try a google search and see if anything turns up. There are two reasons my previous attempt is likely not at all indicative of what is achievable in practice:

1) I ran on a 21164, which is 8x slower at integer mul than the 21264;[/quote]This is surely important in this case.[quote="ewmayer"]2) I expect a C compiler to do a much better job with the integer instructions, since Fortran compiler technology (at least in the past 20-30 years) has been driven mainly by scientific computing, which is dominated by floating-point math. And C at least provides a reasonable way to include ASM macros, so as a last resort one can bypass the compiler that way if it's doing a poor job with crucial parts of the computation.[/quote]Compiler makers which offer combined C and Fortran compilers use the same low level optimizers for the compilers. But that won't help in language related things like assembler macros and such.[quote="ewmayer"]The person affiliated with GIMPS who probably has the most experience with actual hardware implementation of integer transforms is Jason Papadopoulos - here are a couple of related links: [/quote]Thanks for the links. I think I already stumbled across these messages earlier during my web research. I already got the impression that he is the right person while browsing digests, newsgroups or reading some papers ;)

Dresdenboy 2003-08-12 13:43

Now I have some starting point (C implementation) for FGT. I will try to add weighting to it according to Crandalls papers. Then some optimized mod operations will follow.

BTW one drawback of x86 architecture is that imul with double width results have fixed target registers (edx:eax or rdx:rax). I hope the OOO execution allows a dense imul block with some instructions to get the results "out of the way".

ewmayer 2003-08-12 19:09

[quote="Dresdenboy"]Now I have some starting point (C implementation) for FGT. I will try to add weighting to it according to Crandalls papers. Then some optimized mod operations will follow.[/quote]

For Mersenne-mod DWT modulo a Mersenne prime q = 2^p-1 (as in FGT), the DWT weights will be powers of 2, i.e. the DWT weighting/unweighting of each digit can be effected via a leftward shift (with a shift count between 0 and p-1 places), followed by a mod-q operation (which needs just integer and, shift and adds).


All times are UTC. The time now is 15:14.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.