mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Math

Reply
 
Thread Tools
Old 2003-08-02, 20:43   #12
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Quote:
Originally Posted by ewmayer
If you're going to go to the trouble of heavy integer coding, you'd probably be better off by using those integer ops to do an all-integer transform in parallel with the floating-point transform. If your integer transform is done modulo a b-bit prime, you can increase the size of the your transform inputs by around b/2 bits over what is possible for an all-floating transform.
I read what you've wrote in the forum about this possibility (and it was mentioned in F24 paper). Actually this would be the best solution although it would be very difficult to code. That's why I thought I should try to do some smaller code improvements which already help without needing too much development time.

I see the most difficulties in making the code run in parallel like on a SMT capable CPU. Loops, function calls and such things would just have to be combined carefully. SMT capable CPUs could be fine in this case but they could also just make it worse because the threads can't track the process of eachother to avoid cache thrashing and other side effects of such huge datasets.

Did you already write or try to write an implementation of such an "hybrid" transform?

Matthias
Dresdenboy is offline   Reply With Quote
Old 2003-08-03, 01:30   #13
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1163910 Posts
Default

Quote:
Originally Posted by Dresdenboy
Did you already write or try to write an implementation of such an "hybrid" transform?
Yes - about 5 years ago I coded a mod-M61^2 (i.e. complex modular arithmetic modulo M61) Fast Galois Transform in parallel with a floating FFT and did some benchmarks on an Alpha (I think it was a 21164). This was all in Fortran, though, and the compiler was doing a miserable job interleaving floating and integer ops, so it was really slow. But it did conform that I was able to use roughly 30-31 bits more per input than for an all-floating FFT. At that point I didn't have time to pursue it further in any serious way, and there were still technical issues (esp. related to data access patterns for the non-power-of-2 case and efficient carry propagation) that needed to be solved. I'm pretty sure I now know how to do the mixed float/int carry step (which involves error correction of the floating utputs using the modular ones) efficiently, and am working on the data access stuff. More on this as things develop. Also, now that my code is all in C it'll be much easier to write efficient macros and such.
ewmayer is offline   Reply With Quote
Old 2003-08-03, 15:13   #14
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

1011010012 Posts
Default

Quote:
Originally Posted by ewmayer
I'm pretty sure I now know how to do the mixed float/int carry step (which involves error correction of the floating utputs using the modular ones) efficiently, and am working on the data access stuff. More on this as things develop. Also, now that my code is all in C it'll be much easier to write efficient macros and such.
Do you know how many instructions can be held in 21264's reorder buffer? If the compiler did a bad job in interleaving the code only a huge reorder buffer would help creating the parallelism we wanted. But it's surely easier to change the compiler or use some coding tricks than waiting for a new CPU ;)

I created an open source project for optimizing code of other open source projects for AMD64 platforms. Maybe by providing a specially optimized client we could win some of the enthusiasts who will buy such a platform.

I don't know many facts about the distributed RSA cracking but I'm sure many people joined just because they knew that their machine is good in this task.

And maybe it's possible to create a fast integer convolution which could help in other projects. I think the AMD core math library is optimized to the max but not for some special cases (if FPU code is perfectly optimized nobody would look if the integer unit could help).

Matthias
Dresdenboy is offline   Reply With Quote
Old 2003-08-03, 23:42   #15
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103·113 Posts
Default

Quote:
Originally Posted by Dresdenboy
Do you know how many instructions can be held in 21264's reorder buffer? If the compiler did a bad job in interleaving the code only a huge reorder buffer would help creating the parallelism we wanted. But it's surely easier to change the compiler or use some coding tricks than waiting for a new CPU ;)
I don't know that figure - try a google search and see if anything turns up. There are two reasons my previous attempt is likely not at all indicative of what is achievable in practice:

1) I ran on a 21164, which is 8x slower at integer mul than the 21264;

2) I expect a C compiler to do a much better job with the integer instructions, since Fortran compiler technology (at least in the past 20-30 years) has been driven mainly by scientific computing, which is dominated by floating-point math. And C at least provides a reasonable way to include ASM macros, so as a last resort one can bypass the compiler that way if it's doing a poor job with crucial parts of the computation.

Quote:
And maybe it's possible to create a fast integer convolution which could help in other projects. I think the AMD core math library is optimized to the max but not for some special cases (if FPU code is perfectly optimized nobody would look if the integer unit could help).
The person affiliated with GIMPS who probably has the most experience with actual hardware implementation of integer transforms is Jason Papadopoulos - here are a couple of related links:

http://www.mail-archive.com/mersenne%40base.com/msg07322.html

http://www.mail-archive.com/mersenne%40base.com/msg07320.html
ewmayer is offline   Reply With Quote
Old 2003-08-04, 14:02   #16
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

16916 Posts
Default

Quote:
Originally Posted by ewmayer
I don't know that figure - try a google search and see if anything turns up. There are two reasons my previous attempt is likely not at all indicative of what is achievable in practice:

1) I ran on a 21164, which is 8x slower at integer mul than the 21264;
This is surely important in this case.
Quote:
Originally Posted by ewmayer
2) I expect a C compiler to do a much better job with the integer instructions, since Fortran compiler technology (at least in the past 20-30 years) has been driven mainly by scientific computing, which is dominated by floating-point math. And C at least provides a reasonable way to include ASM macros, so as a last resort one can bypass the compiler that way if it's doing a poor job with crucial parts of the computation.
Compiler makers which offer combined C and Fortran compilers use the same low level optimizers for the compilers. But that won't help in language related things like assembler macros and such.
Quote:
Originally Posted by ewmayer
The person affiliated with GIMPS who probably has the most experience with actual hardware implementation of integer transforms is Jason Papadopoulos - here are a couple of related links:
Thanks for the links. I think I already stumbled across these messages earlier during my web research. I already got the impression that he is the right person while browsing digests, newsgroups or reading some papers ;)
Dresdenboy is offline   Reply With Quote
Old 2003-08-12, 13:43   #17
Dresdenboy
 
Dresdenboy's Avatar
 
Apr 2003
Berlin, Germany

192 Posts
Default

Now I have some starting point (C implementation) for FGT. I will try to add weighting to it according to Crandalls papers. Then some optimized mod operations will follow.

BTW one drawback of x86 architecture is that imul with double width results have fixed target registers (edx:eax or rdx:rax). I hope the OOO execution allows a dense imul block with some instructions to get the results "out of the way".
Dresdenboy is offline   Reply With Quote
Old 2003-08-12, 19:09   #18
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

103·113 Posts
Default

Quote:
Originally Posted by Dresdenboy
Now I have some starting point (C implementation) for FGT. I will try to add weighting to it according to Crandalls papers. Then some optimized mod operations will follow.
For Mersenne-mod DWT modulo a Mersenne prime q = 2^p-1 (as in FGT), the DWT weights will be powers of 2, i.e. the DWT weighting/unweighting of each digit can be effected via a leftward shift (with a shift count between 0 and p-1 places), followed by a mod-q operation (which needs just integer and, shift and adds).
ewmayer is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Do normal adults give themselves an allowance? (...to fast or not to fast - there is no question!) jasong jasong 35 2016-12-11 00:57
Intel GPU usable? tha Hardware 4 2015-07-28 15:31
earthquakes: a usable power source ? science_man_88 Lounge 42 2011-03-27 00:52
Graphic Card usable for Prime? Riza Hardware 11 2006-11-09 11:46
Fast Odd Discrete Fourier Transform (ODFT) dsouza123 Miscellaneous Math 1 2005-11-13 21:37

All times are UTC. The time now is 17:46.


Fri Jul 16 17:46:36 UTC 2021 up 49 days, 15:33, 1 user, load averages: 1.16, 1.41, 1.46

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.