mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2003-03-10, 09:46   #1
Cyclamen Persicum
 
Cyclamen Persicum's Avatar
 
Mar 2003

1218 Posts
Default Other mersenne progs, why so slow?

I'm not a Gimps member, because the
probability to be a winner is zero, zero, module zero and
and zero periodically...
I have tried other progs available in Internet, but they as slow as continent drift.
When I wrote my own, but it still 10 or 100 times slower.
I don't know why and what is the secret of Prime95?
It seems the 39th number was discovered without any SSE
for only 1.5 month. But my prog requier 41 year to check it!!!
I use two FFT for squaring, and modular reduction
which worth nothing for 11111111...b.
Cyclamen Persicum is offline   Reply With Quote
Old 2003-03-10, 18:00   #2
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

11110000011002 Posts
Default

The "secret" of Prime95 is twofold -- hardware and software.

Intel Pentium CPUs execute floating-point (FP) instructions more rapidly and with higher precision than most other manufacturers' CPUs. They can pipeline as fast as one FP instruction completion per clock cycle, whereas many others can execute FP no faster than two, three, or more cycles per instruction completed.

In addition, Intel FP units have 80-bit internal registers, longer than the 64-bit registers common on other brands. With proper programming to keep computations inside those internal registers as long as possible before the eventual conversion to a 64-bit external result, Intel CPUs can produce 16 more bits of numerical result per FP instruction than other brands.

George Woltman has intensively studied the details of Intel CPUs and worked for years to maximize the throughput of Prime95 software, programming the most speed-critical modules in assembler language in order to carefully interleave machine instructions so as to keep the various parts of the CPU as busy as possible churning out results with minimum wasted time.

(I'm quite envious of George's work because I've done similar software optimization for speed, in the past, on older computers, but by the time I discovered what George was up to with Prime95, he was waaay ahead of where I'd have started. :) )
cheesehead is offline   Reply With Quote
Old 2003-03-11, 16:42   #3
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

22·3·7·139 Posts
Default

Quote:
Originally Posted by cheesehead
The "secret" of Prime95 is twofold -- hardware and software.

Intel Pentium CPUs execute floating-point (FP) instructions more rapidly and with higher precision than most other manufacturers' CPUs. They can pipeline as fast as one FP instruction completion per clock cycle, whereas many others can execute FP no faster than two, three, or more cycles per instruction completed.
Pretty much all RISC CPUs (Alpha, MIPS, Sparc, PowerPPC) can also do one pipelined add and one mul per cycle. No advantage for Intel there, except that it's by far the most common CPU out there, hence making it worthwhile for someone with the time and skill to write very highly optimized assembly code in large quantities for. Because of the x86's CISC nature and dumb stack-based instruction pipeline, it seems necessary to write ASM to get anywhere near the theoretical maximum throughput out of the chip (especially with FPU-intensive, data-nonlocal algorithms like the FFT), but once that's done, the throughput is as good as most any high-end RISC chip can do. The advantage of the RISCs is that they can get close to the maximal performance starting with high-level code, i.e. you don't need to spend a lot of time writing ASM to get good performance out of them.

In fact George's code is not better on a per-cycle basis than other top-quality LL test codes (Guillermo Valor's Glucas and my Mlucas) - on an Alpha ev6 I get even better per-cycle performance than George's code does on the P4 - but P4's are way cheaper and more numerous, and kudos to George for getting such good performance out of a chip with smaller caches and lower bandwidth to main memory. Also, the speed demons at Intel have been able to ratchet up chip speeds much faster than the RISC manufacturers.

Quote:
In addition, Intel FP units have 80-bit internal registers, longer than the 64-bit registers common on other brands. With proper programming to keep computations inside those internal registers as long as possible before the eventual conversion to a 64-bit external result, Intel CPUs can produce 16 more bits of numerical result per FP instruction than other brands.
Yes, but because of the x86's small register set and the fact that the FFT's we do are so large, this gives only a small accuracy enhancement - perhaps 1-2% over an IEEE64-compliant CPU.

Quote:
George Woltman has intensively studied the details of Intel CPUs and worked for years to maximize the throughput of Prime95 software, programming the most speed-critical modules in assembler language in order to carefully interleave machine instructions so as to keep the various parts of the CPU as busy as possible churning out results with minimum wasted time.
No argument there - George has squeezed remarkable performance out of these cheap commodity CPUs.

Quote:
(I'm quite envious of George's work because I've done similar software optimization for speed, in the past, on older computers, but by the time I discovered what George was up to with Prime95, he was waaay ahead of where I'd have started. :) )
Yes, but you surely learned lots of new stuff along the way, so it has its rewards even if your code doesn't discover the next Mersenne prime. :)
ewmayer is offline   Reply With Quote
Old 2003-03-11, 17:59   #4
cperciva
 
Oct 2002

43 Posts
Default

Quote:
Originally Posted by ewmayer
Quote:
Originally Posted by cheesehead
In addition, Intel FP units have 80-bit internal registers, longer than the 64-bit registers common on other brands. With proper programming to keep computations inside those internal registers as long as possible before the eventual conversion to a 64-bit external result, Intel CPUs can produce 16 more bits of numerical result per FP instruction than other brands.
Yes, but because of the x86's small register set and the fact that the FFT's we do are so large, this gives only a small accuracy enhancement - perhaps 1-2% over an IEEE64-compliant CPU.
Are you sure? It looks to me like it should give an advantage of at least 1.5 bits.
cperciva is offline   Reply With Quote
Old 2003-03-11, 18:26   #5
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

5×29×53 Posts
Default

Quote:
Originally Posted by cperciva
Are you sure? It looks to me like it should give an advantage of at least 1.5 bits.
Ignoring SSE2 which is standard 64-bit FP, the x86 code does all loads and stores in the 64-bit format. The reason for this is that the 80-bit load and store operations are very slow. Thus, the FFT code does not benefit greatly from the 80-bit internal format.
Prime95 is offline   Reply With Quote
Old 2003-03-11, 23:35   #6
cperciva
 
Oct 2002

43 Posts
Default

Quote:
Originally Posted by Prime95
Quote:
Originally Posted by cperciva
Are you sure? It looks to me like it should give an advantage of at least 1.5 bits.
Ignoring SSE2 which is standard 64-bit FP, the x86 code does all loads and stores in the 64-bit format. The reason for this is that the 80-bit load and store operations are very slow. Thus, the FFT code does not benefit greatly from the 80-bit internal format.
I don't know exactly how your FFT is implemented, but assuming that you calculate (a,b) <- (a+bw,a-bw) entirely in registers, the maximum error drops from B^2*N*(n*3.943... + O(1))*eps to B^2*N*(n*1.707... + O(1))*eps. I can't be so precise about the average error, but it should be similar.

The critical point is that complex multiplication and addition introduce up to sqrt(5)+1 ulp of error, while loading and storing introduce only 1 ulp. (Unavoidable inaccuracies in the trig data contributes the remaining sqrt(1/2) ulp.)
cperciva is offline   Reply With Quote
Old 2003-03-11, 23:41   #7
cperciva
 
Oct 2002

43 Posts
Default

(If anyone hasn't already read it, my paper on FFT error bounds has *finally* been published in Math. Comp.)
cperciva is offline   Reply With Quote
Old 2003-03-12, 00:42   #8
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

5·29·53 Posts
Default

Quote:
Originally Posted by cperciva
I don't know exactly how your FFT is implemented, but assuming that you calculate (a,b) <- (a+bw,a-bw) entirely in registers, the maximum error drops from B^2*N*(n*3.943... + O(1))*eps to B^2*N*(n*1.707... + O(1))*eps. I can't be so precise about the average error, but it should be similar.
Obviously, I've not studied the error accumulation as much as you have. Prime95 uses a radix 4 implementation. Correct me if I'm wrong: When using 80-bit FP, you can pretty much assume all the error is introduced by the store instruction that occurs every 2 "levels" of the FFT. Thus, during a 1M FFT squaring each FFT data value is stored 21 times (10 in the forward FFT, 10 in the inverse FFT, and 1 in the carry propagation).
When using 64-bit FP, each radix-4 "unit" does a complex multiply (4 muls, 2 adds) and two complex adds. That is 2 muls and 3 adds for each data value. In the 1M FFT squaring that means there are about 105 opportunities for error to accumulate.

Are you saying this 5 to 1 ratio should result in an extra 1.5 bits of precision? Does that match the 1% to 2% larger exponents the x86
FFTs support?
Prime95 is offline   Reply With Quote
Old 2003-03-12, 00:54   #9
cperciva
 
Oct 2002

1010112 Posts
Default

It's somewhat more complicated than that; the (not perfectly accurate) trig data contributes some error, and the error introduced in complex multiplication can be bounded somewhat more closely than a naive operation count would indicate.

But to answer your second question, in the 20M range you're packing about 20 bits per double, so reducing the error by one bit would mean a change of 2.5% in the size of numbers for which you can use a given FFT length.

That said, all my experience is with worst-case bounds; in practice the errors are much smaller; indeed, I could construct values for which the prime95 multiplication would fail -- but it doesn't matter since double-checking would find those anyway.
cperciva is offline   Reply With Quote
Old 2003-03-12, 01:55   #10
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

1167610 Posts
Default

Quote:
Originally Posted by cperciva
(If anyone hasn't already read it, my paper on FFT error bounds has *finally* been published in Math. Comp.)
Colin, could you kindly provide a link to your paper?

Now that you mention it, I see that Crandall, Papadopoulos' and my paper on F24 has also appeared, at long last - just over 3 years since we actually submitted the initial draft:

http://www.ams.org/journal-getitem?pii=S0025-5718-02-01479-5

The LaTeX manuscript can be gotten for free from

http://hogranch.com/mayer/F24.rev3.tex

Math. Comp. is very slow, and (at least based on our experience, where one reviewer didn't do anything with the manuscript for over a year, without MC doing anything to remind them of their responsibility as a reviewer) has poor timeliness-of-review controls. But grousing about tardiness aside, that paper also has some words about errors in floating-point FFTs, with an approach that is more heuristic in nature and more geared to most-likely error behavior than Colin's. The error estimates thus derived seem to match real-world experience quite closely.
ewmayer is offline   Reply With Quote
Old 2003-03-12, 02:05   #11
cperciva
 
Oct 2002

43 Posts
Default

http://www.ams.org/journal-getitem?pii=S0025-5718-02-01419-9

Or, for those of you without a subscription, a preprint is at http://www.sfu.ca/~cperciva/mpaper.ps

I refer to your F24 paper in a section discussing previous error bound work -- Crandall sent me a copy a couple years ago.
cperciva is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
why is http://www.mersenne.org/ so slow Unregistered Information & Answers 19 2012-04-17 03:12
Slow Down? R.D. Silverman GMP-ECM 55 2011-10-16 17:28
How hot is too hot? Slow is too slow? petrw1 Hardware 13 2008-11-10 23:25
Slow computer Housemouse Hardware 7 2008-02-15 18:18
Really slow machines? Doorbasher Hardware 5 2004-08-23 22:18

All times are UTC. The time now is 20:52.


Sat Dec 4 20:52:42 UTC 2021 up 134 days, 15:21, 1 user, load averages: 1.06, 1.12, 1.19

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.