mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Athlon 64, prime95, and the L2 cache (https://www.mersenneforum.org/showthread.php?t=2373)

Prime95 2004-04-21 04:16

Athlon 64, prime95, and the L2 cache
 
I've been messing with more Opteron timings and came across something interesting.

On a P4, Prime95 processes data in chunks about 1/2 the size of the L2 cache. As it processes this data, it prefetches from main memory into the L2 cache the next data chunk.

You would think the Opteron with its larger 1MB L2 cache would have an advantage here. Unfortunately, prefetch on the Opteron loads the data into the L1 cache. This causes data being worked on to get kicked out of the L1 cache and ends up negating most of the advantages of prefetching. There is no way to effectively use the large L2 cache to prefetch data!

Now the good news. On my 2.0 GHz P4, reading 8 MB of data and operating on it takes 26 million clocks. On the 1.4 GHz Opteron the same code takes 11 million clocks. The penalty for accessing main memory is much less.

On the P4, this main memory penalty has led to a strategy that minimizes the number of passes over main memory at all costs. Currently prime95 employs a two pass strategy operating out of the L2 cache. The Opteron should benefit from switching to a three pass strategy where the data occupies 32KB (1/2 of the L1 cache). Total gain should be in the 5-15% range, but is a lot of work....

Cyclamen Persicum 2004-04-21 15:51

Hey, was perfectly you-Prime95 who wrote PRP.EXE ?
It works two times faster than my prog with 256Kbit number.
900s vs. 1900s =(((
This is due to such kinda cach tricks and ruses...
On the other hand, my Borland-Delphi-Turbo-Pascal prog
works merely two times slower =))), that is good.

Dresdenboy 2004-04-21 19:12

[QUOTE=Prime95]The Opteron should benefit from switching to a three pass strategy where the data occupies 32KB (1/2 of the L1 cache). Total gain should be in the 5-15% range, but is a lot of work....[/QUOTE]
If a controlled prefetching into L2 is not possible, then the L2 cache could at least be used indirectly (RAM->L1->L2->L1). Is there a place for this in the current data access strategy of Prime95?

Prime95 2004-04-22 03:19

[QUOTE=Dresdenboy]If a controlled prefetching into L2 is not possible, then the L2 cache could at least be used indirectly (RAM->L1->L2->L1). Is there a place for this in the current data access strategy of Prime95?[/QUOTE]

Yes, I think, maybe. The ideal scenario should have a final pass using 16KB or 32KB FFT data chunks leaving room for sin/cos data and cache lines prefetching data without thrashing the cache. The middle pass should operate on 512KB of data at a time, and calls pass 3 on while the data is in the L2 cache. Pass 1 operates out of main memory.

These shenanigans may not be necessary though. Since memory access is faster than a P4, it might be that prefetching from RAM->L1 will hide the memory latencies.

Prime95 2004-04-22 03:21

[QUOTE=Cyclamen Persicum]Hey, was perfectly you-Prime95 who wrote PRP.EXE ?
It works two times faster than my prog with 256Kbit number.
900s vs. 1900s =(([/QUOTE]

That is an impressive accomplishment for high level language programming.

Cyclamen Persicum 2004-04-22 13:02

[i]That is an impressive accomplishment for high level language programming.[/i]

The secret is that I call fftw3.dll from fftw.org, and after performing an inverse FFT all I need is to take care about carries:

Carry:=0;
for i:=1 to 2*n do begin
zi:=z[i]*norm+Carry;
rz:=round(zi);
if abs(zi-rz)>0.4 then Round_off_Error;
c[i]:=rz;
Carry:=rz shr 16;
end;

Since fftw is "fastest in the world" and uses SSE2, and FFT itself devour the majority of time so carry prolongation and other procedures eat nothing comparing to FFT, it is still unclear why your PRP.EXE
is incredibly fast...

Tell me please, what kinda modular reduction do you use for
2^n+k and k*2^n+1? Is it classical base 2^n reduction or new WDT-like?
Do you use full ranged zero-padded FFT or half ranged?
And why k*2^n+1 are much faster than 2^n+k (I mean just a single power, not a proof)??? I cannot see any difference.

Prime95 2004-04-22 14:51

PRP uses a traditional full-length zero-padded FFT for the k*2^n+/-1 case. It uses a classical reduction - divide top half of result by k and add/sub it from the bottom half.

PRP is faster than FFTW for several reasons. 1) Careful attention is paid to caching issues. 2) All assembly code. 3) PRP being specialized code can perform a few tricks - like the FFT result can be in bit-reversed order - and
the inverse FFT can start before the forward FFT finishes.

The 2^n+/-c code is slow because a classical reduction was not written for this case. Thus, PRP uses a very slow general purpose reduction.

Cyclamen Persicum 2004-04-23 16:04

Thank you very much for your answer. You are the greatest programmer of all times and among all people indeed. I appreciate your attention.

Currently I have written two progs which deal with long arithmetic:
[url]http://persicum.front.ru/parsec_24.zip[/url]
[url]http://persicum.front.ru/purgen_06.zip[/url]

The first is fast enough RSA encoder, until one may write even faster by usage of x86, MMX, SSE2;
The second is a probable prime generator. It is interesting that it supports Compound Mersennes such as 2^1024 - 2^256 + 2^224 + 2^128 + 2^64 - 2^32 - 1. Although it uses FFT and optionally can attach external FFTW DLL, it is several times slower than PRP.EXE. I think due to ignoring cache issues. It pulls a long number through the cache several times until it to be squared:
1) Padding of doubles;
2) Fourier transform
3) Complex squaring
4) Inverse Fourier transform
5) Carries prolongation

And PRP.EXE contrives do all this job via just one pass along a long number...

[i]The 2^n+/-c code is slow because a classical reduction was not written for this case. Thus, PRP uses a very slow general purpose reduction.[/i]

Please, write fast classical reduction for 2^n +/- k as soon as possible!!!
You know it is very-very easy: just multiply upper half-chunk by k and add/subtract it to lower half-chunk. This kind of primes maybe the most important among all *probable* primes!

Tell me please what is the fastest general purpose reduction in the case of using FFT? I currently use Montgomery via TWO full-length FFT multiplications. For n^2 case Montgomery requires only ONE long multiplication, but I don't know what about n log n case.

Prime95 2004-04-23 17:50

[QUOTE=Cyclamen Persicum]
Please, write fast classical reduction for 2^n +/- k as soon as possible!!!

Tell me please what is the fastest general purpose reduction in the case of using FFT?[/QUOTE]

I'm looking at improving the FFT routines to handle k*2^n+/-c quickly. The k*2^n+/-1 case can probably be sped up by 40% (or more). The k*2^n+/-c case should triple in speed. I also want to incorporate Jean Penne's changes that lets LLR use IBDWTs.

IIRC, the general purpose reduction that PRP uses takes three multiplies for each input bit. Multiply #1 computes x^2. Multiply #2 multiplies by the precomputed 1/n value. The integer part of the result is discarded. The fractional part is then multiplied by n to get the remainder.

ThomRuley 2004-04-23 18:38

A little off subject, maybe, but I was curious what optimizations are being attempted for Athlon64 processors. I just added one to my farm and was thinking that these should really do well with Prime95. Are there any plans to take advantage of the the 64-bit part of the processor?

Cyclamen Persicum 2004-04-24 11:00

[i]I'm looking at improving the FFT routines...[/i]

How will I know about new version of PRP.EXE


All times are UTC. The time now is 10:21.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.