mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

mdettweiler 2009-12-15 17:04

[quote=TheJudger;198902]Hi,

my first attempt to reduce the number of sieve initialisations to one per group failed greatly... :(

The good news: the 2nd attempt works fine an took only ~10 lines of code :)

sieving the first 2000 primes:
TF M66xxxxxx to 2^60: 17 seconds (too much overhead for sieve initialisation...)

sieving the first 4000 primes:
TF M66xxxxxx to 2^64: 220 seconds :)[/quote]
Just curious, how does this compare to a modern CPU doing the same task? Based on what I recall from the last time I did TF work, this looks pretty fast.

ATH 2009-12-16 00:37

[QUOTE=mdettweiler;198925]Just curious, how does this compare to a modern CPU doing the same task? Based on what I recall from the last time I did TF work, this looks pretty fast.[/QUOTE]

It is very fast. I tested on 1 core (3 other cores idle) on Core2Quad (Yorksfield) Q9450 2.66 Ghz with WinXP x64 and Prime95 25.11.2 64bit:

M660xxxxx: 1 to 2^64: 1369sec
M669xxxxx: 1 to 2^64: 1406sec

So 1 core on a 1.5year old cpu is over 6 times slower.

mdettweiler 2009-12-16 01:55

[quote=ATH;198970]It is very fast. I tested on 1 core (3 other cores idle) on Core2Quad (Yorksfield) Q9450 2.66 Ghz with WinXP x64 and Prime95 25.11.2 64bit:

M660xxxxx: 1 to 2^64: 1369sec
M669xxxxx: 1 to 2^64: 1406sec

So 1 core on a 1.5year old cpu is over 6 times slower.[/quote]
Wow! :shock: Now I really wish I had a real GPU in my computer... :ermm:

Uncwilly 2009-12-16 02:00

Can this do numbers in the 100M digit range (332,190,000+)? Can they be taken up to the 77 bit level with it?

If so, this might influence my decision to buy a new desktop. I would get a real screamer and do P-1 on one core and TF on 2 and a LL on the 4th, while doing other TF's on the GPU.

TheJudger 2009-12-18 10:51

Hi ATH,

[QUOTE=ATH;198970]It is very fast. I tested on 1 core (3 other cores idle) on Core2Quad (Yorksfield) Q9450 2.66 Ghz with WinXP x64 and Prime95 25.11.2 64bit:

M660xxxxx: 1 to 2^64: 1369sec
M669xxxxx: 1 to 2^64: 1406sec

So 1 core on a 1.5year old cpu is over 6 times slower.[/QUOTE]

I'm wondering why M669xxxxx is slower than M660xxxxx, M669xxxxx should have LESS factor candidates.

Can you run a benchmark from 2^64 to 2^65 aswell?
There might be a slowdown in P95 for factors above 2^64.
My CUDA-Code has only one code path: factors up to 2^70 (maybe 2^71, I have to check, I would say that I've a 90% chance that 2^71 will work without modifications) so there should be no slowdown when moving to factors > 2^64. :)

TheJudger 2009-12-18 10:58

Hi Uncwilly,

[QUOTE=Uncwilly;198973]Can this do numbers in the 100M digit range (332,190,000+)? Can they be taken up to the 77 bit level with it?

If so, this might influence my decision to buy a new desktop. I would get a real screamer and do P-1 on one core and TF on 2 and a LL on the 4th, while doing other TF's on the GPU.[/QUOTE]

right now I've a soft limit for the exponent of (2^31) -1. I just haven't thought about exponents above this limit right now.

77bit? No. Right now I'm using 3 32bit integers (24bits data + 8 bit carry) so this gives me 72bit overall. I'm using 24bits per int because current GPUs can do 24x24 multiply 4 times faster than 32x32...

I could do 4x 24bits, too. But at this moment I just want to have a working version which runs effective from 2^64 to 2^70 (or 2^71).

CUDA-doc notes that upcomming GPUs can do 32x32 multiply much faster than now...

TheJudger 2009-12-19 22:30

No code improvement, just some more benchmarks.

M66362159
sieving the first 4000 primes:
TF to 2^64: 220 seconds
TF from 2^64 to 2^65: 212 seconds
TF from 2^65 to 2^66: 423 seconds

In the range from 2^65 to 2^66 exists one factor, so Prime95 might stop there?
63205291599831307391 is a factor of M66362159

TF to 2^64 does one step less precalculation so for each factor candidate is one more long division needed. This is the reason why it is slower than TF from 2^64 to 2^65.

ATH 2009-12-20 22:22

[QUOTE=TheJudger;199326]M66362159
sieving the first 4000 primes:
TF to 2^64: 220 seconds
TF from 2^64 to 2^65: 212 seconds
TF from 2^65 to 2^66: 423 seconds
[/QUOTE]

1 core (3 other cores idle) on Core2Quad (Yorksfield) Q9450 2.66 Ghz with Windows XP x64 and Prime95 25.11.2 64bit:

M66362189:
TF to 2^64: 1373 sec
TF 2^64-2^65: 1741 sec
TF 2^65-2^66: 3889 sec
TF 2^66-2^67: 7733 sec

TheJudger 2009-12-22 10:20

Hi!

Thank you for the benchmarks, ATH :)

Little speed increase caused by several minimal improvements. :)
M66362159
sieving the first 4000 primes:
TF to 2^64: 205 seconds (needs further testing, quick tests showed no problems)
TF from 2^64 to 2^65: 205 seconds
TF from 2^65 to 2^66: 406 seconds

Some code cleanups asweel.
And finally an user interface (command line).

Next steps are:
- some more cleanups (remove unused code, ...)
- some more describtion about compiletime options
- testing the code on different platforms on mersenne numbers with known factors (check that my code finds the factors aswell). Here you all can help me. Prefered are people who know how to compile/run CUDA code.

TheJudger 2009-12-24 00:35

Hi,

Good news and bad news! (don't be afraid, it is not really bad)

Good news #1:
I quick test on my 8800GTX (G80) ran fine, it produced the same results as my GTX 275 (GT200b). The G80 is the oldest CUDA-capable chip with "compute capabilities 1.0". If the code works on G80 it should work on all CUDA-enabled products. :)

Good news #2:
I've increased the raw speed of the GPU code.
My GTX 275 can now test ~53.1M factor candidates per second (M66362159, tested with 65 bit factors). Performance before the latests optimisation was ~36.6M factor candidates per second. The "trick" is to hack the ptx code (lets say it is like assembler code on CPU) and replace one instruction. The nvidia compiler has no intrinsic for [u]mul24hi while it exists in the ptx code. (24x24 multiply is faster as mentioned before)

Bad news #1:
The "ptx hack" is ugly!!!
I have to check some compilers...
There is a patch to enable some more intrinsics but I was not able to build the compiler. :(

Bad news #2:
My siever is too slow. Without the latest optimisation a single core of a Core 2 running at 3GHz was sufficient to feed the GPU (GTX 275) with new factor candidates to test. Now it is too slow as the GPU code is faster now.
I have to think about possiblities:
(1) speedup the siever by writing better code (I'm not sure if I can do this).
If "Fermi" is only twice as fast as the GT200 chip (due to the fact it has roughly doubled amount of shaders) and has no other improvements I need to speedup the siever again by a factor of 2.
(2) write a multithreaded siever. I think I can do this but I'm not really happy with this solution.
(3) put the siever on the GPU. I'm not sure if this might work...
(4) newer GPUs are capable of running several "kernels" at the same time. With some modifications on the code it should be possible to have several instances of the application running at the same time. If the GPU is too fast for one CPU core just start another test on a different exponent on a 2nd Core, ...

personnally I prefer (4)
Any comments?

-----
New benchmark numbers (still on M66362159)
8800GTX, 3.00GHz Core 2 Duo, sieving up to 37831 (4000th odd prime):
TF to 2^64: 283s

GTX 275, Core 2 Duo overclocked to 4GHz, sieveing up to 3581 (500th odd prime)
In this configuration the siever causes allready some stalls on the GPU.
TF to 2^64: 190s
TF from 2^64 to 2^65: 190s
TF from 2^65 to 2^66: 379s

Raw "factor testing speed" of the 8800 GTX is exactly half of the GTX 275 so imagin the time if the siever would be fast enough...

axn 2009-12-24 01:41

[QUOTE=TheJudger;199739]Bad news #2:
My siever is too slow. Without the latest optimisation a single core of a Core 2 running at 3GHz was sufficient to feed the GPU (GTX 275) with new factor candidates to test. Now it is too slow as the GPU code is faster now.[/QUOTE]

I am assuming you have a prelim sieve to get the TF candidates. Why not just lower the sieve limit for that one?

In fact, the ideal scenario would involve the program doing benchmark during runtime to pick the optimal sieve limit.


All times are UTC. The time now is 14:22.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.