mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

Karl M Johnson 2010-06-10 12:44

Thanks for all the suggestions.
I finally managed to benchmark Prime95 and mfaktc.


Specs:
Core2Quad Q6600 @ 3.51 Ghz
4 * 1 GB DDR2 @ 780 Mhz
GTX 285 @ 711/1548/2500

Prime95 x64 25.11 build 2(it used 1 core, all options default)
mfaktc x64 0.07p1(NumStreams=3, other default)

exp=73708469 bit_min=64 bit_max=65


Prime95: 20 minutes
Mfaktc:82668 msec

Like, wow :tu:
That's some speed.

TheJudger 2010-06-10 13:06

Hi Karl,

you might want to try bigger ranges with mfaktc.
E.g. TF M73708469 from 2^64 to 2^67 at once.
This [B]might[/B] increase the speedup (compared to P95) by another 5%.

Oliver

Karl M Johnson 2010-06-10 17:12

This new exponent took 574.85 seconds to get finished on mfaktc.
ETA for Prime95 is 5h41m.

Now, the speedup:

exp=73708469 bit_min=64 bit_max=65
mfact is 14.63x faster than Prime95

exp=73708469 bit_min=64 bit_max=67
mfaktc is 35.59x faster than Prime95


You were right Oliver, mfaktc was even faster on these limits.
Well, it could be Prime95's ETA algorithm.
But, still, mfaktc is faster :wink:

chalsall 2010-06-10 19:27

[QUOTE=Karl M Johnson;218052]Well, it could be Prime95's ETA algorithm.[/QUOTE]

Do *not* trust Prime95's ETA algorithm for factoring.

kjaget 2010-06-11 11:31

1 Attachment(s)
Here's the windows port of 0.08. As always, please report any problems you find to this thread and I'll check them out ASAP.

[ATTACH]5312[/ATTACH]

TheJudger 2010-06-11 11:52

Hi Karl,

[QUOTE=Karl M Johnson;218052]This new exponent took 574.85 seconds to get finished on mfaktc.
ETA for Prime95 is 5h41m.

Now, the speedup:

exp=73708469 bit_min=64 bit_max=65
mfact is 14.63x faster than Prime95

exp=73708469 bit_min=64 bit_max=67
mfaktc is 35.59x faster than Prime95[/QUOTE]

I think the ETA of Prime95 is wrong.
From 2^64 to 2^67 there are seven times more candidates than from 2^64 to 2^65. Comparing the runtime of mfaktc it needs 574.85s/82.67s = 6.95x longer for the 7 times bigger work. Offcourse this depends on the exponent and range.
I think the CPU code (siever) limits your throughput. The throughput should increase once you start 2 instances of mfaktc (in different directories) simultaneously. Of course this lowers the speedup compared to Prime95 but it will yield a higher throughput!

Oliver

amphoria 2010-06-22 19:27

mfaktc v0.08 for Win-64
 
I had to recompile the code so that THREADS_PER_GRID was correct for a GTX 465. After getting it to compile I ran the SELF_TEST and confirmed that all the expected factors were found.

However, when I run it on a real candidate all I get is the following:

[QUOTE]
mfaktc v0.08Winx64

Compiletime Options
THREADS_PER_GRID 1081344
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 230945bits
VERBOSE_TIMING disabled
SELFTEST disabled
MORE_CLASSES disabled

Runtime Options
SievePrimes 25000
SievePrimesAdjust 1
NumStreams 4
WorkFile worktodo.txt
Checkpoints enabled

CUDA device info
name: GeForce GTX 465
compute capabilities: 2.0
maximum threads per block: 1024
number of multiprocessors: 11 (352 shader cores)
clock rate: 810MHz

got assignment: exp=332203901 bit_min=68 bit_max=77

tf(332203901, 68, 77);
k_min = 444227030520
k_max = 227444239813169
Using GPU kernel "95bit_mul32"
class 0: tested 105633251328 candidates in 2243766ms (47078550/sec) (avg. wait: 11115usec)
avg. wait > 500usec, increasing SievePrimes to 27000
class 4: tested 104929296384 candidates in 2252921ms (46574778/sec) (avg. wait: 11116usec)
avg. wait > 500usec, increasing SievePrimes to 29000
class 15: tested 104284815360 candidates in 2264227ms (46057579/sec) (avg. wait: 11118usec)
avg. wait > 500usec, increasing SievePrimes to 31000
class 16: tested 103688994816 candidates in 2279172ms (45494150/sec) (avg. wait: 11116usec)
avg. wait > 500usec, increasing SievePrimes to 33000
class 19: tested 103137509376 candidates in 2293794ms (44963719/sec) (avg. wait: 11117usec)
avg. wait > 500usec, increasing SievePrimes to 35000
[/QUOTE]

Basically it is increasing the value of SievePrimes but the avg wait is not decreasing so it is always going to be > 500usec. What am I doing wrong?

TheJudger 2010-06-23 08:07

Hello,

[QUOTE=amphoria;219552]Basically it is increasing the value of SievePrimes but the avg wait is not decreasing so it is always going to be > 500usec. What am I doing wrong?[/QUOTE]

The short answer: you're using Windows. :razz:

Actually I've no clue what happens on Windows, I seems like it doesn't affect all Windows systems. A friend of mine tested it on Windows 7 Ultimate 64bit with the binary from Kevin on his GTX 275 and it ran fine while Kevin (kjaget) noticed this problem on his Windows machine, too.
Did you try to adjust NumStreams?
Does this happen with the recent binary from Kevin, too? The value of THREADS_PER_GRID is not optimal for your GPU in the default setting but this is not a real problem, just a (very small) performance penalty.

Oliver

amphoria 2010-06-23 16:45

[QUOTE=TheJudger;219610]Hello,



The short answer: you're using Windows. :razz:

Actually I've no clue what happens on Windows, I seems like it doesn't affect all Windows systems. A friend of mine tested it on Windows 7 Ultimate 64bit with the binary from Kevin on his GTX 275 and it ran fine while Kevin (kjaget) noticed this problem on his Windows machine, too.
Did you try to adjust NumStreams?
Does this happen with the recent binary from Kevin, too? The value of THREADS_PER_GRID is not optimal for your GPU in the default setting but this is not a real problem, just a (very small) performance penalty.

Oliver[/QUOTE]

I tried adjusting NumStreams to 5 and also 4 with this run. Note that I changed THREADS_PER_GRID to 33 << 15, hence the need to recompile Kevin's code.

Did Kevin manage to work around it? I guess I could try it under Linux in a VM.

TheJudger 2010-06-24 07:40

Hello amphoria,

As told, you don't need to adjust THREADS_PER_GRID, without an "optimal" setting there is only a small performance penalty but the code runs fine!

---
Kevin: just an idea, perhaps we should check the resolution of the gettimeofday() emulation on Windows... Perhaps it gets confused by SpeedStep/Turbomode/Cool&Quiet/whatever.

Oliver

kjaget 2010-06-24 20:48

Recent versions (anything which added variable number of streams) give expected results running a GTX275 - 77+M candidates/sec once sieve_primes adjusts.

I do see the whole "increasing sieve primes doesn't lower average wait" with large exponents running small bit ranges, but since those runs take less than a minute I'm not really losing that much time in real terms (percentage wise it hurts but it's only a few seconds).

I never figured out what was going on in that case, but one guess was that Windows behaves differently from Linux in how it schedules streams. Maybe faster GPUs need more streams ready to run than the program can provide, so it's wasting a lot of time running X streams when the GPU could handle X+2 or whatever. That would kill throughput since some of the GPU would always be idle.

I haven't looked at the code for this in a while, but would it be worth it to try and hack up some test code with a huge number of streams (20+) and see if the behavior changes?


All times are UTC. The time now is 22:42.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.