mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2010-06-10, 12:44   #276
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3×137 Posts
Default

Thanks for all the suggestions.
I finally managed to benchmark Prime95 and mfaktc.


Specs:
Core2Quad Q6600 @ 3.51 Ghz
4 * 1 GB DDR2 @ 780 Mhz
GTX 285 @ 711/1548/2500

Prime95 x64 25.11 build 2(it used 1 core, all options default)
mfaktc x64 0.07p1(NumStreams=3, other default)

exp=73708469 bit_min=64 bit_max=65


Prime95: 20 minutes
Mfaktc:82668 msec

Like, wow
That's some speed.
Karl M Johnson is offline   Reply With Quote
Old 2010-06-10, 13:06   #277
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

45716 Posts
Default

Hi Karl,

you might want to try bigger ranges with mfaktc.
E.g. TF M73708469 from 2^64 to 2^67 at once.
This might increase the speedup (compared to P95) by another 5%.

Oliver
TheJudger is offline   Reply With Quote
Old 2010-06-10, 17:12   #278
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3×137 Posts
Default

This new exponent took 574.85 seconds to get finished on mfaktc.
ETA for Prime95 is 5h41m.

Now, the speedup:

exp=73708469 bit_min=64 bit_max=65
mfact is 14.63x faster than Prime95

exp=73708469 bit_min=64 bit_max=67
mfaktc is 35.59x faster than Prime95


You were right Oliver, mfaktc was even faster on these limits.
Well, it could be Prime95's ETA algorithm.
But, still, mfaktc is faster
Karl M Johnson is offline   Reply With Quote
Old 2010-06-10, 19:27   #279
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

2·67·73 Posts
Default

Quote:
Originally Posted by Karl M Johnson View Post
Well, it could be Prime95's ETA algorithm.
Do *not* trust Prime95's ETA algorithm for factoring.
chalsall is online now   Reply With Quote
Old 2010-06-11, 11:31   #280
kjaget
 
kjaget's Avatar
 
Jun 2005

3×43 Posts
Default

Here's the windows port of 0.08. As always, please report any problems you find to this thread and I'll check them out ASAP.

mfaktc-0.08-win-x64.zip
kjaget is offline   Reply With Quote
Old 2010-06-11, 11:52   #281
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Hi Karl,

Quote:
Originally Posted by Karl M Johnson View Post
This new exponent took 574.85 seconds to get finished on mfaktc.
ETA for Prime95 is 5h41m.

Now, the speedup:

exp=73708469 bit_min=64 bit_max=65
mfact is 14.63x faster than Prime95

exp=73708469 bit_min=64 bit_max=67
mfaktc is 35.59x faster than Prime95
I think the ETA of Prime95 is wrong.
From 2^64 to 2^67 there are seven times more candidates than from 2^64 to 2^65. Comparing the runtime of mfaktc it needs 574.85s/82.67s = 6.95x longer for the 7 times bigger work. Offcourse this depends on the exponent and range.
I think the CPU code (siever) limits your throughput. The throughput should increase once you start 2 instances of mfaktc (in different directories) simultaneously. Of course this lowers the speedup compared to Prime95 but it will yield a higher throughput!

Oliver

Last fiddled with by TheJudger on 2010-06-11 at 11:52
TheJudger is offline   Reply With Quote
Old 2010-06-22, 19:27   #282
amphoria
 
amphoria's Avatar
 
"Dave"
Sep 2005
UK

23×347 Posts
Default mfaktc v0.08 for Win-64

I had to recompile the code so that THREADS_PER_GRID was correct for a GTX 465. After getting it to compile I ran the SELF_TEST and confirmed that all the expected factors were found.

However, when I run it on a real candidate all I get is the following:

Quote:
mfaktc v0.08Winx64

Compiletime Options
THREADS_PER_GRID 1081344
THREADS_PER_BLOCK 256
SIEVE_SIZE_LIMIT 32kiB
SIEVE_SIZE 230945bits
VERBOSE_TIMING disabled
SELFTEST disabled
MORE_CLASSES disabled

Runtime Options
SievePrimes 25000
SievePrimesAdjust 1
NumStreams 4
WorkFile worktodo.txt
Checkpoints enabled

CUDA device info
name: GeForce GTX 465
compute capabilities: 2.0
maximum threads per block: 1024
number of multiprocessors: 11 (352 shader cores)
clock rate: 810MHz

got assignment: exp=332203901 bit_min=68 bit_max=77

tf(332203901, 68, 77);
k_min = 444227030520
k_max = 227444239813169
Using GPU kernel "95bit_mul32"
class 0: tested 105633251328 candidates in 2243766ms (47078550/sec) (avg. wait: 11115usec)
avg. wait > 500usec, increasing SievePrimes to 27000
class 4: tested 104929296384 candidates in 2252921ms (46574778/sec) (avg. wait: 11116usec)
avg. wait > 500usec, increasing SievePrimes to 29000
class 15: tested 104284815360 candidates in 2264227ms (46057579/sec) (avg. wait: 11118usec)
avg. wait > 500usec, increasing SievePrimes to 31000
class 16: tested 103688994816 candidates in 2279172ms (45494150/sec) (avg. wait: 11116usec)
avg. wait > 500usec, increasing SievePrimes to 33000
class 19: tested 103137509376 candidates in 2293794ms (44963719/sec) (avg. wait: 11117usec)
avg. wait > 500usec, increasing SievePrimes to 35000
Basically it is increasing the value of SievePrimes but the avg wait is not decreasing so it is always going to be > 500usec. What am I doing wrong?
amphoria is offline   Reply With Quote
Old 2010-06-23, 08:07   #283
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hello,

Quote:
Originally Posted by amphoria View Post
Basically it is increasing the value of SievePrimes but the avg wait is not decreasing so it is always going to be > 500usec. What am I doing wrong?
The short answer: you're using Windows.

Actually I've no clue what happens on Windows, I seems like it doesn't affect all Windows systems. A friend of mine tested it on Windows 7 Ultimate 64bit with the binary from Kevin on his GTX 275 and it ran fine while Kevin (kjaget) noticed this problem on his Windows machine, too.
Did you try to adjust NumStreams?
Does this happen with the recent binary from Kevin, too? The value of THREADS_PER_GRID is not optimal for your GPU in the default setting but this is not a real problem, just a (very small) performance penalty.

Oliver

Last fiddled with by TheJudger on 2010-06-23 at 08:47
TheJudger is offline   Reply With Quote
Old 2010-06-23, 16:45   #284
amphoria
 
amphoria's Avatar
 
"Dave"
Sep 2005
UK

23×347 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Hello,



The short answer: you're using Windows.

Actually I've no clue what happens on Windows, I seems like it doesn't affect all Windows systems. A friend of mine tested it on Windows 7 Ultimate 64bit with the binary from Kevin on his GTX 275 and it ran fine while Kevin (kjaget) noticed this problem on his Windows machine, too.
Did you try to adjust NumStreams?
Does this happen with the recent binary from Kevin, too? The value of THREADS_PER_GRID is not optimal for your GPU in the default setting but this is not a real problem, just a (very small) performance penalty.

Oliver
I tried adjusting NumStreams to 5 and also 4 with this run. Note that I changed THREADS_PER_GRID to 33 << 15, hence the need to recompile Kevin's code.

Did Kevin manage to work around it? I guess I could try it under Linux in a VM.
amphoria is offline   Reply With Quote
Old 2010-06-24, 07:40   #285
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Hello amphoria,

As told, you don't need to adjust THREADS_PER_GRID, without an "optimal" setting there is only a small performance penalty but the code runs fine!

---
Kevin: just an idea, perhaps we should check the resolution of the gettimeofday() emulation on Windows... Perhaps it gets confused by SpeedStep/Turbomode/Cool&Quiet/whatever.

Oliver
TheJudger is offline   Reply With Quote
Old 2010-06-24, 20:48   #286
kjaget
 
kjaget's Avatar
 
Jun 2005

2018 Posts
Default

Recent versions (anything which added variable number of streams) give expected results running a GTX275 - 77+M candidates/sec once sieve_primes adjusts.

I do see the whole "increasing sieve primes doesn't lower average wait" with large exponents running small bit ranges, but since those runs take less than a minute I'm not really losing that much time in real terms (percentage wise it hurts but it's only a few seconds).

I never figured out what was going on in that case, but one guess was that Windows behaves differently from Linux in how it schedules streams. Maybe faster GPUs need more streams ready to run than the program can provide, so it's wasting a lot of time running X streams when the GPU could handle X+2 or whatever. That would kill throughput since some of the GPU would always be idle.

I haven't looked at the code for this in a while, but would it be worth it to try and hack up some test code with a huge number of streams (20+) and see if the behavior changes?
kjaget is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 20:46.


Fri Aug 6 20:46:20 UTC 2021 up 14 days, 15:15, 1 user, load averages: 2.36, 2.45, 2.64

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.