![]() |
|
|
#276 |
|
Mar 2010
3·137 Posts |
Thanks for all the suggestions.
I finally managed to benchmark Prime95 and mfaktc. Specs: Core2Quad Q6600 @ 3.51 Ghz 4 * 1 GB DDR2 @ 780 Mhz GTX 285 @ 711/1548/2500 Prime95 x64 25.11 build 2(it used 1 core, all options default) mfaktc x64 0.07p1(NumStreams=3, other default) exp=73708469 bit_min=64 bit_max=65 Prime95: 20 minutes Mfaktc:82668 msec Like, wow That's some speed. |
|
|
|
|
|
#277 |
|
"Oliver"
Mar 2005
Germany
11·101 Posts |
Hi Karl,
you might want to try bigger ranges with mfaktc. E.g. TF M73708469 from 2^64 to 2^67 at once. This might increase the speedup (compared to P95) by another 5%. Oliver |
|
|
|
|
|
#278 |
|
Mar 2010
41110 Posts |
This new exponent took 574.85 seconds to get finished on mfaktc.
ETA for Prime95 is 5h41m. Now, the speedup: exp=73708469 bit_min=64 bit_max=65 mfact is 14.63x faster than Prime95 exp=73708469 bit_min=64 bit_max=67 mfaktc is 35.59x faster than Prime95 You were right Oliver, mfaktc was even faster on these limits. Well, it could be Prime95's ETA algorithm. But, still, mfaktc is faster
|
|
|
|
|
|
#279 |
|
If I May
"Chris Halsall"
Sep 2002
Barbados
2·67·73 Posts |
|
|
|
|
|
|
#280 |
|
Jun 2005
8116 Posts |
Here's the windows port of 0.08. As always, please report any problems you find to this thread and I'll check them out ASAP.
mfaktc-0.08-win-x64.zip |
|
|
|
|
|
#281 | |
|
"Oliver"
Mar 2005
Germany
45716 Posts |
Hi Karl,
Quote:
From 2^64 to 2^67 there are seven times more candidates than from 2^64 to 2^65. Comparing the runtime of mfaktc it needs 574.85s/82.67s = 6.95x longer for the 7 times bigger work. Offcourse this depends on the exponent and range. I think the CPU code (siever) limits your throughput. The throughput should increase once you start 2 instances of mfaktc (in different directories) simultaneously. Of course this lowers the speedup compared to Prime95 but it will yield a higher throughput! Oliver Last fiddled with by TheJudger on 2010-06-11 at 11:52 |
|
|
|
|
|
|
#282 | |
|
"Dave"
Sep 2005
UK
1010110110002 Posts |
I had to recompile the code so that THREADS_PER_GRID was correct for a GTX 465. After getting it to compile I ran the SELF_TEST and confirmed that all the expected factors were found.
However, when I run it on a real candidate all I get is the following: Quote:
|
|
|
|
|
|
|
#283 | |
|
"Oliver"
Mar 2005
Germany
21278 Posts |
Hello,
Quote:
![]() Actually I've no clue what happens on Windows, I seems like it doesn't affect all Windows systems. A friend of mine tested it on Windows 7 Ultimate 64bit with the binary from Kevin on his GTX 275 and it ran fine while Kevin (kjaget) noticed this problem on his Windows machine, too. Did you try to adjust NumStreams? Does this happen with the recent binary from Kevin, too? The value of THREADS_PER_GRID is not optimal for your GPU in the default setting but this is not a real problem, just a (very small) performance penalty. Oliver Last fiddled with by TheJudger on 2010-06-23 at 08:47 |
|
|
|
|
|
|
#284 | |
|
"Dave"
Sep 2005
UK
AD816 Posts |
Quote:
Did Kevin manage to work around it? I guess I could try it under Linux in a VM. |
|
|
|
|
|
|
#285 |
|
"Oliver"
Mar 2005
Germany
11×101 Posts |
Hello amphoria,
As told, you don't need to adjust THREADS_PER_GRID, without an "optimal" setting there is only a small performance penalty but the code runs fine! --- Kevin: just an idea, perhaps we should check the resolution of the gettimeofday() emulation on Windows... Perhaps it gets confused by SpeedStep/Turbomode/Cool&Quiet/whatever. Oliver |
|
|
|
|
|
#286 |
|
Jun 2005
3×43 Posts |
Recent versions (anything which added variable number of streams) give expected results running a GTX275 - 77+M candidates/sec once sieve_primes adjusts.
I do see the whole "increasing sieve primes doesn't lower average wait" with large exponents running small bit ranges, but since those runs take less than a minute I'm not really losing that much time in real terms (percentage wise it hurts but it's only a few seconds). I never figured out what was going on in that case, but one guess was that Windows behaves differently from Linux in how it schedules streams. Maybe faster GPUs need more streams ready to run than the program can provide, so it's wasting a lot of time running X streams when the GPU could handle X+2 or whatever. That would kill throughput since some of the GPU would always be idle. I haven't looked at the code for this in a while, but would it be worth it to try and hack up some test code with a huge number of streams (20+) and see if the behavior changes? |
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1676 | 2021-06-30 21:23 |
| The P-1 factoring CUDA program | firejuggler | GPU Computing | 753 | 2020-12-12 18:07 |
| gr-mfaktc: a CUDA program for generalized repunits prefactoring | MrRepunit | GPU Computing | 32 | 2020-11-11 19:56 |
| mfaktc 0.21 - CUDA runtime wrong | keisentraut | Software | 2 | 2020-08-18 07:03 |
| World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |