mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

ixfd64 2013-01-07 18:03

Perhaps a little off-topic, but how feasible is GPU P-1 factoring?

firejuggler 2013-01-07 18:23

it is possible, but not efficient?

kladner 2013-01-07 19:43

[QUOTE=chalsall;323942]For those who suddenly find themselves with spare CPU capacity available (and have some memory available), please consider doing some P-1 work.[/QUOTE]

Five P-1s online since last night, Sir! :cool: I'm holding back one core to see how responsive things are, though if anything it's a bit better than it was with 6x mfaktc 0.19 running.

I have been experimenting with the GPUSieve settings. On the GTX 570, GPUSieveSize=128 improved the Time value from ~5.9s to ~5.8s, and GHz-D/D from 419 to 429. GPUSieveProcessSize=8 caused a very slight improvement, and I returned it to the default of 16.

On the GTX 460, GPUSieveSize=128 essentially doubled the time and halved the GHz-D/D. IIRC, going to GPUSieveSize=32 had a similar effect. However, GPUSieveProcessSize=8 reduced the Time from 12.05s to 11.9s. GHz-D/D went from 207 to 209.

chalsall 2013-01-07 20:21

[QUOTE=kladner;323956]Five P-1s online since last night, Sir! :cool: I'm holding back one core to see how responsive things are, though if anything it's a bit better than it was with 6x mfaktc 0.19 running.[/quote]

Thanks kladner.

It's a little funny how much we find we have to fight Augustus ourselves simply to come to terms.....

We throw our hands out hoping for impact, but we don't actually hope for impact.

Kinda weird....

kracker 2013-01-07 23:42

[QUOTE=chalsall;323942]
For those who suddenly find themselves with spare CPU capacity available (and have some memory available), please consider doing some P-1 work.
[/QUOTE]

Hmm, which do you think is more useful to the project, DC or P-1?
And this probably is a unanswerable or stupid question: if P-1, what is the "minimum" memory required/recommended for what is being given out now?

James Heinrich 2013-01-07 23:48

[QUOTE=kracker;323979]P-1, what is the "minimum" memory required/recommended for what is being given out now?[/QUOTE]Assuming what's given out now is somewhere around 60M, my [url=http://mersenne.ca/prob.php?exponent=60000011&guess_saved_tests=2]P-1 probability calculator[/url] says that around 512MB is "minimum", 1.5GB is "good" and 12GB is "max".

kracker 2013-01-08 00:02

[QUOTE=James Heinrich;323980]Assuming what's given out now is somewhere around 60M, my [URL="http://mersenne.ca/prob.php?exponent=60000011&guess_saved_tests=2"]P-1 probability calculator[/URL] says that around 512MB is "minimum", 1.5GB is "good" and 12GB is "max".[/QUOTE]

Hmm, there is a "max"? didn't know that.
Anyways, thanks, good to know :smile:

Dubslow 2013-01-08 00:05

[QUOTE=kracker;323982]Hmm, there is a "max"? didn't know that.
Anyways, thanks, good to know :smile:[/QUOTE]

Well, sort of. If there's enough memory, Prime95 will do all of stage 2 in one pass ("processing 480 of 480 relative primes"). If there's 10 times [i]that[/i] amount of memory, then it will use more relative primes, but the gains are minimal at best. Heck, after 3-4 GiB, the gains are minimal.

kracker 2013-01-08 00:10

[QUOTE=Dubslow;323984]Well, sort of. If there's enough memory, Prime95 will do all of stage 2 in one pass ("processing 480 of 480 relative primes"). If there's 10 times [I]that[/I] amount of memory, then it will use more relative primes, but the gains are minimal at best. Heck, after 3-4 GiB, the gains are minimal.[/QUOTE]

I see. So it's not really a literal "max" it's just "over this, gains almost useless unless.."

Dubslow 2013-01-08 00:12

[QUOTE=kracker;323986]I see. So it's not really a literal "max" it's just "over this, gains almost useless unless.."[/QUOTE]

Well... sort of. :smile: Like I said, the "gains almost useless" point is (quite a bit) lower than the max he mentions; the max refers to the memory required to process the "standard" amount of relative primes in one pass (where "standard" is a hand-waving over-simplification).

Xyzzy 2013-01-08 00:46

[QUOTE]So it's not really a literal "max" it's just "over this, gains almost useless unless.."[/QUOTE][url]http://www.mersenneforum.org/showpost.php?p=282335&postcount=10[/url]

James Heinrich 2013-01-08 00:56

[QUOTE=kracker;323986]I see. So it's not really a literal "max" it's just "over this, gains almost useless unless.."[/QUOTE]For any given bounds, there is a certain amount of RAM required to run a given number of relative primes at once. Normally Prime95 runs several passes with as many RPs as it has RAM for at once, to complete a full set of 480 relative primes. I don't believe Prime95 will let you run P-1 if you don't have enough RAM to run at least 8 RPs at once (hence the "minimum" value). Each pass has some (small) overhead, so fewer passes means a bit (slightly) faster. The "maximum" value represents running all 480 RPs in one pass. Under certain semi-rare conditions, Prime95 will select a number of relative primes other than 480, but that's the "normal" value.

However, the P-1 bounds are partially selected based on the amount of RAM available, so a machine with 512MB allocated and another with 20GB allocated won't pick the same bounds for the same exponent. The one with more RAM will pick higher bounds, run a little [i]slower[/i], but have a higher chance of a factor. If they were forced to use the same bounds, the more-RAM machine would run the assignment slightly faster.

Dubslow 2013-01-08 01:40

[QUOTE=James Heinrich;323994]run a little [i]slower[/i], but have a higher chance of a factor.[/QUOTE]

...the end result being that you get more factors per cpu time. :smile:

ixfd64 2013-01-08 06:59

[QUOTE=ixfd64;319724]I've set up my CUDA environment, but I get the following errors when I try to compile mfaktc 0.19: [see attachment]

Anyone know what I'm doing wrong?

Edit: I've changed the item type to CUDA C/C++ and the platform to VC90, and I've also installed Visual C++ 2008. However, it's still complaining of an issue with the "atomicInc" function. Anyone know how to resolve this?[/QUOTE]

OK, I've decided to try compiling mfaktc again. The error went away after I changed the code generation parameter to "compute_11,sm_11" as suggested. However, I'm getting a bunch of new errors:

[QUOTE]1>------ Build started: Project: mfaktc_0.20, Configuration: Debug Win32 ------
1> Compiling CUDA source file tf_96bit_base_math.cu...
1>
1> C:\Users\danny\Desktop\mfaktc-0.20\src>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\bin\nvcc.exe" -gencode=arch=compute_11,code=\"sm_11,compute_11\" --use-local-env --cl-version 2008 -ccbin "c:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include" -G --keep-dir "Debug" -maxrregcount=0 --machine 32 --compile -g -DWIN32 -D_DEBUG -D_WINDOWS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -o "Debug\tf_96bit_base_math.cu.obj" "C:\Users\danny\Desktop\mfaktc-0.20\src\tf_96bit_base_math.cu"
1>C:/Users/danny/Desktop/mfaktc-0.20/src/tf_96bit_base_math.cu(21): error : identifier "int96" is undefined
1>C:/Users/danny/Desktop/mfaktc-0.20/src/tf_96bit_base_math.cu(21): error : identifier "int96" is undefined
1>C:/Users/danny/Desktop/mfaktc-0.20/src/tf_96bit_base_math.cu(33): error : incomplete type is not allowed
1>C:/Users/danny/Desktop/mfaktc-0.20/src/tf_96bit_base_math.cu(33): error : identifier "int96" is undefined
1>C:/Users/danny/Desktop/mfaktc-0.20/src/tf_96bit_base_math.cu(33): error : identifier "a" is undefined
1>C:/Users/danny/Desktop/mfaktc-0.20/src/tf_96bit_base_math.cu(35): error : expected a ";"
1>C:/Users/danny/Desktop/mfaktc-0.20/src/tf_96bit_base_math.cu(170): warning : parsing restarts here after previous syntax error
1>C:/Users/danny/Desktop/mfaktc-0.20/src/tf_96bit_base_math.cu(194): error : expected a declaration
1>C:/Users/danny/Desktop/mfaktc-0.20/src/tf_96bit_base_math.cu(281): warning : parsing restarts here after previous syntax error
1>C:/Users/danny/Desktop/mfaktc-0.20/src/tf_96bit_base_math.cu(304): error : expected a declaration
1>C:/Users/danny/Desktop/mfaktc-0.20/src/tf_96bit_base_math.cu(342): warning : parsing restarts here after previous syntax error
1>C:/Users/danny/Desktop/mfaktc-0.20/src/tf_96bit_base_math.cu(384): error : expected a declaration
1>C:/Users/danny/Desktop/mfaktc-0.20/src/tf_96bit_base_math.cu(418): warning : parsing restarts here after previous syntax error
1>C:/Users/danny/Desktop/mfaktc-0.20/src/tf_96bit_base_math.cu(453): error : expected a declaration
1>C:/Users/danny/Desktop/mfaktc-0.20/src/tf_96bit_base_math.cu(21): warning : function "cmp_ge_96" was declared but never referenced
1>C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\extras\visual_studio_integration\MSBuildExtensions\CUDA 5.0.targets(592,9): error MSB3721: The command ""C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\bin\nvcc.exe" -gencode=arch=compute_11,code=\"sm_11,compute_11\" --use-local-env --cl-version 2008 -ccbin "c:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v5.0\include" -G --keep-dir "Debug" -maxrregcount=0 --machine 32 --compile -g -DWIN32 -D_DEBUG -D_WINDOWS -Xcompiler "/EHsc /W3 /nologo /Od /Zi /RTC1 /MDd " -o "Debug\tf_96bit_base_math.cu.obj" "C:\Users\danny\Desktop\mfaktc-0.20\src\tf_96bit_base_math.cu"" exited with code 2.
========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========[/QUOTE]

Anyone know what I'm doing wrong? For the record, I'm using the CUDA 5.0 toolkit.

Dubslow 2013-01-08 07:02

Looks like you're missing a header file.

James Heinrich 2013-01-08 13:45

[QUOTE=James Heinrich;323926]Now that everyone has access to v0.20, I'd like to ask for a new round of benchmarks from everyone so I can update my [url=http://www.mersenne.ca/mfaktc.php#benchmark]GPU-TF benchmark page[/url].[/QUOTE]Thanks for the 10 benchmarks I've received so far. Unfortunately they've all been in the GTX 5xx series (550, 560, 570, 580). I'd be very interested in benchmarks from people with 400- and 600-series cards, please.

TheJudger 2013-01-08 15:23

Hi,

[QUOTE=ixfd64;324018]OK, I've decided to try compiling mfaktc again. The error went away after I changed the code generation parameter to "compute_11,sm_11" as suggested. However, I'm getting a bunch of new errors:



Anyone know what I'm doing wrong? For the record, I'm using the CUDA 5.0 toolkit.[/QUOTE]

you should enable code generation for newer GPU types, too. This code will run faster on those cards.
The Problem is that you try to compile a file which is just part of another file, so you can't compile tf_96bit_base_math.cu standalone. This file shares some common code which is used by other .cu files (tf_96bit.cu, tf_barrett96.cu and tf_barrett96_gs.cu). Take a look at the makefile and you get those dependencies. I'm not using the Microsoft IDE so I have no project file for you. I'm using GNU Make on Windows, too.

Oliver

P.S. I plan to upgrade my Windows to CUDA 5.0 within the next few days so I can provide CUDA 5.0 executables, too.

TObject 2013-01-08 20:19

[QUOTE=TheJudger;323840]As usual: finish your current assignment and upgrade to mfaktc 0.20 after that.[/QUOTE]

It will take me a couple of months to complete some of my longer running assignments. So I would like to check again, is fiddling with checkpoint files strongly discouraged?

Thank you

TObject 2013-01-08 20:45

BTW, I just tried running 0.19 and 0.20 side-by-side, – no problem; everything appears operational (although at a slight [about 5% at a first glance] loss to the overall efficiency).

The 0.20 is insanely fast.

Thank you very much.

LaurV 2013-01-09 07:59

[QUOTE=TObject;324071]The 0.20 is insanely fast.[/QUOTE]

Well, not really (LaurV grumpy now! :razz:)

Everybody seems to miss the fact that on the old 0.18 [B]you had to run more instances to max the GPU[/B]. Comparing the new one with the old one "side by side" is like comparing plums with mangoes: when they are green they look exactly the same except the size. Mangoes are 4 times bigger.

With the old version I was able to get 340-360 GHzDays/day from a single card, running 3 or 4 instances, in 3 (non HT) or respective 2 (HT) overclocked CPU cores.

With the new one I am able to get 390-410 GHzDays/Day for the same exponent range and the same bit levels, with NO CPU participation.

You can not get from a card more that it can give, beside small optimizations. Mfaktc is now a (brilliant) mature product, maybe small future optimizations will make it a bit better and a bit faster, but you won't expect from the future versions to be 100 times faster. Or 10 times faster. Or 3 times faster either! Which I did not expect from 0.20, of course. It is just using the card better, for a small surplus of speed.

Of course, if you max the card (like 97-100% busy) with a single instance, than such run would be "insanely fast", theoretically 2-3-4 times faster then the old version for one instance, same as using more cores in P95 to LL/DC the same exponent, the time per iteration halves, or is 3-4 times shorter (and the LL test faster) depending of how many cores you use.

Put 3-4 instances of the new mfaktc on the same card, and you will see that the times are comparable. The old one was losing time with CPU/GPU communication, which is "solved" by GPU sieving in the new version. That is where the "additional" speed come (plus other small things :razz:).

[B]The biggest advantage[/B] of the new version (as I repeatedly said in the past when we were talking of what I want, and what we should expect with the newer versions), is that [B]IT LETS YOUR CPU FREE[/B], beside of the fact that is "a little bit" faster :razz:.

For me, this (letting the CPU free) is the [URL="http://en.wikipedia.org/wiki/Manna"]manna from the heaven[/URL]! (as everybody knows, my systems are all CPU-bottle-necked). Now I can run P-1, or LL, or DC or aliquots, with the CPU, which before I could not. THIS IS THE BIG ADVANTAGE. For which I bow again to the people who made this possible. :bow:

Dubslow 2013-01-09 08:37

[QUOTE=LaurV;324114]
For me, this (letting the CPU free) is the [URL="http://en.wikipedia.org/wiki/Manna"]manna from the heaven[/URL]! (as everybody knows, my systems are all CPU-bottle-necked). Now I can run P-1, or LL, or DC or aliquots, with the CPU, which before I could not. THIS IS THE BIG ADVANTAGE. For which I bow again to the people who made this possible. :bow:[/QUOTE]

I just stopped running mfaktc for months so I could do the other things :smile:

By the way, how can I make it more responsive? I set GPUSieveProcessSize=8 and GPUSieveSize=16, and now it's useable, but still somewhat laggy. I don't really want to reduce GPUSieveSize further, so will doing something like reducing sieve primes help? Or will that cause an even bigger performance hit than reducing GPUSieveSize further?

lycorn 2013-01-09 11:03

mfaktc 0.20 and the small exponents
 
I´ve been running 0.20 on 60-61M exponents and am quite pleased with it. Faster, and "CPU-free", which is a plus, definitely.
Nevertheless, I was surprised when trying to run it on small exponents (by small I mean 2.6M exponents, from 62 to 64 bits. The GHz-d/d went down from ~258 to 40-45, and this even using using the "LessClasses" version. It is way slower than 0.19 for this type of work.
Is there any setting I should look into, or is it just the way it is?

ET_ 2013-01-09 12:31

Thank you Oliver and George :smile:

Now, my question.

I am actually running 1 mfaktc 0.19 and 1 cudalucas (DC) on my GTX 275, and 2 mfaktc 0.19 and 1 cudalucas (DC+LL) on my GTX580.

If I run mfaktc 0.20, how much GPU can be used for Cudalucas?

Luigi

James Heinrich 2013-01-09 13:00

[QUOTE=lycorn;324124]I was surprised when trying to run it on small exponents (by small I mean 2.6M exponents, from 62 to 64 bits. The GHz-d/d went down from ~258 to 40-45, and this even using using the "LessClasses" version. It is way slower than 0.19 for this type of work.
Is there any setting I should look into, or is it just the way it is?[/QUOTE]GPU sieving is not enabled below 2[sup]64[/sup]. Not only that, but it uses older, less-optimized kernels that are inherently slower. I have been doing a lot of <2[sup]64[/sup] TF above 1000M and the best I can come up with is about 140GHd/d from my GTX 570, and that's using 6 CPU cores to boot. In comparison, running a single GPU-sieving instance in a normal range I can get about 420Ghd/d. So yes, it's pretty inefficient.

Experiment with GridSize (my exponents run in 10 seconds or less, so GridSize=0 made a big improvement for me), use v0.20-LESS_CLASSES 64-bit (in CPU-sieving cases, 64-bit is faster; for GPU-sieving 32-bit is faster).

If you notice on the mfaktc v0.20 .plan there's now a line for improved support below 2[sup]64[/sup], but it talking to Oliver it's apparently non-trivial, so don't hold your breath.

James Heinrich 2013-01-09 13:20

[QUOTE=ET_;324132]If I run mfaktc 0.20, how much GPU can be used for Cudalucas?[/QUOTE]On your GTX 275: you can't -- GPU sieving isn't supported below CC 2.0 (that GPU is CC 1.3).

On any supported GPU: try and see. There's no controllable load-sharing, it's just a competition for GPU resources, whether it's mfaktc+CUDALucas, or multiple instances of mfatkc. You'll likely get somewhere around 50:50 balance, or it may be biased towards one program or the other, depending on how the code flows. Easiest way to answer is try and see.

ET_ 2013-01-09 13:30

[QUOTE=James Heinrich;324137]On your GTX 275: you can't -- GPU sieving isn't supported below CC 2.0 (that GPU is CC 1.3).

On any supported GPU: try and see. There's no controllable load-sharing, it's just a competition for GPU resources, whether it's mfaktc+CUDALucas, or multiple instances of mfatkc. You'll likely get somewhere around 50:50 balance, or it may be biased towards one program or the other, depending on how the code flows. Easiest way to answer is try and see.[/QUOTE]

I did it with mmff, and noticed that one instance of mmff nearly blocks every other program running on the GPU... I was wondering if the same behavior may be expected from the new mfaktc 0.20

Luigi

Dubslow 2013-01-09 13:39

[QUOTE=ET_;324139]I did it with mmff, and noticed that one instance of mmff nearly blocks every other program running on the GPU... I was wondering if the same behavior may be expected from the new mfaktc 0.20

Luigi[/QUOTE]

Presumably. AFAIK, the code is very similar -- TheJudger took Prime95's sieve code (in turn built on work by rcv, bsquared, and axn IIRC), and Prime95 took TheJudger's TF code. :smile:

Aramis Wyler 2013-01-09 16:57

A couple reference points.
 
I will put up benchmarks for my GTX480 when I get home, but I'll need to set the sieve back to default. I had gotten an extra 1.3 ghzdays/day by setting it down to 70000. When I get home from work I'll reset the default, run 5 numbers and post the results to your form.

It's currently doing 374.85 ghzdays/day. It's going about 25 ghz days/day faster than it did when I was running 4 instances of .19 (one per cpu core) because my cpu couldn't keep up with it.

sonjohan 2013-01-09 18:18

Is there a way to suppress the newline-posting after every 5 seconds?
For the benchmark (which asks for wall clock time), it would be quite useful to only see 1st and last line of the output.
It was possible in the previous version, but I don't know whether it's still possible.

kladner 2013-01-09 18:23

[QUOTE=sonjohan;324177]Is there a way to suppress the newline-posting after every 5 seconds?
For the benchmark (which asks for wall clock time), it would be quite useful to only see 1st and last line of the output.
It was possible in the previous version, but I don't know whether it's still possible.[/QUOTE]

Would this be what you're looking for? (From mfaktc.ini)
[CODE]# possible values for PrintMode:
# 0: print a new line for each finished class
# 1: overwrite the current line (more compact output)
#
# Default: PrintMode=0

PrintMode=0
[/CODE]

TheJudger 2013-01-09 21:11

[QUOTE=lycorn;324124]I´ve been running 0.20 on 60-61M exponents and am quite pleased with it. Faster, and "CPU-free", which is a plus, definitely.
Nevertheless, I was surprised when trying to run it on small exponents (by small I mean 2.6M exponents, from 62 to 64 bits. The GHz-d/d went down from ~258 to 40-45, and this even using using the "LessClasses" version. It is way slower than 0.19 for this type of work.
Is there any setting I should look into, or is it just the way it is?[/QUOTE]

Well, below 2[SUP]64[/SUP] mfaktc 0.20 should perform very similar to 0.19. I didn't touch the (old) kernels which can handle those numbers.

[QUOTE=ET_;324139]I did it with mmff, and noticed that one instance of mmff nearly blocks every other program running on the GPU... I was wondering if the same behavior may be expected from the new mfaktc 0.20

Luigi[/QUOTE]

All current GPUs can run only one application at a time (timesharing). CC 2.0 or newer GPUs can handle multiple kernels started from exactly one application (process) at the same time. When they come from different application they will serialized. CC 3.5 can do this, Nvidia calls this "Hyper-Q". Currently only the GK110 chip is CC 3.5 and they sell them as Tesla K20 for a high price.

If you want to mix mfaktc and cudalucas you can run half of your time cudalucas and the remaining time mfaktc.

Oliver

Uncwilly 2013-01-10 00:58

1 Attachment(s)
[QUOTE=LaurV;323933]Time to make Uncwilly happy...[/QUOTE]
I noticed some significant effort mystically showing up on some exponents in the last day or so.

kracker 2013-01-10 01:05

[QUOTE=Uncwilly;324215]I noticed some significant effort mystically showing up on some exponents in the last day or so.[/QUOTE]

:bump2:

lycorn 2013-01-10 11:10

[QUOTE=James Heinrich;324134]GPU sieving is not enabled below 2[sup]64[/sup]. Not only that, but it uses older, less-optimized kernels that are inherently slower. [/QUOTE]

OK, that explains part of the problem (most of it, actually). Also, I was running the 32-bit version, as I was expecting the sieve to be run on the GPU. The SievePrimesMin parameter was at the default 5000.
Running the 64-bit app, and setting the sievePrimes to 2000 provided the same throughput as 0.19, as expected. The GHz-d/d were roughly half of what is obtained when testing mainstream exponents.
That said, I don´t think I´ll be testing small exponents anymore (at least until some new version pops up).

ixfd64 2013-01-10 17:27

I have two more suggestions for the documentation:

1. For people who aren't familiar with console applications, it would be useful for them to know that pressing Ctrl-C terminates the program smoothly.
2. I'm surprised there is no mention of GPU to 72. :smile:

chalsall 2013-01-10 18:01

[QUOTE=ixfd64;324289]2. I'm surprised there is no mention of GPU to 72. :smile:[/QUOTE]

Oliver asked me to provide some language. Unfortunately something came up which took my mind off the deliverable in time. Next release.

swl551 2013-01-11 04:42

0.20 unstable at gpu/OC levels that were fine with 0.19
 
GTX 570 and 0.19 I could run 4 instances on one card clock at 1000mv, 900mhz core. Average combined throughput was 480 ghz per day. Never crashed...

020 has forced drop down to 988mv (default) and 845mhz core to stay reliable. Reducing throughput to only 420 ghz per day. Confirmed on 3 different 570s on different PCs. The oddest thing is that after mfaktc crashes the GPU core clock will NOT go over 405mhz regardless of what I do with afterBurner. I have to reboot to allow the card to return to factory clock speed. This is a condition I have never seen before (below factory clocks)

I recognize all the benefits of 0.20 so this is not a 0.20 vs 0.19. The question is specifically why is 0.20 showing instability where 0.19 did not.

thanks

Scott

Dubslow 2013-01-11 04:48

[QUOTE=swl551;324366]The question is specifically why is 0.20 showing instability where 0.19 did not.[/QUOTE]

Possibly simply because the GPU is under more stress now. Depending on the CPU behind those 4 instances, that might not have been enough to truly saturate the card, where now 0.20 can do that thanks to the GPU sieving. What's the Eq. GHz with 0.20 at factory clock, vs. the Eq. GHz with 0.19 at factory clock/4 instances?

LaurV 2013-01-11 05:32

Plus one for what Dubslow says. Same story as for P95 SSE versus P95 AVX, the older one could stand tremendous overclocks (over 4.5G for i7-2600k) but for the last AVX versions, which use the CPU better and squeeze it harder, producing a lot more heat, even with my water cooling racks, I had to reduce the clock to get stable results.

ixfd64 2013-01-11 05:33

As I mentioned earlier, mfaktc 0.20 is about three times as fast as 0.19 on my GTX 555. Jobs that previously took 100 minutes to complete now finish in just a little over half an hour. But even more surprising is that the average rate skyrocketed from 100M/s to around 933M/s.

I know the number of candidates per second doesn't matter, but the figures I'm getting are quite... shocking. Is this normal?

LaurV 2013-01-11 05:38

[QUOTE=ixfd64;324370]Is this normal?[/QUOTE]
Definitively yes. But take the enthusiasm with a grain of salt, see my post #2057.

TheJudger 2013-01-11 10:29

Hi,

[QUOTE=ixfd64;324370]As I mentioned earlier, mfaktc 0.20 is about three times as fast as 0.19 on my GTX 555. Jobs that previously took 100 minutes to complete now finish in just a little over half an hour. But even more surprising is that the average rate skyrocketed from 100M/s to around 933M/s.

I know the number of candidates per second doesn't matter, but the figures I'm getting are quite... shocking. Is this normal?[/QUOTE]

so you manage the edit the mfaktc.ini but did you read it?
[CODE]# Keep in mind that "number of candidates (M/G)" and "rate (M/s)" are NOT
# compareable between CPU- and GPU-sieving. When sieving is done on GPU
# those number count all factor candidates prior to sieving while CPU
# sieving counts the numbers after the sieving process.
#
[/CODE]

swl551 2013-01-11 12:44

[QUOTE=Dubslow;324367]Possibly simply because the GPU is under more stress now. Depending on the CPU behind those 4 instances, that might not have been enough to truly saturate the card, where now 0.20 can do that thanks to the GPU sieving. What's the Eq. GHz with 0.20 at factory clock, vs. the Eq. GHz with 0.19 at factory clock/4 instances?[/QUOTE]

at factory gpu defaults: with my i7-3770k @ 4.2ghz (unchanged throughout all my work)
0.19 with 4 instances=103ghzDays Per = 412 per day
0.20 with 1 instance= 368ghzDays





Based on the GPU processor % utilization it is not under more stress:

With 4 instances of 0.19 it remains at an immutable 99%.
With 0.20 it moves between 97% to 99%.

Remember one of the symptoms here is that after the crash the GPU cannot be returned to factory clock speeds until a reboot is performed. It always gets "stuck" at a dismal 405mhz.

Based on my 3 different 570 (all different manufacturers/models) across 3 PCs I cannot believe I am the only person seeing this. I'm confident I'm not the only person to use overclocking either.


Also of note in 0.19 if I OC past acceptable limits the instances would hang but the processes where still in memory and visible on the screen. With 0.20 they just disappear and the process is not in memory and I don't get the Windows (the display driver has stopped responding) message.



Thanks
Scott

James Heinrich 2013-01-11 13:11

[QUOTE=swl551;324383]0.19 with 4 instances=103ghzDays Per = 412 per day
0.20 with 1 instance= 368ghzDays[/QUOTE]Don't forget that your i7 contributes about 10GHd/d per core to the v0.19 numbers, so with that factored in throughput is close to the same.

I think in your case you've been providing ample CPU support to the GPU and so throughput won't increase all that much with v0.20. What was your SievePrimes value on v0.19? I'd suspect above 100000.

Are you running 0.20 32-bit or 64-bit? 32-bit has higher performance for GPU-sieving.

What assignment are you getting 368Ghd/d on? Your GPU clocks seem pretty high (I believe you said stock for the card is 845MHz; stock for the base GTX 570 is 732MHz), and yet at "only" 800MHz on my GTX 570 I easily get 420GHd/d.

swl551 2013-01-11 14:00

[QUOTE=James Heinrich;324389]Don't forget that your i7 contributes about 10GHd/d per core to the v0.19 numbers, so with that factored in throughput is close to the same.

I think in your case you've been providing ample CPU support to the GPU and so throughput won't increase all that much with v0.20. What was your SievePrimes value on v0.19? I'd suspect above 100000.

Are you running 0.20 32-bit or 64-bit? 32-bit has higher performance for GPU-sieving.

What assignment are you getting 368Ghd/d on? Your GPU clocks seem pretty high (I believe you said stock for the card is 845MHz; stock for the base GTX 570 is 732MHz), and yet at "only" 800MHz on my GTX 570 I easily get 420GHd/d.[/QUOTE]

Remember I'm trying to get an evaluation on [U]why 0.20 crashes at same clocks speeds 0.19 worked fine at[/U]. I'm not trying to compare performance, but documenting it here just in case it is useful.

Aramis Wyler 2013-01-11 14:12

Maybe sieving just generates more heat than factoring. Another possibility is that the sieving creates a fluctuation in usage that didn't exist when the gpu was being fed by the cpu, and the wave form is a little less unstable than the consistant feed.

I suspect that unless your throughput on .20 + the new throughput on the cpu is < the throughput form .19 though, it's a fairly moot point.

TheJudger 2013-01-11 15:11

Hi Scott,

[QUOTE=swl551;324383]Remember one of the symptoms here is that after the crash the GPU cannot be returned to factory clock speeds until a reboot is performed. It always gets "stuck" at a dismal 405mhz.

Based on my 3 different 570 (all different manufacturers/models) across 3 PCs I cannot believe I am the only person seeing this. I'm confident I'm not the only person to use overclocking either.[/QUOTE]

when sieving is done on [B]C[/B]PU than mfaktc uses most of the time only the GPUinternal registers. When sieving is done on [B]G[/B]PU than mfaktc puts some stress on the shared memory inside the GPU, too.
I'm not sure if this is related to the issues you're seeing.

Can you try cudalucas and/or other applications at you desired OC speeds/voltages?

Oliver

Ralf Recker 2013-01-11 15:16

[QUOTE=swl551;324366]The oddest thing is that after mfaktc crashes the GPU core clock will NOT go over 405mhz regardless of what I do with afterBurner. I have to reboot to allow the card to return to factory clock speed. This is a condition I have never seen before (below factory clocks)
[/QUOTE]

This is the standard behaviour of recent NVIDIA drivers.

[QUOTE=swl551;324383]Remember one of the symptoms here is that after the crash the GPU cannot be returned to factory clock speeds until a reboot is performed. It always gets "stuck" at a dismal 405mhz.[/QUOTE]

Ditto.

IIRC this behaviour was introduced with the 260.xy or 270.xy driver versions.

apsen 2013-01-11 15:16

[QUOTE=swl551;324366]GTX 570 and 0.19 I could run 4 instances on one card clock at 1000mv, 900mhz core. Average combined throughput was 480 ghz per day. Never crashed...

020 has forced drop down to 988mv (default) and 845mhz core to stay reliable. Reducing throughput to only 420 ghz per day. Confirmed on 3 different 570s on different PCs. The oddest thing is that after mfaktc crashes the GPU core clock will NOT go over 405mhz regardless of what I do with afterBurner. I have to reboot to allow the card to return to factory clock speed. This is a condition I have never seen before (below factory clocks)

I recognize all the benefits of 0.20 so this is not a 0.20 vs 0.19. The question is specifically why is 0.20 showing instability where 0.19 did not.

thanks

Scott[/QUOTE]

0.20 uses additional parts of the GPU. What I could tell that with 3 instances of 0.19 saturating GPU The games I play were not affected at all. I could run TF and play games at the same time. But with 0.20 it is impossible - my fps go down to something like 5.

Process Explorer shows you each GPU engine load separately and I could see that 0.19 was stressing one GPU engine and my game was stressing different GPU engine. New 0.20 stresses both of those. Unfortunately Process Explorer only shows engine by their numbers so I can't figure out what those engines are... But 0.20 requiring minimum CC 2.0 I guess it's some block that was added in that architecture.

Thanks,
Andriy

TheJudger 2013-01-11 15:59

[QUOTE=apsen;324404][...]
But 0.20 requiring minimum CC 2.0 I guess it's some block that was added in that architecture.[/QUOTE]

I can enable sieving on CC 1.x GPU easily... but the performance is horrible.
What causes not to run GPU sieving is this code in src/mfaktc.c:
[CODE] if((mystuff.compcapa_major == 1) && mystuff.gpu_sieving)
{
printf("Sorry, GPU sieving is not supported on devices with compute capability 1.x!\n");
printf("disable GPU sieving in mfaktc.ini (set SieveOnGPU to 0).\n");
return 1;
}
[/CODE]

The code is nearly identical for CC 1.x and CC 2.0, the differences are in src/tf_barrett96_gs.cu the functions ___clz() and ___popcnt(), for CC 2.0 there is a simple ptx command for clz and popcnt but for CC 1.x they need to be emulated.

Oliver

James Heinrich 2013-01-11 16:03

[QUOTE=TheJudger;324407]I can enable sieving on CC 1.x GPU easily... but the performance is horrible.[/QUOTE]I'm just curious if you can quantify "horrible"? Presumably it works, but is slower than CPU-sieving even on a slow CPU? By how much?

swl551 2013-01-11 16:14

[QUOTE=TheJudger;324402]Hi Scott,



when sieving is done on [B]C[/B]PU than mfaktc uses most of the time only the GPUinternal registers. When sieving is done on [B]G[/B]PU than mfaktc puts some stress on the shared memory inside the GPU, too.
I'm not sure if this is related to the issues you're seeing.

Can you try cudalucas and/or other applications at you desired OC speeds/voltages?

Oliver[/QUOTE]
CudaLucas will NOT run at the high OC rates I ran 0.19 on. I learned that instantly with CuLu. The answers regarding CulLus sensitivity to GPU excution errors made sense. No one has stated 0.20 has the similar constraints. Maybe that is what we are uncovering here.

kladner 2013-01-11 16:31

I can't address the current situation directly, except to say that I throttled back a bit on both the 570 and the 460 with 2.0. I had a few signs of instability, but some of that may have related to the CPU now running 6x P-1 workers. Since I have made multiple adjustments without fully evaluating each one I can't say for sure. (I can't be sure if things are fully stable now, but I did not wake up to a BSOD this morning as I did yesterday.)

As to the nVidia driver getting stuck at 405 MHz, I saw that when I first started running mfaktc on the 460. I think that was with V 0.17. While experimenting with batch files to get things going, I stopped and started mfaktc repeatedly. After 2-3 restarts the GPU clock would hang at 405 MHz, and I would have to reboot to clear it.

ixfd64 2013-01-11 17:00

[QUOTE=TheJudger;324380]Hi,



so you manage the edit the mfaktc.ini but did you read it?
[CODE]# Keep in mind that "number of candidates (M/G)" and "rate (M/s)" are NOT
# compareable between CPU- and GPU-sieving. When sieving is done on GPU
# those number count all factor candidates prior to sieving while CPU
# sieving counts the numbers after the sieving process.
#
[/CODE][/QUOTE]

I'm aware of that; I just didn't think it would make such a big difference. :o

Dubslow 2013-01-11 17:20

[QUOTE=swl551;324410]CudaLucas will NOT run at the high OC rates I ran 0.19 on. I learned that instantly with CuLu. The answers regarding CulLus sensitivity to GPU excution errors made sense. No one has stated 0.20 has the similar constraints. Maybe that is what we are uncovering here.[/QUOTE]

Well that's what he said, is that 0.20 stresses memory where 0.19 does not. CUDALucas' sensitivity is in the memory, and that's what's different between the two versions, so if CUDALucas fails at those higher clocks, then there's the issue.

Thanks for that information TheJudger -- very good to know.

swl551 2013-01-11 17:38

[QUOTE=Dubslow;324419]Well that's what he said, is that 0.20 stresses memory where 0.19 does not. CUDALucas' sensitivity is in the memory, and that's what's different between the two versions, so if CUDALucas fails at those higher clocks, then there's the issue.

Thanks for that information TheJudger -- very good to know.[/QUOTE]

Yes, I am agreeing with the scenario. :surrender

TheJudger 2013-01-11 21:09

Scott,

I'm pretty sure that you had problems with mfaktc 0.19 at your OC clock/voltage, too. But I *guess* that mfaktc has a very high chance for silent errors.[LIST][*]once started there is no memory allocation, everything is static after startup[*]read 12 bytes per factor candidate from memory and than run in registers[*]in very very very rare cases some data is written to memory (only when a factor was found), this can be billions of FCs with no data written into memory[/LIST]
The selftest won't catch this usually, each test case typically tests 10-15M FCs but at the end it is only checked whether the known factor was found or not, the other millions of results are [B]not[/B] verified. The selftest usually doesn't stress the GPU very hard.

Oliver

swl551 2013-01-11 21:20

[QUOTE=TheJudger;324438]Scott,


I'm pretty sure that you had problems with mfaktc 0.19 at your OC clock/voltage, too. But I *guess* that mfaktc has a very high chance for silent errors.[LIST][*]once started there is no memory allocation, everything is static after startup[*]read 12 bytes per factor candidate from memory and than run in registers[*]in very very very rare cases some data is written to memory (only when a factor was found), this can be billions of FCs with no data written into memory[/LIST]The selftest won't catch this usually, each test case typically tests 10-15M FCs but at the end it is only checked whether the known factor was found or not, the other millions of results are [B]not[/B] verified. The selftest usually doesn't stress the GPU very hard.

Oliver[/QUOTE]
Again,
Thank you. I no longer feel there is a code defect causing the problem.

TheJudger 2013-01-12 12:09

Hi James,

[QUOTE=James Heinrich;324408]I'm just curious if you can quantify "horrible"? Presumably it works, but is slower than CPU-sieving even on a slow CPU? By how much?[/QUOTE]

currently I can't give exact number because my GTX 275 retired (new GTX 680 for my main rig, GTX 470 moved to secondary rig replacing the GTX 275). For mfaktc the GTX 680 is not a very smart decission[SUP]*1[/SUP], same speed than GTX 470 (but less electrical energy and noise) but the main purpose is gaming and in this case the 680 is not the worst decission. :smile:
From my mind the GPU sieving on GTX 275 was half the speed compared to (GTX 275 + one i7 (Nehalem series) core @ 3.5GHz). The CPU kept the GPU busy easily.
I want to setup a test rig with the 275, I can provide exact numbers someday.

[SUP]*1[/SUP]I have now permanent access to a Kepler based GPU, perhaps I can tweak the code a little bit but this is no promise.

Oliver

James Heinrich 2013-01-12 12:31

[QUOTE=TheJudger;324483]For mfaktc the GTX 680 is not a very smart decission[SUP]*1[/SUP][/QUOTE]I thought I remembered you having a GTX 680... can you run a benchmark and submit it on [url=http://mersenne.ca/mfaktc.php#benchmark]my site[/url], please?
I currently don't have any data on how CC 3.0 performs on v0.20-32 with GPU sieiving. I've updated the chart to reflect the new performance level of CC 2.0 and 2.1 (thanks everyone who submitted benchmarks!) and performance is much more consistent than it was with CPU-sieving before. But nobody with a CC 3.0 card has submitted a benchmark yet. :sad:
So until someone does (preferably several someones) the relative performance of all CC 3.0 GPUs (e.g. GTX 6xx) is probably inaccurate.

TheJudger 2013-01-12 14:47

Hi James,
[LIST][*]stock GTX 680 (1008MHz, avg. Boost 1058MHz, actual clock around 1080MHz (this is no OC!))[*]M66362159 from 2[SUP]70[/SUP] to 2[SUP]71[/SUP][*]mfaktc 0.20, Win32, default settings[/LIST]
Jan 12 15:21 | 4617 100.0% | 1.085 0m00s | 298.90 82485 n.a.%
no factor for M66362159 from 2^70 to 2^71 [mfaktc 0.20 barrett76_mul32_gs]
tf(): total time spent: 17m 21.457s

OK?


Oliver

James Heinrich 2013-01-12 15:05

Thanks. I've revised my GFLOPS-GHzd/d ratio down from 15.0 in v0.19, down to 12.0 as a conservative estimate last week, and now with your benchmark down to 11.0 (for comparison, CC 2.0 = 3.6, CC 2.1 = 5.3, CC 1.x = 14.0).
I'll refine this as more users (hopefully) submit benchmarks on my site.
According to these numbers, a GTX 470 is actually still about 7% [i]faster[/i] than a GTX 680. But 9% lower power consumption, plus higher gaming performance still counts for something. :smile:

TheJudger 2013-01-12 15:53

[QUOTE=James Heinrich;324492]
[...] a GTX 470 is actually still about 7% [i]faster[/i] than a GTX 680. But 9% lower power consumption, plus higher gaming performance still counts for something. :smile:[/QUOTE]

well 9% lower TDP, but the real power consumption is [B]much[/B] lower.
Those Tesla can measure the powerconsumption directly, Comparing Tesla M2075 (GF110, 448Cores @ 1150MHz) with Tesla K10 (2x GK104, 1536 Cores @ 745MHz) the Tesla K10 has ~70% throuhput per GPU compared to M2075 but power consumption is less than half (~70W per GPU vs. 170W).

Oliver

bcp19 2013-01-12 16:16

Had an interesting situation with the new .20. I have a GTX 560 in a Core2Quad Q6600. Running the 32 bit program by itself on core 4, it holds a steady 210-220 GHzd/d, but if I start P95 and run DC on cores 1 and 3, mfakts starts varying wildly between 140 and 200 GHzd/d. I switched to the 64 bit program to see what would happen, and even with P95 running it stays fairly steady between 210-220 GHzd/d.

James Heinrich 2013-01-12 17:27

[QUOTE=bcp19;324497]Running the 32 bit program by itself on core 4[/QUOTE]Is there any reason to lock v0.20 to any particular core? What if you let it run on whichever core it chooses, do you still get the wild variance?

bcp19 2013-01-15 04:55

[QUOTE=James Heinrich;324501]Is there any reason to lock v0.20 to any particular core? What if you let it run on whichever core it chooses, do you still get the wild variance?[/QUOTE]
Holdover from the old days of .19 and below. I know the Quad is very finicky when pushed too far with P95... with core 1/3 running DC/LL I can do x DC/LL every y days, but if I saturate the computer with 4 dc/ll, I only get 1.5x in the same y days. I will try to remember to check your suggestion when I am at that system tomorrow.

Xyzzy 2013-01-16 22:11

Do all video cards, when used for the primary/secondary display, lag severely when gpu sieving is enabled?

Dubslow 2013-01-16 22:15

[QUOTE=Xyzzy;324966]Do all video cards, when used for the primary/secondary display, lag severely when gpu sieving is enabled?[/QUOTE]

Mine does, that's for sure. I only run mfaktc (or Msieve) when I'm not using ut.

ixfd64 2013-01-16 22:27

Mine is a GTX 555, and it lags as well.

Xyzzy 2013-01-16 22:32

It would be convenient to be able to click a button and pause activity, and maybe have the activity resume automatically in an hour. Or maybe have mfaktc run at a lower priority or do a "PauseWhileRunning" deal.

Is this something that could be easily coded?

We have no problem stopping the program but we frequently forget to restart it.

We are currently running on a GT 430 but we have a GTX 660Ti to install tomorrow.

James Heinrich 2013-01-16 22:50

[QUOTE=Xyzzy;324971]we have a GTX 660Ti to install tomorrow.[/QUOTE]A [url=http://www.mersenne.ca/mfaktc.php#benchmark]benchmark[/url] would be appreciated. :smile:

Prime95 2013-01-16 23:32

[QUOTE=Xyzzy;324966]Do all video cards, when used for the primary/secondary display, lag severely when gpu sieving is enabled?[/QUOTE]

Knock down the GPUSieveSize parameter until you get acceptable video response time.

Xyzzy 2013-01-17 00:01

[QUOTE]Knock down the GPUSieveSize parameter until you get acceptable video response time.[/QUOTE]We changed "GPUSieveSize" from the default 32 to 4 (!) and the system is much more responsive now. On the GT 430 the estimated GHz-d/day dropped from ~50 to ~45 but we figure letting it run most of the time, rather than on and off and on and off, it might have better throughput. (The GHz-d/day is also much more variable per output line, possibly because the GPU is being allowed to do other work?)

By setting "GPUSieveSize" to the lowest value are we messing anything up, or do we need to balance any other settings?

:mike:

kracker 2013-01-17 00:57

I would be curious as to how much a 660 Ti spits out too :smile:

Dubslow 2013-01-17 01:08

[QUOTE=Prime95;324976]Knock down the GPUSieveSize parameter until you get acceptable video response time.[/QUOTE]

(560) I had reduced mine from 64 to 16, but that wasn't really good enough; throughput went from ~215 to ~205; I'm running it with 4 now, and the lag is acceptable, but throughput has dropped to ~185, but I suppose Xyzzy's logic still applies.

Batalov 2013-01-17 01:36

[QUOTE=Xyzzy;324971]It would be convenient to be able to click a button and pause activity, and maybe have the activity resume automatically in an hour. Or maybe have mfaktc run at a lower priority or do a "PauseWhileRunning" deal.

Is this something that could be easily coded?
[/QUOTE]
If you open the DOS window where a program is running and [COLOR=lemonchiffon][SPOILER][COLOR=lemonchiffon]select[/COLOR][/SPOILER][/COLOR] part of the text, the program will suspend (not immediately, but after the current class will be done); it is a side-effect of DOS, I guess. I instructed my son to do that before he plays his games. When he is done - open the window again and right-click in it once, which will release the selection.

Another way to do is with [URL="http://en.wikipedia.org/wiki/Process_Explorer"]Process Explorer[/URL].

Dubslow 2013-01-17 01:37

[QUOTE=Batalov;324993]If you open the DOS window where a program is running and [COLOR=lemonchiffon][SPOILER][COLOR=lemonchiffon]select[/COLOR][/SPOILER][/COLOR] part of the text, the program will suspend (not immediately, but after the current class will be done); it is a side-effect of DOS, I guess. I instructed my son to do that before he plays his games. When he is done - open the window again and right-click in it once, which will release the selection.

Another way to do is with [URL="http://en.wikipedia.org/wiki/Process_Explorer"]Process Explorer[/URL].[/QUOTE]

What about those of us who use various forms of GNU/Linux (or BSD)? :judge:


[COLOR=green]SB: That's too-oo easy! [B]killall -STOP mfaktc[/B] and later [B]killall -CONT mfaktc [/B][/COLOR]
[COLOR=green]Linux presents no problem! It is Windows that is a problem![/COLOR]

James Heinrich 2013-01-17 02:04

[QUOTE=kracker;324986]I would be curious as to how much a 660 Ti spits out too :smile:[/QUOTE][url=http://mersenne.ca/mfaktc.php]223.6 GHz-days/day[/url] (based on estimates from a single CC 3.0 benchmark).

James Heinrich 2013-01-17 02:08

[QUOTE=Batalov;324993]If you open the DOS window where a program is running and [COLOR=lemonchiffon][SPOILER][COLOR=lemonchiffon]select[/COLOR][/SPOILER][/COLOR] part of the text, the program will suspend ... open the window again and right-click in it once, which will release the selection.[/QUOTE]Much easier: use the handy "Pause" key on your keyboard. Pauses immediately, and resumes on almost any key press (other than Pause and modifier keys like Shift/Alt/Ctrl). Not sure if same applies to Linux, but does work outside DOS and Windows (such as during POST).

Batalov 2013-01-17 02:29

Well, it doesn't pause [I]immediately[/I], -- only just as well as selection. The signal is not being transferred to the GPU which will always finish the last class before snoozing (unless you'd kill it).

Interestingly, putty.exe sends "Pause"-press to the ssh'd host as Ctrl-Z. (but of course it won't unpause for you later with any-key) That's convenient if you use putty ...even though Ctrl-Z also works always.

Plain Linux surely disregards this key (and it's evil brother "Ctrl-pause") (but as any key or key combination it is probably configurable).

LaurV 2013-01-17 04:33

[QUOTE=Xyzzy;324971]It would be convenient to be able to click a button and pause activity, and maybe have the activity resume automatically in an hour. Or maybe have mfaktc run at a lower priority or do a "PauseWhileRunning" deal.

Is this something that could be easily coded?

We have no problem stopping the program but we frequently forget to restart it.

We are currently running on a GT 430 but we have a GTX 660Ti to install tomorrow.[/QUOTE]
Ahhhh... so long I waited for this, but I expected it to come from Dubslow, to be able to tell him back: "Use windows!" :yucky:
There is a "pause/break" key, which (temporarily) stops mfaktc until you press the "space" key.

[edit: grrr did not see there is a new page]

kjaget 2013-01-17 14:25

[QUOTE=Dubslow;324994]What about those of us who use various forms of GNU/Linux (or BSD)? :judge:


[COLOR=green]SB: That's too-oo easy! [B]killall -STOP mfaktc[/B] and later [B]killall -CONT mfaktc [/B][/COLOR]
[COLOR=green]Linux presents no problem! It is Windows that is a problem![/COLOR][/QUOTE]

Control-S in the mfaktc window works under Windows, so you'd think it would work under Linux too. Control-Q (or any other key in Win) to continue.

firejuggler 2013-01-17 14:37

I always assume
control= DanceMyPuppet

apsen 2013-01-17 15:12

My display does not lag on GTX 560 Ti generally but only in the game I play. So I start the game with a batch file:

[CODE]
pssuspend mfaktc-win-32
pssuspend -r mfaktc-win-64
xvm-stat.exe
pssuspend -r mfaktc-win-32
pssuspend mfaktc-win-64
[/CODE]

So when I play the game I run CPU sieving copy, otherwise GPU.

Dubslow 2013-01-17 16:29

[QUOTE=Dubslow;324994]
[COLOR=green]SB: That's too-oo easy! [B]killall -STOP mfaktc[/B] and later [B]killall -CONT mfaktc [/B][/COLOR]
[COLOR=green]Linux presents no problem! It is Windows that is a problem![/COLOR][/QUOTE]

Well duh. :smile: The problem of course is remembering to type `fg` everytime I leave the computer, though as above, I just leave it on now with a much-crippled sieve size.

For a single process I prefer ^Z/fg, though for pausing a bunch of lasieve processes I have two aliases, one each for -STOP and -CONT inside a "for p in `pidof lasieve`; do killall <mode> $p; done" loop. Quite handy things, aliases. :wink:

Xyzzy 2013-01-17 21:05

2 Attachment(s)
[QUOTE][url=http://mersenne.ca/mfaktc.php]223.6 GHz-days/day[/url] (based on estimates from a single CC 3.0 benchmark).[/QUOTE]We just finished installing the GTX 660Ti so we ran a benchmark. (This is with the 32-bit executable on 64-bit Windows 7, with all of the parameters set to default.)

[CODE]no factor for M61160719 from 2^70 to 2^71 [mfaktc 0.20 barrett76_mul32_gs]
tf(): total time spent: 22m 15.685s[/CODE]
The lag with the default settings with the GT 430 was 3-4 seconds. Now the lag is maybe half a second at the worst. We still will probably dial back the "GPUSieveSize" setting a bit.

Our next goal is to get the GT 430 installed with the GTX 460Ti and use the GT 430 as a dedicated GPU factoring card.

TheJudger 2013-01-17 22:31

Hi Mike,

[QUOTE=Xyzzy;324979]We changed "GPUSieveSize" from the default 32 to 4 (!) and the system is much more responsive now. On the GT 430 the estimated GHz-d/day dropped from ~50 to ~45 but we figure letting it run most of the time, rather than on and off and on and off, it might have better throughput. (The GHz-d/day is also much more variable per output line, possibly because the GPU is being allowed to do other work?)

By setting "GPUSieveSize" to the lowest value are we messing anything up, or do we need to balance any other settings?

:mike:[/QUOTE]

the settings in mfaktc.ini *should* not be able to screw up stuff (except for performance). And this is [B]not[/B] a request to try to screw it up...
With the GHz-d/day measurement it is easy to compare throughput.

Oliver

garo 2013-01-17 22:59

If you reduce your GPUSieveSize, try reducing the GPUSieveProcessSize to 8 and optionally the GPUSievePrimes a bit. I got better throughput by reducing the GPUSieveProcessSize.

lycorn 2013-01-19 18:32

[QUOTE=garo;325072] I got better throughput by reducing the GPUSieveProcessSize.[/QUOTE]

Me too (GTX560Ti). But raised the GPUSieveSize from the default 64 to 128.
GPUSieveProcessSize is currently at 8, down from the default 16. That´s the setting that appears to work best on my system, at least for the mainstream exponents (GPUto72 tasks).

ixfd64 2013-01-19 20:01

I reduced GPUSieveProcessSize from 16 to 8, and the time per iteration (factoring from 71 to 73 bits in the 60M range) dropped by about 0.15 seconds. It's not a huge difference, but I guess it all adds up. For the record, this was on my GTX 555.

Xyzzy 2013-01-19 22:02

1 Attachment(s)
We are using our GTX 660Ti as our primary display card. We stop factoring for games but we have found that by sacrificing a (small?) portion of the card's throughput we are able to use the computer for any other task, including 1080p videos, without any lag whatsoever.

[CODE]# Minimum: GPUSieveSize=4
# Maximum: GPUSieveSize=128
# Default: GPUSieveSize=64

GPUSieveSize=4

# Minimum: GPUSieveProcessSize=8
# Maximum: GPUSieveProcessSize=32
# Default: GPUSieveProcessSize=16

GPUSieveProcessSize=8[/CODE]
We have the GT 430 running as well, with the default settings, but it is not hooked up to anything. The GT 430 is pretty slow but it only takes up one slot, only uses around 30 watts and it does not require special power connections.

We are now running the 64-bit binaries. The performance hit for doing so does not seem to be very much.

We purposely purchased the [URL="http://usa.asus.com/Graphics_Cards/NVIDIA_Series/GTX660_TIDC22GD5/"]slowest[/URL] 660Ti card that Asus makes. We have read that in some cases that the more highly (factory) overclocked cards are more likely to produce faulty calculations. By running our card at a lower load and temperature it possibly will be more reliable. Certainly, the fact that it affects our desktop experience in no way means we are willing to let it run continuously, which in the long term might result in a greater overall throughput than if we had to pause it here and there.

FWIW, our system, factoring on both video cards and running 4 instances of P-1 factoring on an i7 3770 CPU, draws 258 watts. (This does not count our display or speakers.)

Disclaimer: We are [URL="http://www.mersenneforum.org/showthread.php?t=16871"]very sensitive to lag[/URL] and it irks us greatly. (We are also severely impaired by flickering lights, like fluorescent lights.)

swl551 2013-01-19 22:18

Setting GPUSieveSize=128 on my GTX 570
 
Observing this conversation I played around with

GPUSieveSize
and
GPUSieveProcessSize (everything other than default decreased throughput)

Setting [B]GPUSieveSize=128[/B] increased my GhzDays from 412 to 422

I confirmed increased throughput on both of my GTX570s running 0.20

(Win 7, 64bit and Win 7, 32bit PCs)

Batalov 2013-01-19 22:24

[CODE]GPUSieveSize=4
...
GPUSieveProcessSize=8[/CODE]
It appears that GPUSieveProcessSize was meant to be a fraction of GPUSieveSize (and an integer fraction, according to the source or else the value is rejected). If it is larger, then its size probably doesn't matter.

Aillas 2013-01-20 14:31

Hi,

could someone please make a version of mfaktc 0.20 for cuda 4.0?

Thanks a lot.

Xyzzy 2013-01-20 18:19

[QUOTE]Observing this conversation I played around with

GPUSieveSize
and
GPUSieveProcessSize[/QUOTE]FWIW, we played around with the values to find the most productive combo, and for both of our cards that combo was "GPUSieveSize=128" and "GPUSieveProcessSize=8".

YMMV

Chuck 2013-01-20 19:20

[QUOTE=Xyzzy;325305]FWIW, we played around with the values to find the most productive combo, and for both of our cards that combo was "GPUSieveSize=128" and "GPUSieveProcessSize=8".

YMMV[/QUOTE]

Same here with GTX 580. Those two changes increased my GPU utilization from 98% to 99% and raised the GHz-d/day from 431 to 435. Extremely minor video lag which doesn't bother me.

Aramis Wyler 2013-01-20 20:21

Same here as well on a gtx480. Increasing the GPUSieveSize from 64 to 128 increased ghz days from ~388.4 (wobbly) to a locked on 395.00. Changing the GPUSieveProcessSize from 16 to 32 tropped ghzdays to 295, and changin it to 8 put us back to ~395 but it was wobbly. I set it back to the default 16.

I am tempted to muck with the GPUSievePrimes number again on these new settings. I had gained about 3 ghz days by dropping it from 82486 to 70000 (gpu uses 69941 at that setting).

These numbers are for doing TF in the 61M range.


All times are UTC. The time now is 22:30.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.