mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   GPU Noob-Experiences/Questions (https://www.mersenneforum.org/showthread.php?t=16136)

kladner 2011-10-14 16:51

GPU Noob-Experiences/Questions
 
Hi,
I'm picking up from
[URL="http://www.mersenneforum.org/showthread.php?t=15545"]Best 4XX series GPU[/URL]
where I've taken the thread far afield from the original topic. All of this is running on a GTX460 and a 1090T. At this point I am running mfaktc under 32 bit XP, and cudalucas in Win7 64 bit.

1) Huzzah! An mfaktc run found my first factor ever last night.:fusion:

2) I did a trial of running a second instance of mfaktc. It was a short run: 2^68 to 2^69. For the duration, I shut down P95 and gave each instance of mfaktc its own core affinity. This did get GPU usage maxed out, but at the cost of considerable display lag....no surprise there. From the README I see that reducing NumStreams from 3 to 2 could improve this.

The question is, which is a more effective use of overall processing capacity?

Currently, I have P95 set to run on 5 cores with the sixth given over to feeding mfaktc. Until recently I had all of the P95 workers set to "Whatever makes the most sense". After reading of a need for more P-1 factoring I set two workers for that. I do note that two workers just started new assignments and are running, or just ran, P-1 factoring with the "just ran" worker going into Primality Testing. The workers I set for P-1 are still running LL assignments from the previous settings.

I let P95/64 have lots of memory (1370MB) with no day/night difference. As I understand things, this is good for P-1, Stage 2.

All that said, would PrimeNet benefit more from 2 instances of mfaktc and 4 P95/64 workers? (Assuming that I could reduce display lag to a comfortable level.) Or is 1-mfaktc, 5-P95 a better balance? I have to say that I'm more inclined to the 1 and 5 scenario because I run cudalucas in Win7. While I know that cudalucas does not really need the 6th CPU core, keeping P64 limited to 5 cores avoids the conflict with Photoshop. It is also easier for me to keep the same number of workers in P95 and P64, since they both work on the same assignments, depending on which OS I boot.

Side note to the above. With the current settings, on a 2^70-71 run, mfaktc reduces SievePrimes to 5000 (This is always the case). Avg.Rate runs in the upper 70's to mid 80's. Avg.Wait runs in the 60's with occasional higher or lower numbers.

I have gathered that the SievePrimes and Avg.Wait numbers are not optimal, but I'm not sure what to do about it. If it would be helpful I will post my mfaktc.ini and any other system details requested.

I've probably bored everyone enough for one post.:sirrobin: I would greatly appreciate any comments or suggestions on any of the elements of all this rambling.

Dubslow 2011-10-14 21:08

What about avg. rate and sieve primes with two instances going? If you can't get rid of the display lag (or don't like 4 cores P95) then I'm out of ideas for getting those numbers up with one instance.

kladner 2011-10-14 21:53

[QUOTE=Dubslow;274529]What about avg. rate and sieve primes with two instances going? [/QUOTE]

I'll have another go at it and get back to you. My rough impression is that there wasn't that much difference from a single instance.

Regarding finding a first factor, I actually discovered that this box had found one previously doing P-1. Still, getting that from mfaktc gave me a sense of accomplishment.

kladner 2011-10-14 22:08

1 Attachment(s)
Here is a quick and dirty side by side. I haven't tried tweaking to reduce lag, yet. This was just a couple of minutes after Sieve Primes settled to 5000. Avg.Rate did go up noticeably. Avg.Wait bounces around.

EDIT: This was with NumStreams=10. I had gone there because README says Windows systems need more than Linux. The following post will show results for NumStreams=2.

kladner 2011-10-14 22:41

1 Attachment(s)
These are run with NumStreams=2. Lag is still noticeable, but somewhat better. NumStreams=3 gave similar results, but the lag was still pretty bad.

I don't find the lag surprising, given a GPU usage between 97-99%. I guess I'll have to decide if it is too intrusive.

The other issue is still whether cutting P95 to 4 cores to run mfaktc x2 is the most productive for GIMPS.

Dubslow 2011-10-14 23:45

[QUOTE=kladner;274539]The other issue is still whether cutting P95 to 4 cores to run mfaktc x2 is the most productive for GIMPS.[/QUOTE]

I can't really answer that; as far as GHz-days, the answer is obvious, but you weren't asking that.

It does seem that two cores does saturate your GPU, though with the obvious lag problem. Does each instance go as fast or almost as fast as if you just had one? I tried looking through the old thread, but avg. rate isn't the best measure and the older assignments were different bit lengths. (Time per class)

kladner 2011-10-15 02:35

1 Attachment(s)
I've been experimenting with the NumStreams, and the effect on mfaktc of running P95. The difference between NumStreams 2 & 3 is pretty marked once Sieve Primes drops to 5000. I didn't capture that in the NS3 set, but you can see it with NS2.

Next, I'll show what starting P95 workers doing primality testing did to things.

kladner 2011-10-15 02:55

Adding P95 to mfaktc, 2 instances
 
1 Attachment(s)
The red lines are approximate for a worker being started. All of this has been with NumStreams=2. That seems to be the most effective setting, at least with this rig.

Granted, these are not totally pristine tests. There are 3 Firefox windows, with 30 tabs between them, Thunderbird email and Photoshop 6 running. They (and Norton Inet Security) might be consuming 8-10% CPU time.

But it's kind of dramatic what happens next, when P95 with 4 workers running is stopped.

kladner 2011-10-15 03:22

1 Attachment(s)
Here, P95 is stopped from 4 workers, then a minute or so later the 2 mfaktc's are shut down. It seems that 2 mfaktc's actually want more than two CPU cores. There's activity still on the first four cores when P95 stops. That quiets down considerably when mfaktc is stopped. Interestingly, though, when mfaktc x2 is running, each instance is always showing 16-17% CPU. But the overall CPU usage drops a lot more than 32-34% when they stop. I'm guessing they may be stimulating some System activity that doesn't show up on their usage tabs.

Dubslow - I do understand about total processed output--the GHz-days. I guess I'm just looking to pick up slack if one part or another of the whole GIMPS process is getting less attention. And the display does get pretty herky-jerky with mfaktc hitting 99% on the GPU.

I haven't gotten back to try 1 instance vs 2. I guess to be really scientific, both should be running the same exponent. But
59288869,70,71 and
59288753,70,71 seem pretty close together.

I'll try that while I've still got some run-time on those two.

kladner 2011-10-15 03:42

1 Attachment(s)
Crap. 59288869,70,71 finished too soon after its Sieve Primes bottomed out. There wasn't time for 59288753,70,71 to get to the same stage. However, #1's Avg.Wait jumped into the 100-200 range when Instance 2 came online. The screen shot is of number 2, showing the same kind of Avg.Wait as number 1 had, running by itself.

mfaktc is happier with company?:beer:

Dubslow 2011-10-15 06:33

I'd say you can write off the extra cpu usage without P95 running as just system stuff, not necessarily related to mfaktc. It does seem happier with two, but we already knew that it was a cpu limited process. As for what you should work on, I'd say the optimal solution is one mfaktc instance, 3 P95 workers, and CUDALucas simultaneously to use the rest of the GPU, though from what I understand you haven't gotten CL and mf- working in the same OS yet. In my experience, with either CUDALucas or mfaktc keeping my GPU (also a 460) at ~90%, I am still able to play TF2 without any laginess, so hopefully, if you get them working at the same time, your screen won't be crapped out.

(Also from the one post I would have said NS3. The time per class is slightly lower.)

kladner 2011-10-15 06:51

I'll work on getting cudalucas going in XP, or mfaktc in Win7. The combination you suggest might be just the ticket. P95/64 might still be able to run 5 workers, since CL doesn't need much CPU.

Meanwhile, I just lined up mfaktc with plenty to work on while I'm sleeping.

Dubslow 2011-10-15 06:54

[QUOTE=kladner;274574]I'll work on getting cudalucas going in XP, or mfaktc in Win7. The combination you suggest might be just the ticket. P95/64 might still be able to run 5 workers, since CL doesn't need much CPU.

Meanwhile, I just lined up mfaktc with plenty to work on while I'm sleeping.[/QUOTE]

Sleep? What's that?

(And I meant 5, but didn't remember you have a hex core :P)

kladner 2011-10-15 07:01

[QUOTE=Dubslow;274575]Sleep? What's that?[/QUOTE]

Something I yearn for more than I get.

kjaget 2011-10-15 13:16

I'm still barely awake so this might not make sense, but have you tried 3 instances of mfaktc yet? To me, if sieve primes is going to 5000 it means the GPU is outrunning the CPU. If that happens in two instances you're maxing out 2 CPUs and the GPU is still waiting for them (this it also hinted at by the fact that you can run two instances at nearly the same avg rate as one, so by adding a second CPU you're doing twice as much work). By adding a 3rd CPU running, you should see the GPU finally saturate and the average rate of each worker go down - but you'd still be doing more work overall.

At this point, you can't use average rate anymore since by sieving you'll remove some candidates and do less work per class. Instead you have to look at the time per class. Run the same or similar exponents on all instances. Let them run for a bit and settle down, and then look at the run time per class.

To get a rough idea of the work you're doing in each case, sum up the inverse of the run times. For example, running 1 class it takes 5.5 seconds, or .182 classes/sec. Running two instances, the times are 5.6 and 5.8 seconds - add up the inverse of these and you complete .351 classes a second, almost a 2x speedup. Repeat for 3 or even 4 instances and continue the process.

Eventually you'll max out the GPU, and individual instances will take longer to run but the overall throughput will be the same (minus some overhead). This will give you a real idea of how many CPUs are required to max the GPU, because right now I still don't think you're there.

Note there were be some non-linear scaling. There's some overhead associated with running mfaktc, so you'll never get exactly 2x speedup from a second core. On the other hand, if your CPU(s) are faster than the GPU, the CPUs can do some extra sieving to remove candidates before the GPU even has to test them, which means you get a bit of super-liner speedup.

You can also download GPU-Z. This has a GPU load monitor. For my setup, running 1 CPU loads the GPU 91-92%, so adding a second CPU in my case doesn't make sense - I'd only get another 5% or so out of my GPU. MY guess is that you're at something like 40-45% with one CPU core, so doubling or even tripling up may make sense.

As for what helps the project the most, a rough estimate is that you'll find a factor 1% of the bit levels you test (i.e. testing an exponent from 70 - 72 has a ~2% chance of finding a factor). This accounts for how rare factors are plus the fact that P-1 factoring has eliminated some of the ones you would have found in the ranges that matter. Each factor found eliminates the need for 2 full LL tests. So if your adding a CPU lets your GPU factor 50 numbers in the time it takes that CPU to run 1 LL test (100 factors to save 2 LL tests) it's a win to dedicate that CPU to GPU factoring.

GPU LL tests don't make sense by this math, even though they are quicker than CPU ones on the whole. But if you do the math, they're so much better at factoring than anything else that it's just hard to beat.

kladner 2011-10-15 16:11

Thanks kjaget! Dubslow has been trying to get these concepts across to me with mixed success. I will give a third mfaktc instance a run. I have been monitoring with GPU-Z, and your estimate of usage is pretty much right on for the GTX460.

"Someone go back and get a shit-load of dimes!" (uh.....TF assignments!)

Would I be correct in taking your logic another step and thinking that remaining CPU cores might be providing the best service doing P-1 factoring?

This is going to take a couple of different arrangements, as I actually do expect the GPU to pump pixels to the display on occasion.:smile: I am imagining an overnight setup which gives most everything (half the CPU and all the GPU) to TF. When I'm trying to use the machine for other things it seems that it's only really usable with one mfaktc running. I haven't gotten to Dubslow's suggestion of running one mfaktc and one cudalucas to see how that affects the performance.

On the other hand, I [U][I]suppose [/I][/U]I could try getting the onboard AMD HD4290 to work beside the GTX460 at the mundane task of display driving. I do know that the nVidia does not like AMD's Open CL driver, though I don't know if that matters under the circumstances. My inclination would be to take that out after reinstalling the AMD drivers.

At that point I might have to do different setups for XP-32 and Win7-64. One of the reasons I put Win7 on this build was to give Photoshop the benefit of 8 gigs of RAM. Cudalucas seems to play nice with Pshop, even though a single instance runs the GPU at 96-98%. Multiple mfaktc's are not so considerate.:no:

kladner 2011-10-15 18:05

mfaktc x3
 
1 Attachment(s)
So here's what I get with 3 mfaktc's, and P95 stopped.

kladner 2011-10-15 18:13

mfaktc x3 + P95 x3 (LL)
 
1 Attachment(s)
The title says it all. I brought the P95 workers in at roughly halfway up the mfaktc boxes. Of course there was some lag starting workers one at a time. P95 didn't seem to make much difference, unless I glanced at the wrong numbers.

EDIT: It is interesting that the GPU memory controller is doing next to nothing. It runs 30-40% with cudalucas, if I remember correctly.

kladner 2011-10-15 19:46

Batch files
 
Carrying on from a discussion (here: [URL="http://www.mersenneforum.org/showthread.php?t=15545"]Best 4XX series GPU[/URL] 2nd page) with Christenson about using batch file to run mfaktc and cudalucas.

Here are a few links on batch commands:

[url]http://aumha.org/a/batches.php[/url]

[url]http://www.computerhope.com/batch.htm#02[/url]
This is the Commands page in a more extensive batch tutorial. I've not seen as many commands listed, so far, as I expect there to be. My impression was that just about any DOS commands would work at a Command Prompt, thus in batch files.

DOS commands:

[url]http://www.computerhope.com/msdos.htm#02[/url]

Microsoft page on the subject:

[url]http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/batch.mspx?mfr=true[/url]

I haven't look into these very much, yet. I'm thinking that a batch file could draw from a folder of assignments and feed them into the command line that starts cudalucas, among other things.

kladner 2011-10-15 20:53

Batch files and CMD commands
 
This is the sort of exhaustive list, with explanations and arguments/switches given, with command line examples.

[url]http://ss64.com/nt/[/url]

Dubslow 2011-10-15 23:14

Hmm. I thought 2 mfaktc's already maxed out your GPU to 99%? In that case 3 mfaktc's would only speed you up by the use of the third core... in other words, not much. The pics you posted seem to have a different class size (Candidates are 4xxM not 5xxM) so I'm not sure how good the comparison is. And yes, I have found CUDALucas in general to be much nicer about actually processing graphics.

kladner 2011-10-16 01:39

kjaget suggested that I try going further, and it has been worth an afternoon's experimentation. I think the first group of three would be the those shown in the screen shots, as I left another three running when we went out for dinner. These are the first two triads in my Results list from the first two runs with three instances. I'm really not sure how to analyze this:

Manual testing 59334211 NF 2011-10-16 01:06 0.4 no factor for M59334211 from 2^70 to 2^71 [mfaktc 0.17-Win barrett79_mul32] 4.0302

Manual testing 59333887 NF 2011-10-16 01:05 0.4 no factor for M59333887 from 2^70 to 2^71 [mfaktc 0.17-Win barrett79_mul32] 4.0302

Manual testing 59333831 NF 2011-10-16 01:04 0.4 no factor for M59333831 from 2^70 to 2^71 [mfaktc 0.17-Win barrett79_mul32] 4.0302

Manual testing 59334133 NF 2011-10-15 20:55 0.2 no factor for M59334133 from 2^70 to 2^71 [mfaktc 0.17-Win barrett79_mul32] 4.0302

Manual testing 59333873 NF 2011-10-15 20:55 0.2 no factor for M59333873 from 2^70 to 2^71 [mfaktc 0.17-Win barrett79_mul32] 4.0302

Manual testing 59333821 NF 2011-10-15 20:55 0.2 no factor for M59333821 from 2^70 to 2^71 [mfaktc 0.17-Win barrett79_mul32] 4.0302

In retrospect, I could have sorted the assignments to get more similar groups of three, or looked more closely at the dozen or so exponents I had on tap when I loaded the first three. And just to remind, the second set also had three LL's running in P95, FWIW. All I can say is that there is a different pattern among the columns, but I don't know how to interpret the changes.

Christenson 2011-10-16 02:24

[QUOTE=kladner;274662]This is the sort of exhaustive list, with explanations and arguments/switches given, with command line examples.

[URL]http://ss64.com/nt/[/URL][/QUOTE]

This is quite a bit more documentation than I have on my fingertips, thank you. Indeed, a batch file issues command lines, which can of course include CUDALucas....but I felt that a short, quick tutorial would be more helpful, since the batch files we will use are really simple.

If you want to do something fancy, I suggest perl (my favorite) or python, as these are full programming languages, and at least perl (and probably python) will be portable across various operating systems. Perl has excellent string handling, too, having the mother of all regular expression handlers built in.

kladner 2011-10-16 02:55

quoth Christenson:
"If you want to do something fancy, I suggest perl (my favorite) or python, as these are full programming languages, and at least perl (and probably python) will be portable across various operating systems."

I had not thought on such a large scale, though if it were within my capacities I might try. I honestly did not have a clear concept of how to carry out (seemingly simple) sequential list reading to feed in successive exponents to cudalucas. That's why I went off in search of command lists. Would I have to buy something to attempt working with perl?

kjaget 2011-10-16 12:35

Looking at the 2 vs 3 CPU mfaktc cases, you're getting 20-25% more TF throughput using 2 CPUs. This is based strictly on the time per class in each case, since that's directly related to the total run time per exponent.

You're getting 2 classes done in about 6.1 seconds with the 2 CPU case, and 3 in 7.5 seconds in the 3 CPU case. So the time increases a little bit while you get a 50% increase in TF throughput. And since sieve_primes is > 5000, you're finally close to maxing out the GPU (each CPU has a bit of extra time to wait for the GPU so it can sieve out more than the minimum number of prime candidates before passing them to the GPU for testing).

Christenson 2011-10-16 14:15

[QUOTE=kladner;274699]quoth Christenson:
"If you want to do something fancy, I suggest perl (my favorite) or python, as these are full programming languages, and at least perl (and probably python) will be portable across various operating systems."

I had not thought on such a large scale, though if it were within my capacities I might try. I honestly did not have a clear concept of how to carry out (seemingly simple) sequential list reading to feed in successive exponents to cudalucas. That's why I went off in search of command lists. Would I have to buy something to attempt working with perl?[/QUOTE]

For a simple job such as this, a plain batch file is sufficient. Just list out the commands in order (noting that I'm making up exponents from whole cloth)
[CODE]
CUDALucas<version>.exe -t25000037
CUDALucas<version>.exe -t25000091
[/CODE]

Now, if you are interested in perl (pathologically eclectic rubbish lister), perl itself can be downloaded and is free. But you might want to buy the O'Reilly press books on the subject -- that is certainly what I did.

If you wanted to work in C, you could also take a look at the parser in mfaktc, re-purpose it to issue cudaLucas commands (system("command line")), and to eliminate those commands from worktodo.txt when it got a positive result from the command line returning, assuming your computer was still operating some 40 or more hours later when the exponent finished.

kladner 2011-10-16 16:02

[QUOTE=kjaget;274734]Looking at the 2 vs 3 CPU mfaktc cases, you're getting 20-25% more TF throughput using 2 CPUs. This is based strictly on the time per class in each case, since that's directly related to the total run time per exponent.[/QUOTE]

Isn't that a 20-25% increase when using three? That's how I read your second paragraph. That is, with 2 a "composite" time per class is 6.1/2=3.05. With 3 the "composite" is 7.5/3=2.5. 3.05/2.5=1.22, or a 22% increase.

Is that manipulating the numbers correctly, bearing in mind that it's only a rough approximation?

kladner 2011-10-16 16:15

[QUOTE=Christenson;274738]For a simple job such as this, a plain batch file is sufficient. Just list out the commands in order (noting that I'm making up exponents from whole cloth)
[CODE]
CUDALucas<version>.exe -t25000037
CUDALucas<version>.exe -t25000091
[/CODE][/QUOTE]

OK. So the basic approach would be just stacking additional command lines with different exponents. I assume what would happen then is that one instance of CL would terminate, and a new one start with a new exponent.

I've been thinking of adding a Pause command at the end just so the "DOS box" will stay open when the work completes. At least I think that's what it would do. As it is, with mfaktc, the prompt closes when there are no more lines in worktodo.txt. I have yet to run a cudalucas assignment to completion, but I think the same thing would happen. I don't suppose it would serve any real purpose, except to be able to see the final screen.

davieddy 2011-10-16 16:21

[QUOTE=kjaget;274609]
As for what helps the project the most, a rough estimate is that you'll find a factor 1% of the bit levels you test (i.e. testing an exponent from 70 - 72 has a ~2% chance of finding a factor). This accounts for how rare factors are plus the fact that P-1 factoring has eliminated some of the ones you would have found in the ranges that matter. Each factor found eliminates the need for 2 full LL tests. So if your adding a CPU lets your GPU factor 50 numbers in the time it takes that CPU to run 1 LL test (100 factors to save 2 LL tests) it's a win to dedicate that CPU to GPU factoring.
[/QUOTE]

Sort of!
I hope this is a claification of the last sentence:

"So if your adding a CPU lets your GPU [B]trial [/B]factor >50 [B]extra [/B]numbers
[B]from 70-72 [/B]in the time it takes that CPU to run 2 LL tests, it's a win to dedicate that CPU to GPU factoring."

I think this same reasoning applies to the highest bit
level worth going up to does it not?

David

Dubslow 2011-10-16 18:49

[QUOTE=kladner;274744]
I've been thinking of adding a Pause command at the end just so the "DOS box" will stay open when the work completes. At least I think that's what it would do. As it is, with mfaktc, the prompt closes when there are no more lines in worktodo.txt. I have yet to run a cudalucas assignment to completion, but I think the same thing would happen. I don't suppose it would serve any real purpose, except to be able to see the final screen.[/QUOTE]
Well, sort of. In Win7, when CUDALucas finishes, Windows thinks "it's stopped responding and is searching for a solution to the problem". Not sure why it doesn't exit cleanly, but either way you're right in that the prompt goes away.

kladner 2011-10-16 20:37

1 Attachment(s)
Quote Dubslow: "...Windows thinks "it's stopped responding and is searching for a solution to the problem."

I guess that's technically correct, though if the program has shut down it's not really the same as sitting there frozen. Who knows what lurks in the "mind" of Windows? Maybe it sees the shutdown as a crash.

More experiments:
(See attached txt file)
I did a series of partial runs with 3 instances of mfaktc, making NumStreams=1,2,&3 in sequence.

NS=1 caused the SievePrimes to go up substantially, but also noticeably increased the Time values. Usability for other tasks was reasonably good.

NS=2 is at least moderately usable for other tasks, though the screen lag is visible. The time values are near enough to those for NS=3 that I don't see any point in going up there with the other drawbacks involved.

NS=3 is definitely the worst for screen refresh. While I was putting the text file together in notepad, when it got long enough to scroll, it took a couple of seconds for the refresh to ripple down through the text. It also did not seem to improve performance that much, if at all.


Having added a Pause command to the batch files did keep the prompt open, so that total estimated time could be read. Once the shortest run ended, the other two quickly dropped SievePrimes to 5000, but there were only a few minutes between them, so I guess that is insignificant in the long haul.

Just to summarize, it seems that on a 1090T system, with a (Gigabyte) GTX460, best throughput vs. screen lag seems to be with 3 instances running, set to NS=2.

Dubslow 2011-10-16 22:44

That's true, but given the lag and the number of cores it takes up, you still might be better of with two mfaktc's. Your choice.

kladner 2011-10-16 23:18

"Your choice"

Still working on that.

I haven't yet tried the cudalucas+mfaktc arrangement. I can do it, now that I've gotten CL to start in XP-32. That's on a similar priority with getting mfaktc running in Win7-64. I am interested in relative performance between the two systems.

vsuite 2011-10-17 01:26

[QUOTE=kladner;274499]{...}All of this is running on a GTX460 and a 1090T. At this point I am running mfaktc under 32 bit XP, and cudalucas in Win7 64 bit.

{...}

All that said, would PrimeNet benefit more from 2 instances of mfaktc and 4 P95/64 workers? (Assuming that I could reduce display lag to a comfortable level.) Or is 1-mfaktc, 5-P95 a better balance? I have to say that I'm more inclined to the 1 and 5 scenario because I run cudalucas in Win7. While I know that cudalucas does not really need the 6th CPU core, keeping P64 limited to 5 cores avoids the conflict with Photoshop. It is also easier for me to keep the same number of workers in P95 and P64, since they both work on the same assignments, depending on which OS I boot. {...}
[/QUOTE]

1. prime95, mfaktc and CudaLucas can all run on Windows XP 32bit, Windows Vista 32&64 bit, and Windows 7 32 and 64 bit. However, IIRC there might be some differences between some of the 32 and 64 bit programs [someone correct me if I'm wrong].
2. I would suggest that you have separate 32 and 64 bit versions of the programs for when you log on in Windows XP 32 bit and Windows 7 64 bit, and let these separate versions of the programs run on their own separate data. That is, do not continue LL, P-1 or TF using the 32 bit software on data which started processing with the 64 bit software, and vice versa. Work on the 64 bit dataset with the 64 bit software only. Then every week or whatever time frame, manually submit your CudaLucas and mfaktc results to mersenne.org and obtain new data for worktodo.txt; prime95 will automatically submit its results and obtain new worktodo.
3. Max out the GPU insofar as is possible without making your system unresponsive. So use two or three concurrent running mfaktc (ie directories mfaktc32-1, mfaktc32-2, mfaktc32-3). If you have a slow GPU (eg GT220), you would have to limit yourself to one mfaktc. With a GTX460-1024MB under 32 bit Windows XP, it takes two full threads of a 2.4GHz Core2 Quad, ie 25%+25% CPU. With a GTX460-768MB under 64 bit Windows 7, it takes two 2/3 threads of a 2.6GHz AMD X2, ie approx 33%+33% CPU (each assigned via processor affinity to a separate core, and I assign Prime95 to two threads LL with each thread getting maybe 16% CPU, and the machine is not unresponsive). Alternatively I run 1 CudaLucas thread along with the same two Prime95 threads, and Prime95 obviously speeds up significantly.
4. Assign (all) the other free threads to Prime95. Once mfaktc is properly tuned, and the system is responsive with it, Prime95 acts as a background task, releases processing power readily, and does not seem to cause problems for me.
5. Check whether one or more of your Prime95 LL or P-1 threads slows down when the 2 mfaktc threads are running. This may be due to them being memory bound under your working conditions if so, you might consider switching them to TF, or choosing a different balance of LL and P-1.

kladner 2011-10-17 02:18

@vsuite -Without quoting or getting into too many details, I just want to say thanks for the in-depth response. It will take me a bit to digest and integrate it, but I really appreciate that you took the effort. I am particularly intrigued by your advice in points 1 and 2, though the tuning tips in 3-5 are things I will pay attention to as well. This is the first time I've seen the idea raised that there might be differences in data handling between 32 and 64 bit versions -- aside from reported better performance under 64 bit, which I tend to expect.....all other things being equal. Emphasis on [U]"things I've seen.[/U]" There is so much stuff in these various fora that there is no telling what has been discussed at one time or another. But your points are a first for me.

I set up trying to run the same exponents in both XP-32 and Win7-64 because I spend irregular time in one or the other. I thought that I would keep one set of data moving forward, and avoid leaving a second set to languish. There is the added constraint that I run Photoshop in the 64 bit environment, and had to find a way to make P64 get along with it. Untweaked, P64 cripples Photoshop for some reason, if it is running on all the cores. I haven't yet tested how Pshop gets along with mfaktc.

In any case, you've given me food for much thought. And that's a big part of why I participate here.

kladner 2011-10-17 02:39

Latest experiment
 
1 Attachment(s)
I currently have the following running in XP32:

2 mfaktc with dedicated CPU cores to feed them.
1 cudalucas assigned to Core0/Worker1, where it seems to make little impact on P95.
P95 running 3 LL's and 1 P-1 stage 2 using 1222MB of RAM.

I can't say that I recommend this combination if you are trying to do very much else with the computer. Redraws and general responsiveness are very sluggish.

On the brighter side, the GPU and CPU are both running flat out, and temps and fan noise are both very reasonable. CUDALucas seems to be cranking out the iterations, though I can't tell just how fast. The two mfaktc's have good numbers, I think.

A fun experiment. Too bad it's only really doable if I want to leave the computer to itself. As it is, it is frustratingly slow to respond, and the screen is irritatingly flickery on things like the Asus system monitor (PC Probe) and on the GPU-Z Sensors tab. Perhaps, as Dubslow suggested, one mfaktc and one CL would strike a better balance.

I think I ought to go mess around on the 64 bit side, too, and see how that responds. But it will be easier if I let the mfaktc's run their courses before I reboot or try anything else.

mfaktc's just quit. Results:
no factor for M59384987 from 2^70 to 2^71
tf(): total time spent: 2h 3m 24.461s

no factor for M59385163 from 2^70 to 2^71
tf(): total time spent: 1h 58m 31.862s

The times seem pretty normal to me for those numbers, within my very limited range of experience. If I'm wrong, please say so. It would improve my frame of reference in such things.

Just to keep it plain, this is a
PhenomII x6 1090T Black Edition, running at ~3400MHz=17x200MHtz, or 200MHz over stock
8GB Corsair Vengeance RAM (2x4GB) @ DDR1600, 9-9-9-24-41-1T
Gigabyte GTX460 1GB GDDR5, 256bit bus, factory OC to GPU 715MHz, RAM 900MHz
Asus M4A89GTD Pro/USB3 board

Now CL alone is keeping the GPU at 99%, 1C cooler and 3% less fan. I have to suppose that CL wasn't getting much GPU lovin' while the 2 mfaktc beasts were gorging themselves. Responsiveness while better, is nowhere near snappy. The redraw ripples are faster, but still visible.

Dubslow 2011-10-17 06:39

Weird. When I use the -c flag I get timings.

As for those mfaktc numbers, those are pretty reasonable; it takes me somewhere between 2-2.5 hrs to do one 70 to 72 run, so 1 hr composite for 70-71 seems about right (also with a GTX460).

Was that not-as-bad-lagginess with cudalucas with one or two mfaktc threads?

Also, GIMPS is a slow moving project, in the sense that many assignments go months or years before finishing, and everything you're doing is well ahead of the LL wavefront. Don't worry about letting those TF assignments sit still for a while. (If there in the 53M range, then don't let them sit for more than a month or so).

Christenson 2011-10-17 12:52

[QUOTE=kladner;274800]@vsuite -Without quoting or getting into too many details, I just want to say thanks for the in-depth response. It will take me a bit to digest and integrate it, but I really appreciate that you took the effort. I am particularly intrigued by your advice in points 1 and 2, though the tuning tips in 3-5 are things I will pay attention to as well. This is the first time I've seen the idea raised that there might be differences in data handling between 32 and 64 bit versions -- aside from reported better performance under 64 bit, which I tend to expect.....all other things being equal. Emphasis on [U]"things I've seen.[/U]" There is so much stuff in these various fora that there is no telling what has been discussed at one time or another. But your points are a first for me.

I set up trying to run the same exponents in both XP-32 and Win7-64 because I spend irregular time in one or the other. I thought that I would keep one set of data moving forward, and avoid leaving a second set to languish. There is the added constraint that I run Photoshop in the 64 bit environment, and had to find a way to make P64 get along with it. Untweaked, P64 cripples Photoshop for some reason, if it is running on all the cores. I haven't yet tested how Pshop gets along with mfaktc.

In any case, you've given me food for much thought. And that's a big part of why I participate here.[/QUOTE]

Kladner, mfaktc0.17 deliberately won't pick up a checkpoint file from a different OS. The developer is considering dropping that. Would you be willing to test for us?

kjaget 2011-10-17 12:58

[QUOTE=kladner;274742]Isn't that a 20-25% increase when using three?[/QUOTE]

Yeah, I was thinking 3 CPUs and mistakenly wrote 2... 3 CPUs are definitely generating more throughput.

[QUOTE] That's how I read your second paragraph. That is, with 2 a "composite" time per class is 6.1/2=3.05. With 3 the "composite" is 7.5/3=2.5. 3.05/2.5=1.22, or a 22% increase.

Is that manipulating the numbers correctly, bearing in mind that it's only a rough approximation?[/QUOTE]

Yep, that's exactly the reasoning I used.

kjaget 2011-10-17 13:09

[QUOTE=kladner;274803]On the brighter side, the GPU and CPU are both running flat out, and temps and fan noise are both very reasonable. CUDALucas seems to be cranking out the iterations, though I can't tell just how fast. The two mfaktc's have good numbers, I think.[/QUOTE]

CUDALucas creates checkpoint files every so often. Like Prime 95, it backs up the old one and creates a new one. Looking at the time stamps on each file and knowing how many iterations between each write will let you calculate a per-iteration time. I forget the exact file names, but I think they are t<exponent number> and c<exponent number>. Should be obvious which is which if you do a dir /od to sort by date.

[QUOTE]A fun experiment. Too bad it's only really doable if I want to leave the computer to itself. As it is, it is frustratingly slow to respond, and the screen is irritatingly flickery on things like the Asus system monitor (PC Probe) and on the GPU-Z Sensors tab. Perhaps, as Dubslow suggested, one mfaktc and one CL would strike a better balance.[/QUOTe]

Any luck playing with grid size in mfaktc.ini? That helped one of my systems, granted with a much slower card.

kladner 2011-10-17 14:05

Testing
 
[QUOTE=Christenson;274860]Kladner, mfaktc0.17 deliberately won't pick up a checkpoint file from a different OS. The developer is considering dropping that. Would you be willing to test for us?[/QUOTE]

Sure. Would that mean having the 64 and 32 bit versions cohabit in the same folders? Thus far, I have 2 sets of three folders. I had not even thought of the possibility, mainly because most mfaktc runs are short enough to wait out to completion.

Let me know what to do and I'll give it a shot.

kladner 2011-10-17 14:22

1 Attachment(s)
[QUOTE=kjaget;274862]CUDALucas creates checkpoint files every so often. Like Prime 95, it backs up the old one and creates a new one. Looking at the time stamps on each file and knowing how many iterations between each write will let you calculate a per-iteration time. I forget the exact file names, but I think they are t<exponent number> and c<exponent number>. Should be obvious which is which if you do a dir /od to sort by date.[/QUOTE]

c & t are the file prefixes. How do I know the iterations/write? I have -c10000 set currently. Are the checkpoints written on the same schedule?


[QUOTE=kjaget;274862]Any luck playing with grid size in mfaktc.ini? That helped one of my systems, granted with a much slower card.[/QUOTE]

I had not thought of that parameter. I'm now running a set of 3 instances, with GridSize=2. Responsiveness is better, and the Time values are all right at 7.5. So performance in mfaktc does not seem to have suffered.

Thanks for the suggestion. Screenshot of that run attached.

What about CPUStreams=x? It is currently at what I think is the default of 10. How might that affect responsiveness and performance, if at all?

EDIT: from the LL FAQ thread, courtesy of ATH. Add the -t switch to the CL command line for timing display.

TheJudger 2011-10-17 16:25

Hi,

[QUOTE=kladner;274871]Sure. Would that mean having the 64 and 32 bit versions cohabit in the same folders? Thus far, I have 2 sets of three folders. I had not even thought of the possibility, mainly because most mfaktc runs are short enough to wait out to completion.

Let me know what to do and I'll give it a shot.[/QUOTE]

mfaktc did never differentiate between 32 and 64 bit. You can have both in the same directory (with different names for the executable, of course). You can use checkpoint files from 32 bit for a 64 bit version and vice versa.

mfaktc does differentiate between Windows and Linux versions (version string of mfaktc changes). I'll likely remove this check (which was added for safety) in upcomming versions.

mfaktc does differentiate between different version numbers and refuses to load the checkpoint file from foreign versions (e.g. mfaktc 0.16 vs. 0.17). Again this check is for safety. [B]If[/B] there is a computational bug in some versions it is easy identify which runs where affected.

Oliver

TheJudger 2011-10-17 16:29

[QUOTE=kladner;274873]What about CPUStreams=x? It is currently at what I think is the default of 10. How might that affect responsiveness and performance, if at all?[/QUOTE]

defaults are:[CODE]
NumStreams=3
CPUStreams=3[/CODE]

10 is usually a waste of memory.

Oliver

kladner 2011-10-17 17:38

[QUOTE=TheJudger;274889]Hi,



mfaktc did never differentiate between 32 and 64 bit. You can have both in the same directory (with different names for the executable, of course). You can use checkpoint files from 32 bit for a 64 bit version and vice versa.

SNIP.....

mfaktc does differentiate between different version numbers and refuses to load the checkpoint file from foreign versions (e.g. mfaktc 0.16 vs. 0.17). Again this check is for safety. [B]If[/B] there is a computational bug in some versions it is easy identify which runs where affected.

Oliver[/QUOTE]

Hi Oliver,

Thanks very much for the explanation. It is particularly good to know about the version # detection. Does this extend to 'a,b,c' sub-versions?

How would worktodo and results files be handled? Would mfaktc.txt be used in common?

Lot's of questions, I know.
Thanks again.

EDIT: Thanks also for the recommended value for CPUStreams.

kladner 2011-10-17 20:57

1 Attachment(s)
In pursuit of various suggestions I now have 3x mfaktc running in Win7-64. The executable is mfaktc171apsen.cuda40.sm_multi.win64
NumStreams=2
CPUStreams=3
GridSize=2


This setup is moderately responsive. It seems to be a bit faster than 32bit, but I could be misremembering as I write this.

EDIT: Yes. The time values seem to be around .5 less than in 32bit.

ATH 2011-10-18 01:09

I haven't used my GPU much until now, but now I started testing Cudalucas.

I have a Geforce GTX 460 and with 1 Cudalucas it runs at 95-96% GPU load and with 100% fan speed the temperature is about 69 degrees C (156 F), does that sound normal?

I'm just wondering if GPUs can hold up to being used 24/7 as well as CPUs? Anyone had problems with their cards or fans on the cards after awhile?

delta_t 2011-10-18 01:52

[QUOTE=ATH;274946]I haven't used my GPU much until now, but now I started testing Cudalucas.

I have a Geforce GTX 460 and with 1 Cudalucas it runs at 95-96% GPU load and with 100% fan speed the temperature is about 69 degrees C (156 F), does that sound normal?

I'm just wondering if GPUs can hold up to being used 24/7 as well as CPUs? Anyone had problems with their cards or fans on the cards after awhile?[/QUOTE]

What you are seeing is normal for CUDALucas, and your temperatures for the GPU seem fair. I don't think you're running too hot for it. In terms of running 24/7 I've been running my GTX 460M for a couple months without any problem. Others have been running theirs nearly continuously at least since July. Check with Xyzzy I believe he's been running several GPUs continuously for a while now.

Chuck 2011-10-18 01:53

[QUOTE=ATH;274946]I haven't used my GPU much until now, but now I started testing Cudalucas.

I have a Geforce GTX 460 and with 1 Cudalucas it runs at 95-96% GPU load and with 100% fan speed the temperature is about 69 degrees C (156 F), does that sound normal?

I'm just wondering if GPUs can hold up to being used 24/7 as well as CPUs? Anyone had problems with their cards or fans on the cards after awhile?[/QUOTE]

I have had my GTX 580 running 24 X 7 since I bought it eight months ago. It is designed to run HOT and does so at 85 degrees C with fan at max and load of 83% with two instances of mfaktc. It has some sort of vapor chamber cooling. I was originally running SETI on it but switched back to Prime95 mfaktc some months back.

It is mounted vertically in a Maingear computer that exhausts the heat out the top of the case (so the fan is blowing UP). I have had no problems and EVGA warrants it for life even if overclocked (I did have to register it and submit a copy of the sales receipt). It has a factory overclock of 797 MHz and I have pushed it to 850 MHz, but I normally run it at the native OC rate.

The Core-i7 970 CPU is water cooled and stays at a more comfortable 69 degrees.

kladner 2011-10-18 02:07

[QUOTE=ATH;274946]I have a Geforce GTX 460 and with 1 Cudalucas it runs at 95-96% GPU load and with 100% fan speed the temperature is about 69 degrees C (156 F), does that sound normal?[/QUOTE]

When I first got my GTX460 about a week ago it got quite hot running FurMark (well duh!) and the two fans revved up pretty high. I see that I said then here [URL]http://www.mersenneforum.org/showthread.php?t=15545[/URL] that I got it up to 80% on the fans, 3450RPM at 79C. The weather was warmer then, and I put the case fans as low as they'd go to get there. Since, things have cooled down, and I put in a stronger case intake fan among other tweaks. Still, I just tried the same trick and could not hit 70C. Since this one was brand new, I kind of suspect that there was some heatsink compound break-in going on. I can't get the fans on mine above the low 50%'s now, but the room is cooler, too. It does seem that the fan control on yours is pretty aggressive. But I think these chips have pretty high temp ratings.

You might look at getting more air through the case. When I put this thing in everything got warmer.

[QUOTE=ATH;274946]I'm just wondering if GPUs can hold up to being used 24/7 as well as CPUs? Anyone had problems with their cards or fans on the cards after awhile?[/QUOTE]

Can't say about that. I haven't had it very long.

ATH 2011-10-18 02:22

Why don't you let the fan run at 100%. Isn't the best strategy to keep temperature as low as possible? (within limits).

I have setup profile policies in Nvidia Control Panel that set fanspeed to 100% (3540 RPM) when temperature exceeds 60 C.

kladner 2011-10-18 02:38

Set affinity from command line
 
Back in the thread I referenced in my previous post, I was going back and forth with Christenson about batch files for starting mfaktc and cudalucas. Somewhere in there I said, "Now if there were just a way to set affinity from a batch file call. It's a pain to do it manually every time."

Some digging later, I've found a free utility (GPL license) StartAffinity here-
[URL]http://www.adsciengineering.com/StartAffinity/[/URL]

So far, it has worked as advertised in my XP-32 setup, with great simplicity. I put in it my System32 folder, but I think it would work in the app folder, too. Here's the batch file I added it to:

h:
cd \mfaktc\mfaktc-0.17
mfaktc-win-32.exe
pause

which became:

h:
cd \mfaktc\mfaktc-0.17
startaffinity mfaktc-win-32.exe 1
pause

The "1" at the end of the third line is specifying that mfaktc be started to run on core 1 (starting with 0 as the first). I did a similar thing for a second instance running on core 5 (the sixth core on the PhenomII).

My only complaint is that it acted a little strange about the Pause command --it just kind of ran over it. I may take that out if I can't get it to behave. If you don't care about the prompt window shutting down when the program runs out of work you don't need the Pause anyway.

This takes a whole extra step out of starting up mfaktc.

EDIT: The two runs completed, and the windows stayed open. I was able to exit via Cntrl-C>y>Enter as usual. This is not the only app which sets affinity. It's just the first I found which works. It is said that you can do it with the Start command, but details are a little hard to come by. I am really jazzed by finding this. Now that I'm settling towards running mfaktc x2, I'm going to build a batch file that does it all in one double-click. Woohoo!

kladner 2011-10-18 02:58

[QUOTE=ATH;274954]Why don't you let the fan run at 100%. Isn't the best strategy to keep temperature as low as possible? (within limits).

I have setup profile policies in Nvidia Control Panel that set fanspeed to 100% (3540 RPM) when temperature exceeds 60 C.[/QUOTE]

I guess I see it as striking a balance. In a perfect world where fans were silent and never wore out, running at 100% would be OK. But I've worn out lots of fans, including little ones on video cards. And I don't have confidence in being able to replace special ones on a card. Noise is not really an issue with the vid card. It's one case fan in particular that can really scream if I let it. But mostly I don't need to, especially in mid-October in Chicago. I look forward to slowing all the fans down as things get cooler still.

On my Gigabyte GTX460, as set up, the fans don't even speed up from the 40% minimum until it hits 60C. Your nVidia Control Panel must be different from mine. No fan controls there. I do have a Gigabyte utility, but I think it only allows Auto or Manual fan control.

kladner 2011-10-18 03:06

[QUOTE=Dubslow;274781]That's true, but given the lag and the number of cores it takes up, you still might be better of with two mfaktc's. Your choice.[/QUOTE]

Getting back to this issue, I'm thinking that indeed 2 dedicated CPU cores are enough, and 3 instances of mfaktc are too much. Usability suffers. There are some issues even with 2x mfaktc.

Granted, with 2, SievePrimes locks down at 5000 and the Wait times are short. But Time per class is running 6.1-6.3, and avg.rate is in the upper 80s.

kladner 2011-10-18 04:22

This is different
 
I'm now running mfaktc x2 in Win7-64bit.The same scenario in XP-32 dived to SievePrimes 5000, and ran Waits under 100 mostly. In 64bit, it's holding SP at 25000 or more, with multi-hundred Waits. Times per class are 5.5 or a little more. Avg.Rates are mostly around 80. I think this is about the best I've seen it running.

StartAffinity works here in 64bit, too.

nucleon 2011-10-18 11:07

[QUOTE=kladner;274963]I'm now running mfaktc x2 in Win7-64bit.The same scenario in XP-32 dived to SievePrimes 5000, and ran Waits under 100 mostly. In 64bit, it's holding SP at 25000 or more, with multi-hundred Waits. Times per class are 5.5 or a little more. Avg.Rates are mostly around 80. I think this is about the best I've seen it running.

StartAffinity works here in 64bit, too.[/QUOTE]

One doesn't need special programs to configure affinity for windows programs.

Inbuilt command 'start' which is invoked via cmd.exe does the trick. i.e.:

cmd.exe /C start /low /affinity XXX mfaktc-win-64.exe

Where XXX is a hexadecimal code for the cpu core. i.e. core0 = 1, core1=2, core2=4, core3=8, core4=10, core5=20, core6=40, core7=80 etc...

If you want an exe to run on multiple cores, then you add the numbers together. i.e. core0+core1 = 3 etc... (Although you wouldn't want to do this for mfaktc - I'm just showing how the numbering scheme makes sense)

The /low switch is for user-interactive programs to have higher priority so mfaktc doesn't slow normal usage down.

-- Craig

ATH 2011-10-18 13:45

[QUOTE=kladner;274956]On my Gigabyte GTX460, as set up, the fans don't even speed up from the 40% minimum until it hits 60C. Your nVidia Control Panel must be different from mine. No fan controls there. I do have a Gigabyte utility, but I think it only allows Auto or Manual fan control.[/QUOTE]

I forgot that I installed Nvidia System tools which adds more features to Nvidia Control Panel:
[URL="http://www.nvidia.com/object/nvidia_system_tools_6.06.html"]http://www.nvidia.com/object/nvidia_system_tools_6.06.html[/URL]

kladner 2011-10-18 15:27

[QUOTE=nucleon;274979]One doesn't need special programs to configure affinity for windows programs.

Inbuilt command 'start' which is invoked via cmd.exe does the trick. i.e.:

cmd.exe /C start /low /affinity XXX mfaktc-win-64.exe

Where XXX is a hexadecimal code for the cpu core. i.e. core0 = 1, core1=2, core2=4, core3=8, core4=10, core5=20, core6=40, core7=80 etc...
-- Craig[/QUOTE]

Thanks for explaining that Craig. I had read that Start was supposed to do that, but had not figured out the number codes, amongst the discussions of Affinity Masks.

I'm still a little puzzled by it. If multiple cores are specified by adding codes together, couldn't "10" mean either "Core 4", or "Cores 1 and 3"?

Also, while I know "what" hex is in general terms, I'm fuzzy on the "why and how" of its use. I've just now figured out the numeric sequence in the codes above, I think. It's using the core number as a power of 2, and mapping the result to hexidecimal, right? So core4 is 2^4=16, which is 10 in hex? That is, the 10's place in decimal represents 16s in hex.

I appreciate that you are trying to get these things across, and I'm trying to get a handle on them. It's just that things like hex are for me foggy realms of which I'm aware, but have a hard time navigating in.

The /low switch might be just the ticket for avoiding display lag, if it applies to GPU usage as well as CPU. If that's true I might be able to run 3 instances of mfaktc without suffering difficult levels of display lag.

Thanks again. I'll give this a try.

kladner 2011-10-18 15:32

[QUOTE=ATH;274997]I forgot that I installed Nvidia System tools which adds more features to Nvidia Control Panel:
[URL]http://www.nvidia.com/object/nvidia_system_tools_6.06.html[/URL][/QUOTE]

Thanks for that ATH. I'll have a look.

kladner 2011-10-18 16:56

Start command, /affinity
 
In response to nucleon's advice, I set up this command line in a batch file:

cmd.exe /C start /low /AFFINITY 20 mfaktc-win-32.exe

running under XP-32.

It returns an Invalid Switch error for /AFFINITY.

I haven't gone to Win7-64 to try it there, yet. I have not been able to find mention of /AFFINITY in the list given by "cmd /?".

Am I missing something?

EDIT: I am interested in working this out for educational purposes. I am doing OK with the free open source StartAffinity. But if nothing else the /low switch shows promise for better system responsiveness.

Edit2: I've tried it now in 64bit, from a shortcut and from a batch file, with and without quotes around the path and mfaktc executable, and all I get is a prompt in the correct directory.

axn 2011-10-18 18:21

[QUOTE=kladner;275002]I'm still a little puzzled by it. If multiple cores are specified by adding codes together, couldn't "10" mean either "Core 4", or "Cores 1 and 3"?

Also, while I know "what" hex is in general terms, I'm fuzzy on the "why and how" of its use. I've just now figured out the numeric sequence in the codes above, I think. It's using the core number as a power of 2, and mapping the result to hexidecimal, right? So core4 is 2^4=16, which is 10 in hex? That is, the 10's place in decimal represents 16s in hex.[/QUOTE]

You've got it.:smile:

bcp19 2011-10-18 20:19

[QUOTE=kladner;275002]Thanks for explaining that Craig. I had read that Start was supposed to do that, but had not figured out the number codes, amongst the discussions of Affinity Masks.

I'm still a little puzzled by it. If multiple cores are specified by adding codes together, couldn't "10" mean either "Core 4", or "Cores 1 and 3"?

[/QUOTE]

You're confusing hex and base 10 there... Hex goes 0-9 then a-f, so:
1=core 0
2=core 1
3=core 0 + 1
4=core 2
5=core 0 + 2
6=core 1 + 2
7=core 0+1+2
8=core 3
9=core 0 + 3
a=core 1 + 3
b=core 0 + 1 + 3
c=core 2 + 3
d=core 0 + 2+ 3
e=core 1 + 2 + 3
f=core 0 + 1 + 2 + 3
10=core 4

kladner 2011-10-18 21:06

[QUOTE=bcp19;275024]You're confusing hex and base 10 there... Hex goes 0-9 then a-f, so:
[/QUOTE]

Thanks bcp19. I'll try to wrap my mind around that.

Dubslow 2011-10-18 22:11

You were right, before. 10 represents 16, i.e. it's the number after 0F (=15-dec) (representing core0123 as bcp19 said). Then the addition as described by nucleon makes sense. (So core1+core3 = 10-dec = 0A-hex.) As for the switch not working, try 002 or 020 for core5; nucleon might have meant that the switch has to be exactly three hex digits long.

Dubslow 2011-10-18 22:19

[QUOTE=ATH;274946]I haven't used my GPU much until now, but now I started testing Cudalucas.

I have a Geforce GTX 460 and with 1 Cudalucas it runs at 95-96% GPU load and with 100% fan speed the temperature is about 69 degrees C (156 F), does that sound normal?

I'm just wondering if GPUs can hold up to being used 24/7 as well as CPUs? Anyone had problems with their cards or fans on the cards after awhile?[/QUOTE]My 460 is in the low 90's for load, 60C at 51% fan.

kladner 2011-10-19 03:37

Quote Dubslow: "As for the switch not working, try 002 or 020 for core5; nucleon might have meant that the switch has to be exactly three hex digits long."

Good point. I'll give it a try.

LaurV 2011-10-19 05:46

The cmd does not have an /affinity switch, this is a switch of the "start" command in win64, so it should be present only on vista or xp64. No idea if present on win7/64, as they completely changed the command interpreter's logic for win7, and I have no computer on hand to test it right now.

The command given in the posts above starts "start" with the options that follows (open a command prompt and type "cmd /?" and see /c switch). You could equivalently write it as

cmd.exe /C "start /low /AFFINITY 20 mfaktc-win-32.exe"

or

cmd.exe /C "what_to_do"

what follows are parameters of "what to do".

As I have already checked, it appears /affinity option is present in Vista and the 64-bit version of XP. It is *not there* for the 32-bit version of XP.

delta_t 2011-10-19 07:33

[QUOTE=LaurV;275060]No idea if present on win7/64, as they completely changed the command interpreter's logic for win7, and I have no computer on hand to test it right now. [/QUOTE]

It is present in Win7/64.

I just made a batch file with exactly what's below to automatically start mfaktc to run on core6.
cmd.exe /C start /low /affinity 40 mfaktc-win-64.exe

You can verify that it worked by checking Task Manager, go to Processes tab, right-click the mfaktc process, and check the Set Affinity and Set Priority.

LaurV 2011-10-19 08:35

Thanks for checking it.
Meanwhile I found the best solution to do this in XP32 without running third party (possible unsafe) application. You have to install microsoft's [URL="http://technet.microsoft.com/en-us/sysinternals/bb897553.aspx"]psexec utility[/URL] (it is not part of the standard "dos" command prompt, it come with separate sysinternals package). It has a "-a" switch that works perfectly for me.

ATH 2011-10-19 10:29

Just to add yet another possibility to the mix :) Imagecfg.exe from Microsoft Application Compatibility Toolkit:
[URL="http://www.microsoft.com/download/en/details.aspx?id=7352"]http://www.microsoft.com/download/en/details.aspx?id=7352[/URL]

[URL="http://www.robpol86.com/index.php/ImageCFG"]http://www.robpol86.com/index.php/ImageCFG[/URL]

can permanently change affinity for an .exe file:

imagecfg.exe -a 1 mfaktc-win-32.exe

Then mfaktc-win-32.exe will run on the first core everytime you use it until you change it again with imagecfg.exe.

The codes for the -a switch is the same as discussed below for start /affinity, but note that the number is in decimal unless you add 0x infront: "imagecfg.exe -a 10" is 10 in decimal or "0xa" in hex, while "imagecfg.exe -a 0x10" is 16 in decimal.

kladner 2011-10-19 18:35

Many thanks to LaurV, delta_t, and ATH for the added info and explanations. I will try to work with them as time permits. I do understand the concerns about 3rd party apps, and will try to substitute one of the suggested MS solutions when I get the chance.

kladner 2011-10-19 18:52

Same mfaktc work in 32 and 64 bit versions
 
1 Attachment(s)
[QUOTE=Christenson;274860]Kladner, mfaktc0.17 deliberately won't pick up a checkpoint file from a different OS. The developer is considering dropping that. Would you be willing to test for us?[/QUOTE]
(Also in response to TheJudger's comments following the above on Page 2 of this string.)


I have been working on this and keeping a journal, which is attached. In the course of doing so, I have also been trying to achieve the most streamlined, fewer-manual-steps arrangement for the day/night different setups I have mentioned previously.

That is:
Day: P95/64 -2 "Makes most sense", 3 P-1, 1 mfaktc
Night: P64 -2 "Makes most sense", 2 P-1, 2 mfaktc

I only mention P64 for the Night setup because everything seems to run better there, and of course, all 8GB of RAM are available.

kladner 2011-10-20 21:08

Set Affinity
 
Thanks to everyone who offered advice on this topic. I now have it worked out in 64bit Win7. The command lines (in batch files) for the two instances I run are:

cmd.exe /c "start /low /affinity 0x10 mfaktc-win-64.exe"

This sets the affinity for CPU 4, on the 0-5 scale.

cmd.exe /c "start /low /affinity 0x20 mfaktc-win-64.exe"

This sets the affinity for CPU 5, on the 0-5 scale.

kladner 2011-10-20 23:17

Set Affinity
 
I have to add that the /low switch for the Start command seems to have greatly improved system responsiveness, while still turning in Time per Class values from 4.9xx-5.0xx on 70-71 runs.

TheJudger 2011-10-21 18:20

Hi!

[QUOTE=kladner;274897]Thanks very much for the explanation. It is particularly good to know about the version # detection. Does this extend to 'a,b,c' sub-versions?[/QUOTE]

The checkpoint file will be accepted if and only if the version string is identical. So mfaktc 0.16p1 will reject a checkpoint file from mfaktc 0.16.

[QUOTE=kladner;274897]How would worktodo and results files be handled? Would mfaktc.txt be used in common?[/QUOTE]

worktodo and result files do [B]not[/B] depend on the mfaktc version.

Oliver

kladner 2011-10-21 18:51

Thanks, Oliver. It all seems to be working out fine.

EDIT: With a little more thought I might have seen that the question on worktodo and results was unnecessary.

Quote: "mfaktc 0.16p1 will reject a checkpoint file from mfaktc 0.16."

Thanks for that, too.

kladner 2011-10-23 21:24

[QUOTE=kladner;275209]The command lines (in batch files) for the two instances I run are:

cmd.exe /c "start /low /affinity 0x10 mfaktc-win-64.exe"

This sets the affinity for CPU 4, on the 0-5 scale.

cmd.exe /c "start /low /affinity 0x20 mfaktc-win-64.exe"

This sets the affinity for CPU 5, on the 0-5 scale.[/QUOTE]

I've been looking at the switches for cmd and start. One of the things I wanted was for the command window to stay open when mfaktc (or cudalucas) finished. In the cmd + start scenario, the Pause command didn't yield the desired results.

However, the following does just what I want when run in a batch file:
[INDENT]cmd.exe /k "start /b /low /affinity 0x20 mfaktc-win-64.exe"
[/INDENT]The changes are replacing the cmd "/c" with "/k", and adding the start command switch "/b". I'm not completely sure that both are necessary, but it works like this so I'm leaving it alone.

The entire batch file is now[INDENT]e:
cd \mfaktc_32-64\mfaktc-0.17_32-64
cmd.exe /k "start /b /low /affinity 0x20 mfaktc-win-64.exe"
[/INDENT]That is:[INDENT]Change to working drive
Change to working directory
Run cmd & keep running after command, let Start run mfaktc in same window (/b), low priority, core 5 (the 6th core).

[/INDENT]

bcp19 2011-10-24 00:09

I was looking at newer video cards (my 450 seems kinda dated) and wanting to stay NVIDIA for the programs I have running here, I looked at their GTX line. For the GTX 550 there are 3 options, an ASUS GTX 550 for $130, a Gigabyte Technology GTX 550 for $142 and an NVIDIA GTX 550 for $200. By comparing all 3, it seems they all have the NVIVIA chip, so the question begs, which is the better choice, the $130, $142 or the $200 one? Is there an obvious difference in computing capability or is this a case of 'paying for the name'?

kladner 2011-10-24 01:47

[QUOTE=bcp19;275464] By comparing all 3, it seems they all have the NVIVIA chip, so the question begs, which is the better choice, the $130, $142 or the $200 one? Is there an obvious difference in computing capability or is this a case of 'paying for the name'?[/QUOTE]

I'd suggest seeing if you can find comparative reviews for the different card brands. Unless one of the manufacturers overclocked the chip, the computing capacity would be the same for the same chip. As to other value, I can't say.

On a quick look at Google for the chipset number I came up with an Anandtech article that compares different nV chipsets from the 5xx and 4xx series, perhaps not terribly favorably. I have not read the whole thing. They also discuss price points. The Suggested Retail on the GTX550Ti is $149. The SR on its big brother, the GTX560Ti is $249, but the 560 has twice as many CUDA processors and a wider memory bus.

[URL]http://www.anandtech.com/show/4221/nvidias-gtx-550-ti-coming-up-short-at-150[/URL]

I know another $100 might be hard to come by. But this article is from mid March. You can likely find the 560 for less by now. If you can stretch your price envelope, the processing power/$ is better for the 560 v 550.

As to one brand card versus another with the same chips, I would not expect a great difference between the Gigabyte and Asus offerings. The nVidia would be the "reference board," built just the way the chipmaker wanted it. Whether that is better, I can't say. That much higher a price does look like "paying for the name".

My own approach was to find the best deal I could on the chipset I decided on (GTX460), which happened to be a Gigabyte. So far so good on that. I happened on a good promotional deal from Microcenter: price reduction + rebate. New Egg certainly does similar things.

EDIT:
Take a look at New Egg, Microcenter, or (your favorite online retailer). If you can sort for just the 500 series cards, and then sort by price, compare the low end price you can get on a 560 with what you can get a 550 for. Don't forget to look for rebates. I see now that the Ti chip is more powerful than the plain 550 or 560. The plain 550 looks to be a higher-clocked version of the GTX 460, which is a good performer as is.

See the difference here:

[URL]http://www.tomshardware.com/reviews/geforce-gtx-560-amp-edition-gtx-560-directcu-ii-top,2944.html[/URL]

Christenson 2011-10-24 01:57

[QUOTE=bcp19;275464]I was looking at newer video cards (my 450 seems kinda dated) and wanting to stay NVIDIA for the programs I have running here, I looked at their GTX line. For the GTX 550 there are 3 options, an ASUS GTX 550 for $130, a Gigabyte Technology GTX 550 for $142 and an NVIDIA GTX 550 for $200. By comparing all 3, it seems they all have the NVIVIA chip, so the question begs, which is the better choice, the $130, $142 or the $200 one? Is there an obvious difference in computing capability or is this a case of 'paying for the name'?[/QUOTE]

While the 450 is perhaps a bit dated, it's still a perfectly good card that runs circles around CPUs at TF, and otherwise does good work. We hope you'll keep it in service somehow; also, make sure you have the power supply capability for your new configuration -- more GPU power = more watts from the power supply!!!!

delta_t 2011-10-24 04:29

[QUOTE=kladner;275440][INDENT]cmd.exe /k "start /b /low /affinity 0x20 mfaktc-win-64.exe"
[/INDENT]The changes are replacing the cmd "/c" with "/k", and adding the start command switch "/b". I'm not completely sure that both are necessary, but it works like this so I'm leaving it alone.[/QUOTE]

If you do the /k you will need the /b, otherwise what will happen is two command windows will open.

delta_t 2011-10-24 04:38

[QUOTE=bcp19;275464]Is there an obvious difference in computing capability or is this a case of 'paying for the name'?[/QUOTE]

The Nvidia chips will be the same. Besides 'paying for the name' on some of them, price differences could be reflected in the cooler (reference design vs. custom), but the higher priced ones could have a factory overclocked chip. Looked for catch phrase terms like "superclocked" or "extreme" or something to that effect and then take a look to see if the core clock frequency is higher than the usual default (for the GTX 550Ti the default core clock frequency is 900MHz).

kladner 2011-10-24 05:05

[QUOTE=delta_t;275482]If you do the /k you will need the /b, otherwise what will happen is two command windows will open.[/QUOTE]

Thanks. I wasn't sure about that. In the end, the combination does give me what I wanted: the mfaktc window staying open to show the final output.

nucleon 2011-10-25 09:42

[QUOTE=delta_t;275484]The Nvidia chips will be the same. Besides 'paying for the name' on some of them, price differences could be reflected in the cooler (reference design vs. custom), but the higher priced ones could have a factory overclocked chip. Looked for catch phrase terms like "superclocked" or "extreme" or something to that effect and then take a look to see if the core clock frequency is higher than the usual default (for the GTX 550Ti the default core clock frequency is 900MHz).[/QUOTE]

Also it might be confusing to see which card is faster and the relative speed of each one (factory overclocks really make it harder).

Sorry I don't know the original source of this info, but in short if you want to compare, go to: [url]http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units#GeForce_500_Series[/url]

Grab the first number in the core config, say for GTX560Ti it's 384, and GTX580 is 512.

Then look at the clock speeds 822MHz (GTX560Ti) 772MHz (GTX580).

Multiply them:
GTX560Ti) 315,648
GTX580) 395,264

So as long as they have enough CPU resources behind them, The GTX580 is roughly 25% faster with those clock speeds.

Some of the factory over clocked GTX560Ti cards come awfully close to the default GTX580 card at much less the price.

Do the numbers for GTX590, and it doesn't look all that great. (Personally I'd prefer 2x 560Ti cards for less $$)

This comparison method is only applicable in comparing the 400 and 500 series of Nvidia cards. BTW Guide only!

-- Craig

nucleon 2011-10-25 10:01

[QUOTE=kladner;275223]I have to add that the /low switch for the Start command seems to have greatly improved system responsiveness, while still turning in Time per Class values from 4.9xx-5.0xx on 70-71 runs.[/QUOTE]

Awesome.

Glad I was able to give you some ideas to play with.

-- Craig

TheJudger 2011-10-25 11:11

[QUOTE=nucleon;275655]Then look at the clock speeds 822MHz (GTX560Ti) 772MHz (GTX580).

Multiply them:
GTX560Ti) 315,648
GTX580) 395,264

So as long as they have enough CPU resources behind them, The GTX580 is roughly 25% faster with those clock speeds.[/QUOTE]

For mfaktc this is not true, mfaktc runs more effecient on CC 2.0 (GTX 580) than on CC 2.1 (GTX 560Ti).

A stock GTX 580 has [B]50+% more throughput[/B] than a stock GTX 560Ti for mfaktc.

Oliver

garo 2011-10-26 20:26

That begs the question. Why?

Christenson 2011-10-27 04:07

[QUOTE=garo;275850]That begs the question. Why?[/QUOTE]
Probably because CC2.1 adds a feature that slows things down. Also, a GTX580 *ought* to be faster than a GTX560, assuming larger part numbers mean more capable cards.

kladner 2011-10-27 04:13

[QUOTE=garo;275850]That begs the question. Why?[/QUOTE]

First item up on a Google of GTX 560 vs Gtx 580:

[url]http://www.hwcompare.com/9133/geforce-gtx-560-ti-vs-geforce-gtx-580/[/url]

More cores, greater memory bandwidth.

garo 2011-10-27 22:08

Well it's kinda obvious why the 580 is more powerful than the 560. It's a bigger number innit, love? The why was for mfaktc running more efficiently on one than the other.

Christenson 2011-10-28 00:50

I think Oliver meant simply more quickly...his measure was TF jobs clock time, not energy or computations per core or anything more sophisticated.

TheJudger 2011-10-28 20:32

[QUOTE=garo;275850]That begs the question. Why?[/QUOTE]

[QUOTE=Christenson;275925]Probably because CC2.1 adds a feature that slows things down. Also, a GTX580 *ought* to be faster than a GTX560, assuming larger part numbers mean more capable cards.[/QUOTE]

The [I]feature[/I] is called ILP (instruction level parallelism) for CC 2.1

[QUOTE=Christenson;276059]I think Oliver meant simply more quickly...his measure was TF jobs clock time, not energy or computations per core or anything more sophisticated.[/QUOTE]

I was thinking about <number of GPU cores> * <GPU core clock rate> / <number of candidates per second>

CC 2.1 is slower because it needs ILP to utilize the full number of GPU cores. My current kernels have many dependend instructions (such as add with carry).

Oliver

garo 2011-10-28 21:41

Ah the feature that is a bug until you write a new version and then it's not!

TheJudger 2011-10-28 22:46

[QUOTE=garo;276179]Ah the feature that is a bug until you write a new version and then it's not![/QUOTE]

They just save some transistors (power budget, die size, ...).

You might want to read this:[LIST][*][url]http://www.anandtech.com/show/3809/nvidias-geforce-gtx-460-the-200-king/2[/url][*][url]http://www.anandtech.com/show/3809/nvidias-geforce-gtx-460-the-200-king/16[/url][/LIST]
I'll wait for the next genenration GPUs and than I'll decide if I give it a try or not.

Oliver

Christenson 2011-10-28 23:07

[QUOTE=garo;276179]Ah the feature that is a bug until you write a new version and then it's not![/QUOTE]
:featurebug:
:leaving:

nucleon 2011-10-29 00:21

[QUOTE=TheJudger;276173]The [I]feature[/I] is called ILP (instruction level parallelism) for CC 2.1

CC 2.1 is slower because it needs ILP to utilize the full number of GPU cores. My current kernels have many dependend instructions (such as add with carry).

Oliver[/QUOTE]

Thanks Oliver. I guess I'll wait for more ILP friendly code :)

Looking at the results from my setup here, actual performance difference 16%.

So my 2x GTX460Ti is 16% faster than 1x GTX580 (actual), but based on my previous calculations above it should be 75%.

Sieve primes is different between the 2 cards. So the above figures are a guide only.

-- Craig

dbaugh 2011-10-29 06:00

CUDA and display
 
If I am using my graphics card for mfaktc how is it that the GPU plays nice by also driving the display? I have heard that CPU onboard graphics is disabled if a video card is detected. Is there a way to trick the computer into using onboard graphics and leave all the GPU for math? Or does the GPU doing double duty not affect TF throughput?

Systems I have seen advertised with multiple C2075 Tesla's often have a lower end GPU installed as well. Will mfaktc automatically take full advantage of the power of such a system or is it tailored for GTX 590 and lower?

- David


All times are UTC. The time now is 23:28.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.