mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

KyleAskine 2011-11-26 12:36

[QUOTE=flashjh;279944]What did you do to fix it?[/QUOTE]

I didn't. It locked up again. As far as I can tell, I cannot get a second instance of mfakto to run on my linux box without getting defunct threads that I need to reboot to get rid of. I know for a fact that heat is not an issue, so I am stumped.

On my windows box it seems that I can though.

bcp19 2011-11-26 14:12

[QUOTE=flashjh;279942]How are you starting multiple instances of mfakto and how are you specifying a single core or multiple cores on your CPU for each instance? Each time I start mfakto no matter how I run it, I see usage across all four cores.[/QUOTE]

Just so you know, 1) I'm running windows, 2) bat files:

g:
cd\mfakto
cmd.exe /k "start /b /low /affinity 0x08 mfakto.exe"

and

g:
cd\mfakto-1
cmd.exe /k "start /b /low /affinity 0x04 mfakto.exe"

Dubslow 2011-11-26 15:27

If you don't want to bother with batch files, you can also set mfakto to run on a specific core by using the Task Manager. Go to the process list (not the application list) and right click on mfakto and select 'Affinity'. It will have the same effect as using the command line options above.

To run multiple instances, you need to have one subfolder for each instance, each subfolder with its own executable and its own worktodo file.

flashjh 2011-11-26 18:55

[QUOTE=bcp19;279983]Just so you know, 1) I'm running windows, 2) bat files:

g:
cd\mfakto
cmd.exe /k "start /b /low /affinity 0x08 mfakto.exe"

and

g:
cd\mfakto-1
cmd.exe /k "start /b /low /affinity 0x04 mfakto.exe"[/QUOTE]

Ok, that did it -- thank you. I looked up the start affinity stuff just to make sure I 'got it'. The actual breakdown for the affinity is as follows:

[COLOR=black][FONT=Verdana][COLOR=black][FONT=Verdana]CPU3 CPU2 CPU1 CPU0 Bin Hex [/FONT][/COLOR]
[COLOR=black][FONT=Verdana]OFF OFF OFF ON = 0001 = 1[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]OFF OFF ON OFF = 0010 = 2[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]OFF OFF ON ON = 0011 = 3[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]OFF ON OFF OFF = 0100 = 4[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]OFF ON OFF ON = 0101 = 5[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]OFF ON ON OFF = 0110 = 6[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]OFF ON ON ON = 0111 = 7[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]ON OFF OFF OFF = 1000 = 8[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]ON OFF OFF ON = 1001 = 9[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]ON OFF ON OFF = 1010 = A[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]ON OFF ON ON = 1011 = B[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]ON ON OFF OFF = 1100 = C[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]ON ON OFF ON = 1101 = D[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]ON ON ON OFF = 1110 = E[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]ON ON ON ON = 1111 = F[/FONT][/COLOR][/FONT][/COLOR]
I also modified the batch file line a bit to add affinity for the GPU. So I use two instances, two cores each to max everything:

cmd.exe /k "start /b /low /affinity 3 mfakto.exe -d 11"
cmd.exe /k "start /b /low /affinity C mfakto.exe -d 12"

First one uses cores one and two with my 1st GPU, second uses cores three and four with my 2nd GPU.

I now can get 130 M/s each with no Prime95. I'm going to play with the sieve a bit to see what I can do. Once I start Prime95 it all goes back to 120 M/s. I guess I need more cores, time for another upgrade.

flashjh 2011-11-27 01:00

[QUOTE=Bdot;279809]Actually, in order to get both the CPU and the GPU to (almost) full load, I have to run 3 mfakto instances and 3 prime95 threads on my quad-core Phenom. This way, the 3 mfakto instances add up to almost 2 CPU cores (with peaks to ~220%). Two of the prime95 threads advance at normal speed, the third is just taking what's left over (~5-10% CPU, i.e. rather crawling along). I don't pin mfakto to any core, I let Windows7 choose.[/QUOTE]

So, what hardware are you using for the whole system? What are your total M/s with the three instances?


[QUOTE=Bdot;279809]I think I'll add a raw performance measurement mode to mfakto, detailing the pure kernel runtime per kernel. This way it would be easier to compare the cards, also to NV/mfaktc. Until then, use tools like GPU-Z, or "aticonfig --od-getclocks | grep load" to find out how much room the GPU still has. I unfortunately have access to only 2 different ATI cards, one of them bound to 11.11 :-([/QUOTE]

Why is it bound to 11.11?


[QUOTE=Bdot;279809]The later.[/QUOTE]

I always get higher μs times when my CPU is too busy like with Prime95. What exactly is the ave. wait time telling us?

flashjh 2011-11-27 01:20

[QUOTE=Dubslow;279984]If you don't want to bother with batch files, you can also set mfakto to run on a specific core by using the Task Manager. Go to the process list (not the application list) and right click on mfakto and select 'Affinity'. It will have the same effect as using the command line options above.

To run multiple instances, you need to have one subfolder for each instance, each subfolder with its own executable and its own worktodo file.[/QUOTE]

Alright, thanks for the tip. I discovered that I can use your suggestion to set the affinity of the instances while they're running so I don't have to open and close them over and over -- once I find a good setting, I can update the batch files to use my new settings. Thanks!

Dubslow 2011-11-27 01:47

[QUOTE=flashjh;280040]So, what hardware are you using for the whole system? What are your total M/s with the three instances?[/QUOTE]

Keep in mind that M/s is not necessarily the best comparison, because M/s changes depending on SievePrimes without affecting actual throughput. Time per class for a similar assignment is a better metric. See below:

[QUOTE=flashjh;280040]
I always get higher μs times when my CPU is too busy like with Prime95. What exactly is the ave. wait time telling us?[/QUOTE]

Going with the above, SievePrimes determines how much work is done on the CPU before being sent to the GPU. Essentially, the CPU eliminates ('sieves') out factor candidates that are not prime. The higher SievePrimes is, the more work the CPU does, and the more candidates are eliminated as being composite. The candidates that are not eliminated by the sieve are tested for being a factor on the GPU. Avg. wait tells how long the CPU must wait for the GPU before doing more sieving. If it is less than ~100 μs, then the CPU is being overloaded and out-powered by the GPU. To rectify this, decrease SievePrimes (which shifts more work to the GPU rather than CPU) or run more than one mfakto instance. If Avg. wait is greater than 1000 or 2000 μs, then the process is bottlenecked by the GPU; the CPU is doing a lot of waiting. Fix this by increasing SievePrimes (which shifts more work to the CPU) or run less instances.

Most people will generally find avg. wait to be very low, i.e. mfakto/c tend to be bottlenecked by the CPU. This is due to the fact that most GPUs are an order of magnitude (or more) more powerful than most CPUs. For instance, one core of my i7-2600k (high-end processor) can just barely keep pace with a GTX 460 (mid range GPU). With SievePrimes=5000 (the lowest it will go) I get average wait times around 100-150 μs (and every once in a it while drops below 100).

Another way to check for CPU overload is if the load on the GPU is less than 90-95%. GPU-Z is a good program for that. If you're at less than 90% load, check your SievePrimes and Avg. wait and consider running more than one thread.

Dubslow 2011-11-27 02:01

I just went back and looked at your hardware; as far as I know 5870's are better than a 460. The processor definitely isn't the best, though it is overclocked. I would not be surprised if you needed two threads per GPU to fully saturate them, and since you have two cards (a [i]lot[/i] of firepower) you might need all four CPU cards to saturate the GPUs. With the above post and your newfound affinity skills, run a bit more testing and see what numbers you get.

flashjh 2011-11-27 02:02

[QUOTE=Dubslow;280045]Keep in mind that M/s is not necessarily the best comparison, because M/s changes depending on SievePrimes without affecting actual throughput. Time per class for a similar assignment is a better metric. See below:



Going with the above, SievePrimes determines how much work is done on the CPU before being sent to the GPU. Essentially, the CPU eliminates ('sieves') out factor candidates that are not prime. The higher SievePrimes is, the more work the CPU does, and the more candidates are eliminated as being composite. The candidates that are not eliminated by the sieve are tested for being a factor on the GPU. Avg. wait tells how long the CPU must wait for the GPU before doing more sieving. If it is less than ~100 μs, then the CPU is being overloaded and out-powered by the GPU. To rectify this, decrease SievePrimes (which shifts more work to the GPU rather than CPU) or run more than one mfakto instance. If Avg. wait is greater than 1000 or 2000 μs, then the process is bottlenecked by the GPU; the CPU is doing a lot of waiting. Fix this by increasing SievePrimes (which shifts more work to the CPU) or run less instances.

Most people will generally find avg. wait to be very low, i.e. mfakto/c tend to be bottlenecked by the CPU. This is due to the fact that most GPUs are an order of magnitude more powerful than most CPUs. For instance, one core of my i7-2600k (high-end processor) can just barely keep pace with a GTX 460 (low-mid range GPU). With SievePrimes=5000 (the lowest it will go) I get average wait times around 100-150 μs.

Another way to check for CPU overload is if the load on the GPU is less than 90-95%. GPU-Z is a good program for that. If you're at less than 90% load, check your SievePrimes and Avg. wait and consider running more than one thread.[/QUOTE]

Thanks for the explanation... I have another question then. So, I understand on my system that the two 5870s outclass my X9650. But, I keep the SievePrimes at 5000 on both mfakto instances because otherwise the M/s drop way down. From what you're saying though, I may be out of balance, correct? So how to find the right balance between mfakto instances and CPU load?

It seems no matter what I try I can't get my GPUs above 70%. Also, as I have it now my ave. wait times are always 0 except a rare jump to 20 - 200 μs. I must be doing something wrong but I'm not sure what to adjust first. More instances with higher sieve maybe?

How do I determine/get maximum throughput?

flashjh 2011-11-27 02:07

[QUOTE=Dubslow;280046]I just went back and looked at your hardware; as far as I know 5870's are better than a 460. The processor definitely isn't the best, though it is overclocked. I would not be surprised if you needed two threads per GPU to fully saturate them, and since you have two cards (a [I]lot[/I] of firepower) you might need all four CPU cards to saturate the GPUs. With the above post and your newfound affinity skills, run a bit more testing and see what numbers you get.[/QUOTE]

We cross-posted a bit, but I'll expand my question just a bit with your info here.

So, is shooting for the lowest ETA on each TF the best troughput or is there something more to look at? How do you maximize thoughput.

[QUOTE=Dubslow;280046]I would not be surprised if you needed two threads per GPU to fully saturate them[/QUOTE]

BTW - You are correct! I all but discovered that my CPU can not max out the GPUs at the same time. Even all four cores don't seem to put a GPU at 100%. When I run two instances of mfakto I need two cores each to get 65% on each GPU. I'll keep testing. Thanks again for the help.

Dubslow 2011-11-27 02:08

Maximum throughput is determined by time per class. (You'll notice that if you increase SievePrimes and M/s drops a lot, time per class should drop less than M/s; it still will drop however because you are shifting more work to the CPU.) When you say both mfakto instances, you mean one for each card? If you mean two for just one card and SievePrimes is already as low as possible (5000) then there's nothing you can do except run a third instance, if your CPU isn't maxed out already. You're right, the avg. wait times suggest that mfakto is limited by the CPU; since you can't decrease SievePrimes anymore, the only fix is more instances.

Edit: Whoops, cross posting. In mfaktc at least, one of the columns printed to the CLI is "Time per class". If you can't find that, then ETA+classes complete is the next best way to determine overall throughput. If you're comparing two instances with the same SievePrimes, then you can compare M/s. If they're different SievePrimes you need to look at time per class.

Edit 2: Can you run two mfakto instances on one GPU, set the affinities (to make sure it's only using two cores of your CPU) and then post their SievePrimes and avg. wait? It sounds like even this won't be pretty...

Edit 3: I will come back in ten minutes to avoid more cross posting.
Edit 4: Just read your last post in more detail. 2 cores gets one GPU to 65% load? I would still be interested in the numbers.


All times are UTC. The time now is 10:39.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.