mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

bcp19 2011-11-26 01:22

[QUOTE=flashjh;279858]From what I can tell you need two or more instances to max out the GPUs. If you have a slower CPU, you might max it out before you max the GPU. Like in my case, my GPUs sit at 60% each with two instances, but my CPU doesn't have much more to throw at it, sitting at 85%. My wait times are quite low, mostly 0, but up to 200µs. But, when I change sieve to anything other than 5000, my M/s drops way down. Could be the difference between ATI & nVidia?[/QUOTE]

I had a 6770 in my i5 2400 and with 2 mfakto running I was only seeing ~70% GPU load and I believe 100 M/s combined and I really didn't want to use up yet another core to try and max it out, but the 560 Ti I replaced the 6770 with works with 1 core at 165 M/s and ~65% load. If I start a second Mfaktc I can get up to 95% and roughly 230 M/s. The 560 is almost maxed with 1 core, 1 core gives me 175 M/s and 2 only gets it up to 200 M/s combined.

flashjh 2011-11-26 05:59

4670
 
So I have a 4670 in my older computer. I tested it out after finally getting the ATI 11.9 driver installed with the AGP card. It all works - I get about 12.2 M/s with the CPU @ ~85%. Prime95 can run one LL test with no impact on the M/s, but it takes the CPU to 100%.

Two questions -

First, what exactly does changing the sieved primes do? On my faster machine any number over 5000 slows things down. I haven't messed with it too much on the slower machine, but 5000 or 200000 gives the same result on the slow system.

Second, what is the comparrison between a GPU and a CPU system? So I get 240 M/s out of the fast machine and 12.2 M/s out of the slow one. But, how do I compare whether I'm better off letting Prime95 do the work or letting the GPU handle it?

BTW - I get a little bit better performace out of the 'mfakto_cl_barrett79' kernal. I don't know how many have reported for HD4xxx cards, but I figured I'd let you know since it still says to report from the .ini file. The 'mfakto_cl_71' kernal gives between 10 - 11 M/s, the 'mfakto_cl_barrett79' kernal gives me the 12.2 M/s.

flashjh 2011-11-26 06:03

[QUOTE=bcp19;279895]I had a 6770 in my i5 2400 and with 2 mfakto running I was only seeing ~70% GPU load and I believe 100 M/s combined and I really didn't want to use up yet another core to try and max it out, but the 560 Ti I replaced the 6770 with works with 1 core at 165 M/s and ~65% load. If I start a second Mfaktc I can get up to 95% and roughly 230 M/s. The 560 is almost maxed with 1 core, 1 core gives me 175 M/s and 2 only gets it up to 200 M/s combined.[/QUOTE]

How are you starting multiple instances of mfakto and how are you specifying a single core or multiple cores on your CPU for each instance? Each time I start mfakto no matter how I run it, I see usage across all four cores.

flashjh 2011-11-26 06:09

[QUOTE=KyleAskine;279876]Alright, I just started a second instance of mfakto on my linux (5870) box. I now have two instances running at around 30000 primes sieved and 100 M/s.[/QUOTE]

What did you do to fix it?

KyleAskine 2011-11-26 12:36

[QUOTE=flashjh;279944]What did you do to fix it?[/QUOTE]

I didn't. It locked up again. As far as I can tell, I cannot get a second instance of mfakto to run on my linux box without getting defunct threads that I need to reboot to get rid of. I know for a fact that heat is not an issue, so I am stumped.

On my windows box it seems that I can though.

bcp19 2011-11-26 14:12

[QUOTE=flashjh;279942]How are you starting multiple instances of mfakto and how are you specifying a single core or multiple cores on your CPU for each instance? Each time I start mfakto no matter how I run it, I see usage across all four cores.[/QUOTE]

Just so you know, 1) I'm running windows, 2) bat files:

g:
cd\mfakto
cmd.exe /k "start /b /low /affinity 0x08 mfakto.exe"

and

g:
cd\mfakto-1
cmd.exe /k "start /b /low /affinity 0x04 mfakto.exe"

Dubslow 2011-11-26 15:27

If you don't want to bother with batch files, you can also set mfakto to run on a specific core by using the Task Manager. Go to the process list (not the application list) and right click on mfakto and select 'Affinity'. It will have the same effect as using the command line options above.

To run multiple instances, you need to have one subfolder for each instance, each subfolder with its own executable and its own worktodo file.

flashjh 2011-11-26 18:55

[QUOTE=bcp19;279983]Just so you know, 1) I'm running windows, 2) bat files:

g:
cd\mfakto
cmd.exe /k "start /b /low /affinity 0x08 mfakto.exe"

and

g:
cd\mfakto-1
cmd.exe /k "start /b /low /affinity 0x04 mfakto.exe"[/QUOTE]

Ok, that did it -- thank you. I looked up the start affinity stuff just to make sure I 'got it'. The actual breakdown for the affinity is as follows:

[COLOR=black][FONT=Verdana][COLOR=black][FONT=Verdana]CPU3 CPU2 CPU1 CPU0 Bin Hex [/FONT][/COLOR]
[COLOR=black][FONT=Verdana]OFF OFF OFF ON = 0001 = 1[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]OFF OFF ON OFF = 0010 = 2[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]OFF OFF ON ON = 0011 = 3[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]OFF ON OFF OFF = 0100 = 4[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]OFF ON OFF ON = 0101 = 5[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]OFF ON ON OFF = 0110 = 6[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]OFF ON ON ON = 0111 = 7[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]ON OFF OFF OFF = 1000 = 8[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]ON OFF OFF ON = 1001 = 9[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]ON OFF ON OFF = 1010 = A[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]ON OFF ON ON = 1011 = B[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]ON ON OFF OFF = 1100 = C[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]ON ON OFF ON = 1101 = D[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]ON ON ON OFF = 1110 = E[/FONT][/COLOR]
[COLOR=black][FONT=Verdana]ON ON ON ON = 1111 = F[/FONT][/COLOR][/FONT][/COLOR]
I also modified the batch file line a bit to add affinity for the GPU. So I use two instances, two cores each to max everything:

cmd.exe /k "start /b /low /affinity 3 mfakto.exe -d 11"
cmd.exe /k "start /b /low /affinity C mfakto.exe -d 12"

First one uses cores one and two with my 1st GPU, second uses cores three and four with my 2nd GPU.

I now can get 130 M/s each with no Prime95. I'm going to play with the sieve a bit to see what I can do. Once I start Prime95 it all goes back to 120 M/s. I guess I need more cores, time for another upgrade.

flashjh 2011-11-27 01:00

[QUOTE=Bdot;279809]Actually, in order to get both the CPU and the GPU to (almost) full load, I have to run 3 mfakto instances and 3 prime95 threads on my quad-core Phenom. This way, the 3 mfakto instances add up to almost 2 CPU cores (with peaks to ~220%). Two of the prime95 threads advance at normal speed, the third is just taking what's left over (~5-10% CPU, i.e. rather crawling along). I don't pin mfakto to any core, I let Windows7 choose.[/QUOTE]

So, what hardware are you using for the whole system? What are your total M/s with the three instances?


[QUOTE=Bdot;279809]I think I'll add a raw performance measurement mode to mfakto, detailing the pure kernel runtime per kernel. This way it would be easier to compare the cards, also to NV/mfaktc. Until then, use tools like GPU-Z, or "aticonfig --od-getclocks | grep load" to find out how much room the GPU still has. I unfortunately have access to only 2 different ATI cards, one of them bound to 11.11 :-([/QUOTE]

Why is it bound to 11.11?


[QUOTE=Bdot;279809]The later.[/QUOTE]

I always get higher μs times when my CPU is too busy like with Prime95. What exactly is the ave. wait time telling us?

flashjh 2011-11-27 01:20

[QUOTE=Dubslow;279984]If you don't want to bother with batch files, you can also set mfakto to run on a specific core by using the Task Manager. Go to the process list (not the application list) and right click on mfakto and select 'Affinity'. It will have the same effect as using the command line options above.

To run multiple instances, you need to have one subfolder for each instance, each subfolder with its own executable and its own worktodo file.[/QUOTE]

Alright, thanks for the tip. I discovered that I can use your suggestion to set the affinity of the instances while they're running so I don't have to open and close them over and over -- once I find a good setting, I can update the batch files to use my new settings. Thanks!

Dubslow 2011-11-27 01:47

[QUOTE=flashjh;280040]So, what hardware are you using for the whole system? What are your total M/s with the three instances?[/QUOTE]

Keep in mind that M/s is not necessarily the best comparison, because M/s changes depending on SievePrimes without affecting actual throughput. Time per class for a similar assignment is a better metric. See below:

[QUOTE=flashjh;280040]
I always get higher μs times when my CPU is too busy like with Prime95. What exactly is the ave. wait time telling us?[/QUOTE]

Going with the above, SievePrimes determines how much work is done on the CPU before being sent to the GPU. Essentially, the CPU eliminates ('sieves') out factor candidates that are not prime. The higher SievePrimes is, the more work the CPU does, and the more candidates are eliminated as being composite. The candidates that are not eliminated by the sieve are tested for being a factor on the GPU. Avg. wait tells how long the CPU must wait for the GPU before doing more sieving. If it is less than ~100 μs, then the CPU is being overloaded and out-powered by the GPU. To rectify this, decrease SievePrimes (which shifts more work to the GPU rather than CPU) or run more than one mfakto instance. If Avg. wait is greater than 1000 or 2000 μs, then the process is bottlenecked by the GPU; the CPU is doing a lot of waiting. Fix this by increasing SievePrimes (which shifts more work to the CPU) or run less instances.

Most people will generally find avg. wait to be very low, i.e. mfakto/c tend to be bottlenecked by the CPU. This is due to the fact that most GPUs are an order of magnitude (or more) more powerful than most CPUs. For instance, one core of my i7-2600k (high-end processor) can just barely keep pace with a GTX 460 (mid range GPU). With SievePrimes=5000 (the lowest it will go) I get average wait times around 100-150 μs (and every once in a it while drops below 100).

Another way to check for CPU overload is if the load on the GPU is less than 90-95%. GPU-Z is a good program for that. If you're at less than 90% load, check your SievePrimes and Avg. wait and consider running more than one thread.

Dubslow 2011-11-27 02:01

I just went back and looked at your hardware; as far as I know 5870's are better than a 460. The processor definitely isn't the best, though it is overclocked. I would not be surprised if you needed two threads per GPU to fully saturate them, and since you have two cards (a [i]lot[/i] of firepower) you might need all four CPU cards to saturate the GPUs. With the above post and your newfound affinity skills, run a bit more testing and see what numbers you get.

flashjh 2011-11-27 02:02

[QUOTE=Dubslow;280045]Keep in mind that M/s is not necessarily the best comparison, because M/s changes depending on SievePrimes without affecting actual throughput. Time per class for a similar assignment is a better metric. See below:



Going with the above, SievePrimes determines how much work is done on the CPU before being sent to the GPU. Essentially, the CPU eliminates ('sieves') out factor candidates that are not prime. The higher SievePrimes is, the more work the CPU does, and the more candidates are eliminated as being composite. The candidates that are not eliminated by the sieve are tested for being a factor on the GPU. Avg. wait tells how long the CPU must wait for the GPU before doing more sieving. If it is less than ~100 μs, then the CPU is being overloaded and out-powered by the GPU. To rectify this, decrease SievePrimes (which shifts more work to the GPU rather than CPU) or run more than one mfakto instance. If Avg. wait is greater than 1000 or 2000 μs, then the process is bottlenecked by the GPU; the CPU is doing a lot of waiting. Fix this by increasing SievePrimes (which shifts more work to the CPU) or run less instances.

Most people will generally find avg. wait to be very low, i.e. mfakto/c tend to be bottlenecked by the CPU. This is due to the fact that most GPUs are an order of magnitude more powerful than most CPUs. For instance, one core of my i7-2600k (high-end processor) can just barely keep pace with a GTX 460 (low-mid range GPU). With SievePrimes=5000 (the lowest it will go) I get average wait times around 100-150 μs.

Another way to check for CPU overload is if the load on the GPU is less than 90-95%. GPU-Z is a good program for that. If you're at less than 90% load, check your SievePrimes and Avg. wait and consider running more than one thread.[/QUOTE]

Thanks for the explanation... I have another question then. So, I understand on my system that the two 5870s outclass my X9650. But, I keep the SievePrimes at 5000 on both mfakto instances because otherwise the M/s drop way down. From what you're saying though, I may be out of balance, correct? So how to find the right balance between mfakto instances and CPU load?

It seems no matter what I try I can't get my GPUs above 70%. Also, as I have it now my ave. wait times are always 0 except a rare jump to 20 - 200 μs. I must be doing something wrong but I'm not sure what to adjust first. More instances with higher sieve maybe?

How do I determine/get maximum throughput?

flashjh 2011-11-27 02:07

[QUOTE=Dubslow;280046]I just went back and looked at your hardware; as far as I know 5870's are better than a 460. The processor definitely isn't the best, though it is overclocked. I would not be surprised if you needed two threads per GPU to fully saturate them, and since you have two cards (a [I]lot[/I] of firepower) you might need all four CPU cards to saturate the GPUs. With the above post and your newfound affinity skills, run a bit more testing and see what numbers you get.[/QUOTE]

We cross-posted a bit, but I'll expand my question just a bit with your info here.

So, is shooting for the lowest ETA on each TF the best troughput or is there something more to look at? How do you maximize thoughput.

[QUOTE=Dubslow;280046]I would not be surprised if you needed two threads per GPU to fully saturate them[/QUOTE]

BTW - You are correct! I all but discovered that my CPU can not max out the GPUs at the same time. Even all four cores don't seem to put a GPU at 100%. When I run two instances of mfakto I need two cores each to get 65% on each GPU. I'll keep testing. Thanks again for the help.

Dubslow 2011-11-27 02:08

Maximum throughput is determined by time per class. (You'll notice that if you increase SievePrimes and M/s drops a lot, time per class should drop less than M/s; it still will drop however because you are shifting more work to the CPU.) When you say both mfakto instances, you mean one for each card? If you mean two for just one card and SievePrimes is already as low as possible (5000) then there's nothing you can do except run a third instance, if your CPU isn't maxed out already. You're right, the avg. wait times suggest that mfakto is limited by the CPU; since you can't decrease SievePrimes anymore, the only fix is more instances.

Edit: Whoops, cross posting. In mfaktc at least, one of the columns printed to the CLI is "Time per class". If you can't find that, then ETA+classes complete is the next best way to determine overall throughput. If you're comparing two instances with the same SievePrimes, then you can compare M/s. If they're different SievePrimes you need to look at time per class.

Edit 2: Can you run two mfakto instances on one GPU, set the affinities (to make sure it's only using two cores of your CPU) and then post their SievePrimes and avg. wait? It sounds like even this won't be pretty...

Edit 3: I will come back in ten minutes to avoid more cross posting.
Edit 4: Just read your last post in more detail. 2 cores gets one GPU to 65% load? I would still be interested in the numbers.

flashjh 2011-11-27 02:34

[QUOTE=Dubslow;280050]Maximum throughput is determined by time per class. (You'll notice that if you increase SievePrimes and M/s drops a lot, time per class should drop less than M/s; it still will drop however because you are shifting more work to the CPU.) When you say both mfakto instances, you mean one for each card? If you mean two for just one card and SievePrimes is already as low as possible (5000) then there's nothing you can do except run a third instance, if your CPU isn't maxed out already. You're right, the avg. wait times suggest that mfakto is limited by the CPU; since you can't decrease SievePrimes anymore, the only fix is more instances.

Edit: Whoops, cross posting. In mfaktc at least, one of the columns printed to the CLI is "Time per class". If you can't find that, then ETA+classes complete is the next best way to determine overall throughput. If you're comparing two instances with the same SievePrimes, then you can compare M/s. If they're different SievePrimes you need to look at time per class.

Edit 2: Can you run two mfakto instances on one GPU, set the affinities (to make sure it's only using two cores of your CPU) and then post their SievePrimes and avg. wait? It sounds like even this won't be pretty...

Edit 3: I will come back in ten minutes to avoid more cross posting.
Edit 4: Just read your last post in more detail. 2 cores gets one GPU to 65% load? I would still be interested in the numbers.[/QUOTE]

Sorry for the cross post. Anyhow, I've got two instances on one GPU with 2 cores. Results: ~ 5.2 sec per post, it running between 1 and 5 classes each time. Avg wait is about 3900 μs now. M/s is 40 on each instance. I was getting about 125 M/s before. GPU load is 41%. SievePrimes autoadjusted itself on both instances from 5000 to 200,000. What do you think?

Dubslow 2011-11-27 02:50

The autoadjust is screwing you up. Turn it off and see what you get with SievePrimes=5000, and then try 10,000 since it's trying so hard to get it higher. Do you not have a time per class column?

flashjh 2011-11-27 02:55

[QUOTE=Dubslow;280052]The autoadjust is screwing you up. Turn it off and see what you get with SievePrimes=5000, and then try 10,000 since it's trying so hard to get it higher. Do you not have a time per class column?[/QUOTE]

Ok, I'll shut it off... no, mfakto does not have a time per class column.

Dubslow 2011-11-27 02:57

[QUOTE=flashjh;280053]Ok, I'll shut it off... no, mfakto does not have a time per class column.[/QUOTE]

Hmm... Bdot? Am I remembering mfaktc wrong? (I do not have access to my comp ATM)

flashjh 2011-11-27 03:11

1 Attachment(s)
[QUOTE=Dubslow;280054]Hmm... Bdot? Am I remembering mfaktc wrong? (I do not have access to my comp ATM)[/QUOTE]

Attached a screenshot of 5000.

Also, I forced 5000 and 10000. 5000 with 2 instances both on the same GPU and the same two CPU cores puts GPU at 62%. If I change it to 10000 the GPU drops to around 58%. CPU cores are just shy of 100% either way.

Dubslow 2011-11-27 03:19

Hmm. That is really odd that the load is at 62% even though avg. wait is so high. Bdot? Maybe you should try letting the CPU sleep?

flashjh 2011-11-27 03:30

1 Attachment(s)
[QUOTE=Dubslow;280058]Hmm. That is really odd that the load is at 62% even though avg. wait is so high. Bdot? Maybe you should try letting the CPU sleep?[/QUOTE]

If I let it sleep it probably won't wake up again.

After testing everything, I think it's clear that my CPU and board architechture just can't keep up with the GPUs. Not that is a bad thing, but I think that for optimum mfakto it's two cores per instance, one instance for each GPU with sieve at 5000 -- thoughts?

I attached a screen shot of one of the instances with that setup. CPU sits at about 85%, GPUs are at about 64%.

TheJudger 2011-11-27 12:57

[QUOTE=Dubslow;280045]Keep in mind that M/s is not necessarily the best comparison, because M/s changes depending on SievePrimes without affecting actual throughput. Time per class for a similar assignment is a better metric.[/QUOTE]

Correct!

[QUOTE=Dubslow;280045]Going with the above, SievePrimes determines how much work is done on the CPU before being sent to the GPU. Essentially, the CPU eliminates ('sieves') out factor candidates that are not prime. The higher SievePrimes is, the more work the CPU does, and the more candidates are eliminated as being composite. The candidates that are not eliminated by the sieve are tested for being a factor on the GPU. Avg. wait tells how long the CPU must wait for the GPU before doing more sieving. If it is less than ~100 μs, then the CPU is being overloaded and out-powered by the GPU. To rectify this, decrease SievePrimes (which shifts more work to the GPU rather than CPU) or run more than one mfakto instance. If Avg. wait is greater than 1000 or 2000 μs, then the process is bottlenecked by the GPU; the CPU is doing a lot of waiting. Fix this by increasing SievePrimes (which shifts more work to the CPU) or run less instances.[/QUOTE]

Again correct (and good written)!

[QUOTE=Dubslow;280050]Edit: Whoops, cross posting. In mfaktc at least, one of the columns printed to the CLI is "Time per class". If you can't find that, then ETA+classes complete is the next best way to determine overall throughput. If you're comparing two instances with the same SievePrimes, then you can compare M/s. If they're different SievePrimes you need to look at time per class.
[/QUOTE]

Same SievePrimes and same exponent ist needed for perfect comparison. To be honest it is OK if the exponents are within the same size usually.

Oliver

diamonddave 2011-11-27 15:15

Correct me if I'm wrong, but I don't think the actual siever is multithreaded, you won't sieve deeper or more efficiently if you assign more than 1 CPU per instance.

So more than 1 CPU per mfakto should not help. Try having 1 or 2 more instance of mfacto running.

I'll take my system as an example:

[CODE]
Instance; sieve; M/s per instance; ETA; Avg. Wait; M/s system; Exp tested per hour
1 26.8M 9000 160M 26m40s 500us 160M 2.25
2 26.8M 26000 120M 33m50s 500us 240M 3.55
3 26.8M 56000 80M 44m15s 600us 240M 4.06
[/CODE]

Although I would get more Ghz/Day by running a 3rd instance, Prime95 actually get to have my last 2 cores.

flashjh 2011-11-27 16:33

[QUOTE=diamonddave;280091]Correct me if I'm wrong, but I don't think the actual siever is multithreaded, you won't sieve deeper or more efficiently if you assign more than 1 CPU per instance.

So more than 1 CPU per mfakto should not help. Try having 1 or 2 more instance of mfacto running.

I'll take my system as an example:

[CODE]
Instance; sieve; M/s per instance; ETA; Avg. Wait; M/s system; Exp tested per hour
1 26.8M 9000 160M 26m40s 500us 160M 2.25
2 26.8M 26000 120M 33m50s 500us 240M 3.55
3 26.8M 56000 80M 44m15s 600us 240M 4.06
[/CODE]

Although I would get more Ghz/Day by running a 3rd instance, Prime95 actually get to have my last 2 cores.[/QUOTE]

I can't speak to the multithreaded siever, but if I only assign 1 core per instance, it way underpowers the GPU.

Thanks everyone for the inputs. I was able to max the CPU and push the GPUs fairly hard with three instances, but in the end my CPU doesn't have the ability to maximize the GPUs.

All-in-all I can run two instances for about 240 M/s along with Prime95. I'm happy with that. mfakto doesn't have a time per class, per se, but with these settings I get the best throughout and fastest times.

Dubslow 2011-11-27 22:13

For whatever reason, the Windows task scheduler seems to be able to put both cores to use on a single thread, and it seems he's gotten the most effective results this way.

Bdot 2011-11-28 00:33

[QUOTE=flashjh;280059]If I let it sleep it probably won't wake up again.

After testing everything, I think it's clear that my CPU and board architechture just can't keep up with the GPUs. Not that is a bad thing, but I think that for optimum mfakto it's two cores per instance, one instance for each GPU with sieve at 5000 -- thoughts?

I attached a screen shot of one of the instances with that setup. CPU sits at about 85%, GPUs are at about 64%.[/QUOTE]

The "time" column is the time per class. Each printed line is a class that was tested. Jumps in the class number mean there were classes that do not need to be tested (because there are no primes in that class as all are divisible by 11, for instance).

For good throughput I suggest running two instances per GPU, without limiting the affinity. The scheduler should sort out, what is available. The siever is single-threaded, but as mfakto (rather the OpenCL runtime) uses a background thread to drive the GPU, you'd never get best performance when setting the affinity to just one core. Driving the GPU and preparing the next set of factor candidates would be serialized on one core, adding big delays (also seen in high wait times). When allowing two cores per instance, the cores will not be fully loaded all the time, but sometimes it is just better to let the two threads per instance run in parallel. Fixing SievePrimes in this mode should not be necessary, but setting it to 5000 may leave some room for one or two prime95 threads.

I guess you'd get even more throughput (tests per day) when running 3 instances per GPU, not limiting the affinity, SievePrimesAutoAdjust, no prime95.

BTW, when just testing this, you don't need to create extra directories. You can use the -tf parameter and specify different exponents for each instance in the same directory:

start mfakto -d 11 -tf 50017789 69 70
start mfakto -d 11 -tf 50019539 69 70
start mfakto -d 11 -tf 50024621 69 70
start mfakto -d 12 -tf 50030767 69 70
start mfakto -d 12 -tf 50031103 69 70
start mfakto -d 12 -tf 50034529 69 70

They will share mfakto.ini and results.txt, but not use any worktodo file.


Regarding the CPU-sleep: mfakto always puts the CPU to sleep when the CPU is waiting for the GPU - the AllowSleep parameter is not used in mfakto. But testing a busy-wait like what you can choose on mfaktc is on my test wish list ...


And regarding the HD6870 machine that is bound to 11.11: This is not my box, and is mainly used for gaming ... :smile: On mission-critical systems like this you can't just replace a driver and risk stability issues :grin:

Dubslow 2011-11-28 04:34

Ooh that post explains a lot. Is that bit about thread design the same for mfaktc?

bcp19 2011-11-28 06:29

[QUOTE=flashjh;280097]I can't speak to the multithreaded siever, but if I only assign 1 core per instance, it way underpowers the GPU.

Thanks everyone for the inputs. I was able to max the CPU and push the GPUs fairly hard with three instances, but in the end my CPU doesn't have the ability to maximize the GPUs.

All-in-all I can run two instances for about 240 M/s along with Prime95. I'm happy with that. mfakto doesn't have a time per class, per se, but with these settings I get the best throughout and fastest times.[/QUOTE]

One other thing you can try, which I did on my system, is to adjust your NumStreams and CPUStreams. I noticed on one core and 3/3 streams that I'd have way low avg wait, then way high, repeating. I changed it to 4/4 and avg wait stayed low. When I went to 2 instances, I changed it to 5/5. 2 instances will now max out both my 560 and 560Ti, and the changes to streams and locking the sieve seemed to help a lot. The 560 was doing 175M/s with 1 and 200M/s with 2 mfaktc running though this did not max the gpu, but setting #1 at 5k sieve and #2 @ 60k Sieve with the 5/5 streams, I got a combined 238M/s and 99% gpu load.

Bdot 2011-11-28 07:34

[QUOTE=Dubslow;280165]Ooh that post explains a lot. Is that bit about thread design the same for mfaktc?[/QUOTE]

No, CUDA does not start another thread, mfaktc is single-threaded. As there is certainly some parallelism required for CUDA as well, I assume the NV driver is hiding that completely.

kjaget 2011-11-28 16:50

Seems I responded too soon, but hopefully this is still good info.

[QUOTE=flashjh;280047]Thanks for the explanation... I have another question then. So, I understand on my system that the two 5870s outclass my X9650. But, I keep the SievePrimes at 5000 on both mfakto instances because otherwise the M/s drop way down.[/QUOTE]

Again, this is not a good measure of performance. M/sec drops because the CPU is filtering out some useless work that the GPU is forced to do if you lower the SievePrimes value. The higher M/sec is because you're forcing the GPU to factor primes which would be discovered as such and skipped if you let the CPU sieve a bit more.

Look at average time per class instead.

[QUOTE]How do I determine/get maximum throughput?[/QUOTE]

Turn on sieve primes adjust or whatever it's called. Run 1 instance, let it stabilize, see how long the per-class time settles on. Take the inverse of this - this is number of classes per second which directly coorelates to exponents/time which is a measure of throughput.

Run two instances of the same exponent, let the per-class times stabilize. Do (1/time_1) + (1/time_2) - this is the number of classes both combined instances will do per unit time. If it's higher than the 1-CPU instance, you're doing more work overall with 2 instances.

Repeat with 3 instances. Do (1/time_1)+(1/time_2)+(1/time_3). See if this is higher than the other two cases.

Eventually you'll run into the situation where you've loaded up on the GPU and so adding more CPUs gives the same throughput (actually probably a bit less since there's some overhead switching between the various runs). That will tell you when to stop adding CPUs. Probably a bit before that, because it's really unlikely that the GPU requires an exact multiple of CPUs to max it out. You'll probably find something like 2 CPUs gets you 90% of the performance of 3 CPUs because 2 almost but not quite max out the GPU. In that case, it's a question of whether that last 10% of performance is worth giving up a CPU which could be doing LL tests instead.

KyleAskine 2011-11-29 10:20

mfakto freezing with multiple instances in linux - fixed??
 
I think I may have fixed my earlier problem. I think my NumStreams was set wayyy to high (it was at 5, don't know why). I lowered it to two, and it has been running two instances for half an hour just fine, which is way longer than it has ever run in the past. I will let it run all day today, and if it goes well, I might bump it again to three instances!

So I guess the lesson is (or may be if it actually works): more instances --> less streams.

KyleAskine 2011-11-29 11:27

Spoke too soon! It crashed again. I may try lowering vectors to two when I get home, but I am officially stumped.

Bdot 2011-11-30 09:38

[QUOTE=KyleAskine;280378]Spoke too soon! It crashed again. I may try lowering vectors to two when I get home, but I am officially stumped.[/QUOTE]
Lowering vectors and lowering GridSize will have the same effect: reduced runtime per kernel but running more kernels. However, lowering GridSize will (almost) keep the efficiency while lowering vectors will reduce it much more. Still, you can test it of course.

Anyway, I don't think any of these will be a permanent solution. Did you already check /var/log/messages? The kernel usually logs something when a hang occurs ...

Can you try downclocking the GPU(s)? In case downclocking helps, this could hint at some hardware issue as the GPU does not live up to its specs.

KyleAskine 2011-11-30 16:49

[QUOTE=Bdot;280514]
Anyway, I don't think any of these will be a permanent solution. Did you already check /var/log/messages? The kernel usually logs something when a hang occurs ...
[/QUOTE]

The most recent hang:

[CODE]Nov 29 17:24:09 kyleserv kernel: [40370.931288] [fglrx] ASIC hang happened
Nov 29 17:24:09 kyleserv kernel: [40370.931296] Pid: 5923, comm: mfakto Tainted: P 2.6.32-5-amd64 #1
Nov 29 17:24:09 kyleserv kernel: [40370.931297] Call Trace:
Nov 29 17:24:09 kyleserv kernel: [40370.931434] [<ffffffffa01a000c>] ? firegl_hardwareHangRecovery+0x1c/0x50 [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.931483] [<ffffffffa0223344>] ? _ZN18mmEnginesContainer9timestampEP26_QS_MM_TIMESTAMP_PACKET_INP27_QS_MM_TIMESTAMP_PACKET_OUT+0x184/0x1c0 [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.931528] [<ffffffffa02393a0>] ? _ZN7PM4Ring9PM4submitEPPjb+0xb0/0x1d0 [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.931560] [<ffffffffa01bc222>] ? firegl_trace+0x72/0x1e0 [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.931603] [<ffffffffa022e6b3>] ? _ZN8mmEngine9timestampEv+0x63/0x90 [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.931644] [<ffffffffa0223330>] ? _ZN18mmEnginesContainer9timestampEP26_QS_MM_TIMESTAMP_PACKET_INP27_QS_MM_TIMESTAMP_PACKET_OUT+0x170/0x1c0 [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.931677] [<ffffffffa01bebf0>] ? firegl_cmmqs_TSExpired+0x0/0xd0 [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.931716] [<ffffffffa0203e3a>] ? IsThreadTSExpired+0xca/0x110 [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.931748] [<ffffffffa01bec4b>] ? firegl_cmmqs_TSExpired+0x5b/0xd0 [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.931778] [<ffffffffa01937a3>] ? KCL_WAIT_Add_Exclusive+0x6c/0x74 [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.931809] [<ffffffffa01aa9fa>] ? irqmgr_wrap_wait_for_hifreq_interrupt_ex+0xba/0x3d0 [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.931840] [<ffffffffa01a8d7b>] ? MCIL_SuspendThread+0xdb/0x120 [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.931882] [<ffffffffa020d662>] ? _ZN2OS13suspendThreadEj+0x22/0x30 [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.931922] [<ffffffffa020635f>] ? CMMQSWaitOnTsSignal+0xaf/0xd0 [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.931963] [<ffffffffa0215a72>] ? _Z8uCWDDEQCmjjPvjS_+0xc32/0x10c0 [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.931995] [<ffffffffa01be732>] ? firegl_cmmqs_CWDDE_32+0x332/0x440 [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.932027] [<ffffffffa01bd060>] ? firegl_cmmqs_CWDDE32+0x70/0x100 [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.932059] [<ffffffffa01bcff0>] ? firegl_cmmqs_CWDDE32+0x0/0x100 [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.932088] [<ffffffffa019bc18>] ? firegl_ioctl+0x1e8/0x250 [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.932092] [<ffffffff81041298>] ? pick_next_task_fair+0xca/0xd6
Nov 29 17:24:09 kyleserv kernel: [40370.932121] [<ffffffffa01921f6>] ? ip_firegl_unlocked_ioctl+0x9/0xd [fglrx]
Nov 29 17:24:09 kyleserv kernel: [40370.932125] [<ffffffff810fab66>] ? vfs_ioctl+0x21/0x6c
Nov 29 17:24:09 kyleserv kernel: [40370.932127] [<ffffffff810fb0b4>] ? do_vfs_ioctl+0x48d/0x4cb
Nov 29 17:24:09 kyleserv kernel: [40370.932130] [<ffffffff810740cc>] ? sys_futex+0x113/0x131
Nov 29 17:24:09 kyleserv kernel: [40370.932131] [<ffffffff810fb143>] ? sys_ioctl+0x51/0x70
Nov 29 17:24:09 kyleserv kernel: [40370.932134] [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
Nov 29 17:24:09 kyleserv kernel: [40370.932137] pubdev:0xffffffffa03f3d90, num of device:1 , name:fglrx, major 8, minor 88.
Nov 29 17:24:09 kyleserv kernel: [40370.932139] device 0 : 0xffff88022d8e0000 .
Nov 29 17:24:09 kyleserv kernel: [40370.932140] Asic ID:0x6898, revision:0x2, MMIOReg:0xffffc90011940000.
Nov 29 17:24:09 kyleserv kernel: [40370.932142] FB phys addr: 0xd0000000, MC :0xf00000000, Total FB size :0x40000000.
Nov 29 17:24:09 kyleserv kernel: [40370.932144] gart table MC:0xf0f8fd000, Physical:0xdf8fd000, size:0x402000.
Nov 29 17:24:09 kyleserv kernel: [40370.932145] mc_node :FB, total 1 zones
Nov 29 17:24:09 kyleserv kernel: [40370.932147] MC start:0xf00000000, Physical:0xd0000000, size:0xfd00000.
Nov 29 17:24:09 kyleserv kernel: [40370.932149] Mapped heap -- Offset:0x0, size:0xf8fd000, reference count:18, mapping count:0,
Nov 29 17:24:09 kyleserv kernel: [40370.932151] Mapped heap -- Offset:0x0, size:0x1000000, reference count:1, mapping count:0,
Nov 29 17:24:09 kyleserv kernel: [40370.932152] Mapped heap -- Offset:0xf8fd000, size:0x403000, reference count:1, mapping count:0,
Nov 29 17:24:09 kyleserv kernel: [40370.932154] mc_node :INV_FB, total 1 zones
Nov 29 17:24:09 kyleserv kernel: [40370.932155] MC start:0xf0fd00000, Physical:0xdfd00000, size:0x30300000.
Nov 29 17:24:09 kyleserv kernel: [40370.932157] Mapped heap -- Offset:0x302f4000, size:0xc000, reference count:1, mapping count:0,
Nov 29 17:24:09 kyleserv kernel: [40370.932158] mc_node :GART_USWC, total 2 zones
Nov 29 17:24:09 kyleserv kernel: [40370.932160] MC start:0x40100000, Physical:0x0, size:0x50000000.
Nov 29 17:24:09 kyleserv kernel: [40370.932162] Mapped heap -- Offset:0x0, size:0x2000000, reference count:22, mapping count:0,
Nov 29 17:24:09 kyleserv kernel: [40370.932163] mc_node :GART_CACHEABLE, total 3 zones
Nov 29 17:24:09 kyleserv kernel: [40370.932164] MC start:0x10400000, Physical:0x0, size:0x2fd00000.
Nov 29 17:24:09 kyleserv kernel: [40370.932166] Mapped heap -- Offset:0x1c00000, size:0x200000, reference count:1, mapping count:0,
Nov 29 17:24:09 kyleserv kernel: [40370.932168] Mapped heap -- Offset:0x1a00000, size:0x200000, reference count:1, mapping count:0,
Nov 29 17:24:09 kyleserv kernel: [40370.932170] Mapped heap -- Offset:0xb00000, size:0xf00000, reference count:5, mapping count:0,
Nov 29 17:24:09 kyleserv kernel: [40370.932172] Mapped heap -- Offset:0x200000, size:0x900000, reference count:8, mapping count:0,
Nov 29 17:24:09 kyleserv kernel: [40370.932174] Mapped heap -- Offset:0x0, size:0x200000, reference count:10, mapping count:0,
Nov 29 17:24:09 kyleserv kernel: [40370.932176] Mapped heap -- Offset:0xef000, size:0x11000, reference count:1, mapping count:0,
Nov 29 17:24:09 kyleserv kernel: [40370.932177] Mapped heap -- Offset:0x282000, size:0x281000, reference count:1, mapping count:0,
Nov 29 17:24:09 kyleserv kernel: [40370.932180] GRBM : 0x3828, SRBM : 0x200000c0 .
Nov 29 17:24:09 kyleserv kernel: [40370.932183] CP_RB_BASE : 0x401000, CP_RB_RPTR : 0x36a80 , CP_RB_WPTR :0x36a80.
Nov 29 17:24:09 kyleserv kernel: [40370.932186] CP_IB1_BUFSZ:0x0, CP_IB1_BASE_HI:0x0, CP_IB1_BASE_LO:0x40489000.
Nov 29 17:24:09 kyleserv kernel: [40370.932188] last submit IB buffer -- MC :0x40489000,phys:0x223c16000.
Nov 29 17:24:09 kyleserv kernel: [40370.932189] Dump the trace queue.
Nov 29 17:24:09 kyleserv kernel: [40370.932190] End of dump[/CODE]

Bdot 2011-11-30 23:28

[QUOTE=KyleAskine;280537]The most recent hang:

[CODE]Nov 29 17:24:09 kyleserv kernel: [40370.931288] [fglrx] ASIC hang happened
[/CODE][/QUOTE]

This appears when a kernel locks the GPU for more than around one or two seconds. This should never happen with mfakto, especially on fast cards like yours.
Try reducing the memory clock to the minimum (almost zero effect on mfakto anyway), and also reduce the GPU clock by 5 or 10% for a test.

KyleAskine 2011-11-30 23:48

[QUOTE=Bdot;280582]This appears when a kernel locks the GPU for more than around one or two seconds. This should never happen with mfakto, especially on fast cards like yours.
Try reducing the memory clock to the minimum (almost zero effect on mfakto anyway), and also reduce the GPU clock by 5 or 10% for a test.[/QUOTE]

Core clock is now 750 and mem clock is 900.

KyleAskine 2011-12-01 03:48

[QUOTE=KyleAskine;280586]Core clock is now 750 and mem clock is 900.[/QUOTE]

Crashed again in the same way.

Bdot 2011-12-01 09:14

[QUOTE=KyleAskine;280606]Crashed again in the same way.[/QUOTE]
Well, send it back to AMD for replacement ... or at least open a case with them, but I guess it's a H/W issue then.

KyleAskine 2011-12-01 11:47

[QUOTE=Bdot;280645]Well, send it back to AMD for replacement ... or at least open a case with them, but I guess it's a H/W issue then.[/QUOTE]

Thanks a bunch for your help!!

Very seriously, I am very grateful not only for your help, but for mfakto in general.

I suspect if everything was correctly optimized, we may be able to pull as much or more GHz/$ out of AMD.

KyleAskine 2011-12-02 12:58

For real this time...
 
I fixed my issue!

I started running two instances of mfakto last night at around 6pm, and as of this morning at 7am they are still running!

The secret as far as I can tell:

<facepalm>

Run each instance in a real separate terminal window.

</facepalm>

What I was doing: running everything in one screen session, which allowed me to control that session through putty.

I also reinstalled the 11.9 drivers, so it is possible that this is not the reason, but I suspect it is.

Bdot 2011-12-02 14:19

[QUOTE=KyleAskine;280777]I fixed my issue!

I started running two instances of mfakto last night at around 6pm, and as of this morning at 7am they are still running!

The secret as far as I can tell:

<facepalm>

Run each instance in a real separate terminal window.

</facepalm>

What I was doing: running everything in one screen session, which allowed me to control that session through putty.

I also reinstalled the 11.9 drivers, so it is possible that this is not the reason, but I suspect it is.[/QUOTE]

Interesting ... and quite strange. When I hear that, my guess would rather be the driver re-installation than the session separation ... if you want to lock up your system one more time, you can try the old session-mode with the reinstalled drivers, just to be sure on that.

I also have some good news: I found a rather small code change that makes mfakto work again on 11.10 and 11.11, with almost no performance-impact (pass the data to the kernel wrapped in a cl_uint8). A little testing over the weekend (if time permits), and I can release that next week.

flashjh 2011-12-02 14:27

[QUOTE=Bdot;280787]Interesting ... and quite strange. When I hear that, my guess would rather be the driver re-installation than the session separation ... if you want to lock up your system one more time, you can try the old session-mode with the reinstalled drivers, just to be sure on that.

I also have some good news: I found a rather small code change that makes mfakto work again on 11.10 and 11.11, with almost no performance-impact (pass the data to the kernel wrapped in a cl_uint8). A little testing over the weekend (if time permits), and I can release that next week.[/QUOTE]

Awesome! Thanks for your work on this.

KyleAskine 2011-12-02 22:30

[QUOTE=Bdot;280787]Interesting ... and quite strange. When I hear that, my guess would rather be the driver re-installation than the session separation ... if you want to lock up your system one more time, you can try the old session-mode with the reinstalled drivers, just to be sure on that.[/QUOTE]

I will do that, but probably not until next week, because I may not have a couple hours to myself until then to make sure I am on it when it locks up :smile:

Thanks again for all your work!

KyleAskine 2011-12-06 01:31

[QUOTE=KyleAskine;280838]I will do that, but probably not until next week, because I may not have a couple hours to myself until then to make sure I am on it when it locks up :smile:

Thanks again for all your work![/QUOTE]

Well, I have run the usual culprits (2x mfakto, 1x mprime, 1x top) in a screen session for a few hours now with no crashes. So I guess it was the drivers. I will keep it running overnight for observation :smile:

Be sure to tell everyone using linux to use 11.9 :smile:

Bdot 2011-12-06 13:30

[QUOTE=KyleAskine;281165]

Be sure to tell everyone using linux to use 11.9 :smile:[/QUOTE]

And I thought I did:

Readme:
[code]
Install Catalyst driver, version >= 11.7

Catalyst driver 11.9 uses up to one CPU core less than its predecessors:
11.9 strongly recommended.
[/code][URL="http://mersennewiki.org/index.php/Mfakto"]Wiki[/URL]:
[code]
AMD Catalyst drivers 11.4 or higher, 11.9 recommended
[/code]Note, the minimum version mismatch is, because I just recently found out that Catalyst versions as low as 11.4 work, and I updated the Wiki ... but if nobody reads it anyway ...

KyleAskine 2011-12-06 15:18

[QUOTE=Bdot;281225]And I thought I did:

Readme:
[code]
Install Catalyst driver, version >= 11.7

Catalyst driver 11.9 uses up to one CPU core less than its predecessors:
11.9 strongly recommended.
[/code][URL="http://mersennewiki.org/index.php/Mfakto"]Wiki[/URL]:
[code]
AMD Catalyst drivers 11.4 or higher, 11.9 recommended
[/code]Note, the minimum version mismatch is, because I just recently found out that Catalyst versions as low as 11.4 work, and I updated the Wiki ... but if nobody reads it anyway ...[/QUOTE]

:redface::redface::redface:

Bdot 2011-12-06 22:18

[QUOTE=KyleAskine;281237]:redface::redface::redface:[/QUOTE]
:grin: :grin: :grin:

Anyway, testing the new mfakto version on Windows with 11.9 and 11.11 went well and everything seemed fine. Today I updated my Linux box to 11.11, and guess what I get within minutes ...
[code]
[fglrx] ASIC hang happened
[/code]The same thing as you described, even at lowest frequency settings. And of course I did not manage immediately to go back to 11.9 - right now it seems as if I did not have any OpenCL driver at all ... Development with OpenCL is not exactly fun :rant:

Bdot 2011-12-07 14:45

[QUOTE=Bdot;281310]:grin: :grin: :grin:

Anyway, testing the new mfakto version on Windows with 11.9 and 11.11 went well and everything seemed fine. Today I updated my Linux box to 11.11, and guess what I get within minutes ...
[code]
[fglrx] ASIC hang happened
[/code][/QUOTE]

I found AMD has updated the docs to their APPSDK of August:
[code]
For a successful build and correct operation of individually downloaded s when using the Linux
Catalyst 11.11 drivers or later and the Linux AMD APP SDK 2.5, delete the old 2.5 runtime
files from /opt/AMDAPP/ .
To do this for 32-bit Linux OS:
a. Go to /opt/AMDAPP/lib/x86.
b. Delete the libamdocl32.so and the libOpenCL.so.1 files.
c. Create a symlink under /usr/lib using the command
# ln –s libOpenCL.so.1 libOpenCL.so
To do this for 64-bit Linux OS:
a. Go to /opt/AMDAPP/lib/x86.
b. Delete the libamdocl32.so and the libOpenCL.so.1 files.
c. Go to /opt/AMDAPP/lib/x86_64.
d. Delete the libamdocl64.so and the libOpenCL.so.1 files.
e. Create a symlink under /usr/lib using the command
# ln –s libOpenCL.so.1 libOpenCL.so
f. Create a symlink under /usr/lib64 using the command
# ln –s libOpenCL.so.1 libOpenCL.so
[/code]3 mfakto instances running now for over 2 hours without any hang, so this seems to be solved! Now I can actually go and do the final changes and tests for mfakto 0.10 release.

KyleAskine 2011-12-10 17:21

[QUOTE=Bdot;281388]Now I can actually go and do the final changes and tests for mfakto 0.10 release.[/QUOTE]

Any ETA? I am going to have to wipe one of my computers soon, and I am installing a new GPU in another. I would prefer to wait until the new mfakto is out so I can go right to the latest drivers in both cases.

Thanks for all of your work!

Bdot 2011-12-11 19:37

[QUOTE=KyleAskine;281787]Any ETA? [/QUOTE]
Well, not too soon. I'm incorporating some mfaktc-changes, and more of an issue, I started bigger restructuring changes that I should have postponed for the next version.

So certainly not before next weekend, and due to saisonal increases of the importance of the family, it may take even longer :smile:.

However, I can send you a testing version for win/64 that was stable* before I started my restructuring. It does not have a lot of changes from 0.09, but runs on 11.11. Send me a pm with your email if you're interested.

* safe to be used to submit results.

KyleAskine 2011-12-11 21:33

[QUOTE=Bdot;281865]Well, not too soon. I'm incorporating some mfaktc-changes, and more of an issue, I started bigger restructuring changes that I should have postponed for the next version.

So certainly not before next weekend, and due to saisonal increases of the importance of the family, it may take even longer :smile:.

However, I can send you a testing version for win/64 that was stable* before I started my restructuring. It does not have a lot of changes from 0.09, but runs on 11.11. Send me a pm with your email if you're interested.

* safe to be used to submit results.[/QUOTE]

I can hold out and just load 11.9

Thanks again for everything you do for us!

flashjh 2011-12-13 19:09

Computer config question -- help.
 
Hey everyone,

Looking for some guidance/help. I was running a QX9650 with two HD 5870s. I haven't actually added up the GHz/Days with that setup, but it ran very well.

Anyway, since I could't max the GPUs with the CPU I decided to upgrade my system.

I bought a dual CPU Opteron 6272 (32 cores total) with the Asus KGPE-D16 motherboard. The system works great but now I can't get mfakto to push my GPUs for anything (like at least 8 times slower). I tried the newest pre-release with 11.11 and then reverted back to the older mfakto with 11.9; same result.

I know it's a new CPU and all, but with two full 16x PCI-e slots I figured the 5870 would be screaming on this thing.

Like I said, everything works great besides that but I got this setup for mfakto.

Any help is appreciated. Since it isn't working I'm going to sell it off any buy something better suited for mfakto -- what does everyone suggest for the fastest single system mfakto crusher? Thanks!

diamonddave 2011-12-13 19:21

[QUOTE=flashjh;282070]Hey everyone,

Looking for some guidance/help. I was running a QX9650 with two HD 5870s. I haven't actually added up the GHz/Days with that setup, but it ran very well.

Anyway, since I could't max the GPUs with the CPU I decided to upgrade my system.

I bought a dual CPU Opteron 6272 (32 cores total) with the Asus KGPE-D16 motherboard. The system works great but now I can't get mfakto to push my GPUs for anything (like at least 8 times slower). I tried the newest pre-release with 11.11 and then reverted back to the older mfakto with 11.9; same result.

I know it's a new CPU and all, but with two full 16x PCI-e slots I figured the 5870 would be screaming on this thing.

Like I said, everything works great besides that but I got this setup for mfakto.

Any help is appreciated. Since it isn't working I'm going to sell it off any buy something better suited for mfakto -- what does everyone suggest for the fastest single system mfakto crusher? Thanks![/QUOTE]

This look like a Bulldozer CPU setup?

Try looking in the "Hardware>Launch of the AMD FX 8 Core processor" less then stellar review! :sad:

Dubslow 2011-12-13 22:43

Yes, the problem with bulldozers is that there is only one FPU for every two ALU's, so a module is not truly two cores. OTOH, I remember reading that mfakt* is supposed to run on the ALU's...

The other note is that Windows and Linux ATM are [i]horrible[/i] at process scheduling on Bulldozers, because they aren't aware of the architecture's oddities. Windows 8 at least should see a major performance increase on Bulldozer because it will understand the module architecture and will thus schedule processes better.


Edit: Conclusion is for now at least, not Bulldozer.

flashjh 2011-12-13 22:50

[QUOTE=Dubslow;282100]Yes, the problem with bulldozers is that there is only one FPU for every two ALU's, so a module is not truly two cores. OTOH, I remember reading that mfakt* is supposed to run on the ALU's...

The other note is that Windows and Linux ATM are [i]horrible[/i] at process scheduling on Bulldozers, because they aren't aware of the architecture's oddities. Windows 8 at least should see a major performance increase on Bulldozer because it will understand the module architecture and will thus schedule processes better.


Edit: Conclusion is for now at least, not Bulldozer.[/QUOTE]

That makes me wonder if it's worth holding on to or not? Anyway, I'm putting my Intel system back together for now. What a shame.

Dubslow 2011-12-13 22:55

[QUOTE=flashjh;282101]That makes me wonder if it's worth holding on to or not? Anyway, I'm putting my Intel system back together for now. What a shame.[/QUOTE]

I have no idea. Considering how much it must have cost, probably holding on to it is better then throwing it away, unless you intended to sell it.

flashjh 2011-12-13 23:05

[QUOTE=Dubslow;282105]I have no idea. Considering how much it must have cost, probably holding on to it is better then throwing it away, unless you intended to sell it.[/QUOTE]

Well, I would either sell everything or set it up for P-1/LL processing. It does a really amazing job for that. I just want to concentrate on TF for now and it's horrible for that.

I'll probably list it for what I paid and use it until then. I just need another power supply that won't arrive until Friday.

For max TFing, what system do you recommend?

KyleAskine 2011-12-13 23:15

[QUOTE=flashjh;282106]Well, I would either sell everything or set it up for P-1/LL processing. It does a really amazing job for that. I just want to concentrate on TF for now and it's horrible for that.

I'll probably list it for what I paid and use it until then. I just need another power supply that won't arrive until Friday.

For max TFing, what system do you recommend?[/QUOTE]

My i5-2500K overclocked to 4.6GHz saturates my 2x6970 at around 35000 primes sieved. Just one instance for each card.

You don't need to go super expensive to do this.

[URL="http://i.imgur.com/7vRKE.jpg"]http://i.imgur.com/7vRKE.jpg[/URL] <-- Pic of me saturating my cards

Dubslow 2011-12-13 23:20

It's all about the GPU and a CPU that can provide it. Either or many core Phenom or Sandy Bridge. Sandy Bridge is better in general, but there's nothing specific to mfakt* that any of the old processors has either as an advantage over the others. I'd recommend Sandy Bridge just because for everything it's shown remarkable computing ability. For reference one core of my 2600k is almost capable of keeping pace with a GTX 460 on its own. Not sure how nVidia cards compare though.

Edit: I was in the middle of typing this when my roommates told me I'd been trolled continuously for the past month. Sorry about the cross post.

Edit2: P-1 and LL are always appreciated.

flashjh 2011-12-14 01:39

Info and 11.12 drivers
 
Thanks for the inputs, so far. I know I don't have to spend a fortune, and my current system was doing very well. But, since I was thinking of upgrading anyway I figured I would go for a really good one (or so I thought). I had read a lot about the 8 Core AMDs, but I didn't see too much on the new Opterons, so I thought I would give it a shot -- whoops.

For the folks in the top 10, I would be curious to know what hardware you're using for my future upgrade considertions -- thanks.

BDOT - So, I upgraded to the 11.12 drivers a minute ago when I was reinstalling everything. I ran self test for vector sizes 1|2|4|8 and they all passed with the pre-release 2 v.10.

bcp19 2011-12-14 03:09

[QUOTE=flashjh;282123]Thanks for the inputs, so far. I know I don't have to spend a fortune, and my current system was doing very well. But, since I was thinking of upgrading anyway I figured I would go for a really good one (or so I thought). I had read a lot about the 8 Core AMDs, but I didn't see too much on the new Opterons, so I thought I would give it a shot -- whoops.

For the folks in the top 10, I would be curious to know what hardware you're using for my future upgrade considertions -- thanks.

BDOT - So, I upgraded to the 11.12 drivers a minute ago when I was reinstalling everything. I ran self test for vector sizes 1|2|4|8 and they all passed with the pre-release 2 v.10.[/QUOTE]

If you goto [URL]http://mersenneforum.org/showthread.php?t=16284[/URL] you can see the CPUs/GPUs I am running (they keep me in the top 5). The 560 is slightly OC (EVGA runs at 1701 MHz vs 1645 for nVidia. Still tinkering with the 2500, so not OC yet.

Dubslow 2011-12-14 17:49

Nucleon and xyzzy both have 3-4 GPU powerhouses, mostly consisting of GTX 580s. I'm not sure if anybody either in PN top 10 or GPU272 top ten use AMD cards/mfakto... That's a worthwhile question.

Dubslow 2011-12-14 18:50

Edit: Here's a link to their hardware lists. xyzzy is two posts after nucleon.
[url]http://mersenneforum.org/showthread.php?t=16050#post275729[/url]


Wow. I hit the edit button, and while I was editing the 1 hr limit expired. New post now. Check out the post times.

KyleAskine 2011-12-14 19:29

[QUOTE=Dubslow;282209]Nucleon and xyzzy both have 3-4 GPU powerhouses, mostly consisting of GTX 580s. I'm not sure if anybody either in PN top 10 or GPU272 top ten use AMD cards/mfakto... That's a worthwhile question.[/QUOTE]

I suspect I am the largerst AMD only person on the network (though I would absolutely love to be proved wrong).

I have 2x6970 and 1x5870. I finally got everything running at what I suspect is maximum efficiency (for me) a couple days ago.

Dubslow 2011-12-14 19:37

btw, props to you on the breadth first thing. You have by far the most factors with much less work.

Bdot 2011-12-14 20:19

[QUOTE=flashjh;282123] I had read a lot about the 8 Core AMDs, but I didn't see too much on the new Opterons, so I thought I would give it a shot -- whoops.
[/QUOTE]
Did you try adding more instances? The "8 times slower" is certainly disturbing. I read that the first boards (among them the review ones!) really sucked with bulldozer. In the meantime some do a better job. However, RedHat 6.2 is the only OS so far to really make use of bulldozer. Windows 8 may add that, but it's not sure yet.

Also, I need to get a newer gcc that can issue the right instruction scheduling for bulldozer (would be a special mfakto edition :smile: ).

In other words, the software side may improve the picture quite a bit within the next couple of weeks. But I sincerely doubt it will improve 8 times.

[QUOTE=flashjh;282123]
BDOT - So, I upgraded to the 11.12 drivers a minute ago when I was reinstalling everything. I ran self test for vector sizes 1|2|4|8 and they all passed with the pre-release 2 v.10.[/QUOTE]

Thanks for this test! If you still have mfakto 0.09 around (with the later [URL="http://www.mersenneforum.org/attachment.php?attachmentid=7276&d=1320939297"]kernel[/URL] to fix the mul24/mad24 issues in 11.10), could you give that a try on 11.12 and run a selftest?

flashjh 2011-12-15 00:20

[QUOTE=Bdot;282236]Did you try adding more instances? The "8 times slower" is certainly disturbing. I read that the first boards (among them the review ones!) really sucked with bulldozer. In the meantime some do a better job. However, RedHat 6.2 is the only OS so far to really make use of bulldozer. Windows 8 may add that, but it's not sure yet.[/QUOTE]

I did try more instances - up to 8 (four per GPU). I also tried 6 on one GPU. I may not be the expert at maximum throughput, but all combinations were not good. Maybe I'll try a Redhad once I get my new power supply. I don't know much about Linux though... haven't used it in a while.

[QUOTE]
Also, I need to get a newer gcc that can issue the right instruction scheduling for bulldozer (would be a special mfakto edition :smile: ).

In other words, the software side may improve the picture quite a bit within the next couple of weeks. But I sincerely doubt it will improve 8 times.
[/QUOTE]

Do they have a newer compiler for bulldozer?

[QUOTE]Thanks for this test! If you still have mfakto 0.09 around (with the later [URL="http://www.mersenneforum.org/attachment.php?attachmentid=7276&d=1320939297"]kernel[/URL] to fix the mul24/mad24 issues in 11.10), could you give that a try on 11.12 and run a selftest?[/QUOTE]

I wasn't able to get the selt tests to work. I used the .09 mfakto with the kernal from your post to fix the 11.10 problems. Vector size 1 failed immediately. 2|4|8 finished, but they all failed. I emailed you a zip file with the outputs from all four runs so you can see them.


Also, do you have a 32 bit version that works with 11.12? I would like to test by 32 bit machine with 11.12 for throughput.

aaronhaviland 2011-12-15 00:23

[QUOTE=KyleAskine;282231]I suspect I am the largerst AMD only person on the network (though I would absolutely love to be proved wrong).

I have 2x6970 and 1x5870. I finally got everything running at what I suspect is maximum efficiency (for me) a couple days ago.[/QUOTE]

I suspect I'm not far behind: 2x6870 and 1x6950 supported by a Phenom II X6 1075T... haven't been running it 24/7 but with winter here, I will likely do so. (I've also got a GTX460/Core i5... but I'm currently doing something else with that)

I've actually found that running 2 clients per card (my most efficient - maximises the 6 CPU cores as well as the GPUs) I appear to be hitting some bandwidth restriction, as feeding all three at the same time is causing a bottleneck somewhere (determined by experimentation of different configurations). I know my PCI slots are all running at x8 speed, but this should still be fast enough.

flashjh 2011-12-16 22:07

AMD Bulldozer Update
 
Good reading and hopefully good news for Bulldozer CPUs... now I just need to get my power supply in so I can test everything!

[URL]http://support.microsoft.com/kb/2592546[/URL]

Dubslow 2011-12-16 22:54

nucleon...
Here's hoping you see this

flashjh 2011-12-17 05:08

Update
 
[QUOTE=flashjh;282486]Good reading and hopefully good news for Bulldozer CPUs... now I just need to get my power supply in so I can test everything!

[URL]http://support.microsoft.com/kb/2592546[/URL][/QUOTE]

The KB download is gone... MS & AMD said it's not ready and many people were getting BSOD with this update. Oh well, at least they're working on it. Looks like AMD and MS need to release two updates to fix the problem and it won't be until 1st Qtr '12.

[URL]http://www.brightsideofnews.com/news/2011/12/16/microsoft-releases-amd-bulldozer-patch-by-mistake2c-incomplete-download.aspx[/URL]

nucleon 2011-12-18 10:55

How can I put this politely...I can't recommend bulldozer at all atm. I can't go past a 2500k for value for money. The bulldozer I have, I'm pretty much keen on throwing it out or give it to some poor uni student who needs a pc.

The other thing hurting bulldozers is 16Kb L1 cache. Every other platform seems to be 32k - and hence everything is optimized for that.

I'm working on getting all the info to do a doco post of my setup - stay tuned.

-- Craig

nucleon 2011-12-18 11:23

...on the comment of Windows scheduler. The patch is still coming.

It was due to come out then pulled due to issues in testing apparently.

-- Craig

nucleon 2011-12-18 12:05

Anyone have mkakto working on one of the newer amd APUs?

I have an E350 here and I had no luck. Tried 11.9 drivers - GPU usage was only <5%. So I think something is up. Tried 11.12 drivers - no luck there either; exe drops out with errors.

-- Craig

Bdot 2011-12-18 18:49

[QUOTE=nucleon;282670]Anyone have mkakto working on one of the newer amd APUs?

I have an E350 here and I had no luck. Tried 11.9 drivers - GPU usage was only <5%. So I think something is up. Tried 11.12 drivers - no luck there either; exe drops out with errors.

-- Craig[/QUOTE]
Drivers 11.10 and above will work only with mfakto 0.10 (to be released next week).

In the last week of this year I may have the chance to test an E350 as well. I had hoped it would work right away, but maybe more work is needed. Can you give a few details - did it not find the GPU (and therefore failed to use it, just running on the CPU)? Maybe you could post the output of when it starts, and also the output of clinfo. But I guess it's easier to wait for 0.10 than downgrade the drivers again ...

If you pm me you email address I can send you the RC of 0.10.

nucleon 2011-12-19 08:35

I think it did found the GPU - using GPUz it did show a slight increase (say 2% to 6%) and it dropped back down after I closed it.

Sorry, but I'm not going to get much time with it to do further testing. I gotta hand it back.

This was a win7 home premium install.

-- Craig

Bdot 2011-12-19 22:50

Version 0.10 release
 
1 Attachment(s)
mfakto version 0.10 is ready, with the following changes:

[LIST][*]workaround for compatibility with Catalyst 11.10 and above (tested up to 11.12)[*]Checkpoints now keep a backup (.bu) that is automatically read if the .cpt file is corrupt[*]mfakto now allows reading checkpoints from any version of mfaktc and mfakto as long as the other parameters match and the checksum is OK[*]split mul24 kernel in two bitranges 0-64 and 61-72 allowing for some optimizations: 10-20% performance improvement for 70-bit assignments using the mul24 kernel:
[I]Please check for your GPU if barrett is still faster or if it makes sense to switch to the mul24 kernel.
On HD5770 and HD 6870 barrett is still a bit ahead ...
[/I][*]merged mfaktc 0.18 features:[*][LIST][*]inifile parameter CheckpointDelay (see below)[*]extended selftest (-st2)[*]Changes to the factor found result line as discussed in the mersenne forum[/LIST] [*]Writing checkpoints can now be limited:[*][LIST][*]set CheckpointDelay <s> to write a checkpoint only if at least s seconds have passed since the last checkpoint[*]set Checkpoint <n>, n>1 to write a checkpoint only after n classes have been tested (set to n=1 to enable CheckpointDelay)[/LIST] [*]added commandline parameter -i|--inifile <file>
to load <file> as inifile (default: mfakto.ini), allowing multiple instances of mfakto in the same directory[*]added ResultsFile parameter to inifile (same reason)[*]... and a few less important things (see README)[/LIST][I]Attention: If you use Catalyst 11.10 or above with this version, there is an incompatibility with AMD APP SDK 2.4 leading to BSODs (Win) or mfakto/X-Server hangs (Linux).[/I] Make sure to uninstall APP SDK - it is no longer needed since 11.9 (Win) or 11.11 (Linux). If you need the APP SDK, install version 2.5.

Source package follows en suite, Linux version tomorrow.

Bdot 2011-12-19 22:54

mfakto 0.10 sources
 
1 Attachment(s)
mfakto 0.10 sources

If anyone feels like creating a Makefile.win for the sources, I'd be happy to move away from VS10 ...

KyleAskine 2011-12-20 01:08

6950 Results
 
What is the MUL24 Kernal?

In .09 I was using the mfakto_cl_71 for my 6950s with the shaders unlocked (so basically 6970s). I was getting around 140 M/s

With mfakto_cl_barrett79 I was getting around 120 M/s, so barrett was around 15-20% slower.

With .10 I seem to be getting around 120 M/s with both, so the mfakto_cl_71 seems to have gotten slower for me. The barrett kernel still runs at the same speed.

I am installing something ATM, but I can check again when I am done if you would like. Let me know if there are any screenshots or output files I can send you if that would help.

Edit: I just confirmed that at the same load, I now run around 20% slower with the new version.

therealwebs 2011-12-20 06:17

I'm running both mfakto win 0.09 and 0.10 on different PCs. I've noticed that mfakto 0.10 (x64) seems to crash fairly regularly. I'm using cat 11.12 with 2x5870s. Since I'm running remote, I haven't been able to monitor the circumstance of the crashes. Event viewer doesn't have anything helpful to add at the moment. I'll update if I can find a set of circumstances that cause the crash.

Bdot 2011-12-20 08:24

[QUOTE=KyleAskine;282848]What is the MUL24 Kernal?

In .09 I was using the mfakto_cl_71 for my 6950s with the shaders unlocked (so basically 6970s). I was getting around 140 M/s

With mfakto_cl_barrett79 I was getting around 120 M/s, so barrett was around 15-20% slower.

With .10 I seem to be getting around 120 M/s with both, so the mfakto_cl_71 seems to have gotten slower for me. The barrett kernel still runs at the same speed.

I am installing something ATM, but I can check again when I am done if you would like. Let me know if there are any screenshots or output files I can send you if that would help.

Edit: I just confirmed that at the same load, I now run around 20% slower with the new version.[/QUOTE]

mul24 kernel is the kernel "mfakto_cl_71".

What type of CPU do you have? I've reduce the sieve size to fit most CPU's 32kb L1 cache. If you have a CPU with 64k L1 cache, then the siever might be slower ... I've lost my Phenom machine (again) therefore I could not test that. As most Intel CPUs have just 32k L1 data cache, I found the optimum sieve size to be ~24kB for those. If you have a 64k-L1-cache-machine, I can send you a special version and note for the next version to either adjust that automatically or make it configurable.

Also, for bulldozer, I can create a 12kiB-siever-version.

Can you confirm that you still see the line
Using GPU kernel "mfakto_cl_71" if you select that kernel be be run?

And can you see a difference in GPU utilization?

Bdot 2011-12-20 08:27

[QUOTE=therealwebs;282875]I'm running both mfakto win 0.09 and 0.10 on different PCs. I've noticed that mfakto 0.10 (x64) seems to crash fairly regularly. I'm using cat 11.12 with 2x5870s. Since I'm running remote, I haven't been able to monitor the circumstance of the crashes. Event viewer doesn't have anything helpful to add at the moment. I'll update if I can find a set of circumstances that cause the crash.[/QUOTE]
Make sure to not have AMD APP SDK 2.4 on your box.

TheJudger 2011-12-20 11:17

[QUOTE=Bdot;282880]Also, for bulldozer, I can create a 12kiB-siever-version.
[/QUOTE]

Well, I've built a mfaktc executable for nucleons Bulldozer with a smaller sieve. It helps a little bit but my sieve code really runs bad on Bulldozer. Per clock something like 1/4 to 1/3 of a current Intel CPU. :sad:

Oliver

KyleAskine 2011-12-20 11:22

[QUOTE=Bdot;282880]mul24 kernel is the kernel "mfakto_cl_71".

What type of CPU do you have? I've reduce the sieve size to fit most CPU's 32kb L1 cache. If you have a CPU with 64k L1 cache, then the siever might be slower ... I've lost my Phenom machine (again) therefore I could not test that. As most Intel CPUs have just 32k L1 data cache, I found the optimum sieve size to be ~24kB for those. If you have a 64k-L1-cache-machine, I can send you a special version and note for the next version to either adjust that automatically or make it configurable.

Also, for bulldozer, I can create a 12kiB-siever-version.

Can you confirm that you still see the line
Using GPU kernel "mfakto_cl_71" if you select that kernel be be run?

And can you see a difference in GPU utilization?[/QUOTE]

I have an i5-2500k.

I don't think it is a siever issue... my utilization is the same (around 90%) with both .09 and .10.

I confirmed that it does say that it is using mfakto_cl_71.

Bdot 2011-12-20 15:28

[QUOTE=TheJudger;282897]Well, I've built a mfaktc executable for nucleons Bulldozer with a smaller sieve. It helps a little bit but my sieve code really runs bad on Bulldozer. Per clock something like 1/4 to 1/3 of a current Intel CPU. :sad:

Oliver[/QUOTE]
Yes, I've seen that reducing the sieve size any further dramatically reduces speed. In so far, the Phenoms (64kiB L1) should be best at sieving, if they get a 60kiB siever ...

Bdot 2011-12-20 15:31

[QUOTE=KyleAskine;282898]I have an i5-2500k.

I don't think it is a siever issue... my utilization is the same (around 90%) with both .09 and .10.

I confirmed that it does say that it is using mfakto_cl_71.[/QUOTE]

That is really sad, and it seems to depend on your GPU's - mfakto_cl_71 v0.10 on my box is faster than v0.09 ...

Can you please pm me your email address? I'd like to send you something to test ...

therealwebs 2011-12-20 17:52

yep, don't have APP SDK 2.4 installed AFAIK. i wanted to install 2.6, but the download link was corrupted so i'm using 2.5.

in terms of stability, mfakto hasn't crashed in the last 10 or so hours. this is coinciding with changing my usage pattern from 2 instances+1 instance to running only 1 instance on each card (so 1+1). from a resource standpoint, i'm using 3 cores of my i5 to feed the cards and 1 core to run prime95. if i allow 2 cores of primes to run, i get a major throughput hit in mfakto.

thanks for this version! i didn't want to have to do a driver rollback to run this on my main machine :)

Bdot 2011-12-20 23:45

[QUOTE=therealwebs;282946]yep, don't have APP SDK 2.4 installed AFAIK. i wanted to install 2.6, but the download link was corrupted so i'm using 2.5.

in terms of stability, mfakto hasn't crashed in the last 10 or so hours. this is coinciding with changing my usage pattern from 2 instances+1 instance to running only 1 instance on each card (so 1+1). from a resource standpoint, i'm using 3 cores of my i5 to feed the cards and 1 core to run prime95. if i allow 2 cores of primes to run, i get a major throughput hit in mfakto.

thanks for this version! i didn't want to have to do a driver rollback to run this on my main machine :)[/QUOTE]
If you could enable userdumper or some other tool to get a crash dump when it aborts next time, that would be really helpful. But of course I hope it does not crash again ;-)

And another note: the aforementioned performance issue seems resolved. kyleaskine and flashjh are helping me test it, so I'll probably release a fix for it tomorrow - together with the linux binary.

BigBrother 2011-12-21 11:45

When using CheckpointDelay=0 and PrintMode=1, the first column (class) of the output is always overwritten by the text 'CP written.', makes it impossible to see which class is being tested.

Bdot 2011-12-21 13:48

1 Attachment(s)
[QUOTE=BigBrother;283030]When using CheckpointDelay=0 and PrintMode=1, the first column (class) of the output is always overwritten by the text 'CP written.', makes it impossible to see which class is being tested.[/QUOTE]
Hehe, that's a use-case that was not intended ... now that mfakto can delay writing the checkpoints, the idea is that they are written only occasionally.

I'll think of some better way to tell that a checkpoint was written.

Thanks for the report.


Here's the fix for the performance issues. It just contains 2 kernel files that need to replace original files from the 0.10 package.

Bdot 2011-12-21 13:55

mfakto 0.10 - Linux version
 
1 Attachment(s)
Here comes the linux version of mfakto 0.10. It has the performance issues resolved, but is otherwise unchanged (also 32kiB sieve limit).

Brain 2011-12-21 18:52

[QUOTE=Bdot;283036]Here's the fix for the performance issues. It just contains 2 kernel files that need to replace original files from the 0.10 package.[/QUOTE]
Wouldn't it be easier to integrate the patch files also into the former 0.10 bundles? I'm waiting with my GPU guide update and liked to minimize download url count...

Bdot 2011-12-22 10:06

1 Attachment(s)
[QUOTE=Brain;283073]Wouldn't it be easier to integrate the patch files also into the former 0.10 bundles? I'm waiting with my GPU guide update and liked to minimize download url count...[/QUOTE]

As you wish ... (of course you're right ...)

henryzz 2011-12-23 18:30

In yafu it is recommended to use a 64kb sieve on amd cpus and 32kb sieve on intel because of a smaller L1 cache. Bulldozer goes down to a 16kb data cache so might want smaller.

Bdot 2011-12-24 10:59

[QUOTE=henryzz;283342]In yafu it is recommended to use a 64kb sieve on amd cpus and 32kb sieve on intel because of a smaller L1 cache. Bulldozer goes down to a 16kb data cache so might want smaller.[/QUOTE]

With MORE_CLASSES, mfakt[co] uses a sieve size that is a multiple of ~12k (13*17*19*23 bits), which results in ~24k optimum for Intel, and ~60k for AMD (12k for BullD). flashjh, did you give the different sieve-size versions a try on your Phenom to confirm?

I guess the next version of mfakto will have sieve size configurable ...

BTW, I had a chance to quickly test the A350 with mfakto. Windows7-64 and Catalyst 11.12 installed, and there's nothing more that is needed. The GPU is detected right away. However, it may not really be worth the effort: ~7M/s was the peak.

I'll test a little more though.
CPU load (mfakto, SievePrimes 200k): ~17%
GPU load : 85-95%
no measureable increase in power-consumption
M52 50xx xxx (2^69 - 2^70): 6.8M/s avg.

flashjh 2011-12-24 19:31

[QUOTE=Bdot;283417]With MORE_CLASSES, mfakt[co] uses a sieve size that is a multiple of ~12k (13*17*19*23 bits), which results in ~24k optimum for Intel, and ~60k for AMD (12k for BullD). flashjh, did you give the different sieve-size versions a try on your Phenom to confirm?

I guess the next version of mfakto will have sieve size configurable ...

BTW, I had a chance to quickly test the A350 with mfakto. Windows7-64 and Catalyst 11.12 installed, and there's nothing more that is needed. The GPU is detected right away. However, it may not really be worth the effort: ~7M/s was the peak.

I'll test a little more though.
CPU load (mfakto, SievePrimes 200k): ~17%
GPU load : 85-95%
no measureable increase in power-consumption
M52 50xx xxx (2^69 - 2^70): 6.8M/s avg.[/QUOTE]

I just finished testing two 5870s with a Phenom x6 1055T. I ran 4 instances, 2 per GPU. All instances are running 70-72 with no stages. SievePrimes is set to autoadjust.

32k 64 bit exe:
GPU 1 runs ~20.6 sec per class with SievePrimes at ~28000.
GPU 2 runs ~22.0 sec per class with SievePrimes at ~36000.

64k 64 bit exe:
GPU 1 runs ~20.0 sec per class with SievePrimes at ~41000.
GPU 2 runs ~20.5 sec per class with SievePrimes at ~54000.

Average CPU wait time for all instances is between 200-400us.

Usage:
CPU: 75%
GPU 1: 73%
GPU 2: 85%


All times are UTC. The time now is 00:30.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.