![]() |
For the sake of clarity, all the information in the post above is premised on Prime95's affinity being set to CPUs 2-7 (only) in Task Manager. That's the only affinity setting that's currently operational.
Rodrigo |
[QUOTE=Rodrigo;322793]For the sake of clarity, all the information in the post above is premised on Prime95's affinity being set to CPUs 2-7 (only) in Task Manager. That's the only affinity setting that's currently operational.
Rodrigo[/QUOTE] Hi Rodrigo, mfakto uses AMD's OpenCL implementation, which is designed in a multithreaded way. Basically, there is the user thread that in mfakto's case does the CPU-based sieving and the OpenCL API calls to schedule the GPU work. AMD's OpenCL now uses a different thread that takes the parameters of the API calls, drives the GPU and collects the results. Then the results are passed back to the user thread. Binding the whole mfakto process to a single core will make the sieving compete with the background thread that drives the GPU, resulting in low GPU utilization. 0x3 instead of 0x1 could improve the situation a lot because it allows two hyperthreads of the same physical core to be used. But I've seen cases when this does not work. Not limiting mfakto in your case may be the better solution. Cuda is running the GPU-related stuff in the same thread that issued the API calls, therefore mfaktc is working perfectly fine when bound to a single CPU core. Mfakto, however, workes best, if at least one CPU core is "unused" so that it can quickly take over the task to drive the GPU and pass the data back and forth. For this task, an unused hyper-thread is fully sufficient. A core that runs prime95 at low priority, however, is not optimal because even the low priority job will finish its time-slice (typically 15ms) before switching to mfakto and serving the GPU. In average you lose about half a time-slice before the next job is scheduled on the GPU. Making things worse, the mfakto main thread takes a timing of how long the GPU call took, in order to adjust SievePrimes for optimal resource usage. If there is no CPU core available to drive the GPU (i.e. the OpenCL background thread), then the main thread includes this wait time in the GPU part of the equation. There's just no way to separate the GPU run time from the time between the API call and the actual time when the GPU starts. This means that mfakto will adjust SievePrimes higher, when there is no CPU to serve the background thread. Higher SievePrimes now further reduces GPU utilization ... Therefore, if you plan to run something on each "logical core" aka hyperthread, then it is necessary to fix SievePrimes to some value that you need to find out for yourself. Mfakto's autoadjust does not work on a fully loaded system. In order to bind mfakto to a certain core, it is better to use the mfakto.ini variable SieveCPUMask: [code] # Allow to set the CPU affinity of the siever thread. It is a bit-mask, # specified in decimal (sorry): e.g. 0=no limit, 5=CPU0+CPU2, 15=CPU0-3, # 18446744073709551615=CPU0-63 (max value) # # Default: SieveCPUMask=0 SieveCPUMask=0 [/code]This will bind the user thread (the sieving, i.e. the most CPU-intense part) to a certain set of CPUs, but leave the OpenCL threads to run on any CPU. This way, their "mini-jobs" to drive the GPU will be done fastest. On an i7 (4 physical, 8 logical cores) I usually run 3 prime95 threads and 4 mfakto instances, leaving one hyperthread "unused". This way I can leave mfakto on auto-adjust. On a Phenom x4 (4 cores, no hyper-threading) I can run 3 mfakto instances on auto-adjust and no prime95. Or I can use fix SievePrimes on 3 instances of mfakto and run 2 threads of prime95. After careful balancing I managed to make the 3 mfakto processes take 2 cores together ... quite some experimenting. Bdot |
Hi Bdot,
Thank you very much for the details! Despite my lack of expertise, I was (I think!) actually able to follow almost everything you said there. One point where I need clarification. First you wrote that: [QUOTE=Bdot;322947][B]Binding the whole mfakto process to a single core [/B]will make the sieving compete with the background thread that drives the GPU, resulting in low GPU utilization. 0x3 instead of 0x1 could improve the situation a lot because it allows two hyperthreads of the same physical core to be used. But I've seen cases when this does not work. Not limiting mfakto in your case may be the better solution.[/QUOTE] But then later on you wrote that: [QUOTE=Bdot;322947][B]In order to bind mfakto to a certain core[/B], it is better to use the mfakto.ini variable SieveCPUMask[/QUOTE][emphasis added in both quotes] Let me see if I understand this. Would it be correct to say (1) that you recommend that mfakto not be bound to a particular CPU core, but then (2) that if one wants to bind mfakto to a particular core, then using SieveCPUMask is the way to do it? Rodrigo |
[QUOTE=Rodrigo;322991]Hi Bdot,
Thank you very much for the details! Despite my lack of expertise, I was (I think!) actually able to follow almost everything you said there. [...] Let me see if I understand this. Would it be correct to say (1) that you recommend that mfakto not be bound to a particular CPU core, but then (2) that if one wants to bind mfakto to a particular core, then using SieveCPUMask is the way to do it? Rodrigo[/QUOTE] Hi Rodrigo, yes, that is basically what I wanted to say. And the reason why using SieveCPUMask is better is, that this will only bind the sieve thread to the specified CPU(s), not the whole process. Using the /affinity switch for cmd/start will limit all mfakto threads to the specified CPU(s), including the background thread driving the GPU. Bdot |
[QUOTE=Bdot;323063]yes, that is basically what I wanted to say. And the reason why using SieveCPUMask is better is, that this will only bind the sieve thread to the specified CPU(s), not the whole process. Using the /affinity switch for cmd/start will limit all mfakto threads to the specified CPU(s), including the background thread driving the GPU.[/QUOTE]
Thanks very much, Bdot -- maybe there's hope for me yet. :smile: In that case, we will keep things simple, and not bind mfakto to a particular core. Now, on to another question. I noticed in your earlier post that you said that: [QUOTE]On an i7 (4 physical, 8 logical cores) I usually run 3 prime95 threads and 4 mfakto instances, leaving one hyperthread "unused". [/QUOTE] It's amazing that you can run, not only 3 instances of P95, but 4 of mfakto in addition! Here is what I would like to know: did you use any special commands or settings to make it so that (a) Prime95 would run on three hyperthreads, and (b) mfakto would run on four other ones? I'm thinking that all I would need to do is to set the CPU affinity for Prime95 in Task Manager to three specific threads, and then mfakto will find its own way and work around them. If you are doing any LL work, what kinds of per-iteration times are you getting by using a single thread for each exponent? And with four mfakto instances, what time/class values or average M/s rates are you getting per instance? (This way I will be able to compare and see how effectively I'm setting things up.) Rodrigo |
[QUOTE=Rodrigo;323086]Thanks very much, Bdot -- maybe there's hope for me yet. :smile:
It's amazing that you can run, not only 3 instances of P95, but 4 of mfakto in addition! Here is what I would like to know: did you use any special commands or settings to make it so that (a) Prime95 would run on three hyperthreads, and (b) mfakto would run on four other ones? I'm thinking that all I would need to do is to set the CPU affinity for Prime95 in Task Manager to three specific threads, and then mfakto will find its own way and work around them. If you are doing any LL work, what kinds of per-iteration times are you getting by using a single thread for each exponent? And with four mfakto instances, what time/class values or average M/s rates are you getting per instance? (This way I will be able to compare and see how effectively I'm setting things up.) Rodrigo[/QUOTE] I did not set any affinity in task manager or anything alike. I made sure that prime95 uses different physical cores (in the Test->Worker Windows screen I specified CPU #1, #3 and #5 for worker #1, #2 and #3, respectively). If you then start the 4 mfakto-instances, they'll run on other hyper-threads. They do influence prime95, but I don't have the numbers here - I'll post that later. |
Hi Bdot,
This is good, it will help me to decide how to allocate the various cores/hyperthreads. Currently, running a single mfakto instance, I'm getting TF time/class values of around 3 seconds. What sorts of time/class values (per instance) are you getting with 4 mkfakto instances? Could I actually quadruple my TF output :surprised and still be able to use the PC for work? Rodrigo |
1 Attachment(s)
[QUOTE=Rodrigo;323455]Hi Bdot,
This is good, it will help me to decide how to allocate the various cores/hyperthreads. Currently, running a single mfakto instance, I'm getting TF time/class values of around 3 seconds. What sorts of time/class values (per instance) are you getting with 4 mkfakto instances? Could I actually quadruple my TF output :surprised and still be able to use the PC for work? Rodrigo[/QUOTE] Hi Rodrigo, :smile: I'd not expect quadruple, but double may be in range (depending on your GPU). For the attached screen (post-processed to fit the 1600x1200 requirement) you'll see prime95 running alone @ 11ms/it for DCs. After starting the 4 mfakto instances, the per iteration time increases to 20ms. The system is an i7 2600 (3.4GHz) with an HD 6870 @ 960MHz. It is memory-bandwidth limited: running 4 LL tests increases the iteration times to ~14ms. BTW, this machine achieves the best primenet credit when running just one prime95 thread. The higher SievePrimes values more than compensate for the loss in LL-throughput (GHz-days wise). If you setup your machine like that, expect the machine to respond slower. If you can still use it for work or not depends on the type of work and the type of GPU - both CPU and GPU run near 100% utilization. You'll need to test it to find out ... Less mfakto-instances or lower GridSize may improve responsiveness. Bdot |
Extremely informative post, Bdot -- thank you! :tu:
This will be a huge help as I look for the most GIMPS-productive settings that will still allow normal work on my new PC. Rodrigo |
[QUOTE=Bdot;323470]For the attached screen (post-processed to fit the 1600x1200 requirement) you'll see prime95 running alone @ 11ms/it for DCs. After starting the 4 mfakto instances, the per iteration time increases to 20ms. The system is an i7 2600 (3.4GHz) with an HD 6870 @ 960MHz. It is memory-bandwidth limited: running 4 LL tests increases the iteration times to ~14ms.
BTW, this machine achieves the best primenet credit when running just one prime95 thread. The higher SievePrimes values more than compensate for the loss in LL-throughput (GHz-days wise).[/QUOTE] Hi Bdot, This may sound like a silly question, but here goes anyway: Suppose that you were to close down 3 of your 4 instances of mfakto (regardless of what you did with the P95 workers). What would be the effect on the time/class of the one remaining mfakto instance? Rodrigo |
[QUOTE=Rodrigo;323502]Hi Bdot,
This may sound like a silly question, but here goes anyway: Suppose that you were to close down 3 of your 4 instances of mfakto (regardless of what you did with the P95 workers). What would be the effect on the time/class of the one remaining mfakto instance? Rodrigo[/QUOTE] Hi Rodrigo, provided the 4 instances have auto-adjusted to fully load CPU and GPU, then stopping 3 of them has mainly these additive effects: [LIST][*]The remaining instance has been using about 25% of the GPU when 4 instances were running - this will stay so for a few classes: it will only be able to use slightly above 25% GPU, due to the (relatively) high SievePrimes. The time per class will be reduced only slighly.[*]The remaining instance no longer needs to share a physical core. This means, that sieving will be almost twice as fast - doubling the GPU utilization and cutting the time per class in half.[*]With each finished class, mfakto will lower SievePrimes until the GPU utilization is close to 100% or the minimum SievePrimes value is reached. This can result in further reduction of the time per class by some 10%.[/LIST]Adding up, my estimation (rather a WAG) is that the GPU load will drop from 100% to ~55-60%, with the time per class of the remainig instance dropping to ~45-50% of before. Then GPU load and throughput will slowly increase. I guess, the time per class will stabilize at 35-40% of before, if a single CPU core can fully load the GPU. This guess can be off quite a bit, depending on the SievePrimes difference. At lower SievePrimes, more composite factor candidates will be tested by the GPU: With SievePrimes at 5000, 75% of all FCs in a class will be sieved out. At 60000, this ratio is at 80%. 25 instead of 20 FCs to be tested is 25% more effort ... These 25% (in this example) plus possibly some better GPU utilization are the gains that you can expect when switching from one to multiple instances. Excerpt from the [I]mfakto --perftest[/I] output: [code] SievePrimes: 256 2000 5000 10000 20000 40000 60000 80000 100000 200000 500000 1000000 Sieved out: 63.63% 72.35% 74.97% 76.63% 78.08% 79.35% 80.02% 80.47% 80.81% 81.78% 82.91% 83.68% [/code]Bdot |
| All times are UTC. The time now is 23:06. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.