![]() |
Mfaktc sieveprimes=5000 OK?
Hi,
I'm playing with the GPU and mfaktc again, and I've noticed that if I adjust SievePrimes down from the default 25000 to 5000, I get a significant speedup. Is there any downside to reducing SievePrimes? Can this cause factors to be missed? |
Hi,
[QUOTE=NBtarheel_33;273723]I'm playing with the GPU and mfaktc again, and I've noticed that if I adjust SievePrimes down from the default 25000 to 5000, I get a significant speedup. Is there any downside to reducing SievePrimes? Can this cause factors to be missed?[/QUOTE] Unless there is an undiscovered bug the value of SievePrimes does not cause missing factors. Significant speedup... how much faster is it? Keep in mind that the avg. rate doesn't really matters. What matters is the time per class/assignment. [B]Perhaps[/B] you want to dedicate another core to mfaktc. SievePrimes=5000 usually tells you that not enough CPU resources are available. Just start another copy of mfaktc in a [B]separate directory[/B] working on different exponents. Oliver |
What is avg. rate then if it doesn't really matter? I notice the same thing, where the speed up is in avg. rate. (the auto adjust doesn't up it from 5000 if that matters)
|
Oliver, correct me if I'm wrong, but I think that Average Rate refers to the rate at which the GPU takes candidate factors (output from the prime sieve running on the CPU, which at sieveprimes = 5,000, is doing a relatively worse job of eliminating things that can't be factors) and determines whether the candidates are or are not factors.
Classes, on the other hand, represent a fixed division in the population of all possible factors, so you want those to fly by as fast as possible. A low sievePrimes indicates that the CPU core is saturated, and you *may* want to add another CPU to the job by running a second copy of mfaktc on a different core. (But possibly not, as this other core could be doing P-1 or LL tests) |
Mfaktc on my system always adjusts sieveprimes down to 5000 after a few seconds of operation. I tried shutting off the automatic adjustment and using higher values, but 5000 always gives me the best performance (least amount of total time spent).
I am running two instances of mfaktc, two LL tests, one DC and one P-1 on a six-core system (core-i7, hyperthreading off). Chuck |
[QUOTE=Chuck;273843]Mfaktc on my system always adjusts sieveprimes down to 5000 after a few seconds of operation. I tried shutting off the automatic adjustment and using higher values, but 5000 always gives me the best performance (least amount of total time spent).
I am running two instances of mfaktc, two LL tests, one DC and one P-1 on a six-core system (core-i7, hyperthreading off). Chuck[/QUOTE] Assuming you have set up the instances of mfaktc to run on two separate cores using your operating system, that means you have a very nice GPU that your CPU isn't quite keeping up with, so the low sieveprimes reduces the CPU load per class, and makes the GPU, with the spare cycles available, do a little more work. |
[QUOTE=Christenson;273919]Assuming you have set up the instances of mfaktc to run on two separate cores using your operating system, that means you have a very nice GPU that your CPU isn't quite keeping up with, so the low sieveprimes reduces the CPU load per class, and makes the GPU, with the spare cycles available, do a little more work.[/QUOTE]
It is a GTX580 Black Ops (797 MHz default clock) Chuck |
[QUOTE=Chuck;273922]It is a GTX580 Black Ops (797 MHz default clock)
Chuck[/QUOTE] QED, a *very* nice GPU!:smile: |
[QUOTE=Chuck;273922]It is a GTX580 Black Ops (797 MHz default clock)
Chuck[/QUOTE] daaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaannnnnnnnnnnnnnnnnnnnnnnngggggggggggggg |
[QUOTE=Chuck;273843]Mfaktc on my system always adjusts sieveprimes down to 5000 after a few seconds of operation. I tried shutting off the automatic adjustment and using higher values, but 5000 always gives me the best performance (least amount of total time spent).[/QUOTE]
This might be telling you that there's a benefit to running a 3rd instance of mfaktc to keep the card fed with data? |
[QUOTE=kjaget;274007]This might be telling you that there's a benefit to running a 3rd instance of mfaktc to keep the card fed with data?[/QUOTE]
No, it's running at 85% GPU load. Chuck |
[QUOTE=kjaget;274007]This might be telling you that there's a benefit to running a 3rd instance of mfaktc to keep the card fed with data?[/QUOTE]
Indeed, Mr chuck could do this...remember I said the GPU was distinctly outrunning the CPU...and he would extract a bit more throughput from the GPU, but it would be at the expense of whatever the third core would otherwise do, and probably raise the temperature some. Better balance might be to run a copy of CudaLucas, which uses very little CPU, instead, or have the #3 core do P-1. Remember that only 10-20% of exponents will be knocked out by the new capabilities of TF, leaving 80-90% still needing those boring old LL tests. But the numbers chuck is throwing out are no slouch...even if they are not at the exact peak of optimum GIMPS TF performance. |
I'm quite happy with his throughput ... :big grin:
|
[QUOTE=Christenson;274051]Indeed, Mr chuck could do this...remember I said the GPU was distinctly outrunning the CPU...and he would extract a bit more throughput from the GPU, but it would be at the expense of whatever the third core would otherwise do, and probably raise the temperature some.
Better balance might be to run a copy of CudaLucas, which uses very little CPU, instead, or have the #3 core do P-1. Remember that only 10-20% of exponents will be knocked out by the new capabilities of TF, leaving 80-90% still needing those boring old LL tests. But the numbers chuck is throwing out are no slouch...even if they are not at the exact peak of optimum GIMPS TF performance.[/QUOTE] On the GPU I am doing TFs in the 26M range 67—>68 for ckdo and in the 600M range from 64—>67. The lower range takes about 15 min per check, and the upper one about 30 sec. I am using the MORE_CLASSES disabled version for the higher range which has helped a lot, cutting the time from 45 sec to 30 sec BUT increasing the GPU load and temperature. It runs 24 hours/day so that's all the load I am going to put on it. I'm not the smartest one in the room here — I just have money to spend on a fast computer and am retired so have unlimited time. I never could figure out how to work CudaLucas. As I remember it doesn't take its work from a worktodo file and I wouldn't know how to set up a batch file. mfaktc was as complicated a thing as I could manage... The CPU is a core-i7 970 OCd to 3675 MHz, six cores with hyperthreading turned off. Two cores do mfaktc with the GPU, two cores LL, one core DC and one core P-1 |
I can make you smarter, just keep reading what I tell you:
[url]http://mersenneforum.org/showthread.php?t=15545[/url], towards the end of the thread, gives a quick intro to batch files, and a sample command line for CUDALucas. On Windows, there's not much to them, and you don't need much either -- it got pointed out to me that the simplest batch file (just one line) is a command line associated with one of those "shortcuts" on the desktop. Wordpad is a fine editor for Windows batch files, so is notepad. Then, get you a copy of CUDALucas. There may be some nice pointers in the "putting it all together" thread. If not, you'll need to wade backwards through the CUDALucas thread, but this is admittedly painful. Run said copy, and ask more questions here or somewhere, as we are WAAAYYY off-topic. Oh, and don't run Microsoft Internet Explorer unless you are absolutely required to. That's the first rule of anti-virus. |
[QUOTE=Dubslow;273735]What is avg. rate then if it doesn't really matter? I notice the same thing, where the speed up is in avg. rate. (the auto adjust doesn't up it from 5000 if that matters)[/QUOTE]
Right. I was getting about 107M/sec with SievePrimes=5000, and only 70-80M/sec with SievePrimes=25000. |
[QUOTE=Chuck;274009]No, it's running at 85% GPU load.
Chuck[/QUOTE] Where can I get a hold of this figure? Does it show up in Task Manager on Windows, or top on Linux? |
[QUOTE=TheJudger;273724]Hi,
Unless there is an undiscovered bug the value of SievePrimes does not cause missing factors. Significant speedup... how much faster is it? Keep in mind that the avg. rate doesn't really matters. What matters is the time per class/assignment. [B]Perhaps[/B] you want to dedicate another core to mfaktc. SievePrimes=5000 usually tells you that not enough CPU resources are available. Just start another copy of mfaktc in a [B]separate directory[/B] working on different exponents. Oliver[/QUOTE] How much CPU should typically be allocated to each instance of mfaktc? I was working with an 8-core Nehalem Xeon system @ 2.66 GHz with 2 Fermi GPUs. At first, I had Prime95 TFing on all 8 cores, as well as 2 instances of mfaktc going on the GPUs. I noticed that three of the CPU cores were slower than the others, so I stopped their workers, and ran 5 TF CPU cores, and two instances of mfaktc. Things picked up nicely at this point. I also experimented with a third instance of mfaktc; it slowed the GPUs down into the 65-70M/sec range. If we have X GPUs, should we run exactly X copies of mfaktc, or does it make more efficient use of the GPUs to run more? |
[QUOTE=NBtarheel_33;274086]How much CPU should typically be allocated to each instance of mfaktc? I was working with an 8-core Nehalem Xeon system @ 2.66 GHz with 2 Fermi GPUs. At first, I had Prime95 TFing on all 8 cores, as well as 2 instances of mfaktc going on the GPUs. I noticed that three of the CPU cores were slower than the others, so I stopped their workers, and ran 5 TF CPU cores, and two instances of mfaktc. Things picked up nicely at this point. I also experimented with a third instance of mfaktc; it slowed the GPUs down into the 65-70M/sec range.
If we have X GPUs, should we run exactly X copies of mfaktc, or does it make more efficient use of the GPUs to run more?[/QUOTE] Nice system...but I suspect your GPU cards are running circles around your CPU cores at TF....you will need to run *at least* one core of mfaktc to keep one GPU busy with sieved output....maybe two or three, see chuck's discussion of two cores not quite keeping up with the sieving on his GPU. |
[QUOTE=NBtarheel_33;274084]Where can I get a hold of this figure? Does it show up in Task Manager on Windows, or top on Linux?[/QUOTE]
I am using the EVGA precision utility which allows for overclocking, but also displays GPU usage statistics. You could also use MSI Afterburner, a different skin but performs the same functions. Chuck |
[QUOTE=Chuck;274009]No, it's running at 85% GPU load.
Chuck[/QUOTE] Which tells me that adding a 3rd instance might bring that up to 100% load. Right now 2 CPU cores can't keep up so the GPU is idle for 15% of the time. You'd just have to balance whether 100% of one CPU core is worth trading for an extra 15% of a GPU core. And as you mentioned, that includes not only performance trade-offs, but heat, noise and so on. You might also consider doing a higher bit-depth for the 600M range. It's possible that there's so much overhead in running such a small bitlevel on a fast card that adding 1 more bit might not slow down the process that much. I have no idea to the answers to any of these questions, but it might be worth a try experimenting. |
[QUOTE=kjaget;274119]Which tells me that adding a 3rd instance might bring that up to 100% load. Right now 2 CPU cores can't keep up so the GPU is idle for 15% of the time. You'd just have to balance whether 100% of one CPU core is worth trading for an extra 15% of a GPU core. And as you mentioned, that includes not only performance trade-offs, but heat, noise and so on.
You might also consider doing a higher bit-depth for the 600M range. It's possible that there's so much overhead in running such a small bitlevel on a fast card that adding 1 more bit might not slow down the process that much. I have no idea to the answers to any of these questions, but it might be worth a try experimenting.[/QUOTE] This is good enough for me. Eventually I will get to higher bit levels in the 600M range. |
[QUOTE=Chuck;274111]I am using the EVGA precision utility which allows for overclocking, but also displays GPU usage statistics. You could also use MSI Afterburner, a different skin but performs the same functions.
Chuck[/QUOTE] Or GPU-Z |
[QUOTE=NBtarheel_33;274086]How much CPU should typically be allocated to each instance of mfaktc? I was working with an 8-core Nehalem Xeon system @ 2.66 GHz with 2 Fermi GPUs. At first, I had Prime95 TFing on all 8 cores, as well as 2 instances of mfaktc going on the GPUs. I noticed that three of the CPU cores were slower than the others, so I stopped their workers, and ran 5 TF CPU cores, and two instances of mfaktc. Things picked up nicely at this point. I also experimented with a third instance of mfaktc; it slowed the GPUs down into the 65-70M/sec range.
If we have X GPUs, should we run exactly X copies of mfaktc, or does it make more efficient use of the GPUs to run more?[/QUOTE] It's not mfaktc threads per GPU, it's CPU cores per GPU, and it's just an accident that it's one mfaktc thread per CPU core. Rephrase that: One CPU core may or may not be enough to saturate the GPU. If not, run a second core on the same GPU. In order to actually do this, you run a second mfaktc instance. So figure out how many cores are necessary to saturate the GPU, then run that number of mfaktc's. Now, the other MAJOR piece of advice: please Please PLEASE set each mfaktc thread to run on a [I]specific core[/I]. When I ran three cores P95 with one mfaktc, the P95's got about 80% efficiency, and the fourth core got 50% efficiency, without setting mfaktc to a specific core. I figured out (with help on the forum) how to do that, and both P95 and mfaktc performance went up dramatically. The Windows or Linux scheduler isn't designed to handle such specific things. In Windows, you can set the "affinity" of each process under the Process tab on the Task Manager by right clicking on whatever process you want. Please try this before putting more than one CPU core per GPU. I'd think one Nehalem core would be pretty good for a GPU, unless it's like a 590 or something. Note: You'll need to know which cores P95 is using to avoid setting the wrong affinity. |
[QUOTE=Dubslow;274199]Or GPU-Z[/QUOTE]
Right, that's a better choice since it is a monitoring utility. I have it on the desktop and forgot about it since I haven't used it for a few months. Chuck |
[QUOTE=Dubslow;274200]It's not mfaktc threads per GPU, it's CPU cores per GPU, and it's just an accident that it's one mfaktc thread per CPU core.
Rephrase that: One CPU core may or may not be enough to saturate the GPU. If not, run a second core on the same GPU. In order to actually do this, you run a second mfaktc instance. So figure out how many cores are necessary to saturate the GPU, then run that number of mfaktc's. Now, the other MAJOR piece of advice: please Please PLEASE set each mfaktc thread to run on a [I]specific core[/I]. When I ran three cores P95 with one mfaktc, the P95's got about 80% efficiency, and the fourth core got 50% efficiency, without setting mfaktc to a specific core. I figured out (with help on the forum) how to do that, and both P95 and mfaktc performance went up dramatically. The Windows or Linux scheduler isn't designed to handle such specific things. In Windows, you can set the "affinity" of each process under the Process tab on the Task Manager by right clicking on whatever process you want. Please try this before putting more than one CPU core per GPU. I'd think one Nehalem core would be pretty good for a GPU, unless it's like a 590 or something. Note: You'll need to know which cores P95 is using to avoid setting the wrong affinity.[/QUOTE] I have tried fiddling with the CPU affinities, but any changes I make have ended up slowing things down. |
Hmm. My gut-reaction guess would be that without setting a cpu, the scheduler sets as much time on as many cores as mfaktc can use, but then setting the affinity limits the cpu to the amount of work that one core can do. This would mean that mfaktc uses more than one core's worth of power to keep up with the 580, and without the affinities set, the scheduler lets it have that time. Have you tried setting mfaktc to two or three cores?
Let me do that again after reviewing your hardware: You have two separate instances of mfaktc to saturate just one 580, and four simultaneous P95 threads? Okay. Do you use the "best cpu" setting in P95, or do you set that to use specific CPU's? If you set mfaktc affinities without P95 doing the same, the scheduler could still create interference between the two. If P95 is already set, then... Try one mfaktc instance with affinity set to two cores (obviously the ones not in use by P95). If not, try the two instances you currently use, and set them each to one core. If not, then I'm not really sure. What cpu affinity settings have you already tried? |
[QUOTE=Chuck;274242]I have tried fiddling with the CPU affinities, but any changes I make have ended up slowing things down.[/QUOTE]
Some things to be aware of, but you may already know: 1) P95 tries to run at lowest priority, and has some intelligence about cores and threads. 2) mfaktc, as currently written, is 1 instance, one thread. When it starts communicating automatically, it will gain a communications thread, but it still is really not aware of processes or threads except on the GPU side. So each instance is running at whatever priority it inherits from the window it starts in, unless you fool with it in task manager. 3) mfaktc is modelled as Sieve on CPU feeds factor candidates to test on GPU. That means that increasing CPU performance to feed more or better factor candidates takes CPU away from P95. |
[QUOTE=Dubslow;274200]Now, the other MAJOR piece of advice: please Please PLEASE set each mfaktc thread to run on a [I]specific core[/I]. When I ran three cores P95 with one mfaktc, the P95's got about 80% efficiency, and the fourth core got 50% efficiency, without setting mfaktc to a specific core. I figured out (with help on the forum) how to do that, and both P95 and mfaktc performance went up dramatically. The Windows or Linux scheduler isn't designed to handle such specific things. In Windows, you can set the "affinity" of each process under the Process tab on the Task Manager by right clicking on whatever process you want. [/QUOTE]
This needs to be placed somewhere on the GPU FAQ. I just came upon this piece of advice; tried it, and it is indeed very important. |
2^?
what is the max exponent i can use in mfaktc ? |
Bigger then you need. Theoretically unlimited. As the exponent goes higher, less candidates have to be tested for each bit level, and the amount of work that need to be done for the same bit level is lower. Practically, there might be a hardware cap somewhere at 64, or 80, or 96 bits, who cares? It would not make too much sense to go over 10G.
The range below 1G is handled by Primenet, the higher ranges by OBD projects. Beside of strange experimenting, there will be totally NO NEED to go higher then 10G. Better question would be "what is the maximum bitlevel" for a factor. That is somewhere at 92 or 96 bits, again bigger then one needs with the current available hardware (speeds). |
I think, as currently designed, mfaktc has a hard limit of 32-bit exponent (2^32-1). There might be other internal limitations that reduces this even further. Try running mfaktc with 4294967291, the largest 32-bit prime, and see if it accepts it.
|
[QUOTE=NormanRKN;306164]2^?
what is the max exponent i can use in mfaktc ?[/QUOTE] [QUOTE=LaurV;306166]Bigger then you need. Theoretically unlimited. As the exponent goes higher, less candidates have to be tested for each bit level, and the amount of work that need to be done for the same bit level is lower. Practically, there might be a hardware cap somewhere at 64, or 80, or 96 bits, who cares? It would not make too much sense to go over 10G. The range below 1G is handled by Primenet, the higher ranges by OBD projects. Beside of strange experimenting, there will be totally NO NEED to go higher then 10G. Better question would be "what is the maximum bitlevel" for a factor. That is somewhere at 92 or 96 bits, again bigger then one needs with the current available hardware (speeds).[/QUOTE] [QUOTE=axn;306167]I think, as currently designed, mfaktc has a hard limit of 32-bit exponent (2^32-1). There might be other internal limitations that reduces this even further. Try running mfaktc with 4294967291, the largest 32-bit prime, and see if it accepts it.[/QUOTE] I found a limit on the size of k in the valid assignment checker. [code]int ret = 1; if(exp < 1000000) {ret = 0; if(verbosity >= 1)printf("WARNING: exponents < 1000000 are not supported!\n");} else if(!isprime(exp)) {ret = 0; if(verbosity >= 1)printf("WARNING: exponent is not prime!\n");} else if(bit_min < 1 ) {ret = 0; if(verbosity >= 1)printf("WARNING: bit_min < 1 doesn't make sense!\n");} else if(bit_min > 94) {ret = 0; if(verbosity >= 1)printf("[U]WARNING: bit_min > 94 is not supported![/U]\n");} else if(bit_min >= bit_max) {ret = 0; if(verbosity >= 1)printf("WARNING: bit_min >= bit_max doesn't make sense!\n");} else if(bit_max > 95) {ret = 0; if(verbosity >= 1)printf("[U]WARNING: bit_max > 95 is not supported![/U]\n");} else if(((double)(bit_max-1) - (log((double)exp) / log(2.0F))) > 63.9F) /* this leave enough room so k_min/k_max won't overflow in tf_XX() */ {ret = 0; if(verbosity >= 1)printf("[U]WARNING: k_max > 2^63.9 is not supported![/U]\n");}[/code] With those limits, the maximum exponent is somewhere around [URL="http://www.wolframalpha.com/input/?i=solve%202%5E95%3D2*2%5E63.9*p&t=crmtb01"]1.15081*10[sup]9[/sup][/URL], or just north of 1G, or just over a quarter of axn's limit. [url]http://www.wolframalpha.com/input/?i=primes+near+1.15081*10%5E9[/url] Edit: It's probably not that hard to modify the code to go higher, but I'll let axn/TheJudger/other knowledgeable persons speak to that. |
I didn't want to go so much in details and confuse the man. :D
Oliver said few times that the [URL="http://www.mersenneforum.org/showpost.php?p=251781&postcount=575"]exponent is limited to 32 bits[/URL], does it matter? For higher, use Factor5 or so. I was one of the guys asking long ago about factoring OBD exponents, and got reply from Uncwilly about Factor5, the discussion is somewhere around. edit @Dubslow: size of k does not limit the size of the expo. When expo gets higher, k gets smaller. The factor candidate has to fit the barret/whatever/kernel when squared mod factor. The expo is just a string of bits which says how many times you do the square-shift trick. |
[QUOTE=LaurV;306169]
edit @Dubslow: size of k does not limit the size of the expo. When expo gets higher, k gets smaller. The factor candidate has to fit the barret/whatever/kernel when squared mod factor. The expo is just a string of bits which says how many times you do the square-shift trick.[/QUOTE] :doh!: Yeah, I screwed up. Too lazy to care though. |
| All times are UTC. The time now is 08:10. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.