![]() |
1 Attachment(s)
[QUOTE=Bdot;342463]
Please check, if VectorSize=2 or VectorSize=3 helps when GPU sieving. This would not be very fortunate, as it leaves ~25% of the vector units unused, but it may still be better than having to spill registers to global memory ...[/QUOTE] Sorry for the late reply. VectorSize 3 works on cpu sieving but not on gpu sieving... |
[QUOTE=kracker;343601]Sorry for the late reply. VectorSize 3 works on cpu sieving but not on gpu sieving...[/QUOTE]
Bug confirmed - as 3 is a bad choice for the CPU sieve kernels, I omitted it from the GPU sieve kernels, without checks. How did VectorSize=2 do? |
[QUOTE=Jayder;343478]Will the "best fit" GPUSievePrimes value remain constant? Or will it change as you change bit level, exponent, or possibly kernel? For example, if a GPUSievePrimes value of 52000 works best on a 332M exponent going from 2^69 to 2^70, will 52000 also be the best for a 65M exponent and 2^73 to 2^74? What about an 8M exponent and 2^60 to 2^61? Etc.
Are there any good strategies for finding the best value? Other than intelligent trial and error. I searched for an answer in the mfaktc thread, but it is kind of a massive thread. I also considered experimenting and finding out for myself, but it would take a very long time to find out on my slow GPU. Hopefully I am not missing an answer that is staring me in the face.[/QUOTE] I have not tested that yet, but here's my theory: Sieving has a constant speed for different kernels, factor sizes etc. Its speed depends only on the GPUSieve* parameters. The optimal selection of GPUSievePrimes depends on finding the best relation of the kernel run times of the sieve kernel versus the trial factoring kernel. Better (i. e. longer) sieving means less (i.e. shorter) trial factoring, in a non-linear dependency. Therefore, anything that changes the speed of trial factoring will also change the optimal GPUSievePrimes. If testing takes longer (because a slower kernel needs to be used, or because the exponent has more bits), then GPUSievePrimes can be a bit higher. However, the differences should be rather small, and so are the achievable improvements. I'll test that and add more details to my theoretical explanation soon. I'm also working on an automatic optimizer for those and other variables, so that no more manual change-and-test cycle will be needed. |
I've been testing the new version of mfakto on my 7770, and all I can say is -- wow!! :fusion:
Single instances of TF (70 --> 71) on an M672xxxxx went from 71 minutes (71.56 GHz-days/day) with 0.12 (x64), to 36 minutes (142.23 GHz-days/day) with 0.13 (x64). Thank you [B]Bdot[/B]... and thank you [B]kracker[/B] for alerting me to the new version! Oh, and Prime95 performance apparently is no longer taking a hit of any kind. Fantastic job! :tu: One curiosity I noticed. The results for the same TF on the 0.13 x32 version were 35 minutes and 144.67 Ghz-days/day, [B]better[/B] than for the x64 variety. (Nothing else changed in the system environment.) Is that expected? This wasn't the case with the 0.12 x32 (86 minutes, 59.61 GHz-days/day) which wasn't quite as productive as the x64. Rodrigo |
UPDATE:
Just finished a run of mfakto 0.13 x32 in two instances on the 7770 (CPU: Core i7-3770, Windows 7 Home Premium x64, three Prime95 workers running). One instance completed at a rate of 73.89 GHz-days/day; the other, at 73.92. (This time it wasn't the exact same exponent that was factored, but two consecutive ones.) Prime95 still virtually if not completely unaffected, but there seems to be almost no benefit any longer to running more than one instance of mfakto. (With version 0.12 I could run three instances of TF and reach ~140 GHz-days/day, although not consistently and the average was closer to 130.) Another thing: this time (with the two instances running) there was a severe lag whenever I moved the cursor or tried to reposition a window. I have verified that this does not happen when running the single instance of 0.13. Hope that this info is useful. Rodrigo |
0.13 moved the sieving part to the GPU, which was previously done on the CPU. So, unless I'm hugely mistaken, when using GPU sieving you will only need to ever run one instance to reach full GPU load. The CPU will remain unused, because it's not doing work anymore. This is all expected behaviour. Pretty great, right?
If it's still lagging with one instance, I think GPUSieveSize is the setting you'll want to lower. It will make everything a lot more responsive, and I didn't see any reduction in GHzD when I lowered it. As for why x32 seems better/the same as x64, I am not sure. I will leave that for somebody more knowledgeable to answer. But the difference between the two was only one minute; is it abnormal for it to fluctuate that greatly? You weren't using the GPU for something else? Off topic: I discovered something that maybe others don't know about: Ctrl+S will pause/unpause the worker (any command-line function, it seems). I use this now instead of Ctrl+C. Hopefully I'm not breaking anything. |
[QUOTE=Jayder;344367]As for why x32 seems better/the same as x64, I am not sure.[/QUOTE]
If it is anything like the CUDA version, it's because with 32-bit vs. 64-bit memory addresses there's half the data to keep track of. The CPU sieve picked up some speed from using 64-bit code to offset this in older versions, but with all that happening in the GPU on the newer versions there's no [net] benefit to 64-bit code. See [url]http://mersenneforum.org/showpost.php?p=323678&postcount=1981[/url] for more detail. |
[QUOTE=Jayder;344367]0.13 moved the sieving part to the GPU, which was previously done on the CPU. So, unless I'm hugely mistaken, when using GPU sieving you will only need to ever run one instance to reach full GPU load. The CPU will remain unused, because it's not doing work anymore. This is all expected behaviour. Pretty great, right?
If it's still lagging with one instance, I think GPUSieveSize is the setting you'll want to lower. It will make everything a lot more responsive, and I didn't see any reduction in GHzD when I lowered it. As for why x32 seems better/the same as x64, I am not sure. I will leave that for somebody more knowledgeable to answer. But the difference between the two was only one minute; is it abnormal for it to fluctuate that greatly? You weren't using the GPU for something else? Off topic: I discovered something that maybe others don't know about: Ctrl+S will pause/unpause the worker (any command-line function, it seems). I use this now instead of Ctrl+C. Hopefully I'm not breaking anything.[/QUOTE] Yeah, it IS pretty great! Less "paperwork" that way, keeping track of who's doing what on the computer. No, there was nothing else going on with the GPU during any of the mfakto runs. But @kjaget's explanation makes sense. Nice find about Ctrl+S, by the way. I'll start using that instead of Ctrl+C when I simply need it to pause. Rodrigo |
[QUOTE=kjaget;344372]If it is anything like the CUDA version, it's because with 32-bit vs. 64-bit memory addresses there's half the data to keep track of.
The CPU sieve picked up some speed from using 64-bit code to offset this in older versions, but with all that happening in the GPU on the newer versions there's no [net] benefit to 64-bit code. See [URL]http://mersenneforum.org/showpost.php?p=323678&postcount=1981[/URL] for more detail.[/QUOTE] Thanks for the explanation! I should be doing my work on the 32-bit version, then, to squeeze that little extra output from it. Rodrigo |
[QUOTE=Bdot;343638]Bug confirmed - as 3 is a bad choice for the CPU sieve kernels, I omitted it from the GPU sieve kernels, without checks.
How did VectorSize=2 do?[/QUOTE] Slower. [QUOTE=Rodrigo;344357]UPDATE: Just finished a run of mfakto 0.13 x32 in two instances on the 7770 (CPU: Core i7-3770, Windows 7 Home Premium x64, three Prime95 workers running). One instance completed at a rate of 73.89 GHz-days/day; the other, at 73.92. (This time it wasn't the exact same exponent that was factored, but two consecutive ones.) Prime95 still virtually if not completely unaffected, but there seems to be almost no benefit any longer to running more than one instance of mfakto. (With version 0.12 I could run three instances of TF and reach ~140 GHz-days/day, although not consistently and the average was closer to 130.) Another thing: this time (with the two instances running) there was a severe lag whenever I moved the cursor or tried to reposition a window. I have verified that this does not happen when running the single instance of 0.13. Hope that this info is useful. Rodrigo[/QUOTE] Is this in the LL range ~65M? On my 7770 non OC'ed I am getting around 160 Ghz/days. |
[QUOTE=kracker;344384]
Is this in the LL range ~65M? On my 7770 non OC'ed I am getting around 160 Ghz/days.[/QUOTE] The exponents were in the 67M range (TF). My 7770 is as it came, no adjustments made to it. The only tweak I've made to mfakto 0.13 is to change the VectorSize value from the default 4 down to 2, as suggested by the program itself the first time I ran it. Have you made any other adjustments to increase the output? Rodrigo |
| All times are UTC. The time now is 23:10. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.