mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-06-16, 23:15   #815
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23×271 Posts
Default

Quote:
Originally Posted by Bdot View Post
Please check, if VectorSize=2 or VectorSize=3 helps when GPU sieving. This would not be very fortunate, as it leaves ~25% of the vector units unused, but it may still be better than having to spill registers to global memory ...
Sorry for the late reply. VectorSize 3 works on cpu sieving but not on gpu sieving...
Attached Thumbnails
Click image for larger version

Name:	v3f.jpg
Views:	94
Size:	202.1 KB
ID:	9900  
kracker is online now   Reply With Quote
Old 2013-06-17, 12:34   #816
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

25516 Posts
Default

Quote:
Originally Posted by kracker View Post
Sorry for the late reply. VectorSize 3 works on cpu sieving but not on gpu sieving...
Bug confirmed - as 3 is a bad choice for the CPU sieve kernels, I omitted it from the GPU sieve kernels, without checks.

How did VectorSize=2 do?
Bdot is offline   Reply With Quote
Old 2013-06-17, 12:48   #817
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by Jayder View Post
Will the "best fit" GPUSievePrimes value remain constant? Or will it change as you change bit level, exponent, or possibly kernel? For example, if a GPUSievePrimes value of 52000 works best on a 332M exponent going from 2^69 to 2^70, will 52000 also be the best for a 65M exponent and 2^73 to 2^74? What about an 8M exponent and 2^60 to 2^61? Etc.

Are there any good strategies for finding the best value? Other than intelligent trial and error.

I searched for an answer in the mfaktc thread, but it is kind of a massive thread. I also considered experimenting and finding out for myself, but it would take a very long time to find out on my slow GPU. Hopefully I am not missing an answer that is staring me in the face.
I have not tested that yet, but here's my theory:

Sieving has a constant speed for different kernels, factor sizes etc. Its speed depends only on the GPUSieve* parameters.

The optimal selection of GPUSievePrimes depends on finding the best relation of the kernel run times of the sieve kernel versus the trial factoring kernel. Better (i. e. longer) sieving means less (i.e. shorter) trial factoring, in a non-linear dependency. Therefore, anything that changes the speed of trial factoring will also change the optimal GPUSievePrimes. If testing takes longer (because a slower kernel needs to be used, or because the exponent has more bits), then GPUSievePrimes can be a bit higher.

However, the differences should be rather small, and so are the achievable improvements. I'll test that and add more details to my theoretical explanation soon.

I'm also working on an automatic optimizer for those and other variables, so that no more manual change-and-test cycle will be needed.
Bdot is offline   Reply With Quote
Old 2013-06-25, 02:19   #818
Rodrigo
 
Rodrigo's Avatar
 
Jun 2010
Pennsylvania

2·467 Posts
Default

I've been testing the new version of mfakto on my 7770, and all I can say is -- wow!!

Single instances of TF (70 --> 71) on an M672xxxxx went from 71 minutes (71.56 GHz-days/day) with 0.12 (x64), to 36 minutes (142.23 GHz-days/day) with 0.13 (x64). Thank you Bdot... and thank you kracker for alerting me to the new version!

Oh, and Prime95 performance apparently is no longer taking a hit of any kind. Fantastic job!

One curiosity I noticed. The results for the same TF on the 0.13 x32 version were 35 minutes and 144.67 Ghz-days/day, better than for the x64 variety. (Nothing else changed in the system environment.) Is that expected? This wasn't the case with the 0.12 x32 (86 minutes, 59.61 GHz-days/day) which wasn't quite as productive as the x64.

Rodrigo
Rodrigo is offline   Reply With Quote
Old 2013-06-25, 05:27   #819
Rodrigo
 
Rodrigo's Avatar
 
Jun 2010
Pennsylvania

16468 Posts
Default

UPDATE:

Just finished a run of mfakto 0.13 x32 in two instances on the 7770 (CPU: Core i7-3770, Windows 7 Home Premium x64, three Prime95 workers running).

One instance completed at a rate of 73.89 GHz-days/day; the other, at 73.92. (This time it wasn't the exact same exponent that was factored, but two consecutive ones.) Prime95 still virtually if not completely unaffected, but there seems to be almost no benefit any longer to running more than one instance of mfakto. (With version 0.12 I could run three instances of TF and reach ~140 GHz-days/day, although not consistently and the average was closer to 130.)

Another thing: this time (with the two instances running) there was a severe lag whenever I moved the cursor or tried to reposition a window. I have verified that this does not happen when running the single instance of 0.13.

Hope that this info is useful.

Rodrigo

Last fiddled with by Rodrigo on 2013-06-25 at 05:35 Reason: update to the update
Rodrigo is offline   Reply With Quote
Old 2013-06-25, 09:52   #820
Jayder
 
Jayder's Avatar
 
Dec 2012

11616 Posts
Default

0.13 moved the sieving part to the GPU, which was previously done on the CPU. So, unless I'm hugely mistaken, when using GPU sieving you will only need to ever run one instance to reach full GPU load. The CPU will remain unused, because it's not doing work anymore. This is all expected behaviour. Pretty great, right?

If it's still lagging with one instance, I think GPUSieveSize is the setting you'll want to lower. It will make everything a lot more responsive, and I didn't see any reduction in GHzD when I lowered it.

As for why x32 seems better/the same as x64, I am not sure. I will leave that for somebody more knowledgeable to answer. But the difference between the two was only one minute; is it abnormal for it to fluctuate that greatly? You weren't using the GPU for something else?


Off topic: I discovered something that maybe others don't know about: Ctrl+S will pause/unpause the worker (any command-line function, it seems). I use this now instead of Ctrl+C. Hopefully I'm not breaking anything.
Jayder is offline   Reply With Quote
Old 2013-06-25, 13:03   #821
kjaget
 
kjaget's Avatar
 
Jun 2005

12910 Posts
Default

Quote:
Originally Posted by Jayder View Post
As for why x32 seems better/the same as x64, I am not sure.
If it is anything like the CUDA version, it's because with 32-bit vs. 64-bit memory addresses there's half the data to keep track of.

The CPU sieve picked up some speed from using 64-bit code to offset this in older versions, but with all that happening in the GPU on the newer versions there's no [net] benefit to 64-bit code.

See http://mersenneforum.org/showpost.ph...postcount=1981 for more detail.

Last fiddled with by kjaget on 2013-06-25 at 13:08
kjaget is offline   Reply With Quote
Old 2013-06-25, 15:28   #822
Rodrigo
 
Rodrigo's Avatar
 
Jun 2010
Pennsylvania

2×467 Posts
Default

Quote:
Originally Posted by Jayder View Post
0.13 moved the sieving part to the GPU, which was previously done on the CPU. So, unless I'm hugely mistaken, when using GPU sieving you will only need to ever run one instance to reach full GPU load. The CPU will remain unused, because it's not doing work anymore. This is all expected behaviour. Pretty great, right?

If it's still lagging with one instance, I think GPUSieveSize is the setting you'll want to lower. It will make everything a lot more responsive, and I didn't see any reduction in GHzD when I lowered it.

As for why x32 seems better/the same as x64, I am not sure. I will leave that for somebody more knowledgeable to answer. But the difference between the two was only one minute; is it abnormal for it to fluctuate that greatly? You weren't using the GPU for something else?


Off topic: I discovered something that maybe others don't know about: Ctrl+S will pause/unpause the worker (any command-line function, it seems). I use this now instead of Ctrl+C. Hopefully I'm not breaking anything.
Yeah, it IS pretty great! Less "paperwork" that way, keeping track of who's doing what on the computer.

No, there was nothing else going on with the GPU during any of the mfakto runs. But @kjaget's explanation makes sense.

Nice find about Ctrl+S, by the way. I'll start using that instead of Ctrl+C when I simply need it to pause.

Rodrigo
Rodrigo is offline   Reply With Quote
Old 2013-06-25, 15:35   #823
Rodrigo
 
Rodrigo's Avatar
 
Jun 2010
Pennsylvania

2·467 Posts
Default

Quote:
Originally Posted by kjaget View Post
If it is anything like the CUDA version, it's because with 32-bit vs. 64-bit memory addresses there's half the data to keep track of.

The CPU sieve picked up some speed from using 64-bit code to offset this in older versions, but with all that happening in the GPU on the newer versions there's no [net] benefit to 64-bit code.

See http://mersenneforum.org/showpost.ph...postcount=1981 for more detail.
Thanks for the explanation! I should be doing my work on the 32-bit version, then, to squeeze that little extra output from it.

Rodrigo
Rodrigo is offline   Reply With Quote
Old 2013-06-25, 16:24   #824
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23×271 Posts
Default

Quote:
Originally Posted by Bdot View Post
Bug confirmed - as 3 is a bad choice for the CPU sieve kernels, I omitted it from the GPU sieve kernels, without checks.

How did VectorSize=2 do?
Slower.
Quote:
Originally Posted by Rodrigo View Post
UPDATE:

Just finished a run of mfakto 0.13 x32 in two instances on the 7770 (CPU: Core i7-3770, Windows 7 Home Premium x64, three Prime95 workers running).

One instance completed at a rate of 73.89 GHz-days/day; the other, at 73.92. (This time it wasn't the exact same exponent that was factored, but two consecutive ones.) Prime95 still virtually if not completely unaffected, but there seems to be almost no benefit any longer to running more than one instance of mfakto. (With version 0.12 I could run three instances of TF and reach ~140 GHz-days/day, although not consistently and the average was closer to 130.)

Another thing: this time (with the two instances running) there was a severe lag whenever I moved the cursor or tried to reposition a window. I have verified that this does not happen when running the single instance of 0.13.

Hope that this info is useful.

Rodrigo
Is this in the LL range ~65M? On my 7770 non OC'ed I am getting around 160 Ghz/days.
kracker is online now   Reply With Quote
Old 2013-06-25, 17:13   #825
Rodrigo
 
Rodrigo's Avatar
 
Jun 2010
Pennsylvania

2·467 Posts
Default

Quote:
Originally Posted by kracker View Post
Is this in the LL range ~65M? On my 7770 non OC'ed I am getting around 160 Ghz/days.
The exponents were in the 67M range (TF).

My 7770 is as it came, no adjustments made to it. The only tweak I've made to mfakto 0.13 is to change the VectorSize value from the default 4 down to 2, as suggested by the program itself the first time I ran it.

Have you made any other adjustments to increase the output?

Rodrigo

Last fiddled with by Rodrigo on 2013-06-25 at 17:14
Rodrigo is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL: an OpenCL program for Mersenne primality testing preda GpuOwl 2718 2021-07-06 18:30
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3497 2021-06-05 12:27
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 13:51.


Fri Jul 16 13:51:01 UTC 2021 up 49 days, 11:38, 2 users, load averages: 1.53, 1.42, 1.51

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.