mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-06-19, 11:34   #34
Colt45ws
 
Colt45ws's Avatar
 
Jun 2010

1710 Posts
Default

No, 55000
I have to make an amendment to my previous post, I must have made a mistake when I was keeping track of which GridSize I was running. 3, 2, and 1 are identical. Maybe leaning towards 2 almost imperceptibly. Then 4 and 0.

CPU load is around 13%, or about 65% of a single core.
Im running a i7-920 @ 4GHz
Attached Files
File Type: txt GridSize.txt (11.6 KB, 289 views)
Colt45ws is offline   Reply With Quote
Old 2011-06-19, 15:31   #35
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5×359 Posts
Default

mfaktc/mfakto certainly needs a GPU-based siever....I have to complete a different project (automatic assignment handling) first before I can think about taking it on.
Christenson is offline   Reply With Quote
Old 2011-06-19, 17:17   #36
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT)

130358 Posts
Default

Quote:
Originally Posted by Christenson View Post
mfaktc/mfakto certainly needs a GPU-based siever....I have to complete a different project (automatic assignment handling) first before I can think about taking it on.
Can I suggest that if you can get recieving assignments working faster than both then you should. It is fine to only report results occasionally but running out of work is bad.
henryzz is offline   Reply With Quote
Old 2011-06-19, 18:15   #37
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

11001010010102 Posts
Default

Quote:
Originally Posted by henryzz View Post
but running out of work is bad.
That happens to be one of my favourite occupations.

But if picking the low-hanging fruit floats your boat,
go ahead with Breadth First.
OTOH if you get bored with finding new factors (or "getting work"),
try making it as easy for us CPU-bound, patient,
LL-testing prime searchers as possible.

TFing X to X+1 is 1/7th of X+3 effort.*

David

*Open to correction, but you get the idea.
1+2+4 = 7

Last fiddled with by davieddy on 2011-06-19 at 18:52
davieddy is offline   Reply With Quote
Old 2011-06-19, 18:41   #38
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5·359 Posts
Default

Henry:
Once automatic reporting begins to work, it will come all at once....I'm having issues with learning my tools (eclipse) right now, just have to sit and work at it...then add the mutex and thread management stuff and call the appropriate parts of P95.

As for you CPU-bound, LL-testing types (which, incidentally, includes myself), don't worry. The way I look at it is that TF and P-1 both have as their goal making as many LL tests as possible unnecessary. Odds of finding a factor for a given exponent, for the current bit level of 70, are about 1/70. Supposing the GPUs are 128 times faster than the CPUs, then we can do 7 extra bit levels, which will factor about 10% of the candidates that wouldn't have been factored by CPU. This helps, but the real speed-up in finding M48 and beyond will be in freed-up CPUs not doing TF and in the GPU LL tests.
Christenson is offline   Reply With Quote
Old 2011-06-19, 18:50   #39
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by henryzz View Post
Can I suggest that if you can get recieving assignments working faster than both then you should. It is fine to only report results occasionally but running out of work is bad.
Hehe, if receiving the assignment takes longer than the task itself, then we don't need to optimize the GPU kernels anymore ...
Bdot is offline   Reply With Quote
Old 2011-06-19, 21:45   #40
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

39410 Posts
Default

Quote:
Originally Posted by Christenson View Post
mfaktc/mfakto certainly needs a GPU-based siever....I have to complete a different project (automatic assignment handling) first before I can think about taking it on.
Don't forget about this thread. It looks like some work has been done on this kind of sieving this year!
Ken_g6 is offline   Reply With Quote
Old 2011-06-23, 02:07   #41
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

10010101012 Posts
Default

This missing carry flag is driving me nuts ...

Has anyone a better idea for the carry-propagation:

Code:
typedef _int96_t
{
  uint d0, d1, d2;
} int96_t;

void sub_96(int96_t *res, int96_t a, int96_t b)
/* a must be greater or equal b!
res = a - b */
{
  uint carry = (b.d0 > a.d0);

  res->d0 = a.d0 - b.d0;
  res->d1 = a.d1 - b.d1 - (carry ? 1 : 0);
  res->d2 = a.d2 - b.d2 - (((res->d1 > a.d1) || ((res->d1 == a.d1) && carry)) ? 1 : 0);
}
I also need this for an int192 (6x32 bit). Then the above logic would become quite lengthy ... Do I really need to use something like this:

Code:
  uint carry = (b.d0 > a.d0);

  res->d0 = a.d0 - b.d0;
  res->d1 = a.d1 - b.d1 - (carry ? 1 : 0);

  carry = (res->d1 > a.d1) || ((res->d1 == a.d1) && carry);
  res->d2 = a.d2 - b.d2 - (carry ? 1 : 0);

  carry = (res->d2 > a.d2) || ((res->d2 == a.d2) && carry);
  res->d3 = a.d3 - b.d3 - (carry ? 1 : 0);

  carry = (res->d3 > a.d3) || ((res->d3 == a.d3) && carry);
  res->d4 = a.d4 - b.d4 - (carry ? 1 : 0);
 
...

Last fiddled with by Bdot on 2011-06-23 at 02:08
Bdot is offline   Reply With Quote
Old 2011-06-23, 04:21   #42
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

1100010102 Posts
Default

Getting the carries would be a lot simpler if the number were 4x24-bit numbers instead of 3x32. Bit shifts could be used instead of conditionals, and conditionals on AMD are slow. This would also seem to allow for easier multiplication, when 24-bit multiplies are faster than 32-bit ones.
Ken_g6 is offline   Reply With Quote
Old 2011-06-23, 09:14   #43
Bdot
 
Bdot's Avatar
 
Nov 2010
Germany

3·199 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
Getting the carries would be a lot simpler if the number were 4x24-bit numbers instead of 3x32. Bit shifts could be used instead of conditionals, and conditionals on AMD are slow. This would also seem to allow for easier multiplication, when 24-bit multiplies are faster than 32-bit ones.
Sure, the 24-bit stuff works quite well. I just wanted to get a 32-bit kernel running in order to compare exactly that.

BTW, conditional loads are not slow (1st cycle: eval condition and prepare the two possible load values, 2nd cycle: load it), they run at full speed. Only branches having a different control flow have that big penalty, which consists of executing both branches plus some overhead to mask out one of the executions.
Bdot is offline   Reply With Quote
Old 2011-06-23, 09:44   #44
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

17·31 Posts
Default

Warning: I don't know anything about OpenCL...

Why do you use ||, && et ?: at all? Doesn't OpenCL say a comparison result is either 0 or 1? If so then, I would have written:

Code:
uint carry = (b.d0 > a.d0);

res->d0 = a.d0 - b.d0;
res->d1 = a.d1 - b.d1 - carry;
res->d2 = a.d2 - b.d2 - ((res->d1 > a.d1) | ((res->d1 == a.d1) & carry));
and:

Code:
uint carry = (b.d0 > a.d0);

res->d0 = a.d0 - b.d0;
res->d1 = a.d1 - b.d1 - carry;

carry = (res->d1 > a.d1) | ((res->d1 == a.d1) & carry);
res->d2 = a.d2 - b.d2 - carry;
...
ldesnogu is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
gpuOwL: an OpenCL program for Mersenne primality testing preda GPU Computing 2263 2020-06-04 01:10
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3271 2020-05-19 22:42
LL with OpenCL msft GPU Computing 433 2019-06-23 21:11
OpenCL for FPGAs TObject GPU Computing 2 2013-10-12 21:09
Program to TF Mersenne numbers with more than 1 sextillion digits? Stargate38 Factoring 24 2011-11-03 00:34

All times are UTC. The time now is 00:42.

Fri Jun 5 00:42:22 UTC 2020 up 71 days, 22:15, 0 users, load averages: 1.04, 1.09, 1.16

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.