20110619, 11:34  #34 
Jun 2010
17_{10} Posts 
No, 55000
I have to make an amendment to my previous post, I must have made a mistake when I was keeping track of which GridSize I was running. 3, 2, and 1 are identical. Maybe leaning towards 2 almost imperceptibly. Then 4 and 0. CPU load is around 13%, or about 65% of a single core. Im running a i7920 @ 4GHz 
20110619, 15:31  #35 
Dec 2010
Monticello
5×359 Posts 
mfaktc/mfakto certainly needs a GPUbased siever....I have to complete a different project (automatic assignment handling) first before I can think about taking it on.

20110619, 17:17  #36 
Just call me Henry
"David"
Sep 2007
Cambridge (GMT)
13035_{8} Posts 
Can I suggest that if you can get recieving assignments working faster than both then you should. It is fine to only report results occasionally but running out of work is bad.

20110619, 18:15  #37 
"Lucan"
Dec 2006
England
1100101001010_{2} Posts 
That happens to be one of my favourite occupations.
But if picking the lowhanging fruit floats your boat, go ahead with Breadth First. OTOH if you get bored with finding new factors (or "getting work"), try making it as easy for us CPUbound, patient, LLtesting prime searchers as possible. TFing X to X+1 is 1/7th of X+3 effort.* David *Open to correction, but you get the idea. 1+2+4 = 7 Last fiddled with by davieddy on 20110619 at 18:52 
20110619, 18:41  #38 
Dec 2010
Monticello
5·359 Posts 
Henry:
Once automatic reporting begins to work, it will come all at once....I'm having issues with learning my tools (eclipse) right now, just have to sit and work at it...then add the mutex and thread management stuff and call the appropriate parts of P95. As for you CPUbound, LLtesting types (which, incidentally, includes myself), don't worry. The way I look at it is that TF and P1 both have as their goal making as many LL tests as possible unnecessary. Odds of finding a factor for a given exponent, for the current bit level of 70, are about 1/70. Supposing the GPUs are 128 times faster than the CPUs, then we can do 7 extra bit levels, which will factor about 10% of the candidates that wouldn't have been factored by CPU. This helps, but the real speedup in finding M48 and beyond will be in freedup CPUs not doing TF and in the GPU LL tests. 
20110619, 18:50  #39 
Nov 2010
Germany
3·199 Posts 
Hehe, if receiving the assignment takes longer than the task itself, then we don't need to optimize the GPU kernels anymore ...

20110619, 21:45  #40  
Jan 2005
Caught in a sieve
394_{10} Posts 
Quote:


20110623, 02:07  #41 
Nov 2010
Germany
1001010101_{2} Posts 
This missing carry flag is driving me nuts ...
Has anyone a better idea for the carrypropagation: Code:
typedef _int96_t { uint d0, d1, d2; } int96_t; void sub_96(int96_t *res, int96_t a, int96_t b) /* a must be greater or equal b! res = a  b */ { uint carry = (b.d0 > a.d0); res>d0 = a.d0  b.d0; res>d1 = a.d1  b.d1  (carry ? 1 : 0); res>d2 = a.d2  b.d2  (((res>d1 > a.d1)  ((res>d1 == a.d1) && carry)) ? 1 : 0); } Code:
uint carry = (b.d0 > a.d0); res>d0 = a.d0  b.d0; res>d1 = a.d1  b.d1  (carry ? 1 : 0); carry = (res>d1 > a.d1)  ((res>d1 == a.d1) && carry); res>d2 = a.d2  b.d2  (carry ? 1 : 0); carry = (res>d2 > a.d2)  ((res>d2 == a.d2) && carry); res>d3 = a.d3  b.d3  (carry ? 1 : 0); carry = (res>d3 > a.d3)  ((res>d3 == a.d3) && carry); res>d4 = a.d4  b.d4  (carry ? 1 : 0); ... Last fiddled with by Bdot on 20110623 at 02:08 
20110623, 04:21  #42 
Jan 2005
Caught in a sieve
110001010_{2} Posts 
Getting the carries would be a lot simpler if the number were 4x24bit numbers instead of 3x32. Bit shifts could be used instead of conditionals, and conditionals on AMD are slow. This would also seem to allow for easier multiplication, when 24bit multiplies are faster than 32bit ones.

20110623, 09:14  #43  
Nov 2010
Germany
3·199 Posts 
Quote:
BTW, conditional loads are not slow (1st cycle: eval condition and prepare the two possible load values, 2nd cycle: load it), they run at full speed. Only branches having a different control flow have that big penalty, which consists of executing both branches plus some overhead to mask out one of the executions. 

20110623, 09:44  #44 
Jan 2008
France
17·31 Posts 
Warning: I don't know anything about OpenCL...
Why do you use , && et ?: at all? Doesn't OpenCL say a comparison result is either 0 or 1? If so then, I would have written: Code:
uint carry = (b.d0 > a.d0); res>d0 = a.d0  b.d0; res>d1 = a.d1  b.d1  carry; res>d2 = a.d2  b.d2  ((res>d1 > a.d1)  ((res>d1 == a.d1) & carry)); Code:
uint carry = (b.d0 > a.d0); res>d0 = a.d0  b.d0; res>d1 = a.d1  b.d1  carry; carry = (res>d1 > a.d1)  ((res>d1 == a.d1) & carry); res>d2 = a.d2  b.d2  carry; ... 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
gpuOwL: an OpenCL program for Mersenne primality testing  preda  GPU Computing  2263  20200604 01:10 
mfaktc: a CUDA program for Mersenne prefactoring  TheJudger  GPU Computing  3271  20200519 22:42 
LL with OpenCL  msft  GPU Computing  433  20190623 21:11 
OpenCL for FPGAs  TObject  GPU Computing  2  20131012 21:09 
Program to TF Mersenne numbers with more than 1 sextillion digits?  Stargate38  Factoring  24  20111103 00:34 