![]() |
![]() |
#34 |
Jun 2010
17 Posts |
![]()
No, 55000
I have to make an amendment to my previous post, I must have made a mistake when I was keeping track of which GridSize I was running. 3, 2, and 1 are identical. Maybe leaning towards 2 almost imperceptibly. Then 4 and 0. CPU load is around 13%, or about 65% of a single core. Im running a i7-920 @ 4GHz |
![]() |
![]() |
![]() |
#35 |
Dec 2010
Monticello
111000000112 Posts |
![]()
mfaktc/mfakto certainly needs a GPU-based siever....I have to complete a different project (automatic assignment handling) first before I can think about taking it on.
|
![]() |
![]() |
![]() |
#36 |
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
25·11·17 Posts |
![]()
Can I suggest that if you can get recieving assignments working faster than both then you should. It is fine to only report results occasionally but running out of work is bad.
|
![]() |
![]() |
![]() |
#37 |
"Lucan"
Dec 2006
England
2·3·13·83 Posts |
![]()
That happens to be one of my favourite occupations.
But if picking the low-hanging fruit floats your boat, go ahead with Breadth First. OTOH if you get bored with finding new factors (or "getting work"), try making it as easy for us CPU-bound, patient, LL-testing prime searchers as possible. TFing X to X+1 is 1/7th of X+3 effort.* David *Open to correction, but you get the idea. 1+2+4 = 7 Last fiddled with by davieddy on 2011-06-19 at 18:52 |
![]() |
![]() |
![]() |
#38 |
Dec 2010
Monticello
70316 Posts |
![]()
Henry:
Once automatic reporting begins to work, it will come all at once....I'm having issues with learning my tools (eclipse) right now, just have to sit and work at it...then add the mutex and thread management stuff and call the appropriate parts of P95. As for you CPU-bound, LL-testing types (which, incidentally, includes myself), don't worry. The way I look at it is that TF and P-1 both have as their goal making as many LL tests as possible unnecessary. Odds of finding a factor for a given exponent, for the current bit level of 70, are about 1/70. Supposing the GPUs are 128 times faster than the CPUs, then we can do 7 extra bit levels, which will factor about 10% of the candidates that wouldn't have been factored by CPU. This helps, but the real speed-up in finding M48 and beyond will be in freed-up CPUs not doing TF and in the GPU LL tests. |
![]() |
![]() |
![]() |
#39 |
Nov 2010
Germany
3·199 Posts |
![]()
Hehe, if receiving the assignment takes longer than the task itself, then we don't need to optimize the GPU kernels anymore ...
|
![]() |
![]() |
![]() |
#40 | |
Jan 2005
Caught in a sieve
5·79 Posts |
![]() Quote:
![]() |
|
![]() |
![]() |
![]() |
#41 |
Nov 2010
Germany
3·199 Posts |
![]()
This missing carry flag is driving me nuts ...
Has anyone a better idea for the carry-propagation: Code:
typedef _int96_t { uint d0, d1, d2; } int96_t; void sub_96(int96_t *res, int96_t a, int96_t b) /* a must be greater or equal b! res = a - b */ { uint carry = (b.d0 > a.d0); res->d0 = a.d0 - b.d0; res->d1 = a.d1 - b.d1 - (carry ? 1 : 0); res->d2 = a.d2 - b.d2 - (((res->d1 > a.d1) || ((res->d1 == a.d1) && carry)) ? 1 : 0); } Code:
uint carry = (b.d0 > a.d0); res->d0 = a.d0 - b.d0; res->d1 = a.d1 - b.d1 - (carry ? 1 : 0); carry = (res->d1 > a.d1) || ((res->d1 == a.d1) && carry); res->d2 = a.d2 - b.d2 - (carry ? 1 : 0); carry = (res->d2 > a.d2) || ((res->d2 == a.d2) && carry); res->d3 = a.d3 - b.d3 - (carry ? 1 : 0); carry = (res->d3 > a.d3) || ((res->d3 == a.d3) && carry); res->d4 = a.d4 - b.d4 - (carry ? 1 : 0); ... Last fiddled with by Bdot on 2011-06-23 at 02:08 |
![]() |
![]() |
![]() |
#42 |
Jan 2005
Caught in a sieve
1100010112 Posts |
![]()
Getting the carries would be a lot simpler if the number were 4x24-bit numbers instead of 3x32. Bit shifts could be used instead of conditionals, and conditionals on AMD are slow. This would also seem to allow for easier multiplication, when 24-bit multiplies are faster than 32-bit ones.
|
![]() |
![]() |
![]() |
#43 | |
Nov 2010
Germany
3×199 Posts |
![]() Quote:
BTW, conditional loads are not slow (1st cycle: eval condition and prepare the two possible load values, 2nd cycle: load it), they run at full speed. Only branches having a different control flow have that big penalty, which consists of executing both branches plus some overhead to mask out one of the executions. |
|
![]() |
![]() |
![]() |
#44 |
Jan 2008
France
13·43 Posts |
![]()
Warning: I don't know anything about OpenCL...
Why do you use ||, && et ?: at all? Doesn't OpenCL say a comparison result is either 0 or 1? If so then, I would have written: Code:
uint carry = (b.d0 > a.d0); res->d0 = a.d0 - b.d0; res->d1 = a.d1 - b.d1 - carry; res->d2 = a.d2 - b.d2 - ((res->d1 > a.d1) | ((res->d1 == a.d1) & carry)); Code:
uint carry = (b.d0 > a.d0); res->d0 = a.d0 - b.d0; res->d1 = a.d1 - b.d1 - carry; carry = (res->d1 > a.d1) | ((res->d1 == a.d1) & carry); res->d2 = a.d2 - b.d2 - carry; ... |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
gpuOwL: an OpenCL program for Mersenne primality testing | preda | GpuOwl | 2760 | 2022-05-15 00:00 |
mfaktc: a CUDA program for Mersenne prefactoring | TheJudger | GPU Computing | 3541 | 2022-04-21 22:37 |
LL with OpenCL | msft | GPU Computing | 433 | 2019-06-23 21:11 |
OpenCL for FPGAs | TObject | GPU Computing | 2 | 2013-10-12 21:09 |
Program to TF Mersenne numbers with more than 1 sextillion digits? | Stargate38 | Factoring | 24 | 2011-11-03 00:34 |