![]() |
1 Attachment(s)
No, 55000
I have to make an amendment to my previous post, I must have made a mistake when I was keeping track of which GridSize I was running. 3, 2, and 1 are identical. Maybe leaning towards 2 almost imperceptibly. Then 4 and 0. CPU load is around 13%, or about 65% of a single core. Im running a i7-920 @ 4GHz |
mfaktc/mfakto certainly needs a GPU-based siever....I have to complete a different project (automatic assignment handling) first before I can think about taking it on.
|
[QUOTE=Christenson;264155]mfaktc/mfakto certainly needs a GPU-based siever....I have to complete a different project (automatic assignment handling) first before I can think about taking it on.[/QUOTE]
Can I suggest that if you can get recieving assignments working faster than both then you should. It is fine to only report results occasionally but running out of work is bad. |
[QUOTE=henryzz;264166]but running out of work is bad.[/QUOTE]
That happens to be one of my favourite occupations. But if picking the low-hanging fruit floats your boat, go ahead with Breadth First. OTOH if you get bored with finding new factors (or "getting work"), try making it as easy for us CPU-bound, patient, LL-testing prime searchers as possible. TFing X to X+1 is 1/7th of X+3 effort.* David *Open to correction, but you get the idea. 1+2+4 = 7 |
Henry:
Once automatic reporting begins to work, it will come all at once....I'm having issues with learning my tools (eclipse) right now, just have to sit and work at it...then add the mutex and thread management stuff and call the appropriate parts of P95. As for you CPU-bound, LL-testing types (which, incidentally, includes myself), don't worry. The way I look at it is that TF and P-1 both have as their goal making as many LL tests as possible unnecessary. Odds of finding a factor for a given exponent, for the current bit level of 70, are about 1/70. Supposing the GPUs are 128 times faster than the CPUs, then we can do 7 extra bit levels, which will factor about 10% of the candidates that wouldn't have been factored by CPU. This helps, but the real speed-up in finding M48 and beyond will be in freed-up CPUs not doing TF and in the GPU LL tests. |
[QUOTE=henryzz;264166]Can I suggest that if you can get recieving assignments working faster than both then you should. It is fine to only report results occasionally but running out of work is bad.[/QUOTE]
Hehe, if receiving the assignment takes longer than the task itself, then we don't need to optimize the GPU kernels anymore ... |
[QUOTE=Christenson;264155]mfaktc/mfakto certainly needs a GPU-based siever....I have to complete a different project (automatic assignment handling) first before I can think about taking it on.[/QUOTE]
Don't forget about [thread=11900]this thread[/thread]. It looks like some work has been done on this kind of sieving this year! :smile: |
This missing carry flag is driving me nuts ...
Has anyone a better idea for the carry-propagation: [code] typedef _int96_t { uint d0, d1, d2; } int96_t; void sub_96(int96_t *res, int96_t a, int96_t b) /* a must be greater or equal b! res = a - b */ { uint carry = (b.d0 > a.d0); res->d0 = a.d0 - b.d0; res->d1 = a.d1 - b.d1 - (carry ? 1 : 0); [B] res->d2 = a.d2 - b.d2 - (((res->d1 > a.d1) || ((res->d1 == a.d1) && carry)) ? 1 : 0); [/B]} [/code]I also need this for an int192 (6x32 bit). Then the above logic would become quite lengthy ... Do I really need to use something like this: [code] uint carry = (b.d0 > a.d0); res->d0 = a.d0 - b.d0; res->d1 = a.d1 - b.d1 - (carry ? 1 : 0); carry = (res->d1 > a.d1) || ((res->d1 == a.d1) && carry); res->d2 = a.d2 - b.d2 - (carry ? 1 : 0); carry = (res->d2 > a.d2) || ((res->d2 == a.d2) && carry); res->d3 = a.d3 - b.d3 - (carry ? 1 : 0); carry = (res->d3 > a.d3) || ((res->d3 == a.d3) && carry); res->d4 = a.d4 - b.d4 - (carry ? 1 : 0); ... [/code] |
Getting the carries would be a lot simpler if the number were 4x24-bit numbers instead of 3x32. Bit shifts could be used instead of conditionals, and conditionals on AMD are slow. This would also seem to allow for easier multiplication, when 24-bit multiplies are faster than 32-bit ones.
|
[QUOTE=Ken_g6;264468]Getting the carries would be a lot simpler if the number were 4x24-bit numbers instead of 3x32. Bit shifts could be used instead of conditionals, and conditionals on AMD are slow. This would also seem to allow for easier multiplication, when 24-bit multiplies are faster than 32-bit ones.[/QUOTE]
Sure, the 24-bit stuff works quite well. I just wanted to get a 32-bit kernel running in order to compare exactly that. BTW, conditional loads are not slow (1st cycle: eval condition and prepare the two possible load values, 2nd cycle: load it), they run at full speed. Only branches having a different control flow have that big penalty, which consists of executing both branches plus some overhead to mask out one of the executions. |
Warning: I don't know anything about OpenCL...
Why do you use ||, && et ?: at all? Doesn't OpenCL say a comparison result is either 0 or 1? If so then, I would have written: [code]uint carry = (b.d0 > a.d0); res->d0 = a.d0 - b.d0; res->d1 = a.d1 - b.d1 - carry; res->d2 = a.d2 - b.d2 - ((res->d1 > a.d1) | ((res->d1 == a.d1) & carry));[/code]and: [code]uint carry = (b.d0 > a.d0); res->d0 = a.d0 - b.d0; res->d1 = a.d1 - b.d1 - carry; carry = (res->d1 > a.d1) | ((res->d1 == a.d1) & carry); res->d2 = a.d2 - b.d2 - carry; ...[/code] |
All times are UTC. The time now is 06:43. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.