mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfakto: an OpenCL program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=15646)

ixfd64 2019-04-11 19:15

1 Attachment(s)
I've encountered a weird issue when running mfakto on a Mac Pro. This particular machine has dual AMD FirePro D700 GPUs, but the screen lags badly even when I run just one instance. The lag remains even when I tell mfakto to use the other GPU via the [c]-d 1[/c] switch. In both cases, Activity Monitor always shows the first GPU being in use. This is probably more of a Mac issue, but does anyone know why this is happening?

In the meantime, rebuilding mfakto using OpenCL 1.2 seems to resolve the issue where mfakto does not correctly detect certain clock speeds.

ixfd64 2019-04-12 23:43

1 Attachment(s)
Sorry for the double post, but I resolved the issue. It turns out devices 0 and 1 both map to one GPU while device 2 maps to the other.

I'm also pretty confident the macOS build is working correctly and am about to put it in "production" mode. To build mfakto for macOS:
[LIST=1][*]Check out the latest mfakto code. I'm using Preda's development branch: [url]https://github.com/preda/mfakto[/url][*]Extract the attached archive into the [C]src[/C] folder, overwriting the existing files.[*]Compile the application using the macOS makefile: [C]make -f Makefile.macOS[/C][/LIST]
[B]Update:[/B] the changes have been merged into Preda's development branch.

bayanne 2019-04-13 09:10

Very interesting, I will try it out shortly, although at the moment I only have an old iMac running 10.13.3 with a Radeon HD 4850

c10ck3r 2019-04-15 19:18

For mfakt(c,o):
If P-1 has been performed, would there be a speedup expected if k-values which were B1-smooth were sieved prior to testing? My reason for asking is that looking at for example the 40M range factoring effort page, most exponents have B1>~5e5, while the k-value for 72-73 bits (lowest incomplete range) goes to ~1.2e14, which to me suggests that many of these candidates could be removed. I think the "classes" does this for primes up to 11 if I am interpreting that correctly, but would further speedups occur? Would it make sense to do this only if we were taking an exponent several bit-levels?
TIA

R. Gerbicz 2019-04-15 21:57

[QUOTE=c10ck3r;513787]For mfakt(c,o):
If P-1 has been performed, would there be a speedup expected if k-values which were B1-smooth were sieved prior to testing? My reason for asking is that looking at for example the 40M range factoring effort page, most exponents have B1>~5e5, while the k-value for 72-73 bits (lowest incomplete range) goes to ~1.2e14, which to me suggests that many of these candidates could be removed.[/QUOTE]

We can use your idea also in the current wavefront.
Interesting idea! The classes is a different subject.
You can apply also the second stage's bound in this.
Use also round(log2(r^e))-round(log2(r^(e-1))) in the table updates (if you are doing successive updates), similar to siqs methods.

Quick and dirty code (and a simplified code), setting just the moderate stage two=10^7
[CODE]
fun(b1,b2,b,it)={hit=0;for(h=1,it,n=random(2^b);u=factor(n);o=matsize(u)[1];
if(o>1,p1=u[o-1,1],p1=1);p2=u[o,1];
if(p1<=b1&&p2<=b2,hit++));return(hit)}

fun(5*10^5,10^7,46,10000)
[/CODE]

the output (random result):
[CODE]
(23:51) gp > fun(5*10^5,10^7,46,10000)
%25 = 3104
[/CODE]

so saving roughly 31-5=26% (with the roughly 5% of the additional sieving).

One real(!) bottleneck in the sieve would be is that you need 6-7 bits for the size of the cofactor ( (q-1)/2/p ),
and one bit to store if you have used a prime in (b1,b2] or not, because you are allowed to use at most one prime (with exponent=1) in (b1,b2].

I'm not sure if I will go/code it, but these are the basics of the mentioned (and improved) idea.

preda 2019-04-17 09:11

But can this be implemented for any gain on a GPU? the current sieve is extremely simple and fast, and the current trial of a K that passed the sieve is extremely fast as well (and thousands being done in parallel).

Do you need to factor K-1 for the check?

[QUOTE=R. Gerbicz;513799]We can use your idea also in the current wavefront.
Interesting idea! The classes is a different subject.
You can apply also the second stage's bound in this.
Use also round(log2(r^e))-round(log2(r^(e-1))) in the table updates (if you are doing successive updates), similar to siqs methods.

Quick and dirty code (and a simplified code), setting just the moderate stage two=10^7
[CODE]
fun(b1,b2,b,it)={hit=0;for(h=1,it,n=random(2^b);u=factor(n);o=matsize(u)[1];
if(o>1,p1=u[o-1,1],p1=1);p2=u[o,1];
if(p1<=b1&&p2<=b2,hit++));return(hit)}

fun(5*10^5,10^7,46,10000)
[/CODE]

the output (random result):
[CODE]
(23:51) gp > fun(5*10^5,10^7,46,10000)
%25 = 3104
[/CODE]

so saving roughly 31-5=26% (with the roughly 5% of the additional sieving).

One real(!) bottleneck in the sieve would be is that you need 6-7 bits for the size of the cofactor ( (q-1)/2/p ),
and one bit to store if you have used a prime in (b1,b2] or not, because you are allowed to use at most one prime (with exponent=1) in (b1,b2].

I'm not sure if I will go/code it, but these are the basics of the mentioned (and improved) idea.[/QUOTE]

R. Gerbicz 2019-04-17 10:44

[QUOTE=preda;513938]But can this be implemented for any gain on a GPU? the current sieve is extremely simple and fast, and the current trial of a K that passed the sieve is extremely fast as well (and thousands being done in parallel).

Do you need to factor K-1 for the check?[/QUOTE]

The point is that we don't factor q-1=2*k*p, so the k value directly one by one.
We're using a sieve to find those k numbers that has all primepower divisors<=b1, and at most one prime factor in (b1,b2].
You can do it, because the k values are also in arithmetic progression (if you are in a fixed residue class),
like in the sieving for q=2*k*p+1. A little more overhead here is that you have to deal with primepowers also, not just primes.

kriesel 2019-04-17 17:14

Posted as part of #44 of Concepts in GIMPS trial factoring at [URL]https://www.mersenneforum.org/showpost.php?p=508523&postcount=6[/URL]
[QUOTE]regarding possibly eliminating the 20 to 25% overlap of P-1 and TF. If doing so had no overhead, this may speed up the total per-exponent effort by <~0.03 x 0.25 / 2, or [B]<~0.375%[/B]. Not large, but worth exploring. Its effect is diminished from there since current practice is to opportunistically seek a quickly found factor by TF in the lower bit levels, then P-1 after all TF is done. See also #31, regarding P-1 being run before the last, highest TF level, which may offer an opportunity for saving [B]<~0.19%[/B] if the final bit level of TF, which is comparable in effort to all preceding levels combined, is performed only on k values not also covered by the preceding P-1.[/QUOTE]Not a revolutionary speedup, but seems worth investigating and perhaps implementing.

c10ck3r 2019-04-19 06:22

[QUOTE=kriesel;513964]Posted as part of #44 of Concepts in GIMPS trial factoring at [URL]https://www.mersenneforum.org/showpost.php?p=508523&postcount=6[/URL]
Not a revolutionary speedup, but seems worth investigating and perhaps implementing.[/QUOTE]
Thanks for linking to this! Hadn't seen this post previously. I would like to suggest that we don't think of the speedup in terms of overall effort (at least not on this forum), given that so many are interested in finding (a) factor(s) of numbers even if we already know them to be composite through LL/DC testing: see the petrw1 side project. A 15% speedup (after sieving costs) would be a welcome help for this- if it could be applied across all exponents (it can't due to size/lack of P-1), and we assume that it causes 15% of the exponents to be factored to an additional bitdepth, we get 0.15*21300000*1/~72= ~44,400 more factors found. Alternatively, it could be thought to increase the expected # factors found/month from ~535 to ~615.

kriesel 2019-04-19 12:41

[QUOTE=c10ck3r;514090]Thanks for linking to this! Hadn't seen this post previously. I would like to suggest that we don't think of the speedup in terms of overall effort (at least not on this forum), given that so many are interested in finding (a) factor(s) of numbers even if we already know them to be composite through LL/DC testing: see the petrw1 side project. A 15% speedup (after sieving costs) would be a welcome help for this- if it could be applied across all exponents (it can't due to size/lack of P-1), and we assume that it causes 15% of the exponents to be factored to an additional bitdepth, we get 0.15*21300000*1/~72= ~44,400 more factors found. Alternatively, it could be thought to increase the expected # factors found/month from ~535 to ~615.[/QUOTE]I'm in favor of applying whichever metric makes sense in context. Which in this case apparently could be as high as ~15% in TF, or as low as ~0.15% in overall effort. All improvements worth a developer's time are good. When they have multiple application, so much the better. The part I quoted in post 1515 implies primenet assignment ordering of TF bit levels and P-1 may need to be changed to take best advantage for overall throughput. Also, TF bit level limits or P-1 bounds optimization versus exponent may be affected.

kriesel 2019-04-19 13:49

[QUOTE=c10ck3r;514090]Thanks for linking to this! ... A 15% speedup (after sieving costs) would be a welcome help for this- if it could be applied across all exponents (it can't due to size/lack of P-1), and we assume that it causes 15% of the exponents to be factored to an additional bitdepth, we get 0.15*21300000*1/~72= ~44,400 more factors found. Alternatively, it could be thought to increase the expected # factors found/month from ~535 to ~615.[/QUOTE]You're welcome. And thanks for your comments & perspective. I tend to think in GIMPS-primary-goal mode, marching the wavefronts forward to find more Mersenne primes, and not so much about any side projects or other considerations. Your post has prompted me to think and write some more about the implications. Please review the newly extended #44 concept content.


All times are UTC. The time now is 22:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.