mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

GP2 2019-01-04 04:01

[QUOTE=preda;504858]- even without any P-1 done, the rate of factors found by PRP-1 is not huge, let's say about 5% depending on the level of TF[/QUOTE]

But this is certainly no worse than the usual rate for finding factors by ordinary P−1.

Remember that as you increase bit-length (for TF) and non-smoothness of k (for P−1), the difficulty goes up exponentially. So there is a relatively narrow transition zone between trivially easy factors that were found long ago and impossibly hard factors that we'll never find.

Maybe you got discouraged too easily, and should reconsider.

preda 2019-01-04 04:13

[QUOTE=GP2;504861]But this is certainly no worse than the usual rate for finding factors by ordinary P−1.

Remember that as you increase bit-length (for TF) and non-smoothness of k (for P−1), the difficulty goes up exponentially. So there is a relatively narrow transition zone between trivially easy factors that were found long ago and impossibly hard factors that we'll never find.

Maybe you got discouraged too easily, and should reconsider.[/QUOTE]

Yes, I'm open to reconsider.

But, P-1 can and is done anyway. It's done outside of GpuOwl. I think it's most commonly done on CPUs. So not having the P-1 in gpuowl does not mean it does not get done, only that it's done in a different way.

kriesel 2019-01-04 04:20

[QUOTE=preda;504865]Yes, I'm open to reconsider.

But, P-1 can and is done anyway. It's done outside of GpuOwl. I think it's most commonly done on CPUs. So not having the P-1 in gpuowl does not mean it does not get done, only that it's done in a different way.[/QUOTE]
A similar argument could be applied to PRP. That wouldn't leave much for gpuOwl to do.
gpuowl v4-5 is the only route I know of to do P-1 useful for GIMPS on OpenCl.
For NVIDIA there's CUDAPm1, and on Intel cpus, prime95/mprime.

LaurV 2019-01-04 07:43

@Mihai, you could split it in two, if not difficult for you to maintain both applications, or keep it as an option/command line switch. GPU P-1 is faster than CPU and some may prefer to do it separate. I use both cudaLucas and cudaPM1, and I am happy of the fact that they are separate applications.

This is by no mean intending to give you any push, it is just that I would feel kinda sorry to see you abandoning the P-1 stuff... In fact it seems that you are one of the very few openCL "experts" remaining here, and I would be quite happy to see you taking over mfakto too :razz: (especially now, with Bdot seems abandoning it, see the discussion in the mfakto thread).


(hehe, știi chestia aia cu ”bate calul ăla care trage mai tare”?)

SELROC 2019-01-04 08:24

I think we are up to speed with gpu manual testing as primenet.py works pretty well.

preda 2019-01-04 11:54

[QUOTE=LaurV;504885]@Mihai, you could split it in two, if not difficult for you to maintain both applications, or keep it as an option/command line switch. GPU P-1 is faster than CPU and some may prefer to do it separate. I use both cudaLucas and cudaPM1, and I am happy of the fact that they are separate applications.

This is by no mean intending to give you any push, it is just that I would feel kinda sorry to see you abandoning the P-1 stuff... In fact it seems that you are one of the very few openCL "experts" remaining here, and I would be quite happy to see you taking over mfakto too :razz: (especially now, with Bdot seems abandoning it, see the discussion in the mfakto thread).


(hehe, știi chestia aia cu ”bate calul ăla care trage mai tare”?)[/QUOTE]

:)
OK, I would like to have a standalone OpenCL P-1 tester. The problem is that while first-stage P-1 is simple, and I've implemented it already, second-stage of P-1 is more complex and I'd need to learn how to do it before I can implement it.

PRP-1 was not a good replacement for P-1, because it was doing a full slow PRP even if somebody only wanted a quick P-1.

PRP-1 was very good for this use-case: somebody actually wanted to do the PRP of an exponent without any P-1 done previously.

Brainstorming, I'm thinking of this: I could do a stand-alone "variant" P-1 that uses the ideas from PRP-1. The first stage P-1 is always the same between the two (classic P-1 and "variant P-1"). The difference is in the second stage. The output of the "variant P-1" would be either a factor found (rarely, 5%), or a pair (base, residue) that can be continued into a PRP.

Let's take an example: for an 90M exponent, without any P-1 done. Let's say somebody wants to do P-1(B1=1M, B2=20M) on it. Doing it in the "variant P-1" way, it would require:
- about 1.44M squarings for stage-1 (same as classic P-1, ~ B1 * 1.44)
- about (20M squaring + about 1M muls) for stage 2 to B2=20M done "PRP-1 way"

(this, I think, is competitive with classic P-1, although the difference (in stage 2) is small)

Now, for a no-factor-found P-1, the gain might come by saving the final residue of the "variant P-1" and, when a PRP is desired in the future for the same exponent, start it from there. For the example, the PRP instead of doing 90M iteration would only need 70M (90 - 20) iterations.

This may be a gain, but it puts a lot of burden on the server, who would need to save full residue of the P-1 to allow to continue a PRP from there. Thus, storage and implementation to do.

kriesel 2019-01-04 15:13

[QUOTE=preda;504907]:)
OK, I would like to have a standalone OpenCL P-1 tester. The problem is that while first-stage P-1 is simple, and I've implemented it already, second-stage of P-1 is more complex and I'd need to learn how to do it before I can implement it.

PRP-1 was not a good replacement for P-1, because it was doing a full slow PRP even if somebody only wanted a quick P-1.

PRP-1 was very good for this use-case: somebody actually wanted to do the PRP of an exponent without any P-1 done previously.

Brainstorming, I'm thinking of this: I could do a stand-alone "variant" P-1 that uses the ideas from PRP-1. The first stage P-1 is always the same between the two (classic P-1 and "variant P-1"). The difference is in the second stage. The output of the "variant P-1" would be either a factor found (rarely, 5%), or a pair (base, residue) that can be continued into a PRP.

Let's take an example: for an 90M exponent, without any P-1 done. Let's say somebody wants to do P-1(B1=1M, B2=20M) on it. Doing it in the "variant P-1" way, it would require:
- about 1.44M squarings for stage-1 (same as classic P-1, ~ B1 * 1.44)
- about (20M squaring + about 1M muls) for stage 2 to B2=20M done "PRP-1 way"

(this, I think, is competitive with classic P-1, although the difference (in stage 2) is small)

Now, for a no-factor-found P-1, the gain might come by saving the final residue of the "variant P-1" and, when a PRP is desired in the future for the same exponent, start it from there. For the example, the PRP instead of doing 90M iteration would only need 70M (90 - 20) iterations.

This may be a gain, but it puts a lot of burden on the server, who would need to save full residue of the P-1 to allow to continue a PRP from there. Thus, storage and implementation to do.[/QUOTE]
In the CUDAPm1 source code, there's a notation for where "stage 3" would be added. This is extension of stage 1 to higher bounds than initially run. By analogy, a "stage 4" extension of B2 could also be possible.

Conventional P-1 does a gcd at the end of stage one and another at the end of stage 2. There are other possibilities. One could go ~halfway through stage one and do a gcd. If a factor is found, early exit. If not, continue to the full B1 and gcd again. Same for stage 2.
As I recall prime95 offers the option to save locally the P-1 full residue files for extension to higher bounds, perhaps by some other program.

In the case where an OpenCl P-1-only program exists, but the server does not (yet?) accept the P-1 residue files, or the user hasn't the bandwidth to upload them (still stuck at 128 kbit/sec here, fiber is probably another year out), and primality testing does not make any further use of those computed powers of 3, nor does most extension of P-1 bounds higher, that's about the same as the current prime95 and CUDApm1 and CUDALucas status quo and typical usage. Saving the full-residue files should be optional. Those who want to keep them for backups, proof of work, debugging, bounds extension, or for use by other programs can. Those who just want P-1 factoring answers and conserved disk space will probably be the majority. Volunteers working to factor further can share the P-1 full residue files among themselves without necessarily involving the primenet server. (Google drive?)

Making double use of the powering of 3 as in PRP-1 was very creative, and that efficiency is desirable, but it is not an immediate or global requirement.
P-1 code that "merely" does P-1 efficiently and reliably on OpenCl is a good thing, an advance from the status quo. Any performance advantage, reusability of the full P-1 residue, etc, is a bonus.

There are possibilities for using the Jacobi symbol as error checking, which to my knowledge no other P-1 code has. It may be too expensive computationally to be a net productivity gain, but might be useful as a user controllable option. People sometimes get no factors for a long time and wonder if something is wrong. Nobody knows what the error rate of current or past P-1 factoring is. I suppose it could be estimated by linear interpolation from primality test run times assuming errors occur at a rate over time; ~2% x 2.5% = ~0.05%. I think the complexities of CUDAPm1 make its error rate higher; the bug list is nontrivial.

kriesel 2019-01-04 15:59

[QUOTE=preda;504858]In recent updates to GpuOwl I dropped the PRP-1 feature (which allowed to do a normal P-1 first-stage before the PRP, and a P-1 second-stage in parallel with the PRP).

This was because:
- most exponents at the wavefront (all?) already have TF to high bits and some P-1 already done.
- even without any P-1 done, the rate of factors found by PRP-1 is not huge, let's say about 5% depending on the level of TF

Thus the benefit of PRP-1 was marginal. To make useful use of it, it would need to be run on larger exponents (not at wavefront), that have no P-1 and lower TF.

Dropping PRP-1 removes the dependency on the GMP library.

If anybody has PRP-1 work ongoing (i.e. with B1 != 0), they should finish it before upgrading because GpuOwl v6 can't do PRP-1.[/QUOTE]
How does V6 compare in PRP speed to previous versions? Any data, any gpu model, any OS, any fft length, anyone?

preda 2019-01-04 21:50

[QUOTE=kriesel;504932]How does V6 compare in PRP speed to previous versions? Any data, any gpu model, any OS, any fft length, anyone?[/QUOTE]

Should be exactly the same speed...

kriesel 2019-01-04 22:13

-time crashes on Windows after collecting data (slowly)
 
Ran v3.8 on RX480 in Windows 7, AMD Adrenalin driver 18.10.2, with -time option. The GhzD/day figures and gpu utilization are quite low compared to running without the -time option; ~76 GhzD/day.The following is what it ran in two sessions terminating in application crashes. Next up, a try of the equivalent in v5.[CODE]2019-01-04 15:32:48 condorella-rx480 gpuowl-OpenCL 3.8-91c52fa
2019-01-04 15:32:48 condorella-rx480 FFT 9216K: Width 1024 (256x4), Height 512 (64x8), Middle 9; 16.27 bits/word
2019-01-04 15:32:48 condorella-rx480 Note: using short carry kernels
2019-01-04 15:32:51 condorella-rx480 Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics
2019-01-04 15:32:55 condorella-rx480 OpenCL compilation in 3921 ms, with "-DEXP=153500033u -DWIDTH=1024u -DSMALL_HEIGHT=512u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0 "
2019-01-04 15:32:56 condorella-rx480 PRP M(153500033), FFT 9216K, 16.27 bits/word, 1045 GHz-day
2019-01-04 15:33:55 condorella-rx480 OK loaded: 146771000/153500033, blockSize 500, 6d17e08673529e2f
2019-01-04 15:34:14 condorella-rx480 OK initial check: 6d17e08673529e2f
2019-01-04 15:35:24 condorella-rx480 OK 146772000/153500033 [95.62%], 36.65 ms/it [35.55, 37.74] (16.0 GHz-day/day); ETA 2d 20:30; 8daf006dbb9b251c (check 19.69s) (saved)
2019-01-04 15:35:24 condorella-rx480 15.2% tailFused : 2605 [ 1999, 5101] us/call x 2499 calls
2019-01-04 15:35:24 condorella-rx480 15.1% carryFused : 3225 [ 2686, 7314] us/call x 1996 calls
2019-01-04 15:35:24 condorella-rx480 13.2% transposeW : 1613 [ 999, 4683] us/call x 3503 calls
2019-01-04 15:35:24 condorella-rx480 12.7% fftMiddleIn : 1552 [ 991, 4890] us/call x 3503 calls
2019-01-04 15:35:24 condorella-rx480 11.3% fftMiddleOut : 1608 [ 999, 5096] us/call x 3001 calls
2019-01-04 15:35:24 condorella-rx480 11.2% transposeH : 1590 [ 995, 4536] us/call x 3001 calls
2019-01-04 15:35:24 condorella-rx480 7.0% fftP : 1980 [ 1000, 4885] us/call x 1507 calls
2019-01-04 15:35:24 condorella-rx480 4.3% carryA : 1813 [ 1000, 4295] us/call x 1002 calls
2019-01-04 15:35:24 condorella-rx480 4.1% mulFused : 3452 [ 3000, 6524] us/call x 502 calls
2019-01-04 15:35:24 condorella-rx480 3.8% fftW : 1607 [ 993, 4640] us/call x 1005 calls
2019-01-04 15:35:24 condorella-rx480 2.1% carryB : 887 [ 0, 3332] us/call x 1005 calls
2019-01-04 15:35:24 condorella-rx480
2019-01-04 15:40:12 condorella-rx480 146780000/153500033 [95.62%], 35.99 ms/it [31.15, 37.90] (16.3 GHz-day/day); ETA 2d 19:12; 3c480a9a9b10ec2a
2019-01-04 15:40:12 condorella-rx480 26.4% carryFused : 3242 [ 2680, 7270] us/call x 7984 calls
2019-01-04 15:40:12 condorella-rx480 21.0% tailFused : 2577 [ 1998, 5352] us/call x 8000 calls
2019-01-04 15:40:12 condorella-rx480 13.7% transposeW : 1672 [ 998, 4898] us/call x 8032 calls
2019-01-04 15:40:12 condorella-rx480 13.3% fftMiddleOut : 1625 [ 993, 5161] us/call x 8016 calls
2019-01-04 15:40:12 condorella-rx480 12.8% transposeH : 1572 [ 994, 4640] us/call x 8016 calls
2019-01-04 15:40:12 condorella-rx480 12.6% fftMiddleIn : 1538 [ 985, 5001] us/call x 8032 calls
2019-01-04 15:40:12 condorella-rx480 0.1% fftP : 2188 [ 1179, 9406] us/call x 48 calls
2019-01-04 15:49:23 condorella-rx480 gpuowl-OpenCL 3.8-91c52fa
2019-01-04 15:49:23 condorella-rx480 FFT 9216K: Width 1024 (256x4), Height 512 (64x8), Middle 9; 16.27 bits/word
2019-01-04 15:49:23 condorella-rx480 Note: using short carry kernels
2019-01-04 15:49:26 condorella-rx480 Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics
2019-01-04 15:49:30 condorella-rx480 OpenCL compilation in 3897 ms, with "-DEXP=153500033u -DWIDTH=1024u -DSMALL_HEIGHT=512u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0 "
2019-01-04 15:49:31 condorella-rx480 PRP M(153500033), FFT 9216K, 16.27 bits/word, 1045 GHz-day
2019-01-04 15:50:32 condorella-rx480 OK loaded: 146772000/153500033, blockSize 500, 8daf006dbb9b251c
2019-01-04 15:50:50 condorella-rx480 OK initial check: 8daf006dbb9b251c
2019-01-04 15:51:36 condorella-rx480 OK 146773000/153500033 [95.62%], 27.31 ms/it [18.28, 36.34] (21.5 GHz-day/day); ETA 2d 03:02; a98af4a2b6a710da (check 17.96s) (saved)
2019-01-04 15:51:36 condorella-rx480 15.0% tailFused : 2543 [ 1999, 5639] us/call x 2499 calls
2019-01-04 15:51:36 condorella-rx480 14.9% carryFused : 3159 [ 2624, 8031] us/call x 1996 calls
2019-01-04 15:51:36 condorella-rx480 13.2% transposeW : 1596 [ 996, 4933] us/call x 3503 calls
2019-01-04 15:51:36 condorella-rx480 12.5% fftMiddleIn : 1505 [ 986, 5055] us/call x 3503 calls
2019-01-04 15:51:36 condorella-rx480 11.4% fftMiddleOut : 1601 [ 993, 5165] us/call x 3001 calls
2019-01-04 15:51:36 condorella-rx480 10.8% transposeH : 1530 [ 990, 4827] us/call x 3001 calls
2019-01-04 15:51:36 condorella-rx480 7.1% fftP : 2003 [ 1162, 7116] us/call x 1507 calls
2019-01-04 15:51:36 condorella-rx480 4.7% carryA : 2005 [ 1197, 6959] us/call x 1002 calls
2019-01-04 15:51:36 condorella-rx480 4.2% mulFused : 3544 [ 2986, 7226] us/call x 502 calls
2019-01-04 15:51:36 condorella-rx480 4.0% fftW : 1667 [ 996, 5241] us/call x 1005 calls
2019-01-04 15:51:36 condorella-rx480 2.1% carryB : 874 [ 0, 3383] us/call x 1005 calls
2019-01-04 15:51:36 condorella-rx480
2019-01-04 15:55:50 condorella-rx480 146780000/153500033 [95.62%], 36.41 ms/it [33.72, 38.19] (16.2 GHz-day/day); ETA 2d 19:58; 3c480a9a9b10ec2a
[/CODE]

kriesel 2019-01-04 22:35

v5 -time output on RX480, 79M PRP DC, Win
 
This is a no P-1 run, just PRP. -time output does not handle the 0-calls cases as gracefully as it could. Probably an easy fix there. No crashes in 10 minutes. ETA with -time ~ 3 weeks, without it ~4 days. This is an assigned PRP DC. Unfortunately the first test was also with offset zero.[CODE]C:\msys64\home\ken\gpuowl-compile\v5.0-9c13870>openowl -time
2019-01-04 16:19:46 gpuowl 5.0-9c13870
2019-01-04 16:19:46 -time
2019-01-04 16:19:46 79055077 FFT 4608K: Width 256x4, Height 64x4, Middle 9; 16.75 bits/word
2019-01-04 16:19:46 using short carry kernels
2019-01-04 16:19:49 Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics
2019-01-04 16:19:52 OpenCL compilation in 2627 ms, with "-DEXP=79055077u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0
"
2019-01-04 16:19:53 79055077.owl not found, starting from the beginning.
2019-01-04 16:20:40 79055077 OK 800 0.00%; 28.15 ms/sq, 0 MULs; ETA 25d 18:05; 3aa268928b9c975c (check 11.28s)
2019-01-04 16:20:41 nan% carryFused : 2446 us/call x 1568 calls
2019-01-04 16:20:41 nan% carryFusedMul : nan us/call x 0 calls
2019-01-04 16:20:41 nan% fftP : 1755 us/call x 98 calls
2019-01-04 16:20:41 nan% fftW : 953 us/call x 64 calls
2019-01-04 16:20:41 nan% fftH : 1340 us/call x 100 calls
2019-01-04 16:20:41 nan% fftMiddleIn : 1337 us/call x 1666 calls
2019-01-04 16:20:41 nan% fftMiddleOut : 1021 us/call x 1632 calls
2019-01-04 16:20:41 nan% carryA : 1781 us/call x 64 calls
2019-01-04 16:20:41 nan% carryM : nan us/call x 0 calls
2019-01-04 16:20:41 nan% carryB : 438 us/call x 64 calls
2019-01-04 16:20:41 nan% transposeW : 1053 us/call x 1666 calls
2019-01-04 16:20:41 nan% transposeH : 1600 us/call x 1632 calls
2019-01-04 16:20:41 nan% transposeIn : 1000 us/call x 4 calls
2019-01-04 16:20:41 nan% transposeOut : 0 us/call x 1 calls
2019-01-04 16:20:41 nan% square : nan us/call x 0 calls
2019-01-04 16:20:41 nan% multiply : 1939 us/call x 33 calls
2019-01-04 16:20:41 nan% multiplySub : nan us/call x 0 calls
2019-01-04 16:20:41 nan% tailFused : 2114 us/call x 1599 calls
2019-01-04 16:20:41 nan% readResidue : 1000 us/call x 2 calls
2019-01-04 16:20:41 nan% isNotZero : 9000 us/call x 1 calls
2019-01-04 16:20:41 nan% isEqual : 0 us/call x 1 calls
2019-01-04 16:20:41
2019-01-04 16:24:16 79055077 10000 0.01%; 23.40 ms/sq, 0 MULs; ETA 21d 09:47; fa9ad651bc910bc8
2019-01-04 16:24:16 nan% carryFused : 2100 us/call x 9177 calls
2019-01-04 16:24:16 nan% carryFusedMul : nan us/call x 0 calls
2019-01-04 16:24:16 nan% fftP : 1174 us/call x 69 calls
2019-01-04 16:24:16 nan% fftW : 848 us/call x 46 calls
2019-01-04 16:24:16 nan% fftH : 1130 us/call x 69 calls
2019-01-04 16:24:16 nan% fftMiddleIn : 1201 us/call x 9246 calls
2019-01-04 16:24:16 nan% fftMiddleOut : 960 us/call x 9223 calls
2019-01-04 16:24:16 nan% carryA : 1174 us/call x 46 calls
2019-01-04 16:24:16 nan% carryM : nan us/call x 0 calls
2019-01-04 16:24:16 nan% carryB : 435 us/call x 46 calls
2019-01-04 16:24:16 nan% transposeW : 983 us/call x 9246 calls
2019-01-04 16:24:16 nan% transposeH : 1342 us/call x 9223 calls
2019-01-04 16:24:16 nan% transposeIn : nan us/call x 0 calls
2019-01-04 16:24:16 nan% transposeOut : nan us/call x 0 calls
2019-01-04 16:24:16 nan% square : nan us/call x 0 calls
2019-01-04 16:24:16 nan% multiply : 2652 us/call x 23 calls
2019-01-04 16:24:16 nan% multiplySub : nan us/call x 0 calls
2019-01-04 16:24:16 nan% tailFused : 1850 us/call x 9200 calls
2019-01-04 16:24:16 nan% readResidue : 0 us/call x 1 calls
2019-01-04 16:24:16 nan% isNotZero : nan us/call x 0 calls
2019-01-04 16:24:16 nan% isEqual : nan us/call x 0 calls[/CODE]


All times are UTC. The time now is 23:11.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.