![]() |
[QUOTE=preda;504858]- even without any P-1 done, the rate of factors found by PRP-1 is not huge, let's say about 5% depending on the level of TF[/QUOTE]
But this is certainly no worse than the usual rate for finding factors by ordinary P−1. Remember that as you increase bit-length (for TF) and non-smoothness of k (for P−1), the difficulty goes up exponentially. So there is a relatively narrow transition zone between trivially easy factors that were found long ago and impossibly hard factors that we'll never find. Maybe you got discouraged too easily, and should reconsider. |
[QUOTE=GP2;504861]But this is certainly no worse than the usual rate for finding factors by ordinary P−1.
Remember that as you increase bit-length (for TF) and non-smoothness of k (for P−1), the difficulty goes up exponentially. So there is a relatively narrow transition zone between trivially easy factors that were found long ago and impossibly hard factors that we'll never find. Maybe you got discouraged too easily, and should reconsider.[/QUOTE] Yes, I'm open to reconsider. But, P-1 can and is done anyway. It's done outside of GpuOwl. I think it's most commonly done on CPUs. So not having the P-1 in gpuowl does not mean it does not get done, only that it's done in a different way. |
[QUOTE=preda;504865]Yes, I'm open to reconsider.
But, P-1 can and is done anyway. It's done outside of GpuOwl. I think it's most commonly done on CPUs. So not having the P-1 in gpuowl does not mean it does not get done, only that it's done in a different way.[/QUOTE] A similar argument could be applied to PRP. That wouldn't leave much for gpuOwl to do. gpuowl v4-5 is the only route I know of to do P-1 useful for GIMPS on OpenCl. For NVIDIA there's CUDAPm1, and on Intel cpus, prime95/mprime. |
@Mihai, you could split it in two, if not difficult for you to maintain both applications, or keep it as an option/command line switch. GPU P-1 is faster than CPU and some may prefer to do it separate. I use both cudaLucas and cudaPM1, and I am happy of the fact that they are separate applications.
This is by no mean intending to give you any push, it is just that I would feel kinda sorry to see you abandoning the P-1 stuff... In fact it seems that you are one of the very few openCL "experts" remaining here, and I would be quite happy to see you taking over mfakto too :razz: (especially now, with Bdot seems abandoning it, see the discussion in the mfakto thread). (hehe, știi chestia aia cu ”bate calul ăla care trage mai tare”?) |
I think we are up to speed with gpu manual testing as primenet.py works pretty well.
|
[QUOTE=LaurV;504885]@Mihai, you could split it in two, if not difficult for you to maintain both applications, or keep it as an option/command line switch. GPU P-1 is faster than CPU and some may prefer to do it separate. I use both cudaLucas and cudaPM1, and I am happy of the fact that they are separate applications.
This is by no mean intending to give you any push, it is just that I would feel kinda sorry to see you abandoning the P-1 stuff... In fact it seems that you are one of the very few openCL "experts" remaining here, and I would be quite happy to see you taking over mfakto too :razz: (especially now, with Bdot seems abandoning it, see the discussion in the mfakto thread). (hehe, știi chestia aia cu ”bate calul ăla care trage mai tare”?)[/QUOTE] :) OK, I would like to have a standalone OpenCL P-1 tester. The problem is that while first-stage P-1 is simple, and I've implemented it already, second-stage of P-1 is more complex and I'd need to learn how to do it before I can implement it. PRP-1 was not a good replacement for P-1, because it was doing a full slow PRP even if somebody only wanted a quick P-1. PRP-1 was very good for this use-case: somebody actually wanted to do the PRP of an exponent without any P-1 done previously. Brainstorming, I'm thinking of this: I could do a stand-alone "variant" P-1 that uses the ideas from PRP-1. The first stage P-1 is always the same between the two (classic P-1 and "variant P-1"). The difference is in the second stage. The output of the "variant P-1" would be either a factor found (rarely, 5%), or a pair (base, residue) that can be continued into a PRP. Let's take an example: for an 90M exponent, without any P-1 done. Let's say somebody wants to do P-1(B1=1M, B2=20M) on it. Doing it in the "variant P-1" way, it would require: - about 1.44M squarings for stage-1 (same as classic P-1, ~ B1 * 1.44) - about (20M squaring + about 1M muls) for stage 2 to B2=20M done "PRP-1 way" (this, I think, is competitive with classic P-1, although the difference (in stage 2) is small) Now, for a no-factor-found P-1, the gain might come by saving the final residue of the "variant P-1" and, when a PRP is desired in the future for the same exponent, start it from there. For the example, the PRP instead of doing 90M iteration would only need 70M (90 - 20) iterations. This may be a gain, but it puts a lot of burden on the server, who would need to save full residue of the P-1 to allow to continue a PRP from there. Thus, storage and implementation to do. |
[QUOTE=preda;504907]:)
OK, I would like to have a standalone OpenCL P-1 tester. The problem is that while first-stage P-1 is simple, and I've implemented it already, second-stage of P-1 is more complex and I'd need to learn how to do it before I can implement it. PRP-1 was not a good replacement for P-1, because it was doing a full slow PRP even if somebody only wanted a quick P-1. PRP-1 was very good for this use-case: somebody actually wanted to do the PRP of an exponent without any P-1 done previously. Brainstorming, I'm thinking of this: I could do a stand-alone "variant" P-1 that uses the ideas from PRP-1. The first stage P-1 is always the same between the two (classic P-1 and "variant P-1"). The difference is in the second stage. The output of the "variant P-1" would be either a factor found (rarely, 5%), or a pair (base, residue) that can be continued into a PRP. Let's take an example: for an 90M exponent, without any P-1 done. Let's say somebody wants to do P-1(B1=1M, B2=20M) on it. Doing it in the "variant P-1" way, it would require: - about 1.44M squarings for stage-1 (same as classic P-1, ~ B1 * 1.44) - about (20M squaring + about 1M muls) for stage 2 to B2=20M done "PRP-1 way" (this, I think, is competitive with classic P-1, although the difference (in stage 2) is small) Now, for a no-factor-found P-1, the gain might come by saving the final residue of the "variant P-1" and, when a PRP is desired in the future for the same exponent, start it from there. For the example, the PRP instead of doing 90M iteration would only need 70M (90 - 20) iterations. This may be a gain, but it puts a lot of burden on the server, who would need to save full residue of the P-1 to allow to continue a PRP from there. Thus, storage and implementation to do.[/QUOTE] In the CUDAPm1 source code, there's a notation for where "stage 3" would be added. This is extension of stage 1 to higher bounds than initially run. By analogy, a "stage 4" extension of B2 could also be possible. Conventional P-1 does a gcd at the end of stage one and another at the end of stage 2. There are other possibilities. One could go ~halfway through stage one and do a gcd. If a factor is found, early exit. If not, continue to the full B1 and gcd again. Same for stage 2. As I recall prime95 offers the option to save locally the P-1 full residue files for extension to higher bounds, perhaps by some other program. In the case where an OpenCl P-1-only program exists, but the server does not (yet?) accept the P-1 residue files, or the user hasn't the bandwidth to upload them (still stuck at 128 kbit/sec here, fiber is probably another year out), and primality testing does not make any further use of those computed powers of 3, nor does most extension of P-1 bounds higher, that's about the same as the current prime95 and CUDApm1 and CUDALucas status quo and typical usage. Saving the full-residue files should be optional. Those who want to keep them for backups, proof of work, debugging, bounds extension, or for use by other programs can. Those who just want P-1 factoring answers and conserved disk space will probably be the majority. Volunteers working to factor further can share the P-1 full residue files among themselves without necessarily involving the primenet server. (Google drive?) Making double use of the powering of 3 as in PRP-1 was very creative, and that efficiency is desirable, but it is not an immediate or global requirement. P-1 code that "merely" does P-1 efficiently and reliably on OpenCl is a good thing, an advance from the status quo. Any performance advantage, reusability of the full P-1 residue, etc, is a bonus. There are possibilities for using the Jacobi symbol as error checking, which to my knowledge no other P-1 code has. It may be too expensive computationally to be a net productivity gain, but might be useful as a user controllable option. People sometimes get no factors for a long time and wonder if something is wrong. Nobody knows what the error rate of current or past P-1 factoring is. I suppose it could be estimated by linear interpolation from primality test run times assuming errors occur at a rate over time; ~2% x 2.5% = ~0.05%. I think the complexities of CUDAPm1 make its error rate higher; the bug list is nontrivial. |
[QUOTE=preda;504858]In recent updates to GpuOwl I dropped the PRP-1 feature (which allowed to do a normal P-1 first-stage before the PRP, and a P-1 second-stage in parallel with the PRP).
This was because: - most exponents at the wavefront (all?) already have TF to high bits and some P-1 already done. - even without any P-1 done, the rate of factors found by PRP-1 is not huge, let's say about 5% depending on the level of TF Thus the benefit of PRP-1 was marginal. To make useful use of it, it would need to be run on larger exponents (not at wavefront), that have no P-1 and lower TF. Dropping PRP-1 removes the dependency on the GMP library. If anybody has PRP-1 work ongoing (i.e. with B1 != 0), they should finish it before upgrading because GpuOwl v6 can't do PRP-1.[/QUOTE] How does V6 compare in PRP speed to previous versions? Any data, any gpu model, any OS, any fft length, anyone? |
[QUOTE=kriesel;504932]How does V6 compare in PRP speed to previous versions? Any data, any gpu model, any OS, any fft length, anyone?[/QUOTE]
Should be exactly the same speed... |
-time crashes on Windows after collecting data (slowly)
Ran v3.8 on RX480 in Windows 7, AMD Adrenalin driver 18.10.2, with -time option. The GhzD/day figures and gpu utilization are quite low compared to running without the -time option; ~76 GhzD/day.The following is what it ran in two sessions terminating in application crashes. Next up, a try of the equivalent in v5.[CODE]2019-01-04 15:32:48 condorella-rx480 gpuowl-OpenCL 3.8-91c52fa
2019-01-04 15:32:48 condorella-rx480 FFT 9216K: Width 1024 (256x4), Height 512 (64x8), Middle 9; 16.27 bits/word 2019-01-04 15:32:48 condorella-rx480 Note: using short carry kernels 2019-01-04 15:32:51 condorella-rx480 Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics 2019-01-04 15:32:55 condorella-rx480 OpenCL compilation in 3921 ms, with "-DEXP=153500033u -DWIDTH=1024u -DSMALL_HEIGHT=512u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0 " 2019-01-04 15:32:56 condorella-rx480 PRP M(153500033), FFT 9216K, 16.27 bits/word, 1045 GHz-day 2019-01-04 15:33:55 condorella-rx480 OK loaded: 146771000/153500033, blockSize 500, 6d17e08673529e2f 2019-01-04 15:34:14 condorella-rx480 OK initial check: 6d17e08673529e2f 2019-01-04 15:35:24 condorella-rx480 OK 146772000/153500033 [95.62%], 36.65 ms/it [35.55, 37.74] (16.0 GHz-day/day); ETA 2d 20:30; 8daf006dbb9b251c (check 19.69s) (saved) 2019-01-04 15:35:24 condorella-rx480 15.2% tailFused : 2605 [ 1999, 5101] us/call x 2499 calls 2019-01-04 15:35:24 condorella-rx480 15.1% carryFused : 3225 [ 2686, 7314] us/call x 1996 calls 2019-01-04 15:35:24 condorella-rx480 13.2% transposeW : 1613 [ 999, 4683] us/call x 3503 calls 2019-01-04 15:35:24 condorella-rx480 12.7% fftMiddleIn : 1552 [ 991, 4890] us/call x 3503 calls 2019-01-04 15:35:24 condorella-rx480 11.3% fftMiddleOut : 1608 [ 999, 5096] us/call x 3001 calls 2019-01-04 15:35:24 condorella-rx480 11.2% transposeH : 1590 [ 995, 4536] us/call x 3001 calls 2019-01-04 15:35:24 condorella-rx480 7.0% fftP : 1980 [ 1000, 4885] us/call x 1507 calls 2019-01-04 15:35:24 condorella-rx480 4.3% carryA : 1813 [ 1000, 4295] us/call x 1002 calls 2019-01-04 15:35:24 condorella-rx480 4.1% mulFused : 3452 [ 3000, 6524] us/call x 502 calls 2019-01-04 15:35:24 condorella-rx480 3.8% fftW : 1607 [ 993, 4640] us/call x 1005 calls 2019-01-04 15:35:24 condorella-rx480 2.1% carryB : 887 [ 0, 3332] us/call x 1005 calls 2019-01-04 15:35:24 condorella-rx480 2019-01-04 15:40:12 condorella-rx480 146780000/153500033 [95.62%], 35.99 ms/it [31.15, 37.90] (16.3 GHz-day/day); ETA 2d 19:12; 3c480a9a9b10ec2a 2019-01-04 15:40:12 condorella-rx480 26.4% carryFused : 3242 [ 2680, 7270] us/call x 7984 calls 2019-01-04 15:40:12 condorella-rx480 21.0% tailFused : 2577 [ 1998, 5352] us/call x 8000 calls 2019-01-04 15:40:12 condorella-rx480 13.7% transposeW : 1672 [ 998, 4898] us/call x 8032 calls 2019-01-04 15:40:12 condorella-rx480 13.3% fftMiddleOut : 1625 [ 993, 5161] us/call x 8016 calls 2019-01-04 15:40:12 condorella-rx480 12.8% transposeH : 1572 [ 994, 4640] us/call x 8016 calls 2019-01-04 15:40:12 condorella-rx480 12.6% fftMiddleIn : 1538 [ 985, 5001] us/call x 8032 calls 2019-01-04 15:40:12 condorella-rx480 0.1% fftP : 2188 [ 1179, 9406] us/call x 48 calls 2019-01-04 15:49:23 condorella-rx480 gpuowl-OpenCL 3.8-91c52fa 2019-01-04 15:49:23 condorella-rx480 FFT 9216K: Width 1024 (256x4), Height 512 (64x8), Middle 9; 16.27 bits/word 2019-01-04 15:49:23 condorella-rx480 Note: using short carry kernels 2019-01-04 15:49:26 condorella-rx480 Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics 2019-01-04 15:49:30 condorella-rx480 OpenCL compilation in 3897 ms, with "-DEXP=153500033u -DWIDTH=1024u -DSMALL_HEIGHT=512u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0 " 2019-01-04 15:49:31 condorella-rx480 PRP M(153500033), FFT 9216K, 16.27 bits/word, 1045 GHz-day 2019-01-04 15:50:32 condorella-rx480 OK loaded: 146772000/153500033, blockSize 500, 8daf006dbb9b251c 2019-01-04 15:50:50 condorella-rx480 OK initial check: 8daf006dbb9b251c 2019-01-04 15:51:36 condorella-rx480 OK 146773000/153500033 [95.62%], 27.31 ms/it [18.28, 36.34] (21.5 GHz-day/day); ETA 2d 03:02; a98af4a2b6a710da (check 17.96s) (saved) 2019-01-04 15:51:36 condorella-rx480 15.0% tailFused : 2543 [ 1999, 5639] us/call x 2499 calls 2019-01-04 15:51:36 condorella-rx480 14.9% carryFused : 3159 [ 2624, 8031] us/call x 1996 calls 2019-01-04 15:51:36 condorella-rx480 13.2% transposeW : 1596 [ 996, 4933] us/call x 3503 calls 2019-01-04 15:51:36 condorella-rx480 12.5% fftMiddleIn : 1505 [ 986, 5055] us/call x 3503 calls 2019-01-04 15:51:36 condorella-rx480 11.4% fftMiddleOut : 1601 [ 993, 5165] us/call x 3001 calls 2019-01-04 15:51:36 condorella-rx480 10.8% transposeH : 1530 [ 990, 4827] us/call x 3001 calls 2019-01-04 15:51:36 condorella-rx480 7.1% fftP : 2003 [ 1162, 7116] us/call x 1507 calls 2019-01-04 15:51:36 condorella-rx480 4.7% carryA : 2005 [ 1197, 6959] us/call x 1002 calls 2019-01-04 15:51:36 condorella-rx480 4.2% mulFused : 3544 [ 2986, 7226] us/call x 502 calls 2019-01-04 15:51:36 condorella-rx480 4.0% fftW : 1667 [ 996, 5241] us/call x 1005 calls 2019-01-04 15:51:36 condorella-rx480 2.1% carryB : 874 [ 0, 3383] us/call x 1005 calls 2019-01-04 15:51:36 condorella-rx480 2019-01-04 15:55:50 condorella-rx480 146780000/153500033 [95.62%], 36.41 ms/it [33.72, 38.19] (16.2 GHz-day/day); ETA 2d 19:58; 3c480a9a9b10ec2a [/CODE] |
v5 -time output on RX480, 79M PRP DC, Win
This is a no P-1 run, just PRP. -time output does not handle the 0-calls cases as gracefully as it could. Probably an easy fix there. No crashes in 10 minutes. ETA with -time ~ 3 weeks, without it ~4 days. This is an assigned PRP DC. Unfortunately the first test was also with offset zero.[CODE]C:\msys64\home\ken\gpuowl-compile\v5.0-9c13870>openowl -time
2019-01-04 16:19:46 gpuowl 5.0-9c13870 2019-01-04 16:19:46 -time 2019-01-04 16:19:46 79055077 FFT 4608K: Width 256x4, Height 64x4, Middle 9; 16.75 bits/word 2019-01-04 16:19:46 using short carry kernels 2019-01-04 16:19:49 Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics 2019-01-04 16:19:52 OpenCL compilation in 2627 ms, with "-DEXP=79055077u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=9u -I. -cl-fast-relaxed-math -cl-std=CL2.0 " 2019-01-04 16:19:53 79055077.owl not found, starting from the beginning. 2019-01-04 16:20:40 79055077 OK 800 0.00%; 28.15 ms/sq, 0 MULs; ETA 25d 18:05; 3aa268928b9c975c (check 11.28s) 2019-01-04 16:20:41 nan% carryFused : 2446 us/call x 1568 calls 2019-01-04 16:20:41 nan% carryFusedMul : nan us/call x 0 calls 2019-01-04 16:20:41 nan% fftP : 1755 us/call x 98 calls 2019-01-04 16:20:41 nan% fftW : 953 us/call x 64 calls 2019-01-04 16:20:41 nan% fftH : 1340 us/call x 100 calls 2019-01-04 16:20:41 nan% fftMiddleIn : 1337 us/call x 1666 calls 2019-01-04 16:20:41 nan% fftMiddleOut : 1021 us/call x 1632 calls 2019-01-04 16:20:41 nan% carryA : 1781 us/call x 64 calls 2019-01-04 16:20:41 nan% carryM : nan us/call x 0 calls 2019-01-04 16:20:41 nan% carryB : 438 us/call x 64 calls 2019-01-04 16:20:41 nan% transposeW : 1053 us/call x 1666 calls 2019-01-04 16:20:41 nan% transposeH : 1600 us/call x 1632 calls 2019-01-04 16:20:41 nan% transposeIn : 1000 us/call x 4 calls 2019-01-04 16:20:41 nan% transposeOut : 0 us/call x 1 calls 2019-01-04 16:20:41 nan% square : nan us/call x 0 calls 2019-01-04 16:20:41 nan% multiply : 1939 us/call x 33 calls 2019-01-04 16:20:41 nan% multiplySub : nan us/call x 0 calls 2019-01-04 16:20:41 nan% tailFused : 2114 us/call x 1599 calls 2019-01-04 16:20:41 nan% readResidue : 1000 us/call x 2 calls 2019-01-04 16:20:41 nan% isNotZero : 9000 us/call x 1 calls 2019-01-04 16:20:41 nan% isEqual : 0 us/call x 1 calls 2019-01-04 16:20:41 2019-01-04 16:24:16 79055077 10000 0.01%; 23.40 ms/sq, 0 MULs; ETA 21d 09:47; fa9ad651bc910bc8 2019-01-04 16:24:16 nan% carryFused : 2100 us/call x 9177 calls 2019-01-04 16:24:16 nan% carryFusedMul : nan us/call x 0 calls 2019-01-04 16:24:16 nan% fftP : 1174 us/call x 69 calls 2019-01-04 16:24:16 nan% fftW : 848 us/call x 46 calls 2019-01-04 16:24:16 nan% fftH : 1130 us/call x 69 calls 2019-01-04 16:24:16 nan% fftMiddleIn : 1201 us/call x 9246 calls 2019-01-04 16:24:16 nan% fftMiddleOut : 960 us/call x 9223 calls 2019-01-04 16:24:16 nan% carryA : 1174 us/call x 46 calls 2019-01-04 16:24:16 nan% carryM : nan us/call x 0 calls 2019-01-04 16:24:16 nan% carryB : 435 us/call x 46 calls 2019-01-04 16:24:16 nan% transposeW : 983 us/call x 9246 calls 2019-01-04 16:24:16 nan% transposeH : 1342 us/call x 9223 calls 2019-01-04 16:24:16 nan% transposeIn : nan us/call x 0 calls 2019-01-04 16:24:16 nan% transposeOut : nan us/call x 0 calls 2019-01-04 16:24:16 nan% square : nan us/call x 0 calls 2019-01-04 16:24:16 nan% multiply : 2652 us/call x 23 calls 2019-01-04 16:24:16 nan% multiplySub : nan us/call x 0 calls 2019-01-04 16:24:16 nan% tailFused : 1850 us/call x 9200 calls 2019-01-04 16:24:16 nan% readResidue : 0 us/call x 1 calls 2019-01-04 16:24:16 nan% isNotZero : nan us/call x 0 calls 2019-01-04 16:24:16 nan% isEqual : nan us/call x 0 calls[/CODE] |
| All times are UTC. The time now is 23:11. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.