![]() |
[QUOTE=kriesel;479017]
Gerbicz check should catch round off errors, right? [/QUOTE] Yes, Gerbicz check should catch almost any/all errors, including round-off errors. (that's why GpuOwl can get away with not tracking maximum round-off-error anymore). |
[QUOTE=preda;479021]Yes, Gerbicz check should catch almost any/all errors, including round-off errors. (that's why GpuOwl can get away with not tracking maximum round-off-error anymore).[/QUOTE]
Ah, yes, I hadn't considered that aspect of it. So in PRP mode you set your maxp values based on where the frequency of Gerbicz-check-based backtracking becomes high enough that the extra cost is comparable to running at the next-higher FFT length? Of course in your case that would mean really pushing the limit, since you don't have an FFT length (say) 10% larger/slower. |
testing the error cliff at 8M -DP
Is it ok to alternate -legacy or not, and -fft DP with -fft M61, on a single exponent? It seems to accept it fine, in V1.9, continuing the exponent.
[QUOTE=preda;478922]This exponent is too large for 8M FFT (it has 18.35 bits/word).[/QUOTE] All the following are from gpuOwL v1.9-74f1a38 and run with 8M fft length, except for 76812401 running with 4M. All are with -DP transform except #9 below as described there. All is on a Radeon RX550 (2GB) not driving a display. Gerbicz check is automatic and should catch round off errors and other types. 1 Complete PRP run of 76812401, started with build 94aa58f; switched to 74f1a38 at iteration 25373000, not run with -legacy at all gpuOwL v1.9- GPU Mersenne primality checker Radeon 500 Series 8 @f:0.0, gfx804 1203MHz OpenCL compilation in 2147 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=76812401u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFP_DP=1 " PRP-3: FFT 4M (1024 * 2048 * 2) of 76812401 (18.31 bits/word) -verbosity 2 starting at iteration 53729500; subsequently, cv typically close to 2%, 1.7-1.9% usual, rare values 2.5, 2.6, 18, 33., completed with no errors flagged, 12ms/iter [CODE] 2 gpuOwL v1.9- GPU Mersenne primality checker Radeon 500 Series 8 @f:0.0, gfx804 1203MHz OpenCL compilation in 2527 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=154000001u -DWIDTH=2048u -DHEIGHT=2048u -DLOG_NWORDS=23u -DFP_DP=1 -save-temps=df/DP_8M" PRP-3: FFT 8M (2048 * 2048 * 2) of 154000001 (18.36 bits/word) no errors in first 300k iterations, cv nicely low at 0.2%, then many errors before 350k iterations, could not advance past 332500 iterations, run terminated after 39 errors. app stop & restart did not advance it any further, nor did system restart. Believed to be due to too high an exponent for the fft length and transform type combination. Note, the above was not run with -legacy option. 3 (exponent at CUDALucas 8M fft length upper limit, started without -legacy option) gpuowl -user kriesel -cpu condorella-rx550 -device 0 -verbosity 2 -dump df PRP-3: FFT 8M (2048 * 2048 * 2) of 149447533 (17.82 bits/word) 400000 iterations, max cv 0.3%, no errors indicated stopped and restarted with -legacy option added, iteration time dropped from 27.37 to 21.36 ms/iter; no errors when iteration 900000 reached; max cv 1.5% until 900000 reported 11.6% -legacy has consistently higher cv, 1.-1.5% typical, in this run for the same exponent on the same hardware 4 gpuowl -user kriesel -cpu condorella-rx550 -device 0 -verbosity 2 -dump df -legacy PRP-3: FFT 8M (2048 * 2048 * 2) of 152000239 (18.12 bits/word) 1,452,000 iterations, max cv 2.1%, no errors indicated, 21.43 ms/iter 5 gpuowl -user kriesel -cpu condorella-rx550 -device 0 -verbosity 2 -dump df -legacy PRP-3: FFT 8M (2048 * 2048 * 2) of 152500021 (18.18 bits/word) 200,000 iterations, max cv 2.2% until 11.8 at 200,000, 16.5% following, then settled down to 1.2% or less for the rest of a million iterations; no errors indicated, 21.44 ms/iter [/CODE](new below here, not posted previously) 6 gpuowl -user kriesel -cpu condorella-rx550 -device 0 -verbosity 2 -dump df -legacy PRP-3: FFT 8M (2048 * 2048 * 2) of 153000031 (18.24 bits/word) no errors in 1.25 million iterations, max cv 1.4% except for 11.6% at 1.1 million, 21.37ms/iter 7 gpuowl -user kriesel -cpu condorella-rx550 -device 0 -verbosity 2 -dump df -legacy PRP-3: FFT 8M (2048 * 2048 * 2) of 153500033 (18.30 bits/word) no errors in 1.1 million iterations, max cv 1.5% except for 14.2% at 60,000 iterations during interactive use, 21.34ms/iter 8 gpuowl -user kriesel -cpu condorella-rx550 -device 0 -verbosity 2 -dump df -legacy PRP-3: FFT 8M (2048 * 2048 * 2) of 153800111 (18.33 bits/word) 1,000,000 iterations, max cv 1.4% except 34.4% at 10,000 iterations, 21.34ms/iter 9 gpuowl -user kriesel -cpu condorella-rx550 -device 0 -verbosity 2 -dump df -legacy PRP-3: FFT 8M (2048 * 2048 * 2) of 153900001 (18.35 bits/word) 1,588,000 iterations (1.03%), cv 1.1-1.25 typ 2.1% max, no errors indicated, 21.40ms/iter switched to no legacy, continued, at 27.27 msec/iteration, until a persistent error near 2.5 million iterations. App stop and restart advanced the iteration counter from 2496000 to 2498500. Another stop/restart did not advance it at all. Restart with -legacy handled the trouble spot as well as speeding it up again. Running briefly in all four combinations of -fft DP or M61 and -legacy or not, but mostly -legacy (and default -DP), it has reached 3.84 million iterations. (end) |
[QUOTE=kriesel;479243]Is it ok to alternate -legacy or not, and -fft DP with -fft M61, on a single exponent? It seems to accept it fine, in V1.9, continuing the exponent.
[/QUOTE] Yes: GpuOwl now saves "compacted bits", which means that the E bits (for exponent E) are in a compact representation which is independent of: - the number of words (4M or 8M fft) - whether the words are "balanced" (DP) or not (M61) So it is possible to change the algorithm or size in the middle. |
1 Attachment(s)
[QUOTE=preda;471318]I recently understood how to implement a "Fast Galois Transform" (FGT) which is simply complex arithmetic with integers modulo some number M.
I had hope that this integer-only transform may be faster on the GPU because it does not use double-precision floating point (which is slow on commodity GPUs). So I had fun and implemented FGT modulo M(31)=2^31-1 and modulo M(61)=2^61-1. Unfortunately the hoped performance gain was not there, but it was a very cool exercise nevertheless. Anyway, now it's possible to select among these 4 transforms: -fft DP : the old double precision floating point -fft SP : simple precision FP -fft M61 : FGT(M61) -fft M31 : FGT(M31) Of these, SP is very fast but also useless at 2M FFT-size and up (it may prove useful for something at lower FFT sizes). M31 has about 5 bits-per-word usable at 4M FFT size. It's not much use by itself, but can be tested. M61 has deeper word bits than DP. So it can be used for real work. Unfortunately it's also slower than DP. Part of the slowness may be from poor compiler optimizations and that aspect may improve in the future, hopefully. ... [/QUOTE] What are the best easily available estimates or data for the min and max exponents versus fft length for the M61 transform? (Or equivalently, bits/word min and max for each fft length when using M61?) M61 may be slower than DP _for the same fft length_, but it's faster, at least on the RX550 in V1.9 build 74f1a38 on Windows 7 Pro, than being pushed to the next larger length of DP, in the range permitted by M61's greater bits/word. See the attached table, with relative speeds from recent testing here. This means that primality and double checking with M61 could give quicker runs for some exponent ranges by allowing use of a smaller FFT length in M61 than in DP; approximately the effect of intermediate fft lengths, but with existing code. There may be a similar, modest, effect of extending a given fft length to higher exponents, going on between default DP and legacy DP, perhaps due to the deeper carry length of legacy. Legacy DP is also faster at all ffft lengths, on the RX550. -legacy appears to have little or no effect on -fft M61. Does it apply only to DP? There are few things that bring clarity to how much I don't yet know, than making a new spreadsheet and discovering how few cells I can fill in, or even how long it takes to realize what rows and columns to include or questions to address with it. (Although as I recall, being married was also pretty effective, at showing me how much I didn't know, about many things.) |
gpuowl v1.9-74f1a38 notes
Worktodo entries are apparently case sensitive.
PRP= records are parsed and run; prp= records are ignored. Sometimes verbosity 2 does not add its additional output. This has been seen in entire runs where EE records were seen immediately, and also in the following run, second line beginning OK: [CODE]gpuOwL v1.9- GPU Mersenne primality checker Radeon 500 Series 8 @f:0.0, gfx804 1203MHz OpenCL compilation in 7690 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=77231809u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFGT_61=1 -DLOG_ROOT2=49u " PRP-3: FFT 4M (1024 * 2048 * 2) of 77231809 (18.41 bits/word) Starting at iteration 21500 OK 21500 / 77231809 [ 0.03%], 0.00 ms/it; ETA 0d 00:00; 5bd2a3aec7d2916b [05:29:02] OK 22000 / 77231809 [ 0.03%], 28.52 ms/it; ETA 25d 11:35; cde99a5f448708c8 [05:29:26] OK 24000 / 77231809 [ 0.03%], 18.93 ms/it [18.91, 18.97] CV 0.2%, check 10.83s; ETA 16d 21:58; 3790afa2fa84a448 [05:30:15] OK 25000 / 77231809 [ 0.03%], 18.94 ms/it [18.91, 18.97] CV 0.2%, check 10.70s; ETA 16d 22:08; e2e98b9012b3f744 [05:30:45][/CODE]It would be good if GpuOwL logged all the calling options, and whatever defaults or ini file settings were used, such as -fft DP or M61, -legacy or current kernels, along with the fft size. |
[QUOTE=preda;479270]Yes: GpuOwl now saves "compacted bits", which means that the E bits (for exponent E) are in a compact representation which is independent of:
- the number of words (4M or 8M fft) - whether the words are "balanced" (DP) or not (M61) So it is possible to change the algorithm or size in the middle.[/QUOTE] This suggests some possibilities, such as having it switch transforms if it runs into error trouble, and testing multiple approaches for speed and reliability, on the fly or at startup or resumption of an exponent. It also bodes well for being able to finish an exponent started by one software version, with another that brings new features that may be faster. |
[QUOTE=Madpoo;478937] Having a good selection of FFT sizes is pretty cool because Prime95 does that whole thing of doing a test on exponents near a boundary to see if the smaller FFT is doing okay or not after however many iterations and switching to the next larger one up if not.
Although, I have a sneaking suspicion there may be more bad results on exponents around those FFT boundaries compared to the ratio of bad results smack dab in the middle of a range. Just a hunch though, I haven't crunched the #'s and with Prime95 results it can be hard to squeeze out the FFT size it used for the test.[/QUOTE] CUDALucas includes fft length in its results record. It's one value, not necessarily reflecting how it may have switched up and down during a run, for varying error values, so it's limited certainty, but may be worth a look at. As I recall somewhat fuzzily, you reported a while ago it was producing a disproportionate percentage of the false prime reports, or bad mismatching residues, or both. Be cautious of results that say they were before CUDALucas V2.051, or really 2.06 May 5 2017 beta. There were invalid-residue checks missing from the CUDALucas Windows executable before 2.06beta, and I think both linux and Windows before 2.051. There was a time early in the development of the opencl port, clLucas, that its results were being reported as CUDALucas instead (before primenet manual reporting was extended for clLucas). Until the mid September 2013 release of clLucas v1.01, as I interpret the cllucas forum posts, cllucas even called itself CUDALucas in its own output. Might be worth a look separately by application. In your "abundant spare time". ;) It would be useful to the maintainers, and to the more conscientious users, to know which apps at their current release are still more error prone, and if it relates to exponent values and fft lengths. |
Distinguishing gpus, dual RX550s, dueling drivers, pcie x16 vs x1
1 Attachment(s)
[QUOTE=preda;478186]
For the device identification problem, I'll keep thinking of a solution (other than variable device order id).[/QUOTE] I've decided NVIDIA and AMD in the same system is more trouble than it's worth. So I've removed the old NVS295 card, that used to drive the display, and installed a second, 4GB, RX550, on which the display is now running. (Some cleanup yet to do on that driver situation.) There are two things that distinguish the cards from each other, in GPU-Z reports, beyond PCI slot id and width in this instance; GPU ram, and the ratio between fan speed and percentage. The one shown on the left of the attached image is a full height 4GB, added recently; on the right is the 2GB low profile card with small diameter fast fans, on which my previous tests were performed. 4 GB 1455 rpm / 26% x 100% =~ 5596 rpm max 2 GB 2778 rpm / 30% x 100% =~ 9260 rpm max Gpuowl v1.9, or perhaps Windows 7, does not handle well, running on the 4GB card in the current situation. GpuOwL will run on the 4GB gpu for a few lines of output and then hang. BSODs seem to be way up in frequency and correlate with running gpuowl 1.9 on the 4GB card. I'm working on resolving that, and running gpuowl on the 2GB card for now. When running gpuOwL on the same card running the display, and using the local display rather than remote access, the screen seemed sluggish; I'm not aware of any option in gpuOwL equivalent to the -polite option in CUDALucas, which gives display operations a turn now and then (with frequency settable). I'm also periodically seeing event log entries similar to the NVIDIA driver timeout but for AMD; looking into that. Approaches described in [URL]https://answers.microsoft.com/en-us/windows/forum/windows_10-hardware-winpc/display-driver-amdkmdap-stopped-responding-and-has/3021f6da-6289-4ccc-bab1-2d48b3d83424?auth=1[/URL] have not resolved it. The good news is the 4GB card on a 1x /16x extender from an 8x slot was only a few percent slower than the same exponent on the 2GB card in a 16x slot. I'll reevaluate that speed and also reliability after a full-width extender arrives. An extender is necessary for one because these cards are dual-width and the two connectors are one width apart on the motherboard. (I've used s1x/16x extenders with Quadro 2000s on other systems. A difference is those were 1x motherboard connectors.) [CODE]gpuOwL v1.9- GPU Mersenne primality checker Radeon 500 Series 8 @3:0.0, gfx804 1203MHz OpenCL compilation in 2269 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=77231809u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFP_DP=1 " PRP-3: FFT 4M (1024 * 2048 * 2) of 77231809 (18.41 bits/word) [2018-02-07 10:43:06 Central Standard Time] Starting at iteration 6280000 OK 6280000 / 77231809 [ 8.13%], 0.00 ms/it; ETA 0d 00:00; 6cc2113b9ef2a2af [10:43:14] OK 6281000 / 77231809 [ 8.13%], 11.16 ms/it [11.08, 11.24] CV 1.0%, check 7.28s; ETA 9d 03:55; fd61aed64b9339a9 [10:43:32] OK 6285000 / 77231809 [ 8.14%], 11.12 ms/it [11.08, 11.25] CV 0.5%, check 7.22s; ETA 9d 03:06; 1796833389325313 [10:44:24] OK 6290000 / 77231809 [ 8.14%], 11.22 ms/it [11.09, 11.97] CV 2.4%, check 7.21s; ETA 9d 05:05; 1bb7d709088356f5 [10:45:27] [/CODE]Above is the newly installed device 0 4GB card on 1x extender, until hang or BSOD; switching gpuOwL to device 1 2GB card below. [CODE] gpuOwL v1.9- GPU Mersenne primality checker Radeon 500 Series 8 @f:0.0, gfx804 1203MHz OpenCL compilation in 2199 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=77231809u -DWIDTH=1024u -DHEIGHT=2048u -DLOG_NWORDS=22u -DFP_DP=1 " PRP-3: FFT 4M (1024 * 2048 * 2) of 77231809 (18.41 bits/word) [2018-02-07 10:56:49 Central Standard Time] Starting at iteration 6290000 OK 6290000 / 77231809 [ 8.14%], 0.00 ms/it; ETA 0d 00:00; 1bb7d709088356f5 [10:56:56] OK 6291000 / 77231809 [ 8.15%], 10.79 ms/it [10.79, 10.79] CV 0.0%, check 7.07s; ETA 8d 20:42; ed282746a277cda4 [10:57:14] OK 6295000 / 77231809 [ 8.15%], 10.79 ms/it [10.76, 10.83] CV 0.2%, check 7.11s; ETA 8d 20:33; 008170d81773c0a0 [10:58:05] OK 6300000 / 77231809 [ 8.16%], 10.89 ms/it [10.76, 11.70] CV 2.6%, check 7.18s; ETA 8d 22:32; 5765572276d5c1bf [10:59:06] OK 6310000 / 77231809 [ 8.17%], 10.84 ms/it [10.76, 11.73] CV 2.0%, check 7.05s; ETA 8d 21:37; 49a6afa09e399ddd [11:01:02] OK 6320000 / 77231809 [ 8.18%], 10.83 ms/it [10.76, 11.70] CV 1.9%, check 7.05s; ETA 8d 21:20; 5fa8d6fe1231d966 [11:02:57] OK 6340000 / 77231809 [ 8.21%], 10.85 ms/it [10.76, 11.70] CV 2.1%, check 7.14s; ETA 8d 21:42; 9e11d36ece31e955 [11:06:41] OK 6360000 / 77231809 [ 8.23%], 10.88 ms/it [10.76, 11.73] CV 2.4%, check 6.99s; ETA 8d 22:06; be9cfcd73b639a9f [11:10:26] OK 6380000 / 77231809 [ 8.26%], 10.83 ms/it [10.76, 11.70] CV 1.9%, check 6.97s; ETA 8d 21:09; 00fd79a4d399f936 [11:14:09] OK 6400000 / 77231809 [ 8.29%], 10.85 ms/it [10.76, 11.70] CV 2.1%, check 7.13s; ETA 8d 21:27; 293981d022ec08ac [11:17:53] OK 6450000 / 77231809 [ 8.35%], 10.85 ms/it [10.76, 11.73] CV 2.1%, check 7.07s; ETA 8d 21:21; eb8bf801263bd009 [11:27:03] [/CODE] GpuOwL 1.9 ran reliably for days on the 2GB low profile RX550, except for power outages, before addition of the second RX550, and appears so far to be doing so with the second RX550 present, except for system BSODs & power.) |
Thanks for the detailed analysis. I concur with your observations:
- M61 does have higher bit capacity at the same FFT size (than DP). I don't know exactly how much, maybe something around 21 bits-per-word (?). - M61 is surprisingly slow. It may be faster than "double the FFT" at DP though. - M61 always uses "long carry", thus the -legacy option doesn't affect it as you say. [QUOTE=kriesel;479272]What are the best easily available estimates or data for the min and max exponents versus fft length for the M61 transform? (Or equivalently, bits/word min and max for each fft length when using M61?) M61 may be slower than DP _for the same fft length_, but it's faster, at least on the RX550 in V1.9 build 74f1a38 on Windows 7 Pro, than being pushed to the next larger length of DP, in the range permitted by M61's greater bits/word. See the attached table, with relative speeds from recent testing here. This means that primality and double checking with M61 could give quicker runs for some exponent ranges by allowing use of a smaller FFT length in M61 than in DP; approximately the effect of intermediate fft lengths, but with existing code. There may be a similar, modest, effect of extending a given fft length to higher exponents, going on between default DP and legacy DP, perhaps due to the deeper carry length of legacy. Legacy DP is also faster at all ffft lengths, on the RX550. -legacy appears to have little or no effect on -fft M61. Does it apply only to DP? There are few things that bring clarity to how much I don't yet know, than making a new spreadsheet and discovering how few cells I can fill in, or even how long it takes to realize what rows and columns to include or questions to address with it. (Although as I recall, being married was also pretty effective, at showing me how much I didn't know, about many things.)[/QUOTE] |
[QUOTE=kriesel;479557]
Gpuowl v1.9, or perhaps Windows 7, does not handle well, running on the 4GB card in the current situation. GpuOwL will run on the 4GB gpu for a few lines of output and then hang. BSODs seem to be way up in frequency and correlate with running gpuowl 1.9 on the 4GB card. I'm working on resolving that, and running gpuowl on the 2GB card for now. [/QUOTE] A wild guess: maybe the PCI extender has something to do with the GPU-freeze? (maybe you could try to move the freezing-GPU to a direct PCI slot and see if the behavior changes?) |
| All times are UTC. The time now is 22:38. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.