![]() |
[QUOTE=kriesel;533104]A few theories of what may have happened:
1) An error in the hardware due to clock rates that are too high for the impeccable accuracy required by the inherent relative lack of P-1 computation error checks (most likely) 2) A software issue 3) A mammoth factor found that exceeds the allowed lengths of gpuowl's output formats, perhaps a composite of several factors (least likely) 4) Something else I haven't thought of 5) Some combination I'll post an update after either the retest, or the availability of a new commit with longer output limits that could be run on the old saved files.[/QUOTE]I've confirmed that a rerun from start of the exponent that gave a mammoth factor output the first time, with the more conservative clocks, has stage 1 P-1 res64s diverging from the first run beginning at 1540000<n<=1550000 iterations, or about 24% of the way through stage 1. So it's looking like #1, hardware error (attributable in turn to pilot error) at the moment. |
[QUOTE=mrh;533135]I didn't actually notice until I started getting text messages that my card was running hot.
With out MERGED_MIDDLE, 6.11-90 is 1068 us/it, but power draw is 12W more than 6.11-84, and temp is much higher.[/QUOTE]How hot? Hot or fast raises error rate. Eventually the error rate becomes high enough the error-free period is shorter than the duration of a P-1 stage or two. P-1 can forgive some errors (they amount to using a different value for the base than 3 in the powering), but others are fatal to finding the correct factor. From the draft cudapm1 readme file, some test candidates with known-good results:[CODE] Run CUDAPm1 on some exponents with known factors that should be found, and see whether you find them. Easiest way is to select from the following list, exponents at or near the size you plan to run, and put them in the worktodo file. The bounds necessary to find factors vary by exponent. CUDAPm1's automatic parameter selection will be enough to find most but not all. Exponent Min B1 Min B2 fft length notes 4444091 7 2,557 256k 10000831 29,173 492,251 ? 24000577 1 281,339 ? 50001781 94,709 4,067,587 2688k 51558151 5,953 2,034,041 2880k 54447193 1,181 682,009 3072k 58610467 70,843 694,201 3200k 61012769 10,273 1,572,097 3360k 81229789 6,709 11,282,221 4704K 100000081 1,289 7,554,653 5600K 120002191 1,563 3,109,391 7168K 150000713 15,131 2,294,519 8640K 200000183 953 1,138,061 11200K 200001187 204,983 207,821 11200K 200003173 4,651 229,813 11200K 249500221 4 2.58951e+9 14336K 249500501 307 167,381 14336K 290001377 2,551 34,354,769 16384K PFactor=1,2,4444091,-1,70,2 PFactor=1,2,10000831,-1,68,2 PFactor=1,2,24000577,-1,70,2 PFactor=1,2,50001781,-1,74,2 PFactor=1,2,51558151,-1,74,2 PFactor=1,2,54447193,-1,74,2 PFactor=1,2,58610467,-1,74,2 PFactor=1,2,61012769,-1,74,2 PFactor=1,2,81229789,-1,75,2 PFactor=1,2,100000081,-1,76,2 Pfactor=1,2,120002191,-1,75,2 Pfactor=1,2,150000713,-1,75,2 Pfactor=1,2,200001187,-1,75,2 PFactor=1,2,249500501,-1,75,2 PFactor=1,2,290001377,-1,75,2 Exponent Factor (may be composite) Prime factors 4444091 1809798096458971047321927127 = 8888183 x 319974553 x 636358278473 10000831 646560662529991467527 24000577 13504596665207 50001781 4392938042637898431087689 = 3 x 182851 x 8008229 51558151 755277543419074012358186647 54447193 17261184235049628259201 58610467 69057033982979789260999 61012769 2018028590362685212673 81229789 355078783674010195200030259699844128700274440385857 = 488121804389130135740149369 x 727438890213848757119753 100000081 3441393510714285782119 120002191 100835659918276033441 150000713 1447762785107694357647 200000183 849003842550205126847 200001187 3050161780881530584679 200003173 14652109287435525414352647642348599 = 4320552944485007 x 3391257895852957657 249500221 5168661482381201657 249500501 3571511465549660434777661921959439 = 11607130072256471 x 307699788260867209 290001377 10645243382592701071676802590718709559 = 1436135993277492383 x 7412420155488583273 or 90944796249039267769901814723364335322839708522092302667497 = * 170370076089478747961 * 371696926552024067119 * 1436135993277492383 Feel free to pick your own. Evaluate them at their equivalent of http://www.mersenne.ca/exponent/249500501[/CODE] |
Oh, not that kinda hot. I alert if the edge temp is over 75C. Normally I run with either of:
/opt/rocm/bin/rocm-smi --setfan 155 --setsclk 5 /opt/rocm/bin/rocm-smi --setfan 120 --setsclk 4 /opt/rocm/bin/rocm-smi --setfan 100 --setsclk 3 Which for me keeps the temp stable between 65 and 72C, depending on ambient. The settings above correspond to gpuowl using around 200W, 150W, 120W. These get selected based on a few factors, like solar output, time of day (electric costs), and indoor temp (not good to add heat if the A/C is running). Running conservatively like this, I've never had a PRP error, that I know of. I only rarely run P-1 with gpowl, because I only have one VII card and it slows down the 24x7 PRP. |
[QUOTE=mrh;533135]FWIW, I went back to v6.11-84-geda9b17 which is a lot faster than v6.11-90-g2f94ace for me:
1008 us/it vs 1524 us/it Both using -use FMA_X2,MERGED_MIDDLE with --setsclk 3.[/QUOTE] If you can tell us your GPU and which new feature caused worse timings, we may be able to make the default settings better for you (and others with the same GPU). |
[QUOTE]mrh ( mailto:*** ) has reported this post:
This is the reason that the user gave: [B]Will do. Can’t get back to it until tomorrow. I’m using the Radeon VIi.[/B] This message has been sent to all moderators of this forum, or all administrators if there are no moderators.[/QUOTE]We are pretty sure that he intended to reply, not to report the post to mods. |
[QUOTE=Batalov;533148]We are pretty sure that he intended to reply, not to report the post to mods.[/QUOTE]
Doh! From my phone, and I can't see. Sorry! |
[QUOTE=kriesel;533136]I've confirmed that a rerun from start of the exponent that gave a mammoth factor output the first time, with the more conservative clocks, has stage 1 P-1 res64s diverging from the first run beginning at 1540000<n<=1550000 iterations, or about 24% of the way through stage 1.
So it's looking like #1, hardware error (attributable in turn to pilot error) at the moment.[/QUOTE]The rerun at 1000Mhz ram clock, 1400Mhz gpu clock went to 0x0 res64 repeatedly beginning after 14 hours, 86% completion of P-1 stage 1, about 2:15 to go. It blindly continued on for hours and into stage 2. Will attempt a third run with nominal power limit, sub-nominal gpu clock (1400Mhz) and below-nominal ram clock (950 Mhz). It's not heat now; gpu hot spot is 71C. 5M PRP time is 956 us/it. I must say I am disappointed by the reliability of the XFX Radeon VII. Preda, if you haven't yet, please add res64 checks to P-1 stage 1. Periodic permanent save files to retreat to would be good, also. |
In a recent commit I enabled MERGED_MIDDLE by default. You can add
-use NO_MERGED_MIDDLE on the command line or in config.txt to get the old behavior. On ROCm 2.10 using MERGED_MIDDLE is more than 15% faster. |
It is warmer today here in Britain and I got a couple of Gerbicz errors on a number. So I am now running
[CODE]sh pp.sh 1 1160 830 1050 4 [/CODE] and the fan at 130 (which is now much quieter). sensors say: [CODE]amdgpu-pci-0300 Adapter: PCI adapter vddgfx: +0.90 V fan1: 2466 RPM (min = 0 RPM, max = 3850 RPM) edge: +69.0°C (crit = +100.0°C, hyst = -273.1°C) (emerg = +105.0°C) junction: +89.0°C (crit = +110.0°C, hyst = -273.1°C) (emerg = +115.0°C) mem: +75.0°C (crit = +94.0°C, hyst = -273.1°C) (emerg = +99.0°C) power1: 216.00 W (cap = 250.00 W) [/CODE] I'm now getting 794 us/it for FFT 5632K. No great hardship. It does seem the Radeon VII card is sensitive to ambient temperature. |
Timing drop in P-1 of V6.6 stage 2
After running steadily for days at ~153 ms/mul, stage 2 dropped to less than a third of that for the last few hours. This was v6.6-5-g667954b on an RX480 and Windows 7.
[CODE]2019-12-19 02:16:16 Round 171 of 180: init 16.18 s; 153.09 ms/mul; 24346 muls 2019-12-19 03:18:38 Round 172 of 180: init 16.14 s; 152.69 ms/mul; 24398 muls 2019-12-19 03:59:29 Round 173 of 180: init 17.56 s; 99.85 ms/mul; 24374 muls 2019-12-19 04:18:01 Round 174 of 180: init 5.14 s; 45.54 ms/mul; 24312 muls 2019-12-19 04:36:41 Round 175 of 180: init 5.86 s; 45.46 ms/mul; 24506 muls 2019-12-19 04:55:12 Round 176 of 180: init 5.78 s; 45.56 ms/mul; 24254 muls 2019-12-19 05:13:44 Round 177 of 180: init 5.56 s; 45.50 ms/mul; 24307 muls 2019-12-19 05:32:13 Round 178 of 180: init 6.12 s; 45.46 ms/mul; 24268 muls 2019-12-19 05:50:51 Round 179 of 180: init 5.53 s; 45.47 ms/mul; 24476 muls 2019-12-19 06:09:29 Round 180 of 180: init 6.05 s; 45.49 ms/mul; 24441 muls 2019-12-19 06:22:22 530000039 P-1 final GCD: no factor[/CODE] |
I have no idea, sorry. Could the 2x speed-up be real? Or maybe some error situation was reached in which everything is fast? I don't know.
[QUOTE=kriesel;533210]After running steadily for days at ~153 ms/mul, stage 2 dropped to less than a third of that for the last few hours. This was v6.6-5-g667954b on an RX480 and Windows 7. [CODE]2019-12-19 02:16:16 Round 171 of 180: init 16.18 s; 153.09 ms/mul; 24346 muls 2019-12-19 03:18:38 Round 172 of 180: init 16.14 s; 152.69 ms/mul; 24398 muls 2019-12-19 03:59:29 Round 173 of 180: init 17.56 s; 99.85 ms/mul; 24374 muls 2019-12-19 04:18:01 Round 174 of 180: init 5.14 s; 45.54 ms/mul; 24312 muls 2019-12-19 04:36:41 Round 175 of 180: init 5.86 s; 45.46 ms/mul; 24506 muls 2019-12-19 04:55:12 Round 176 of 180: init 5.78 s; 45.56 ms/mul; 24254 muls 2019-12-19 05:13:44 Round 177 of 180: init 5.56 s; 45.50 ms/mul; 24307 muls 2019-12-19 05:32:13 Round 178 of 180: init 6.12 s; 45.46 ms/mul; 24268 muls 2019-12-19 05:50:51 Round 179 of 180: init 5.53 s; 45.47 ms/mul; 24476 muls 2019-12-19 06:09:29 Round 180 of 180: init 6.05 s; 45.49 ms/mul; 24441 muls 2019-12-19 06:22:22 530000039 P-1 final GCD: no factor[/CODE][/QUOTE] |
| All times are UTC. The time now is 23:14. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.