mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

kriesel 2019-12-17 23:03

[QUOTE=kriesel;533104]A few theories of what may have happened:
1) An error in the hardware due to clock rates that are too high for the impeccable accuracy required by the inherent relative lack of P-1 computation error checks (most likely)
2) A software issue
3) A mammoth factor found that exceeds the allowed lengths of gpuowl's output formats, perhaps a composite of several factors (least likely)
4) Something else I haven't thought of
5) Some combination

I'll post an update after either the retest, or the availability of a new commit with longer output limits that could be run on the old saved files.[/QUOTE]I've confirmed that a rerun from start of the exponent that gave a mammoth factor output the first time, with the more conservative clocks, has stage 1 P-1 res64s diverging from the first run beginning at 1540000<n<=1550000 iterations, or about 24% of the way through stage 1.
So it's looking like #1, hardware error (attributable in turn to pilot error) at the moment.

kriesel 2019-12-17 23:16

[QUOTE=mrh;533135]I didn't actually notice until I started getting text messages that my card was running hot.
With out MERGED_MIDDLE, 6.11-90 is 1068 us/it, but power draw is 12W more than 6.11-84, and temp is much higher.[/QUOTE]How hot?
Hot or fast raises error rate. Eventually the error rate becomes high enough the error-free period is shorter than the duration of a P-1 stage or two. P-1 can forgive some errors (they amount to using a different value for the base than 3 in the powering), but others are fatal to finding the correct factor.


From the draft cudapm1 readme file, some test candidates with known-good results:[CODE] Run CUDAPm1 on some exponents with known factors that should be found, and
see whether you find them. Easiest way is to select from the following list,
exponents at or near the size you plan to run, and put them in the worktodo
file. The bounds necessary to find factors vary by exponent. CUDAPm1's
automatic parameter selection will be enough to find most but not all.

Exponent Min B1 Min B2 fft length notes
4444091 7 2,557 256k
10000831 29,173 492,251 ?
24000577 1 281,339 ?
50001781 94,709 4,067,587 2688k
51558151 5,953 2,034,041 2880k
54447193 1,181 682,009 3072k
58610467 70,843 694,201 3200k
61012769 10,273 1,572,097 3360k
81229789 6,709 11,282,221 4704K
100000081 1,289 7,554,653 5600K
120002191 1,563 3,109,391 7168K
150000713 15,131 2,294,519 8640K
200000183 953 1,138,061 11200K
200001187 204,983 207,821 11200K
200003173 4,651 229,813 11200K
249500221 4 2.58951e+9 14336K
249500501 307 167,381 14336K
290001377 2,551 34,354,769 16384K

PFactor=1,2,4444091,-1,70,2
PFactor=1,2,10000831,-1,68,2
PFactor=1,2,24000577,-1,70,2
PFactor=1,2,50001781,-1,74,2
PFactor=1,2,51558151,-1,74,2
PFactor=1,2,54447193,-1,74,2
PFactor=1,2,58610467,-1,74,2
PFactor=1,2,61012769,-1,74,2
PFactor=1,2,81229789,-1,75,2
PFactor=1,2,100000081,-1,76,2
Pfactor=1,2,120002191,-1,75,2
Pfactor=1,2,150000713,-1,75,2
Pfactor=1,2,200001187,-1,75,2
PFactor=1,2,249500501,-1,75,2
PFactor=1,2,290001377,-1,75,2

Exponent Factor (may be composite) Prime factors
4444091 1809798096458971047321927127 = 8888183 x 319974553 x 636358278473
10000831 646560662529991467527
24000577 13504596665207
50001781 4392938042637898431087689 = 3 x 182851 x 8008229
51558151 755277543419074012358186647
54447193 17261184235049628259201
58610467 69057033982979789260999
61012769 2018028590362685212673
81229789 355078783674010195200030259699844128700274440385857
= 488121804389130135740149369 x 727438890213848757119753
100000081 3441393510714285782119
120002191 100835659918276033441
150000713 1447762785107694357647
200000183 849003842550205126847
200001187 3050161780881530584679
200003173 14652109287435525414352647642348599
= 4320552944485007 x 3391257895852957657
249500221 5168661482381201657
249500501 3571511465549660434777661921959439
= 11607130072256471 x 307699788260867209
290001377 10645243382592701071676802590718709559
= 1436135993277492383 x 7412420155488583273
or 90944796249039267769901814723364335322839708522092302667497 =
* 170370076089478747961 * 371696926552024067119 * 1436135993277492383

Feel free to pick your own.
Evaluate them at their equivalent of
http://www.mersenne.ca/exponent/249500501[/CODE]

mrh 2019-12-17 23:48

Oh, not that kinda hot. I alert if the edge temp is over 75C. Normally I run with either of:

/opt/rocm/bin/rocm-smi --setfan 155 --setsclk 5
/opt/rocm/bin/rocm-smi --setfan 120 --setsclk 4
/opt/rocm/bin/rocm-smi --setfan 100 --setsclk 3

Which for me keeps the temp stable between 65 and 72C, depending on ambient. The settings above correspond to gpuowl using around 200W, 150W, 120W. These get selected based on a few factors, like solar output, time of day (electric costs), and indoor temp (not good to add heat if the A/C is running).

Running conservatively like this, I've never had a PRP error, that I know of. I only rarely run P-1 with gpowl, because I only have one VII card and it slows down the 24x7 PRP.

Prime95 2019-12-18 00:38

[QUOTE=mrh;533135]FWIW, I went back to v6.11-84-geda9b17 which is a lot faster than v6.11-90-g2f94ace for me:

1008 us/it vs 1524 us/it

Both using -use FMA_X2,MERGED_MIDDLE with --setsclk 3.[/QUOTE]

If you can tell us your GPU and which new feature caused worse timings, we may be able to make the default settings better for you (and others with the same GPU).

Batalov 2019-12-18 02:43

[QUOTE]mrh ( mailto:*** ) has reported this post:

This is the reason that the user gave:
[B]Will do. Can’t get back to it until tomorrow. I’m using the Radeon VIi.[/B]

This message has been sent to all moderators of this forum, or all administrators if there are no moderators.[/QUOTE]We are pretty sure that he intended to reply, not to report the post to mods.

mrh 2019-12-18 03:41

[QUOTE=Batalov;533148]We are pretty sure that he intended to reply, not to report the post to mods.[/QUOTE]

Doh! From my phone, and I can't see. Sorry!

kriesel 2019-12-18 13:12

[QUOTE=kriesel;533136]I've confirmed that a rerun from start of the exponent that gave a mammoth factor output the first time, with the more conservative clocks, has stage 1 P-1 res64s diverging from the first run beginning at 1540000<n<=1550000 iterations, or about 24% of the way through stage 1.
So it's looking like #1, hardware error (attributable in turn to pilot error) at the moment.[/QUOTE]The rerun at 1000Mhz ram clock, 1400Mhz gpu clock went to 0x0 res64 repeatedly beginning after 14 hours, 86% completion of P-1 stage 1, about 2:15 to go. It blindly continued on for hours and into stage 2.

Will attempt a third run with nominal power limit, sub-nominal gpu clock (1400Mhz) and below-nominal ram clock (950 Mhz). It's not heat now; gpu hot spot is 71C. 5M PRP time is 956 us/it. I must say I am disappointed by the reliability of the XFX Radeon VII.

Preda, if you haven't yet, please add res64 checks to P-1 stage 1. Periodic permanent save files to retreat to would be good, also.

preda 2019-12-18 23:42

In a recent commit I enabled MERGED_MIDDLE by default. You can add
-use NO_MERGED_MIDDLE
on the command line or in config.txt to get the old behavior.

On ROCm 2.10 using MERGED_MIDDLE is more than 15% faster.

paulunderwood 2019-12-19 11:28

It is warmer today here in Britain and I got a couple of Gerbicz errors on a number. So I am now running

[CODE]sh pp.sh
1 1160 830 1050 4
[/CODE]

and the fan at 130 (which is now much quieter). sensors say:

[CODE]amdgpu-pci-0300
Adapter: PCI adapter
vddgfx: +0.90 V
fan1: 2466 RPM (min = 0 RPM, max = 3850 RPM)
edge: +69.0°C (crit = +100.0°C, hyst = -273.1°C)
(emerg = +105.0°C)
junction: +89.0°C (crit = +110.0°C, hyst = -273.1°C)
(emerg = +115.0°C)
mem: +75.0°C (crit = +94.0°C, hyst = -273.1°C)
(emerg = +99.0°C)
power1: 216.00 W (cap = 250.00 W)
[/CODE]

I'm now getting 794 us/it for FFT 5632K. No great hardship. It does seem the Radeon VII card is sensitive to ambient temperature.

kriesel 2019-12-19 13:58

Timing drop in P-1 of V6.6 stage 2
 
After running steadily for days at ~153 ms/mul, stage 2 dropped to less than a third of that for the last few hours. This was v6.6-5-g667954b on an RX480 and Windows 7.

[CODE]2019-12-19 02:16:16 Round 171 of 180: init 16.18 s; 153.09 ms/mul; 24346 muls
2019-12-19 03:18:38 Round 172 of 180: init 16.14 s; 152.69 ms/mul; 24398 muls
2019-12-19 03:59:29 Round 173 of 180: init 17.56 s; 99.85 ms/mul; 24374 muls
2019-12-19 04:18:01 Round 174 of 180: init 5.14 s; 45.54 ms/mul; 24312 muls
2019-12-19 04:36:41 Round 175 of 180: init 5.86 s; 45.46 ms/mul; 24506 muls
2019-12-19 04:55:12 Round 176 of 180: init 5.78 s; 45.56 ms/mul; 24254 muls
2019-12-19 05:13:44 Round 177 of 180: init 5.56 s; 45.50 ms/mul; 24307 muls
2019-12-19 05:32:13 Round 178 of 180: init 6.12 s; 45.46 ms/mul; 24268 muls
2019-12-19 05:50:51 Round 179 of 180: init 5.53 s; 45.47 ms/mul; 24476 muls
2019-12-19 06:09:29 Round 180 of 180: init 6.05 s; 45.49 ms/mul; 24441 muls
2019-12-19 06:22:22 530000039 P-1 final GCD: no factor[/CODE]

preda 2019-12-20 14:26

I have no idea, sorry. Could the 2x speed-up be real? Or maybe some error situation was reached in which everything is fast? I don't know.

[QUOTE=kriesel;533210]After running steadily for days at ~153 ms/mul, stage 2 dropped to less than a third of that for the last few hours. This was v6.6-5-g667954b on an RX480 and Windows 7.

[CODE]2019-12-19 02:16:16 Round 171 of 180: init 16.18 s; 153.09 ms/mul; 24346 muls
2019-12-19 03:18:38 Round 172 of 180: init 16.14 s; 152.69 ms/mul; 24398 muls
2019-12-19 03:59:29 Round 173 of 180: init 17.56 s; 99.85 ms/mul; 24374 muls
2019-12-19 04:18:01 Round 174 of 180: init 5.14 s; 45.54 ms/mul; 24312 muls
2019-12-19 04:36:41 Round 175 of 180: init 5.86 s; 45.46 ms/mul; 24506 muls
2019-12-19 04:55:12 Round 176 of 180: init 5.78 s; 45.56 ms/mul; 24254 muls
2019-12-19 05:13:44 Round 177 of 180: init 5.56 s; 45.50 ms/mul; 24307 muls
2019-12-19 05:32:13 Round 178 of 180: init 6.12 s; 45.46 ms/mul; 24268 muls
2019-12-19 05:50:51 Round 179 of 180: init 5.53 s; 45.47 ms/mul; 24476 muls
2019-12-19 06:09:29 Round 180 of 180: init 6.05 s; 45.49 ms/mul; 24441 muls
2019-12-19 06:22:22 530000039 P-1 final GCD: no factor[/CODE][/QUOTE]


All times are UTC. The time now is 23:14.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.