mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

kriesel 2019-10-02 15:04

And...
 
Nonzero pseudorandomly selected shift for gpuowl PRP would be useful. It would make life easier for uncwilly et al in the double, triple, quad checking effort, and gpuowl results could be checked with gpuowl.

axn 2019-10-02 15:12

[QUOTE=kriesel;527172]Nonzero pseudorandomly selected shift for gpuowl PRP would be useful. It would make life easier for uncwilly et al in the double, triple, quad checking effort, and gpuowl results could be checked with gpuowl.[/QUOTE]

I believe untrusted software cannot be doublechecked by another untrusted software, so atleast one test must be done by P95/mprime. This is regardless of shift count.
/IIRC

LaurV 2019-10-02 15:39

Nope. Different shifts can DC an exponent, even if the same program was used. See my own DC history.

Which is not very good, because it can be easily abused, as (for example in CudaLucas) there is no checksum or crc/secret key, etc., and it was discussed in the past many times, but the actual state has its advantages, I personally would not like it changed. I would better like a "short list" of "trusted" users which won't abuse it (and of course, I must be the fist in the list :razz:, a mismatch in my self-DC-ed work is yet to be found, hehe). But this is not easy to implement. Now for example, even with a "short list" of users, you can easily abuse the system as you can report (fake) work in the name of other user and lower his credibility (is "denigrate" a word? ha, it seems it is!).

kriesel 2019-10-02 16:11

[QUOTE=axn;527173]I believe untrusted software cannot be doublechecked by another untrusted software, so at least one test must be done by P95/mprime. This is regardless of shift count.
/IIRC[/QUOTE]Interesting position. Yet CUDALucas on a gpu, having neither security codes, Jacobi check, nor Gerbicz check, went 18 for 18 good in a batch of strategic double checks I ran, while 8 of the 18 illegal sumout first tests in the batch (which I think were from prime95) were mismatches and subsequently verified bad by triple check. Gpuowl with the Gerbicz check may be technically untrusted, yet considerably more reliable than some prime95/mprime installations. The highest doublechecked exponents are mixed. [url]https://www.mersenne.org/report_ll/?exp_lo=600000000&exp_hi=999999999&exp_date=&end_date=&user_only=0&exfirst=1&dispdate=1&exbad=1&exfactor=1&B1=[/url]

Mprime/prime95 are limited to various fft lengths and so exponents as a function of cpu capability, with at least FMA3 required to exceed 596M and only AVX512 able to exceed 920M and the mersenne.org 1G limit. Gpuowl (3.3G), Mlucas (~4.3G) and CUDALucas (2.1G) can far exceed that. The cpu-dependent limitation of prime95 affects P-1 as well as PRP and LL. (Running exponents above 10[SUP]9[/SUP] is to be discouraged, since they are very slow, and there is now no online site like mersenne.org or mersenne.ca at which to coordinate effort or submit any such results.)
My FMA3 hardware is scarce and AVX512 hardware nonexistent. But I have several gpus capable of large exponents.

Perhaps someday George (working with Mihai?) will produce special builds of gpuowl that include the security code and are considered trusted. (Windows and linux flavors)

Prime95 2019-10-12 01:10

Mysterious slowdown
 
I have an unexplained gpuowl slowdown. The card is running 2 gpuowl instances. Two PRP tests completed within 7 seconds of each other. Upon starting the next tests a 6% slowdown is observed. I stopped the tests and resumed them (with a 30+ second stagger) and speeds are back to normal. Here are the two log files:

[CODE]2019-10-11 15:59:54 radeon6.2 89048789 88800000 99.72%; 1718 us/sq; ETA 0d 00:07; 98485251fa66b1a7
2019-10-11 16:01:20 radeon6.2 89048789 88850000 99.78%; 1718 us/sq; ETA 0d 00:06; 890e528a5883bd6f
2019-10-11 16:02:46 radeon6.2 89048789 88900000 99.83%; 1715 us/sq; ETA 0d 00:04; 22e741d0d93ca955
2019-10-11 16:04:12 radeon6.2 89048789 88950000 99.89%; 1718 us/sq; ETA 0d 00:03; ad1698b105f02f48
2019-10-11 16:05:40 radeon6.2 89048789 OK 89000000 99.94%; 1718 us/sq; ETA 0d 00:01; 3c2df17632340d45 (check 2.40s)
2019-10-11 16:07:04 radeon6.2 CC 89048789 / 89048789, 45fcf6f11b116fYY
2019-10-11 16:07:06 radeon6.2 89048789 OK 89049000 100.00%; 1718 us/sq; ETA 0d 00:00; c1ec7cf569dc62YY (check 2.13s)
2019-10-11 16:07:06 radeon6.2 {"exponent":"89048789", "worktype":"PRP-3", "status":"C", "program":{"name":"gpuowl", "version":"v6.5-84-g30c0508"}, "timestamp":"2019-10-11 20:07:06 UTC", "user":"gw2", "computer":"radeon6.2", "aid":"3F98B6BAF4453D8B86F66870E33ED5DF", "fft-length":5242880, "res64":"45fcf6f11b116fYY", "residue-type":1}
2019-10-11 16:07:06 radeon6.2 89048803 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.98 bits/word
2019-10-11 16:07:06 radeon6.2 using short carry kernels
2019-10-11 16:07:06 radeon6.2 OpenCL args "-DEXP=89048803u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x1.02ba3352d6a7ap+0 -DIWEIGHT_STEP=0x1.fa9a51aca2cfdp-1 -DWEIGHT_BIGSTEP=0x1.306fe0a31b715p+0 -DIWEIGHT_BIGSTEP=0x1.ae89f995ad3adp-1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-10-11 16:07:09 radeon6.2 OpenCL compilation in 2760 ms
2019-10-11 16:07:10 radeon6.2 89048803.owl not found, starting from the beginning.
2019-10-11 16:07:16 radeon6.2 89048803 OK 2000 0.00%; 987 us/sq; ETA 1d 00:25; 5c53bf84b606b38c (check 1.38s)
2019-10-11 16:08:42 radeon6.2 89048803 50000 0.06%; 1798 us/sq; ETA 1d 20:27; 0bce9df7b774451e
2019-10-11 16:10:14 radeon6.2 89048803 100000 0.11%; 1823 us/sq; ETA 1d 21:02; 6a7d31c61f3cae1f
2019-10-11 16:11:45 radeon6.2 89048803 150000 0.17%; 1821 us/sq; ETA 1d 20:58; 338e4f4e278beb78
[/CODE]

and

[CODE]2019-10-11 16:00:02 radeon6.1 89048411 88800000 99.72%; 1717 us/sq; ETA 0d 00:07; 0d9cd8ae231e6238
2019-10-11 16:01:28 radeon6.1 89048411 88850000 99.78%; 1719 us/sq; ETA 0d 00:06; 2c4b47d2e1394951
2019-10-11 16:02:54 radeon6.1 89048411 88900000 99.83%; 1721 us/sq; ETA 0d 00:04; 783b5263315e1130
2019-10-11 16:04:20 radeon6.1 89048411 88950000 99.89%; 1718 us/sq; ETA 0d 00:03; ca770a9a3e2a1db9
2019-10-11 16:05:48 radeon6.1 89048411 OK 89000000 99.94%; 1714 us/sq; ETA 0d 00:01; 8e609e2b77d4fa96 (check 2.47s)
2019-10-11 16:07:10 radeon6.1 CC 89048411 / 89048411, a0a0e9062a5434ZZ
2019-10-11 16:07:13 radeon6.1 89048411 OK 89049000 100.00%; 1682 us/sq; ETA 0d 00:00; 355a39c82bb90aZZ (check 2.20s)
2019-10-11 16:07:13 radeon6.1 {"exponent":"89048411", "worktype":"PRP-3", "status":"C", "program":{"name":"gpuowl", "version":"v6.5-84-g30c0508"}, "timestamp":"2019-10-11 20:07:13 UTC", "user":"gw2", "computer":"radeon6.1", "aid":"227598F5DFD8F69D5CB01A83AFF90933", "fft-length":5242880, "res64":"a0a0e9062a5434ZZ", "residue-type":1}
2019-10-11 16:07:13 radeon6.1 89048419 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 16.98 bits/word
2019-10-11 16:07:13 radeon6.1 using short carry kernels
2019-10-11 16:07:13 radeon6.1 OpenCL args "-DEXP=89048419u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x1.02bd9028ab4b4p+0 -DIWEIGHT_STEP=0x1.fa93bc3216fp-1 -DWEIGHT_BIGSTEP=0x1.306fe0a31b715p+0 -DIWEIGHT_BIGSTEP=0x1.ae89f995ad3adp-1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-10-11 16:07:16 radeon6.1 OpenCL compilation in 2630 ms
2019-10-11 16:07:17 radeon6.1 89048419.owl not found, starting from the beginning.
2019-10-11 16:07:26 radeon6.1 89048419 OK 2000 0.00%; 1817 us/sq; ETA 1d 20:57; 991b5af4f773d55c (check 2.24s)
2019-10-11 16:08:53 radeon6.1 89048419 50000 0.06%; 1822 us/sq; ETA 1d 21:03; 953d0916398fa0ad
2019-10-11 16:10:24 radeon6.1 89048419 100000 0.11%; 1820 us/sq; ETA 1d 20:58; 8499cbc26e58f8d4
2019-10-11 16:11:55 radeon6.1 89048419 150000 0.17%; 1822 us/sq; ETA 1d 21:00; 111910b7676c7555
2019-10-11 16:13:26 radeon6.1 89048419 200000 0.22%; 1823 us/sq; ETA 1d 21:00; 3497e36e9c9e0b4b
[/CODE]


My guess is that the 2 PRP tests somehow allocated unaligned memory or the various weights, sin/cos, and FFT allocs were a "bad" distance apart.

My concern is that I might experience similar problems on reboots. Any thoughts preda?

kriesel 2019-10-12 02:12

What is single-instance iteration time on the same gpu?

Prime95 2019-10-12 03:59

[QUOTE=kriesel;527783]What is single-instance iteration time on the same gpu?[/QUOTE]

907 us.

Running two instances at 1718 us gives better throughput (but uses more electricity).

preda 2019-10-12 07:07

[QUOTE=Prime95;527780][...] My guess is that the 2 PRP tests somehow allocated unaligned memory or the various weights, sin/cos, and FFT allocs were a "bad" distance apart.

My concern is that I might experience similar problems on reboots. Any thoughts preda?[/QUOTE]

The initial buffer setup (done CPU-side) should take much shorter than 10s (on the order of 1s?), so I don't expect the buffer memory allocation to be interleaved.

Some kernels are more memory-heavy and some more compute-heavy. Maybe if they happen by chance to hit a bad phase where the kernels from the two instances are running memory-bound at the same time (or compute-bound at the same time) it would produce lower perfermance. I have no idea though if such a phase pattern is stable. But all this is just guessing -- I don't have much experience with running two instances in parallel.

Prime95 2019-10-20 05:25

Rocm 2.9 warning: Expect a 4+% slowdown if you "upgrade" from rocm 2.5 and are running one instance of gpuowl. I saw times go from ~909 us to ~949us.

The good news is my 2 instance timings dropped from ~1729 us to ~1723 us.

Prime95 2019-10-21 01:54

[QUOTE=Prime95;527780]
My guess is that the 2 PRP tests somehow allocated unaligned memory or the various weights, sin/cos, and FFT allocs were a "bad" distance apart.

My concern is that I might experience similar problems on reboots. Any thoughts preda?[/QUOTE]


Interestingly, the 10% slowdown happens every time 2 tests end and the next two begin. Of 5 GPUs, this is the only one that exhibits a slowdown.

Weird.

kriesel 2019-10-21 04:34

[QUOTE=Prime95;528458]Interestingly, the 10% slowdown happens every time 2 tests end and the next two begin. Of 5 GPUs, this is the only one that exhibits a slowdown.

Weird.[/QUOTE]That implies the runs are essentially synchronized. Judging by [url]https://www.mersenneforum.org/showpost.php?p=527780&postcount=1413[/url] a minor desynch is sufficient.

I've seen throughput advantages to staggering multiple runs of other applications. (Sometimes requiring considerable desynch; up to an hour for CUDAPm1.)

Presumably you've already looked for possible differences among the gpus (model, BIOS version), supporting system, and workloads.


All times are UTC. The time now is 23:15.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.