mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

SELROC 2019-07-05 07:27

Note: I have a script that quickly recovers after a power loss.


[url]https://github.com/valeriob01/Mersenne-gpu-computing-node/commit/e90de7d656c60ddbb9eac294977ef4ec01174485[/url]

SELROC 2019-07-08 04:26

Current gpuowl typical performance numbers for 89M exponents:


RX580: 3849 us/sq - ETA 4d 2h
Vega64: 2010 us/sq - ETA 2d
RadeonVII: 910 us/sq- ETA 22h 35m

GP2 2019-07-08 16:50

mprime can do k*b^n+c

How feasible would it be, in principle, to adapt gpuOwL to be more flexible? Wagstaff in particular: (2^p+1)/3

paulunderwood 2019-07-08 20:10

[QUOTE=GP2;521019]mprime can do k*b^n+c

How feasible would it be, in principle, to adapt gpuOwL to be more flexible? Wagstaff in particular: (2^p+1)/3[/QUOTE]

:goodposting:

Working mod 2^p+1 is almost as easy as 2^p-1. Then a final division by 3 to get mod (2^p+1)/3. E.g:

[code]
p=127;Mod(3,2^p+1)^2^p
Mod(9, 170141183460469231731687303715884105729)
[/code]

Gerbicz error checking can be done too,

GP2 2019-07-08 20:34

[QUOTE=paulunderwood;521034]:goodposting:

Working mod 2^p+1 is almost as easy as 2^p-1. Then a final division by 3 to get mod (2^p+1)/3.[/QUOTE]

In mprime, a type-5 residue for Wagstaff simply calculates [c]3^(2^p) mod (2^p + 1)[/c]. So I don't think you need to do a division.

Probably type-4 would also be applicable to Wagstaff? Or perhaps type-2, which is similar to type-4 except using N−1 instead of N+1.

[CODE]
2: SPRP variant, N is PRP if a^((N-1)/2) = +/-1 mod N
4: SPRP variant. N is PRP if a^((N+1)/2) = +/-a mod N
[/CODE]

kriesel 2019-07-08 22:22

gpuowl priorities
 
I'd like to see some P-1 related gpuowl fixes and extensions before Mihai tackles another endeavor such as extension to Wagstaff prp.

P-1 -time [URL]https://www.mersenneforum.org/showpost.php?p=517911&postcount=1211[/URL]

P-1 fail on 8GB RX480 [URL]https://www.mersenneforum.org/showpost.php?p=517853&postcount=1208[/URL]
Mihai replied in late May (post 1210) about planning to revisit P-1 memory management.

P-1 save and resume [URL]https://www.mersenneforum.org/showpost.php?p=517846&postcount=1206[/URL]

As things stand, I'm unable to successfully run v6.x gpuowl P-1 on AMD or NVIDIA.

SELROC 2019-07-09 08:17

ROCm 2.6 is out. Performance is similar to 2.5

SELROC 2019-07-09 09:00

[QUOTE=SELROC;520775]Note: I have a script that quickly recovers after a power loss.


[URL]https://github.com/valeriob01/Mersenne-gpu-computing-node/commit/e90de7d656c60ddbb9eac294977ef4ec01174485[/URL][/QUOTE]




Proposal for improvement of gpuowl checkpoint recovery: what the script does can be done in gpuowl with a few lines. If the checkpoint is invalid, load *-prev.owl, and overwrite the last checkpoint file.

GP2 2019-07-09 19:03

[QUOTE=kriesel;521053]I'd like to see some P-1 related gpuowl fixes and extensions before Mihai tackles another endeavor such as extension to Wagstaff prp.[/QUOTE]

It's up to him to decide what he spends his time and effort doing. I was thinking that there might be some relatively trivial modification.

Like I mentioned earlier, the Wagstaff PRP calculation for type 5 is [c]3^(2^p) mod (2^p + 1)[/c] whereas for Mersenne (where type 1 and type 5 are the same thing), it's [c]3^(2^p − 2) mod (2^p − 1)[/c]. I don't know if there is a similarly simple modification for type 4 or type 2 residues.

Since gpuOwL is a GitHub project, theoretically someone else could make the modification, possibly even forking from an earlier version that still used Mersenne type 1 residues.

kriesel 2019-07-10 00:51

[QUOTE=GP2;521118]It's up to him to decide what he spends his time and effort doing. [/QUOTE]Of course. He volunteers his time, according to his talents and interests, like many others. None of us has a claim on him or each other, or authority to select one path versus another for him. To his credit, he sometimes accepts or asks for input from the user community. And if we users summarize outstanding issues or new desires, it can make him more efficient. Win-win.

[QUOTE]I was thinking that there might be some relatively trivial modification.[/QUOTE]It seems to me that the power difference is trivial, but the mod difference is less so. mod 2[SUP]p[/SUP]-1 result fits in p bits, and can be done rapidly in binary by adding the quotient to the remainder displaced rightward by p bits; mod 2[SUP]p[/SUP]+1 can't. Seems like p+1 bits storage and subtract quotient after a p bit right shift would be in order. That in turn implies borrows rather than carries as in the existing code. But all that is from thinking in untransformed integer binary operand terms.

[QUOTE]
Like I mentioned earlier, the Wagstaff PRP calculation for type 5 is [c]3^(2^p) mod (2^p + 1)[/c] whereas for Mersenne (where type 1 and type 5 are the same thing), it's [c]3^(2^p − 2) mod (2^p − 1)[/c]. I don't know if there is a similarly simple modification for type 4 or type 2 residues.

Since gpuOwL is a GitHub project, theoretically someone else could make the modification, possibly even forking from an earlier version that still used Mersenne type 1 residues.[/QUOTE]Which would be ~gpuowl v1.5 to 3.9. [URL]https://www.mersenneforum.org/showpost.php?p=519603&postcount=15[/URL]
There are other ways to do Wagstaff, to ~920M, though maybe not as high a p as you'd like to go to if you're thinking of taking the new Mersenne conjecture testing further.
There are also other ways to do p-1 factoring on Mersennes, although not above ~432.5M in CUDAPm1 in practice, or ~920M in mprime/prime95, and not on OpenCl at all.

SELROC 2019-07-10 08:42

[QUOTE=SELROC;521088]ROCm 2.6 is out. Performance is similar to 2.5[/QUOTE]


ROCm version 2.6 without Navi10 support until Linux 5.3 in September.

amdgpu-pro has support for Navi10.

preda 2019-07-10 12:39

residue-type 1 is back
 
Back by popular demand: residue-type 1. (in the most recent commit)

This means that GpuOwl's residue is now aligned with mprime's, and GpuOwl can be used to double-check mprime PRP results.

kriesel 2019-07-10 17:34

1 Attachment(s)
[QUOTE=preda;521194]Back by popular demand: residue-type 1. (in the most recent commit)[/QUOTE]Built for windows, tried on RX480.
-h works, -? doesn't without a worktodo.txt existing.
[CODE]>gpuowl-win -h
2019-07-10 10:29:22 gpuowl v6.5-84-g30c0508

Command line options:

-dir <folder> : specify work directory (containing worktodo.txt, results.txt, config.txt, gpuowl.log)
-user <name> : specify the user name.
-cpu <name> : specify the hardware name.
-time : display kernel profiling information.
-fft <size> : specify FFT size, such as: 5000K, 4M, +2, -1.
-block <value> : PRP GEC block size. Default 1000. Smaller block is slower but detects errors sooner.
-log <step> : log every <step> iterations, default 20000. Multiple of 10000.
-carry long|short : force carry type. Short carry may be faster, but requires high bits/word.
-B1 : P-1 B1 bound, default 500000
-B2 : P-1 B2 bound, default B1 * 30
-rB2 : ratio of B2 to B1. Default 30, used only if B2 is not explicitly set
-prp <exponent> : run a single PRP test and exit, ignoring worktodo.txt
-pm1 <exponent> : run a single P-1 test and exit, ignoring worktodo.txt
-results <file> : name of results file, default 'results.txt'
-iters <N> : run next PRP test for <N> iterations and exit. Multiple of 10000.
-use NEW_FFT8,OLD_FFT5,NEW_FFT10: comma separated list of defines, see the #if tests in gpuowl.cl (used for perf tuning).
-device <N> : select a specific device:
0 : Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics
1 : gfx804-8x1203-@3:0.0 Radeon 550 Series

FFT Configurations:
FFT 8K [ 0.01M - 0.18M] 64-64
FFT 32K [ 0.05M - 0.68M] 64-256 256-64
FFT 64K [ 0.10M - 1.34M] 64-512 512-64
FFT 128K [ 0.20M - 2.63M] 1K-64 64-1K 256-256
FFT 192K [ 0.29M - 3.91M] 64-256-6
FFT 224K [ 0.34M - 4.54M] 64-256-7
FFT 256K [ 0.39M - 5.18M] 64-2K 256-512 512-256 2K-64
FFT 288K [ 0.44M - 5.81M] 64-256-9
FFT 320K [ 0.49M - 6.44M] 64-256-10
FFT 352K [ 0.54M - 7.06M] 64-256-11
FFT 384K [ 0.59M - 7.69M] 64-256-12 64-512-6
FFT 448K [ 0.69M - 8.94M] 64-512-7
FFT 512K [ 0.79M - 10.18M] 1K-256 256-1K 512-512 4K-64
FFT 576K [ 0.88M - 11.42M] 64-512-9
FFT 640K [ 0.98M - 12.66M] 64-512-10
FFT 704K [ 1.08M - 13.89M] 64-512-11
FFT 768K [ 1.18M - 15.12M] 64-512-12 64-1K-6 256-256-6
FFT 896K [ 1.38M - 17.57M] 64-1K-7 256-256-7
FFT 1M [ 1.57M - 20.02M] 1K-512 256-2K 512-1K 2K-256
FFT 1152K [ 1.77M - 22.45M] 64-1K-9 256-256-9
FFT 1280K [ 1.97M - 24.88M] 64-1K-10 256-256-10
FFT 1408K [ 2.16M - 27.31M] 64-1K-11 256-256-11
FFT 1536K [ 2.36M - 29.72M] 64-1K-12 64-2K-6 256-256-12 256-512-6 512-256-6
FFT 1792K [ 2.75M - 34.54M] 64-2K-7 256-512-7 512-256-7
FFT 2M [ 3.15M - 39.34M] 1K-1K 512-2K 2K-512 4K-256
FFT 2304K [ 3.54M - 44.13M] 64-2K-9 256-512-9 512-256-9
FFT 2560K [ 3.93M - 48.90M] 64-2K-10 256-512-10 512-256-10
FFT 2816K [ 4.33M - 53.66M] 64-2K-11 256-512-11 512-256-11
FFT 3M [ 4.72M - 58.41M] 1K-256-6 64-2K-12 256-512-12 256-1K-6 512-256-12 512-512-6
FFT 3584K [ 5.51M - 67.87M] 1K-256-7 256-1K-7 512-512-7
FFT 4M [ 6.29M - 77.30M] 1K-2K 2K-1K 4K-512
FFT 4608K [ 7.08M - 86.70M] 1K-256-9 256-1K-9 512-512-9
FFT 5M [ 7.86M - 96.07M] 1K-256-10 256-1K-10 512-512-10
FFT 5632K [ 8.65M - 105.41M] 1K-256-11 256-1K-11 512-512-11
FFT 6M [ 9.44M - 114.74M] 1K-256-12 1K-512-6 256-1K-12 256-2K-6 512-512-12 512-1K-6 2K-256-6
FFT 7M [ 11.01M - 133.32M] 1K-512-7 256-2K-7 512-1K-7 2K-256-7
FFT 8M [ 12.58M - 151.83M] 2K-2K 4K-1K
FFT 9M [ 14.16M - 170.28M] 1K-512-9 256-2K-9 512-1K-9 2K-256-9
FFT 10M [ 15.73M - 188.68M] 1K-512-10 256-2K-10 512-1K-10 2K-256-10
FFT 11M [ 17.30M - 207.02M] 1K-512-11 256-2K-11 512-1K-11 2K-256-11
FFT 12M [ 18.87M - 225.32M] 1K-512-12 1K-1K-6 256-2K-12 512-1K-12 512-2K-6 2K-256-12 2K-512-6 4K-256-6
FFT 14M [ 22.02M - 261.80M] 1K-1K-7 512-2K-7 2K-512-7 4K-256-7
FFT 16M [ 25.17M - 298.13M] 4K-2K
FFT 18M [ 28.31M - 334.34M] 1K-1K-9 512-2K-9 2K-512-9 4K-256-9
FFT 20M [ 31.46M - 370.44M] 1K-1K-10 512-2K-10 2K-512-10 4K-256-10
FFT 22M [ 34.60M - 406.43M] 1K-1K-11 512-2K-11 2K-512-11 4K-256-11
FFT 24M [ 37.75M - 442.34M] 1K-1K-12 1K-2K-6 512-2K-12 2K-512-12 2K-1K-6 4K-256-12 4K-512-6
FFT 28M [ 44.04M - 513.91M] 1K-2K-7 2K-1K-7 4K-512-7
FFT 36M [ 56.62M - 656.22M] 1K-2K-9 2K-1K-9 4K-512-9
FFT 40M [ 62.91M - 727.03M] 1K-2K-10 2K-1K-10 4K-512-10
FFT 44M [ 69.21M - 797.64M] 1K-2K-11 2K-1K-11 4K-512-11
FFT 48M [ 75.50M - 868.07M] 1K-2K-12 2K-1K-12 2K-2K-6 4K-512-12 4K-1K-6
FFT 56M [ 88.08M - 1008.44M] 2K-2K-7 4K-1K-7
FFT 72M [113.25M - 1287.53M] 2K-2K-9 4K-1K-9
FFT 80M [125.83M - 1426.38M] 2K-2K-10 4K-1K-10
FFT 88M [138.41M - 1564.83M] 2K-2K-11 4K-1K-11
FFT 96M [150.99M - 1702.92M] 2K-2K-12 4K-1K-12 4K-2K-6
FFT 112M [176.16M - 1978.12M] 4K-2K-7
FFT 144M [226.49M - 2525.23M] 4K-2K-9
FFT 160M [251.66M - 2797.39M] 4K-2K-10
FFT 176M [276.82M - 3068.76M] 4K-2K-11
FFT 192M [301.99M - 3339.40M] 4K-2K-12
2019-07-10 10:29:30 Exiting because "help"
2019-07-10 10:29:30 Bye

>gpuowl-win -?
2019-07-10 10:29:43 gpuowl v6.5-84-g30c0508
2019-07-10 10:29:43 Note: no config.txt file found
2019-07-10 10:29:43 config: -?
2019-07-10 10:29:43 Can't open 'worktodo.txt' (mode 'rb')
2019-07-10 10:29:43 Bye
[/CODE] Quick low known mersenne test passes; P-1 attempt on 47.8M fails.
[CODE]>gpuowl-win -device 0 -use ORIG_X2
2019-07-10 11:06:54 gpuowl v6.5-84-g30c0508
2019-07-10 11:06:54 Note: no config.txt file found
2019-07-10 11:06:54 config: -device 0 -use ORIG_X2
2019-07-10 11:06:54 1257787 FFT 64K: Width 8x8, Height 64x8; 19.19 bits/word
2019-07-10 11:06:54 using short carry kernels
2019-07-10 11:07:02 OpenCL args "-DEXP=1257787u -DWIDTH=64u -DSMALL_HEIGHT=512u -DMIDDLE=1u -DWEIGHT_STEP=0xe.00d75658c47c8p-3 -DIWEIGHT_STEP=0x9.2405b0b5f2d88p
-4 -DWEIGHT_BIGSTEP=0xc.5672a115506d8p-3 -DIWEIGHT_BIGSTEP=0xa.5fed6a9b15138p-4 -DORIG_X2=1 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-07-10 11:07:06 OpenCL compilation in 4069 ms
2019-07-10 11:07:06 1257787.owl not found, starting from the beginning.
2019-07-10 11:07:07 1257787 OK 2000 0.16%; 207 us/sq; ETA 0d 00:04; 46c7ab6803e1a365 (check 0.23s)
2019-07-10 11:07:11 1257787 20000 1.59%; 210 us/sq; ETA 0d 00:04; de7035c3244acc9b
2019-07-10 11:07:15 1257787 40000 3.18%; 210 us/sq; ETA 0d 00:04; 8e655f023b66fde1
2019-07-10 11:07:19 1257787 60000 4.77%; 210 us/sq; ETA 0d 00:04; e62c225bd51c0bf1
2019-07-10 11:07:23 1257787 80000 6.36%; 210 us/sq; ETA 0d 00:04; 2a37fdc214c2e7c0
2019-07-10 11:07:27 1257787 100000 7.95%; 210 us/sq; ETA 0d 00:04; 09f25999ff3326ca
...
2019-07-10 11:11:14 1257787 1180000 93.80%; 210 us/sq; ETA 0d 00:00; cfea93b53dd3f424
2019-07-10 11:11:19 1257787 1200000 95.39%; 210 us/sq; ETA 0d 00:00; 5a5b25f08d9912e4
2019-07-10 11:11:23 1257787 1220000 96.98%; 210 us/sq; ETA 0d 00:00; 66d4bd30b2ea4a7d
2019-07-10 11:11:27 1257787 1240000 98.57%; 209 us/sq; ETA 0d 00:00; 5c19c247002fe45c
2019-07-10 11:11:31 PP 1257787 / 1257787, 0000000000000001
2019-07-10 11:11:31 1257787 OK 1258000 100.00%; 211 us/sq; ETA 0d 00:00; f4d273818ecfa167 (check 0.23s)
2019-07-10 11:11:31 {"exponent":"1257787", "worktype":"PRP-3", "status":"P", "program":{"name":"gpuowl", "version":"v6.5-84-g30c0508"}, "timestamp":"2019-07-10
16:11:31 UTC", "aid":"0", "fft-length":65536, "res64":"0000000000000001", "residue-type":1}
2019-07-10 11:11:31 47840659 FFT 2560K: Width 8x8, Height 256x8, Middle 10; 18.25 bits/word
2019-07-10 11:11:31 using short carry kernels
2019-07-10 11:11:31 OpenCL args "-DEXP=47840659u -DWIDTH=64u -DSMALL_HEIGHT=2048u -DMIDDLE=10u -DWEIGHT_STEP=0xd.74e0985678c5p-3 -DIWEIGHT_STEP=0x9.8318a69b48b5
p-4 -DWEIGHT_BIGSTEP=0xc.5672a115506d8p-3 -DIWEIGHT_BIGSTEP=0xa.5fed6a9b15138p-4 -DORIG_X2=1 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-07-10 11:11:35 OpenCL compilation in 4221 ms
2019-07-10 11:11:36 47840659 P-1 GPU RAM fits 374 stage2 buffers @ 20.0 MB each
2019-07-10 11:11:36 47840659 P-1 using 360 stage2 buffers (8 rounds)
2019-07-10 11:11:36 P-1 (B1=440000, B2=8800000, D=30030): primes 553065, expanded 560752, doubles 96745 (left 362408), singles 359575, total 456320 (83%)
2019-07-10 11:11:36 47840659 P-1 stage2: 279 blocks starting at block 15 (456320 selected)
2019-07-10 11:11:36 47840659 P-1 starting stage1
2019-07-10 11:12:33 47840659 10000 1.58%; 5632 us/sq; ETA 0d 00:59; c56022afd3a79281
2019-07-10 11:13:29 47840659 20000 3.15%; 5632 us/sq; ETA 0d 00:58; 57fe8bc3f01d07de
2019-07-10 11:14:25 47840659 30000 4.73%; 5630 us/sq; ETA 0d 00:57; 1c811f9a541e4c93
...
2019-07-10 12:04:10 47840659 560000 88.21%; 5632 us/sq; ETA 0d 00:07; e537d47dbf93bbb4
2019-07-10 12:05:06 47840659 570000 89.78%; 5634 us/sq; ETA 0d 00:06; f379136e785f8c92
2019-07-10 12:06:03 47840659 580000 91.36%; 5635 us/sq; ETA 0d 00:05; eb7fec5d09ff1974
2019-07-10 12:06:59 47840659 590000 92.93%; 5631 us/sq; ETA 0d 00:04; d0ea9a92d7208708
2019-07-10 12:07:55 47840659 600000 94.51%; 5630 us/sq; ETA 0d 00:03; 0247296cf97caff4
2019-07-10 12:08:52 47840659 610000 96.08%; 5630 us/sq; ETA 0d 00:02; 55ab076cedf5dee2
2019-07-10 12:09:48 47840659 620000 97.66%; 5630 us/sq; ETA 0d 00:01; 9af5dced9077c32a
2019-07-10 12:10:44 47840659 630000 99.23%; 5635 us/sq; ETA 0d 00:00; 7716930f904d8987
2019-07-10 12:11:12 P-1 stage2 too little memory 6983 MB for 360 buffers of 20971520 b
2019-07-10 12:11:52 Exiting because "P-1 not enough memory"
2019-07-10 12:11:52 Bye[/CODE]Windows build zip file attached. The readme.md included is modified somewhat, for the following changes:
minor spell check and grammar check
worktodo.txt entry additional examples, including P-1 and no-aid forms

SELROC 2019-07-11 05:23

[QUOTE=SELROC;521090]Proposal for improvement of gpuowl checkpoint recovery: what the script does can be done in gpuowl with a few lines. If the checkpoint is invalid, load *-prev.owl, and overwrite the last checkpoint file.[/QUOTE]


PS: This proposal should handle the case when after a power loss the checkpoint file is invalid, when a power loss happens while writing the checkpoint. In this case the file is not being closed and remains without the end-of-file mark. On reboot a filesystem check is done, which truncates the checkpoint file to length zero.

Prime95 2019-07-11 18:41

Happy me.

I tried to return the XFX Radeon VII for a replacement. Amazon was out of stock so they simply refunded the purchase. I ordered an Asrock Radeon VII instead. Installed today and preliminary results look great. First, the stock voltage is 50mV less. A memory overclock of 15% with a 40mV undervolting gives no errors during a short test. 0.85ms / iteration at 5M FFT! Longer testing required of course.

A question for the Linux gurus:

I use "crontab -e" to run mprime at boot. This does not work for gpuowl which must run as root. I tried "sudo crontab -e", but either messed up the entries or this does not work as I expected. What is the recommended way for root to start gpuowl at boot?

paulunderwood 2019-07-11 19:17

[QUOTE=Prime95;521326]Happy me.

I tried to return the XFX Radeon VII for a replacement. Amazon was out of stock so they simply refunded the purchase. I ordered an Asrock Radeon VII instead. Installed today and preliminary results look great. First, the stock voltage is 50mV less. A memory overclock of 15% with a 40mV undervolting gives no errors during a short test. 0.85ms / iteration at 5M FFT! Longer testing required of course.

A question for the Linux gurus:

I use "crontab -e" to run mprime at boot. This does not work for gpuowl which must run as root. I tried "sudo crontab -e", but either messed up the entries or this does not work as I expected. What is the recommended way for root to start gpuowl at boot?[/QUOTE]

[c]sudo whoami[/c] should tell you something. Then [c]sudo crontab -u root -e[/c]. You might want to start a shell script to firstly configure the card and then launch gpuowl. In crontab have something like:

[code]
@reboot /bin/bash /path-to/mystartupgpuowl.sh
[/code]

And in that shell script have:
[code]
/opt/rocm/bin/rocm-smi --load path-to-and-config-file
cd /home/george/gpuowl; ./gpuowl &
[/code]

preda 2019-07-11 21:33

[QUOTE=GP2;521118]I was thinking that there might be some relatively trivial modification.

Like I mentioned earlier, the Wagstaff PRP calculation for type 5 is [c]3^(2^p) mod (2^p + 1)[/c] whereas for Mersenne (where type 1 and type 5 are the same thing), it's [c]3^(2^p − 2) mod (2^p − 1)[/c]. I don't know if there is a similarly simple modification for type 4 or type 2 residues.
[/QUOTE]

This is what I don't know (maybe somebody could enlighten me) about the implementation of "mod 2^p+1":

In the Mersenne case, we want a cyclic convolution. The simple weighting that is done before/after the FFT achieves that.

For the "mod 2^p+1", we want a negacyclic convolution. Can this be achieved through a similar weighting (with different weights)? Or is something more involved needed?

To add a bit more detail: in the mersenne case, the weights are real. IF for 2^p+1 we need weighting with complex weights, this changes the implementation significantly because the FFT input is not real anymore.

Prime95 2019-07-12 00:13

[QUOTE=preda;521342]This is what I don't know (maybe somebody could enlighten me) about the implementation of "mod 2^p+1"[/QUOTE]

There are two weights. You still have the real weights to distribute the p bits uniformly over the FFTLEN words.

You also need to apply complex roots-of-minus-one to "trick" the FFT into doing a negacyclic convolution instead of a cyclic convolution. You don't need any extra FFT memory, but you do need a modified first pass that takes real inputs and produces weighted complex FFT'ed outputs. Not easy, but not hard either.

Next you need a new simpler second pass that scraps all the Hermetian symmetry computations before the point-wise squaring.

SELROC 2019-07-12 06:03

[QUOTE=Prime95;521326]Happy me.

I tried to return the XFX Radeon VII for a replacement. Amazon was out of stock so they simply refunded the purchase. I ordered an Asrock Radeon VII instead. Installed today and preliminary results look great. First, the stock voltage is 50mV less. A memory overclock of 15% with a 40mV undervolting gives no errors during a short test. 0.85ms / iteration at 5M FFT! Longer testing required of course.

A question for the Linux gurus:

I use "crontab -e" to run mprime at boot. This does not work for gpuowl which must run as root. I tried "sudo crontab -e", but either messed up the entries or this does not work as I expected. What is the recommended way for root to start gpuowl at boot?[/QUOTE]


I think you have to write a systemd service file.
Something like this: gpuowl.service


[Unit]
Description=GpuOwl
After=network-online.target
Wants=network-online.target

[Service]
ExecStart=/home/george/gpuowl <arguments>
Restart=on-failure
RestartSec=1minute
WatchdogSec=20minutes
TimeoutStopSec=150seconds
StandardOutput=syslog
NotifyAccess=main
KillSignal=SIGINT

[Install]
WantedBy=multi-user.target

SELROC 2019-07-14 07:04

[QUOTE=Prime95;521326]Happy me.

I tried to return the XFX Radeon VII for a replacement. Amazon was out of stock so they simply refunded the purchase. I ordered an Asrock Radeon VII instead. Installed today and preliminary results look great. First, the stock voltage is 50mV less. A memory overclock of 15% with a 40mV undervolting gives no errors during a short test. 0.85ms / iteration at 5M FFT! Longer testing required of course.

A question for the Linux gurus:

I use "crontab -e" to run mprime at boot. This does not work for gpuowl which must run as root. I tried "sudo crontab -e", but either messed up the entries or this does not work as I expected. What is the recommended way for root to start gpuowl at boot?[/QUOTE]

[QUOTE=SELROC;521383]I think you have to write a systemd service file.
Something like this: gpuowl.service


[Unit]
Description=GpuOwl
After=network-online.target
Wants=network-online.target

[Service]
ExecStart=/home/george/gpuowl <arguments>
Restart=on-failure
RestartSec=1minute
WatchdogSec=20minutes
TimeoutStopSec=150seconds
StandardOutput=syslog
NotifyAccess=main
KillSignal=SIGINT

[Install]
WantedBy=multi-user.target[/QUOTE]




here a good guide:


[url]https://www.digitalocean.com/community/tutorials/understanding-systemd-units-and-unit-files[/url]

SELROC 2019-07-14 07:17

Checkpoint file management
 
It seems that mfakto manages checkpoint files, after a result is computed, the checkpoints are removed.
Also, if a checkpoint is invalid, mfakto renames it (to mark it as bad) and loads the previous checkpoint.

SELROC 2019-07-15 07:23

[QUOTE=SELROC;521383]I think you have to write a systemd service file.
Something like this: gpuowl.service


[Unit]
Description=GpuOwl
After=network-online.target
Wants=network-online.target

[Service]
ExecStart=/home/george/gpuowl <arguments>
Restart=on-failure
RestartSec=1minute
WatchdogSec=20minutes
TimeoutStopSec=150seconds
StandardOutput=syslog
NotifyAccess=main
KillSignal=SIGINT

[Install]
WantedBy=multi-user.target[/QUOTE]


It is important to use SIGINT instead of SIGQUIT.


SIGINT behavior is like Control-C in the terminal, it lets gpuowl save a checkpoint before stopping.

Bulldozer 2019-07-16 00:15

I'm having a problem running gpuowl on my laptop. It has an integrated CPU (Intel HD 620) and a AMD Radeon R5 530. When I run this program, it always runs on my HD 620 and get a bunch of errors. It never runs on my R5 530. I tried re-installing both of the drivers, re-installing Windows, and set the program in high-performance mode in Radeon Settings. None of these works. I hope for an answer.

kriesel 2019-07-16 01:15

[QUOTE=Bulldozer;521694]I'm having a problem running gpuowl on my laptop. It has an integrated CPU (Intel HD 620) and a AMD Radeon R5 530. When I run this program, it always runs on my HD 620 and get a bunch of errors. It never runs on my R5 530. I tried re-installing both of the drivers, re-installing Windows, and set the program in high-performance mode in Radeon Settings. None of these works. I hope for an answer.[/QUOTE]
Regardless of which device number you specify? Both the AMD and Intel OpenCL drivers are ok? (I've seen installing software for one knock the other out.)

See [URL]https://www.mersenneforum.org/showpost.php?p=488474&postcount=6[/URL] for utilities to check opencl is seeing both devices, etc.

ATH 2019-07-19 22:22

So I got my Radeon VII but I'm a bit lost, it has been many many years since I had an AMD card and it was way before using GPUs for any calculations, and I'm also new to gpuowl.

I installed the newest drivers: Adrenalin 2019 19.7.2. I had "gpuowl-win7-x64-v6.5-c48d46f.7z" from [URL="https://mersenneforum.org/showpost.php?p=516704&postcount=1171"]post #1171[/URL] on my hard drive already from 2 months ago, I think I got it to confirm that OpenCL really worked on my RTX 2080 which it did.

Now when I run it with -device 1 (Radeon VII) it only writes the first few lines but never gets to the "OpenCL compilation in ..." line and it never starts running.

[QUOTE]2019-07-19 23:38:23 config: -device 1
2019-07-19 23:38:23 80293033 FFT 4608K: Width 256x4, Height 64x4, Middle 9; 17.02 bits/word
2019-07-19 23:38:23 using short carry kernels[/QUOTE]

When I use -device 0 it works fine and runs on my RTX 2080.


I tried downloading the " gpuowl-win-v6.5-84-g30c0508.7z" from [URL="https://mersenneforum.org/showpost.php?p=521225&postcount=1274"]post #1274[/URL] but it does not start at all on neither card:

[CODE]2019-07-20 00:05:56 config: -device 1
2019-07-20 00:05:56 80293033 FFT 4608K: Width 256x4, Height 64x4, Middle 9; 17.02 bits/word
2019-07-20 00:05:56 using short carry kernels
2019-07-20 00:05:56 OpenCL args "-DEXP=80293033u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=9u -DWEIGHT_STEP=0xf.d1f3073e091p-3 -DIWEIGHT_STEP=0x8.17498299a4db8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-07-20 00:05:56 OpenCL compilation error -11 (args -DEXP=80293033u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=9u -DWEIGHT_STEP=0xf.d1f3073e091p-3 -DIWEIGHT_STEP=0x8.17498299a4db8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-07-20 00:05:56 C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:197:3: error: implicit declaration of function '__asm' is invalid in C99
X2(u[0], u[2]);
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:174:2: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:197:3: error: expected ')'
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:174:35: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:197:3: note: to match this '('
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:174:7: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:197:3: error: expected ')'
X2(u[0], u[2]);
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:175:35: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:197:3: note: to match this '('
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:175:7: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:198:3: error: expected ')'
X2_mul_t4(u[1], u[3]);
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:180:35: note: expanded from macro 'X2_mul_t4'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (t.x) : "v" (b.x), "v" (t.x)); \
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:198:3: note: to match this '('
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:180:7: note: expanded from macro 'X2_mul_t4'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (t.x) : "v" (b.x), "v" (t.x)); \
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:1982019-07-20 00:05:56 Exception 9gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:215 build
2019-07-20 00:05:56 Bye[/CODE]


Are there any more Windows executables collected somewhere?

ewmayer 2019-07-19 23:27

[QUOTE=preda;521342]This is what I don't know (maybe somebody could enlighten me) about the implementation of "mod 2^p+1":

In the Mersenne case, we want a cyclic convolution. The simple weighting that is done before/after the FFT achieves that.[/quote]
No, the FFT is inherently cyclic-convolutional ... the IBDWT weighting allow us to use a prime-length "bit folding" boundary in conjunction with an underlying polynomial-multiply which most naturally lends itself to a bitness which is highly composite, by way of being a multiple of the transform length.

[quote]For the "mod 2^p+1", we want a negacyclic convolution. Can this be achieved through a similar weighting (with different weights)? Or is something more involved needed?

To add a bit more detail: in the mersenne case, the weights are real. IF for 2^p+1 we need weighting with complex weights, this changes the implementation significantly because the FFT input is not real anymore.[/QUOTE]
As George noted, for (mod 2^p+1) you need 2 distinct weightings: the IBDWT one to allow for a prime-length bit-folding, and the standard acyclic-effecting weighting, which for a length-n transform uses the first n complex (2*n)th roots of unity. That needs a complex FFT algorithm; for length-n real input vector you can use a length-(n/2) complex FFT. Noting that the [j]th and [j+n/2]th acyclic weights (call them 'awt') are related by awt[j+n/2] = I*awt[j], you can see that in this context it makes sense to group pairs of real inputs together not via the usual (x[j],x[j+1])-treated-as-a-complex-datum scheme but rather in (x[j],x[j+n/2]) pairs, since applying the acyclic-weights turns those 2 reals into (awt[j]*x[j],I*awt[j]*x[j+n/2]), i.e. we can pull out the shared complex acyclic-multiplier awt[j] = exp(I*j/(2*n) to get a weighted complex input awt[j]*(x[j] + I*x[j+n/2]). This is the so-called "right-angle transform" trick. Crandall & Fagin recapped it (since it wasn't new) in the Fermat-mod section of the same 1994 paper where they introduced the Mersenne-mod IBDWT.

preda 2019-07-20 00:50

Try running it with "-use NO_ASM"
(if it works, you can put that option in config.txt )

The __asm() errors are because you're running with a driver (Adrenalin/windows) that does not support assembly. ROCm/linux works fine with __asm(). Anyway, assembly support is not mandatory, you just need to disable it with -use NO_ASM . I haven't found a way yet (in OpenCL) to automatically detect __asm support.

[QUOTE=ATH;521950]So I got my Radeon VII but I'm a bit lost, it has been many many years since I had an AMD card and it was way before using GPUs for any calculations, and I'm also new to gpuowl.

I installed the newest drivers: Adrenalin 2019 19.7.2. I had "gpuowl-win7-x64-v6.5-c48d46f.7z" from [URL="https://mersenneforum.org/showpost.php?p=516704&postcount=1171"]post #1171[/URL] on my hard drive already from 2 months ago, I think I got it to confirm that OpenCL really worked on my RTX 2080 which it did.

Now when I run it with -device 1 (Radeon VII) it only writes the first few lines but never gets to the "OpenCL compilation in ..." line and it never starts running.



When I use -device 0 it works fine and runs on my RTX 2080.


I tried downloading the " gpuowl-win-v6.5-84-g30c0508.7z" from [URL="https://mersenneforum.org/showpost.php?p=521225&postcount=1274"]post #1274[/URL] but it does not start at all on neither card:

[CODE]2019-07-20 00:05:56 config: -device 1
2019-07-20 00:05:56 80293033 FFT 4608K: Width 256x4, Height 64x4, Middle 9; 17.02 bits/word
2019-07-20 00:05:56 using short carry kernels
2019-07-20 00:05:56 OpenCL args "-DEXP=80293033u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=9u -DWEIGHT_STEP=0xf.d1f3073e091p-3 -DIWEIGHT_STEP=0x8.17498299a4db8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-07-20 00:05:56 OpenCL compilation error -11 (args -DEXP=80293033u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=9u -DWEIGHT_STEP=0xf.d1f3073e091p-3 -DIWEIGHT_STEP=0x8.17498299a4db8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-07-20 00:05:56 C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:197:3: error: implicit declaration of function '__asm' is invalid in C99
X2(u[0], u[2]);
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:174:2: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:197:3: error: expected ')'
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:174:35: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:197:3: note: to match this '('
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:174:7: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:197:3: error: expected ')'
X2(u[0], u[2]);
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:175:35: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:197:3: note: to match this '('
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:175:7: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:198:3: error: expected ')'
X2_mul_t4(u[1], u[3]);
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:180:35: note: expanded from macro 'X2_mul_t4'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (t.x) : "v" (b.x), "v" (t.x)); \
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:198:3: note: to match this '('
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:180:7: note: expanded from macro 'X2_mul_t4'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (t.x) : "v" (b.x), "v" (t.x)); \
^
C:\Users\ATH\AppData\Local\Temp\\OCL7076T0.cl:1982019-07-20 00:05:56 Exception 9gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:215 build
2019-07-20 00:05:56 Bye[/CODE]


Are there any more Windows executables collected somewhere?[/QUOTE]

preda 2019-07-20 00:51

Thank you, this is useful!

[QUOTE=ewmayer;521952]No, the FFT is inherently cyclic-convolutional ... the IBDWT weighting allow us to use a prime-length "bit folding" boundary in conjunction with an underlying polynomial-multiply which most naturally lends itself to a bitness which is highly composite, by way of being a multiple of the transform length.


As George noted, for (mod 2^p+1) you need 2 distinct weightings: the IBDWT one to allow for a prime-length bit-folding, and the standard acyclic-effecting weighting, which for a length-n transform uses the first n complex (2*n)th roots of unity. That needs a complex FFT algorithm; for length-n real input vector you can use a length-(n/2) complex FFT. Noting that the [j]th and [j+n/2]th acyclic weights (call them 'awt') are related by awt[j+n/2] = I*awt[j], you can see that in this context it makes sense to group pairs of real inputs together not via the usual (x[j],x[j+1])-treated-as-a-complex-datum scheme but rather in (x[j],x[j+n/2]) pairs, since applying the acyclic-weights turns those 2 reals into (awt[j]*x[j],I*awt[j]*x[j+n/2]), i.e. we can pull out the shared complex acyclic-multiplier awt[j] = exp(I*j/(2*n) to get a weighted complex input awt[j]*(x[j] + I*x[j+n/2]). This is the so-called "right-angle transform" trick. Crandall & Fagin recapped it (since it wasn't new) in the Fermat-mod section of the same 1994 paper where they introduced the Mersenne-mod IBDWT.[/QUOTE]

ATH 2019-07-20 01:07

Thanks. I assume there is no Windows driver where it works?

Now there are no errors but it does not actually start calculating, the card is not being used at all.

[CODE]
gpuowl-win.exe -device 1 -use NO_ASM
2019-07-20 03:01:43 gpuowl v6.5-84-g30c0508
2019-07-20 03:01:43 Note: no config.txt file found
2019-07-20 03:01:43 config: -device 1 -use NO_ASM
2019-07-20 03:01:43 80293033 FFT 4608K: Width 256x4, Height 64x4, Middle 9; 17.02 bits/word
2019-07-20 03:01:43 using short carry kernels
2019-07-20 03:01:43 OpenCL args "-DEXP=80293033u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=9u -DWEIGHT_STEP=0xf.d1f3073e091p-3 -DIWEIGHT_STEP=0x8.17498299a4db8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DNO_ASM=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"[/CODE]


I was afraid I was being too optimistic trying to run Nvidia and AMD card in the same computer.

Anyone else have any Windows binaries? Only Kriesel posted binaries in this thread from the latests versions.

kriesel 2019-07-20 08:51

[QUOTE=preda;521955]Try running it with "-use NO_ASM"
(if it works, you can put that option in config.txt )

The __asm() errors are because you're running with a driver (Adrenalin/windows) that does not support assembly. ROCm/linux works fine with __asm(). Anyway, assembly support is not mandatory, you just need to disable it with -use NO_ASM . I haven't found a way yet (in OpenCL) to automatically detect __asm support.[/QUOTE]FWIW, the example run of gpuowl v6.5-84-g30c0508 in [URL]https://www.mersenneforum.org/showpost.php?p=521225&postcount=1274[/URL] was on an RX480, Win7 x64, Adrenalin 18.10.2 driver with -use ORIG_X2, after the advice of Prime95 to -use FMA_X2 at [URL]https://www.mersenneforum.org/showpost.php?p=517932&postcount=1213[/URL], plus subsequent experimentation for performance [URL]https://www.mersenneforum.org/showpost.php?p=517961&postcount=1217[/URL]
No data on Radeon VII here yet.

Hadn't seen NO_ASM back when I made the -use list at [URL]https://www.mersenneforum.org/showpost.php?p=517999&postcount=1222[/URL] but I see it there in the gpuowl.cl code of v6.5-76-g1ca08e2-dirty

SELROC 2019-07-20 09:22

[QUOTE=preda;521955]Try running it with "-use NO_ASM"
(if it works, you can put that option in config.txt )

The __asm() errors are because you're running with a driver (Adrenalin/windows) that does not support assembly. ROCm/linux works fine with __asm(). Anyway, assembly support is not mandatory, you just need to disable it with -use NO_ASM . I haven't found a way yet (in OpenCL) to automatically detect __asm support.[/QUOTE]


there should be a way to detect which driver is in use, amdgpu-pro doesn't support __asm() and I set -use NO_ASM.

preda 2019-07-20 11:22

Normally the next log line would be something like:
"OpenCL compilation in 2195 ms"
So it seems it your case it's stuck at the OpenCL compilation step. I'm sorry but I don't really know why, and unfortunatelly I can't repro. (I would be happy to have a fix if the problem is on gpuowl's side)

[QUOTE=ATH;521958]Thanks. I assume there is no Windows driver where it works?

Now there are no errors but it does not actually start calculating, the card is not being used at all.

[CODE]
gpuowl-win.exe -device 1 -use NO_ASM
2019-07-20 03:01:43 gpuowl v6.5-84-g30c0508
2019-07-20 03:01:43 Note: no config.txt file found
2019-07-20 03:01:43 config: -device 1 -use NO_ASM
2019-07-20 03:01:43 80293033 FFT 4608K: Width 256x4, Height 64x4, Middle 9; 17.02 bits/word
2019-07-20 03:01:43 using short carry kernels
2019-07-20 03:01:43 OpenCL args "-DEXP=80293033u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=9u -DWEIGHT_STEP=0xf.d1f3073e091p-3 -DIWEIGHT_STEP=0x8.17498299a4db8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DNO_ASM=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"[/CODE]


I was afraid I was being too optimistic trying to run Nvidia and AMD card in the same computer.

Anyone else have any Windows binaries? Only Kriesel posted binaries in this thread from the latests versions.[/QUOTE]

SELROC 2019-07-20 12:44

[QUOTE=SELROC;521978]there should be a way to detect which driver is in use, amdgpu-pro doesn't support __asm() and I set -use NO_ASM.[/QUOTE]


it is possible with a script, the "vendor" field should change accordingly for nvidia, also the "configuration: driver=amdgpu latency=0" should change accordingly:


# lshw -class video
*-display
description: VGA compatible controller
product: Ellesmere [Radeon RX 470/480]
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 0
bus info: pci@0000:01:00.0
version: e7
width: 64 bits
clock: 33MHz
capabilities: pm pciexpress msi vga_controller bus_master cap_list rom
configuration: driver=amdgpu latency=0
resources: iomemory:220-21f iomemory:210-20f irq:126 memory:2200000000-23ffffffff memory:2100000000-21001fffff ioport:e000(size=256) memory:f7e00000-f7e3ffff memory:f7e40000-f7e5ffff
*-display
description: VGA compatible controller
product: Intel Corporation
vendor: Intel Corporation
physical id: 2
bus info: pci@0000:00:02.0
version: 04
width: 64 bits
clock: 33MHz
capabilities: pciexpress msi pm vga_controller bus_master cap_list rom
configuration: driver=i915 latency=0
resources: iomemory:2f0-2ef iomemory:2f0-2ef irq:125 memory:2ffe000000-2ffeffffff memory:2fe0000000-2fefffffff ioport:f000(size=64) memory:c0000-dffff

kriesel 2019-07-20 14:41

[QUOTE=ATH;521958]
I was afraid I was being too optimistic trying to run Nvidia and AMD card in the same computer.[/QUOTE]
AH!
Divide and conquer.
Use multiple cl test and info utilities and device manager to check how functional the AMD opencl and driver installation is. Sometimes one will claim all's fine and others will show issues. I've seen one vendor's opencl install hose another's. (iIn that case it was an NVIDIA or AMD SDK install disabling the opencl use of the Intel igp, until the SDKs were removed and the Intel opencl reinstalled.)

You could try a temporary complete removal of the NVIDIA driver followed by removal and reinstall of the AMD driver. Also sometimes an additional second reboot is needed after a graphics driver install.

SELROC 2019-07-20 14:41

[QUOTE=preda;521955]Try running it with "-use NO_ASM"
(if it works, you can put that option in config.txt )

The __asm() errors are because you're running with a driver (Adrenalin/windows) that does not support assembly. ROCm/linux works fine with __asm(). Anyway, assembly support is not mandatory, you just need to disable it with -use NO_ASM . I haven't found a way yet (in OpenCL) to automatically detect __asm support.[/QUOTE]


Hi Mihai, I found a C++ library:


[url]https://github.com/ThePhD/infoware[/url]


Example: [url]https://github.com/ThePhD/infoware/blob/master/examples/gpu.cpp[/url]


reactions?

SELROC 2019-07-21 06:42

[QUOTE=preda;521955]Try running it with "-use NO_ASM"
(if it works, you can put that option in config.txt )

The __asm() errors are because you're running with a driver (Adrenalin/windows) that does not support assembly. ROCm/linux works fine with __asm(). Anyway, assembly support is not mandatory, you just need to disable it with -use NO_ASM . I haven't found a way yet (in OpenCL) to automatically detect __asm support.[/QUOTE]


This header file detects platform:
[url]https://github.com/hendrix2897/platform-detect/blob/master/PlatformDetect.h[/url]

maxzor 2019-07-23 19:37

Hello and thank you for the program.
How much of it depends on CPU performance?
Will it be significantly slower running on a Radeon VII with a pentium II, i5 2500 or R7 1800x (or 3600) ?
I am about to compile in linux soon.
I have a 1800x, and setup Radeon VI for gpuOwl and Nvidia 1050ti for the lesser stuff, any experience in balancing load between two gpus appreciated! [url=https://betrig.com/]betrig[/url]

SELROC 2019-07-31 05:41

[QUOTE=maxzor;522179]Hello and thank you for the program.
How much of it depends on CPU performance?
Will it be significantly slower running on a Radeon VII with a pentium II, i5 2500 or R7 1800x (or 3600) ?
I am about to compile in linux soon.
I have a 1800x, and setup Radeon VI for gpuOwl and Nvidia 1050ti for the lesser stuff, any experience in balancing load between two gpus appreciated![/QUOTE]


Hello, gpuowl is independent from CPU performance, because the computation is done on the GPU all the time.
To compile on linux you need a few libraries preinstalled, libgmp-dev and probably g++-8 depending on your linux distribution.


Any finding on how to run gpuowl on Nvidia is welcome.

xx005fs 2019-08-11 07:13

R9 390x and R9 280x performance?
 
Hi all, I have a friend recently who upgraded his computer and is giving me his old R9 280x to me for a cheap price. Does anyone have performance number on gpuowl with the 280x on the current wavefront? At the same time, is buying used R9 390x worth it for gpuowl, and some benchmarks at current wavefront will be welcomed. Thanks

preda 2019-08-11 08:59

[QUOTE=xx005fs;523498]Hi all, I have a friend recently who upgraded his computer and is giving me his old R9 280x to me for a cheap price. Does anyone have performance number on gpuowl with the 280x on the current wavefront? At the same time, is buying used R9 390x worth it for gpuowl, and some benchmarks at current wavefront will be welcomed. Thanks[/QUOTE]

I didn't have a R9 280x, so I don't know perf numbers. On Linux, ROCm may not support 280x as too old, but amdgpu-pro should(?) support it. I wouldn't recommend buying R9 390x for gpuowl: the best thing to buy is Radeon VII (expensive though), but makes up the price in power cost savings IMO.

xx005fs 2019-08-11 15:32

[QUOTE=preda;523499]I didn't have a R9 280x, so I don't know perf numbers. On Linux, ROCm may not support 280x as too old, but amdgpu-pro should(?) support it. I wouldn't recommend buying R9 390x for gpuowl: the best thing to buy is Radeon VII (expensive though), but makes up the price in power cost savings IMO.[/QUOTE]

Thanks for the suggestions. Since the LL benchmark values on mersenne.ca seems to be ran on clLucas and are vastly different than gpuowl speeds (besides the Radeon vii run), and I see the R9 390x surprisingly situated above Vega 56, so I just wanted a confirmation in the performance hierarchy. Also, it seems that Radeon viis are out of stock nearly everywhere and the price has gone up from 600$ to 700, which is definitely out of my current budget range. I guess I'll just wait for a while till sometime AMD or Nvidia release a true "professional" GPU with high double precision to single precision ratio.

kriesel 2019-08-11 17:11

[QUOTE=xx005fs;523522]Also, it seems that Radeon viis are out of stock nearly everywhere and the price has gone up from 600$ to 700, which is definitely out of my current budget range. I guess I'll just wait for a while till sometime AMD or Nvidia release a true "professional" GPU with high double precision to single precision ratio.[/QUOTE]If you think the Radeon VII is expensive, check out the prices of current existing NVIDIA Tesla or Quadro or AMD MI product lines, which have the higher dp capability of a pro gpu. I found an MI25 once, used, for around $3000.

preda 2019-09-03 13:47

P-1 bug in GpuOwl
 
Hi, I just realized that P-1 stage2 in GpuOwl was broken since v6.5-51-gefc3c9f
P-1 should be fixed starting with v6.6 (pending more validation)
If somebody had the bad luck of doing P-1 with an affected version, the stage2 part of the "no factor" results is not valid. (stage1 is good though). Any factor-found results are good, too.

Independenly, I recently changed the memory allocation in stage2, which was a problem in the past (reported by Ken) (this was how I realized there's a bug).

While this shows lack of testing on my part, it's also an advice to self-validate: please do a couple of P-1 on known results (that can be found in the folder test-pm1 in GpuOwl source) before starting serious P-1 work.

kriesel 2019-09-03 21:35

[QUOTE=preda;525076]Hi, I just realized that P-1 stage2 in GpuOwl was broken since v6.5-51-gefc3c9f
P-1 should be fixed starting with v6.6 (pending more validation)
If somebody had the bad luck of doing P-1 with an affected version, the stage2 part of the "no factor" results is not valid. (stage1 is good though). Any factor-found results are good, too.

Independently, I recently changed the memory allocation in stage2, which was a problem in the past (reported by Ken) (this was how I realized there's a bug).

While this shows lack of testing on my part, it's also an advice to self-validate: please do a couple of P-1 on known results (that can be found in the folder test-pm1 in GpuOwl source) before starting serious P-1 work.[/QUOTE]
Great; progress I've been waiting for.
What sort of validation do you have in mind?
We all should validate each specific installation combination, and on most gpu applications, also benchmark and tune, specific to the system, gpu model, planned work type and exponent range.

I've made a successful 4M exponent P-1 run in gpuowl on Win7 and RX480 and a 50M p using -use NO_ASM on gpuowl-win v6.6-5-667954b, although ORIG_X2 might have been better.
Bounds and therefore runtime were excessive on the 4M. It found the product of all 3 known factors for it.
[CODE]{"exponent":"4444091", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.6-5-g667954b"}, "timestamp":"2019-09-03 20:26:34 UTC", "fft-length":229376, "B1":500000, "B2":15000000, "factors":["1809798096458971047321927127"]}
{"exponent":"50001781", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.6-5-g667954b"}, "timestamp":"2019-09-03 21:12:48 UTC", "fft-length":2883584, "B1":95000, "B2":4100000, "factors":["4392938042637898431087689"]}
[/CODE]Do you have any guidance for what is feasible on a gpu versus installed gpu ram amounts? 2GB, 4GB, 8GB, etc.?
Interestingly, GPU-Z v2.24.0 showed 0% gpu load indicated throughout stage 2, but showed 100% load during stage 1 and during V6.5 or v3.8 PRP3. GPU-Z shows other oddities on this system when accessed by RDP which is almost always.

kriesel 2019-09-03 22:15

P-1 bounds in GpuOwL
 
Please consider, rather than the fixed B1 bound and derived B2 bound, having the default similar to the GPUto72 bounds or a fit to them. See [url]https://www.mersenneforum.org/showpost.php?p=522257&postcount=23[/url]

kriesel 2019-09-03 22:25

[QUOTE=kriesel;525105]I've made a successful 4M exponent P-1 run in gpuowl on Win7 and RX480 and a 50M p using -use NO_ASM on gpuowl-win v6.6-5-667954b, although ORIG_X2 might have been better.
[CODE]{"exponent":"4444091", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.6-5-g667954b"}, "timestamp":"2019-09-03 20:26:34 UTC", "fft-length":229376, "B1":500000, "B2":15000000, "factors":["1809798096458971047321927127"]}
{"exponent":"50001781", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.6-5-g667954b"}, "timestamp":"2019-09-03 21:12:48 UTC", "fft-length":2883584, "B1":95000, "B2":4100000, "factors":["4392938042637898431087689"]}
[/CODE][/QUOTE]The news is not as good on NVIDIA. This was an attempt on a GTX 1080 Ti.
[CODE]>gpuowl-win -device 0 -use ORIG_X2 -B1 6000 -B2 2100000 -pm1 51558151
2019-09-03 17:16:17 gpuowl v6.6-5-g667954b
2019-09-03 17:16:17 Note: no config.txt file found
2019-09-03 17:16:17 config: -device 0 -use ORIG_X2 -B1 6000 -B2 2100000 -pm1 51558151
2019-09-03 17:16:17 51558151 FFT 2816K: Width 8x8, Height 256x8, Middle 11; 17.88 bits/word
2019-09-03 17:16:17 using short carry kernels
2019-09-03 17:16:17 OpenCL args "-DEXP=51558151u -DWIDTH=64u -DSMALL_HEIGHT=2048u -DMIDDLE=11u -DWEIGHT_STEP=0x8.b1cf5f16b2fap-3 -DIWEIGHT_STEP=0xe.b8
c9efc21a378p-4 -DWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-3 -DIWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-09-03 17:16:24

2019-09-03 17:16:24 OpenCL compilation in 6895 ms
2019-09-03 17:16:25 51558151 P-1 starting stage1
2019-09-03 17:18:01 Exception 9gpu_error: INVALID_VALUE clGetDeviceInfo(id, what, bufSize, buf, NULL) at clwrap.cpp:98 getInfo
2019-09-03 17:18:01 Bye

>gpuowl-win -device 0 -use ORIG_X2 -B1 2000 -B2 8000000 -pm1 100000081
2019-09-03 17:18:01 gpuowl v6.6-5-g667954b
2019-09-03 17:18:01 Note: no config.txt file found
2019-09-03 17:18:01 config: -device 0 -use ORIG_X2 -B1 2000 -B2 8000000 -pm1 100000081
2019-09-03 17:18:01 100000081 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.34 bits/word
2019-09-03 17:18:01 using short carry kernels
2019-09-03 17:18:02 OpenCL args "-DEXP=100000081u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xc.a5067a8c5cb2p-3 -DIWEIGHT_STEP=0xa.
1f74af2719fap-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-09-03 17:18:05

2019-09-03 17:18:05 OpenCL compilation in 3541 ms
2019-09-03 17:18:07 100000081 P-1 starting stage1
2019-09-03 17:20:57 Exception 9gpu_error: INVALID_VALUE clGetDeviceInfo(id, what, bufSize, buf, NULL) at clwrap.cpp:98 getInfo
2019-09-03 17:20:57 Bye[/CODE]

kriesel 2019-09-04 00:02

cmd line options; p-1 save files
 
I note there is -user and -cpu but no -aid command line option.
Is there a save file provision for P-1 runs so that they can be stopped and later continued, or is a run in progress lost if it is halted?

kriesel 2019-09-04 01:08

gpuowl reports both target bounds with stage 1 factor found
 
See [M]100002337[/M]. Factor was found in stage 1. B1 and B2 were included in the report. It is customary to report B1 and B2 as the B1 value when the factor is found in stage 1 and stage 2 is not fully performed. That way, the recorded bounds reflect actual factoring limits completed, and processing credit is properly computed, not overestimated.
[CODE]2019-09-03 19:29:07 100002337 1170000 97.69%; 4859 us/sq; ETA 0d 00:02; a2c219ce3eac5f47
2019-09-03 19:29:55 100002337 1180000 98.52%; 4848 us/sq; ETA 0d 00:01; f8dd78f4927d7326
2019-09-03 19:30:44 100002337 1190000 99.36%; 4863 us/sq; ETA 0d 00:01; df1df66b8d7a209c
2019-09-03 19:31:21 P-1 stage2 using 160 buffers of 44.0 MB each
2019-09-03 19:31:22 P-1 (B1=830000, B2=17430000, D=30030): primes 1050980, expanded 1071560, doubles 177259 (left 703338), singles 696462, total 873721 (83%)
2019-09-03 19:31:22 100002337 P-1 stage2: 553 blocks starting at block 28 (873721 selected)
2019-09-03 19:35:48 Round 1 of 18: init 4.62 s; 5.37 ms/mul; 48637 muls
2019-09-03 19:35:48 100002337 P-1 [B]stage1 GCD[/B]: 2393819567666978656303937
2019-09-03 19:35:48 {"exponent":"100002337", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.6-5-g667954b"}, "timestamp":"2019-09-04 0
0:35:48 UTC", "fft-length":5767168, "B1":830000, "B2":[B]17430000[/B], "factors":["2393819567666978656303937"]}
2019-09-03 19:35:48 Bye[/CODE]

preda 2019-09-04 15:30

[QUOTE=kriesel;525123]See [M]100002337[/M]. Factor was found in stage 1. B1 and B2 were included in the report. It is customary to report B1 and B2 as the B1 value when the factor is found in stage 1 and stage 2 is not fully performed. That way, the recorded bounds reflect actual factoring limits completed, and processing credit is properly computed, not overestimated.[/QUOTE]

OK, I think I fixed this in a recent commit: if a factor is found in stage1, do not include B2 in the report.

Prime95 2019-09-04 15:34

[QUOTE=preda;525076]Hi, I just realized that P-1 stage2 in GpuOwl was broken since v6.5-51-gefc3c9f[/QUOTE]

Preda is being kind in not mentioning this is my fault. I made several small optimizations in gpuowl and not knowing the code very well assumed my Gerbicz PRP testing would catch any bugs. However, I made a typo in a code path only used by P-1.

Sorry about that.

preda 2019-09-04 15:38

[QUOTE=kriesel;525113]The news is not as good on NVIDIA. This was an attempt on a GTX 1080 Ti.
[/QUOTE]

I was using an AMD-specific extension to get the amount of free RAM on the GPU. I updated the code to also work when said extension is not available. For such a situation (no free GPU RAM info) I added a flag which specifies a limit on the GPU RAM GpuOwl can use:
-maxAlloc <size in MB>

Feel free to try again on NVIDIA, to see what problem we hit next.

BTW, -maxAlloc also works in general, hard-limiting the amount of GPU memory used by a GpuOwl instance.

kriesel 2019-09-04 15:50

memory query etc.
 
[QUOTE=preda;525155]I was using an AMD-specific extension to get the amount of free RAM on the GPU. I updated the code to also work when said extension is not available. For such a situation (no free GPU RAM info) I added a flag which specifies a limit on the GPU RAM GpuOwl can use:
-maxAlloc <size in MB>

Feel free to try again on NVIDIA, to see what problem we hit next.

BTW, -maxAlloc also works in general, hard-limiting the amount of GPU memory used by a GpuOwl instance.[/QUOTE]Will give it a try. It's very surprising that querying available memory is available in OpenGL but not a standard part of OpenCL. Seems like a glaring omission for performance. [URL]https://stackoverflow.com/questions/3568115/how-do-i-determine-available-device-memory-in-opencl[/URL]
GPU-Z and nvidia-smi have ways of querying gpu memory usage.

I have a few command-line runs of gpuowl P-1 going on my RX480 as benchmarks versus widely spaced exponent values and tests of how high it can go.
Presumably the new commit's -maxAlloc value should leave some space uncommitted, perhaps a GB, for other usage of the gpu ram.

Thanks to you and George for your efforts, which also unavoidably result in a few bugs escaping.

I confirmed by a quick test near the beginning that a halted P-1 run starts over from the beginning; there's no save file. With run time of more than a day on an RX480 for M300M, that's a drawback.

PontiacGTX 2019-09-04 17:19

[QUOTE=preda;493560]Yep, that means that the LLVM GCN backend does not know how to translate some operation to GCN. In this case it may be about a 128-bit SUB.

Are you using ROCm or amdgpu-pro? if ROCm, which version?

If you're on most recent ROCm (i.e. 1.8.2), we should let them know ("ROCm issues") about it.[/QUOTE]
I know this is a bit old but I have no other reference regarding this same problem but cna you somehow update the LLVM version that OpenCL uses to build the kernel in Windows while using the same OpenCL APP SDK 2.0??

PontiacGTX 2019-09-04 17:25

well the problem happens when I try to use an uint64_t type in a kernel

ixfd64 2019-09-04 18:05

[QUOTE=preda;525076]Hi, I just realized that P-1 stage2 in GpuOwl was broken since v6.5-51-gefc3c9f
P-1 should be fixed starting with v6.6 (pending more validation)
If somebody had the bad luck of doing P-1 with an affected version, the stage2 part of the "no factor" results is not valid. (stage1 is good though). Any factor-found results are good, too.

Independenly, I recently changed the memory allocation in stage2, which was a problem in the past (reported by Ken) (this was how I realized there's a bug).

While this shows lack of testing on my part, it's also an advice to self-validate: please do a couple of P-1 on known results (that can be found in the folder test-pm1 in GpuOwl source) before starting serious P-1 work.[/QUOTE]

George, any idea how many potentially bad P-1 results that need to be redone?

Prime95 2019-09-04 19:34

[QUOTE=ixfd64;525176]George, any idea how many potentially bad P-1 results that need to be redone?[/QUOTE]

Madpoo found about 50.

kriesel 2019-09-04 20:21

[QUOTE=Prime95;525185]Madpoo found about 50.[/QUOTE]
Perhaps he and uncwilly could put together a thread for cleaning that up.

pinhodecarlos 2019-09-04 20:51

How much memory those P-1 tasks need. Can it be ran without having a GIMPS account? I can dedicate a couple of cores to the cause.

preda 2019-09-04 21:32

[QUOTE=PontiacGTX;525171]well the problem happens when I try to use an uint64_t type in a kernel[/QUOTE]

uint64_t is not a type in OpenCL. Use "unsigned long" instead, which is guaranteed (in OpenCL) to be 64bits.

preda 2019-09-04 21:37

[QUOTE=pinhodecarlos;525191]How much memory those P-1 tasks need. Can it be ran without having a GIMPS account? I can dedicate a couple of cores to the cause.[/QUOTE]

In general P-1 stage2 is faster with more RAM available. With GpuOwl I'd recommend a GPU with 8GB or more. I need to do a more detailed write-up on the memory use and performance impact.

kriesel 2019-09-04 22:24

First V6.x P-1 success on NVIDIA, a good-news/bad-news story
 
2 Attachment(s)
On Windows 7 Pro x64, NVIDIA GTX 1080 Ti:

By command line option specification...

Successful stage 1 factor find:[CODE]{"exponent":"4444091", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.7-2-g7cf95d0"}, "timestamp":"2019-09-04 19:13:31 UTC", "fft-length":229376, "B1":40000, "factors":["1809798096458971047321927127"]}[/CODE]Successful 2 stages no factor run:[CODE]{"exponent":"6972593", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.7-2-g7cf95d0"}, "timestamp":"2019-09-04 19:33:17 UTC", "fft-length":360448, "B1":60000, "B2":780000}[/CODE]By worktodo file:
50M appeared to run successfully, but the program crashed before stage 2 gcd completed and after the next worktodo line has already begun, a 100M task. It was apparently in the midst of updating the worktodo file; there's a worktodo.bak containing the line [CODE]B1=440000,B2=8360000;PFactor=0,1,2,50001781,-1,73,2
[/CODE] and a worktodo.txt without it; no indication in gpuowl.log or console whether the factor was found or not. A flush to log and console immediately upon stage completion might be good to preserve whatever result it achieved. (Nice result you got there; it would be a shame if anything happened to it.) It is reproducible from the command line also.

Issue also occurred for cmd line input [CODE]gpuowl-win -device 0 -use ORIG_X2 -maxAlloc 10240 -pm1 24036583 -B1 220000 -B2 3960000
[/CODE] so it appears to be parameter or memory size related; max gpu usage was ~1.9GB during that run.

Many cases, summarized very briefly, run from the command line options as shown; exponent up to 10M ok, 11M and up a problem. (Meanwhile on AMD v6.6 happily runs 200M and probably higher.)
[CODE]:gpuowl-win -device 0 -use ORIG_X2 -maxAlloc 10240 -pm1 4444091 -B1 40000 -B2 480000 factor in stage 1
:gpuowl-win -device 0 -use ORIG_X2 -maxAlloc 10240 -pm1 6972593 -B1 60000 -B2 780000 ok no factor
:gpuowl-win -device 0 -use ORIG_X2 -maxAlloc 10240 -pm1 50001781 -B1 830000 -B2 17430000 fails near stage 2 gcd

:gpuowl-win -device 0 -use ORIG_X2 -maxAlloc 10240 -pm1 24036583 -B1 220000 -B2 3960000 fails near stage 2 gcd
:gpuowl-win -device 0 -use ORIG_X2 -maxAlloc 10240 -pm1 24000001 -B1 220000 -B2 3960000 factor in stage 1
:gpuowl-win -device 0 -use ORIG_X2 -maxAlloc 10240 -pm1 13466917 -B1 110000 -B2 1760000 fails near stage 2 gcd
:gpuowl-win -device 0 -use ORIG_X2 -maxAlloc 10240 -pm1 13466951 -B1 110000 -B2 1760000 factor in stage 1
:gpuowl-win -device 0 -use ORIG_X2 -maxAlloc 10240 -pm1 10000019 -B1 80000 -B2 1200000 factor in stage 1

:gpuowl-win -device 0 -use ORIG_X2 -maxAlloc 10240 -pm1 10000139 -B1 80000 -B2 1200000 completes 2 stages no factor, peak 9183MiB
:gpuowl-win -device 0 -use ORIG_X2 -maxAlloc 10240 -pm1 12000073 -B1 110000 -B2 1650000 peak 7495MiB, fails near stage 2 gcd

gpuowl-win -device 0 -use ORIG_X2 -maxAlloc 10240 -pm1 11000159 -B1 100000 -B2 1500000 peak 10218MiB, fails near stage 2 gcd[/CODE]Generally: CUDAPm1 outputs what the number of relative primes was run together in a pass in stage 2, and what the e value for the Brent Suyama extension is. I don't see either of these in gpuowl's output. Prime95 as I recall shows them, and includes e in the result record if larger than 2.
Gpuowl uses us/sq in P-1 stage 1, but ms/mul in stage 2. The timings are comparable.

preda 2019-09-05 11:24

[QUOTE=kriesel;525198]
50M appeared to run successfully, but the program crashed before stage 2 gcd completed and after the next worktodo line has already begun, a 100M task. It was apparently in the midst of updating the worktodo file; there's a worktodo.bak containing the line [CODE]B1=440000,B2=8360000;PFactor=0,1,2,50001781,-1,73,2
[/CODE] and a worktodo.txt without it; no indication in gpuowl.log or console whether the factor was found or not. A flush to log and console immediately upon stage completion might be good to preserve whatever result it achieved. (Nice result you got there; it would be a shame if anything happened to it.) It is reproducible from the command line also.

Issue also occurred for cmd line input [CODE]gpuowl-win -device 0 -use ORIG_X2 -maxAlloc 10240 -pm1 24036583 -B1 220000 -B2 3960000
[/CODE] so it appears to be parameter or memory size related; max gpu usage was ~1.9GB during that run.
[/QUOTE]

What it looks like happened there is:
- P-1 stage2 completed, and GCD was started on CPU in the background,
- worktodo.txt was corretly updated
- at this point, the program is supposed to just stay there and wait for the GCD to finish. Instead it crashed.

So, there was no result to write (or flush) yet. I did run your example (-pm1 24036583 -B1 220000 -B2 3960000) on my AMD gpu, no crash. Is your crash reproducible? happens every single time? Only on Nvidia? what is the last message in the log, before the crash?

I see in the screen-shot with the crash, there is the next task starting. But you can also reproduce it with -pm1 on the command line, where there is not any "next" task, right?

Did stage2 find any successful factor -- maybe when trying a known-factor that should be found in stage2 ?

What is the "additional information" on the windows crash window?

With command line "-pm1", if there is a crash, is there a delay between the last line written to output, and the crash? (i.e. does the crash occur when the GCD (on CPU) comes back?)

Could you ever run stage2 to completion on windows without a crash? (i.e. maybe the cause is Windows, not Nvidia?)

preda 2019-09-05 12:26

Ken: crash at end of stage2 might be fixed now, please retry.

kriesel 2019-09-05 16:19

[QUOTE=preda;525223]What it looks like happened there is:
- P-1 stage2 completed, and GCD was started on CPU in the background,
- worktodo.txt was correctly updated
- at this point, the program is supposed to just stay there and wait for the GCD to finish. Instead it crashed.

So, there was no result to write (or flush) yet. I did run your example (-pm1 24036583 -B1 220000 -B2 3960000) on my AMD gpu, no crash. Is your crash reproducible? happens every single time? Only on Nvidia? what is the last message in the log, before the crash?

I see in the screen-shot with the crash, there is the next task starting. But you can also reproduce it with -pm1 on the command line, where there is not any "next" task, right?[/QUOTE]Yes, both worktodo and command line cases crash and were provided. I didn't duplicate exactly, but explored at what p the issue occurs vs. doesn't.[QUOTE]
Did stage2 find any successful factor -- maybe when trying a known-factor that should be found in stage2 ?[/QUOTE]For p=50001781, k= 43927 815717 583124 = 2[SUP]2[/SUP] × 29 × 983 × 94709 × 4 067587 so[QUOTE]{"exponent":"50001781", "worktype":"PM1", "status":"F", "program":{"name":"gpuowl", "version":"v6.6-5-g667954b"}, "timestamp":"2019-09-03 21:12:48 UTC", "fft-length":2883584, "B1":95000, "B2":4100000, "factors":["4392938042637898431087689"]}[/QUOTE] on AMD RX480 and Windows 7, but on different system with NVIDIA GTX 1080 Ti and Windows 7 with somewhat different bounds same exponent it gave this in the log:
[CODE]2019-09-04 12:38:12 Note: no config.txt file found
2019-09-04 12:38:12 config: -device 0 -use ORIG_X2 -maxAlloc 10240
2019-09-04 12:38:12 50001781 FFT 2816K: Width 8x8, Height 256x8, Middle 11; 17.34 bits/word
2019-09-04 12:38:12 using short carry kernels
2019-09-04 12:38:13 OpenCL args "-DEXP=50001781u -DWIDTH=64u -DSMALL_HEIGHT=2048u -DMIDDLE=11u -DWEIGHT_STEP=0xc.a3abda189b648p-3 -DIWEIGHT_STEP=0xa.208a4c5c1d58p-4 -DWEIGHT_BIGSTEP=0xa.5fed6a9b15138p-3 -DIWEIGHT_BIGSTEP=0xc.5672a115506d8p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-09-04 12:38:18

2019-09-04 12:38:18 OpenCL compilation in 5272 ms
2019-09-04 12:38:19 50001781 P-1 starting stage1
2019-09-04 12:38:47 50001781 10000 1.58%; 2804 us/sq; ETA 0d 00:29; 044528bffde3983b
2019-09-04 12:39:15 50001781 20000 3.15%; 2816 us/sq; ETA 0d 00:29; 415fbce27a9baab2
2019-09-04 12:39:44 50001781 30000 4.73%; 2824 us/sq; ETA 0d 00:28; 21f570c38ee14685
2019-09-04 12:40:12 50001781 40000 6.30%; 2835 us/sq; ETA 0d 00:28; 4bdecfdeb836f41e
2019-09-04 12:40:41 50001781 50000 7.88%; 2838 us/sq; ETA 0d 00:28; 95d006c6a337a424
2019-09-04 12:41:09 50001781 60000 9.45%; 2853 us/sq; ETA 0d 00:27; 81c56d44009699ea
2019-09-04 12:41:38 50001781 70000 11.03%; 2859 us/sq; ETA 0d 00:27; be957ab31eaa093a
2019-09-04 12:42:07 50001781 80000 12.60%; 2858 us/sq; ETA 0d 00:26; 453bc86fd585b610
2019-09-04 12:42:35 50001781 90000 14.18%; 2858 us/sq; ETA 0d 00:26; d4859abbb88f685e
2019-09-04 12:43:04 50001781 100000 15.75%; 2866 us/sq; ETA 0d 00:26; 2acc32952e6a8413
2019-09-04 12:43:33 50001781 110000 17.33%; 2870 us/sq; ETA 0d 00:25; 1a465e1dd2daf5e7
2019-09-04 12:44:01 50001781 120000 18.90%; 2870 us/sq; ETA 0d 00:25; ad8304e19075e6d2
2019-09-04 12:44:30 50001781 130000 20.48%; 2869 us/sq; ETA 0d 00:24; 9e83c080c8678aa4
2019-09-04 12:44:59 50001781 140000 22.05%; 2870 us/sq; ETA 0d 00:24; 437a7a3a4b6cbf1c
2019-09-04 12:45:28 50001781 150000 23.63%; 2869 us/sq; ETA 0d 00:23; e0f5c2193df0bafc
2019-09-04 12:45:56 50001781 160000 25.20%; 2870 us/sq; ETA 0d 00:23; fd62439dd6f046a9
2019-09-04 12:46:25 50001781 170000 26.78%; 2870 us/sq; ETA 0d 00:22; 852da47da8286b37
2019-09-04 12:46:54 50001781 180000 28.35%; 2870 us/sq; ETA 0d 00:22; e1cce4b69a7acf51
2019-09-04 12:47:23 50001781 190000 29.93%; 2869 us/sq; ETA 0d 00:21; c3b8528c7dd22b9a
2019-09-04 12:47:51 50001781 200000 31.50%; 2870 us/sq; ETA 0d 00:21; c8635b0ffaea5d2c
2019-09-04 12:48:20 50001781 210000 33.08%; 2870 us/sq; ETA 0d 00:20; d1b17f46043060f7
2019-09-04 12:48:49 50001781 220000 34.65%; 2870 us/sq; ETA 0d 00:20; d901653e490557cf
2019-09-04 12:49:18 50001781 230000 36.23%; 2870 us/sq; ETA 0d 00:19; 7952ed49512db373
2019-09-04 12:49:47 50001781 240000 37.80%; 2872 us/sq; ETA 0d 00:19; 4ce2439936761fbb
2019-09-04 12:50:15 50001781 250000 39.38%; 2870 us/sq; ETA 0d 00:18; 0ce0df8abef61983
2019-09-04 12:50:44 50001781 260000 40.95%; 2867 us/sq; ETA 0d 00:18; 007b9e0a50ad8d5a
2019-09-04 12:51:13 50001781 270000 42.53%; 2874 us/sq; ETA 0d 00:17; 74cfe20ed732099c
2019-09-04 12:51:42 50001781 280000 44.10%; 2870 us/sq; ETA 0d 00:17; 682a467ea133a2cb
2019-09-04 12:52:10 50001781 290000 45.68%; 2870 us/sq; ETA 0d 00:16; 4eef8363ac560c71
2019-09-04 12:52:39 50001781 300000 47.25%; 2869 us/sq; ETA 0d 00:16; 539a790a21e18704
2019-09-04 12:53:08 50001781 310000 48.83%; 2872 us/sq; ETA 0d 00:16; 9f35af2bf71dfaa3
2019-09-04 12:53:37 50001781 320000 50.40%; 2870 us/sq; ETA 0d 00:15; f41e03524a1b3550
2019-09-04 12:54:05 50001781 330000 51.98%; 2870 us/sq; ETA 0d 00:15; 41e0191cbdb2c2e2
2019-09-04 12:54:34 50001781 340000 53.55%; 2869 us/sq; ETA 0d 00:14; fc1568cbec210f4a
2019-09-04 12:55:03 50001781 350000 55.13%; 2869 us/sq; ETA 0d 00:14; d2f37138d933edd7
2019-09-04 12:55:32 50001781 360000 56.70%; 2872 us/sq; ETA 0d 00:13; 65f321d7053de9a7
2019-09-04 12:56:01 50001781 370000 58.28%; 2870 us/sq; ETA 0d 00:13; 4dbbb77f22fb8d41
2019-09-04 12:56:29 50001781 380000 59.85%; 2872 us/sq; ETA 0d 00:12; f3ba9e252fc12f01
2019-09-04 12:56:58 50001781 390000 61.43%; 2872 us/sq; ETA 0d 00:12; 53c045418c81bd8b
2019-09-04 12:57:27 50001781 400000 63.00%; 2869 us/sq; ETA 0d 00:11; e52f23bff88b0a9a
2019-09-04 12:57:56 50001781 410000 64.58%; 2870 us/sq; ETA 0d 00:11; d287c9bbab002582
2019-09-04 12:58:24 50001781 420000 66.15%; 2870 us/sq; ETA 0d 00:10; aea3464b6dfe3a8f
2019-09-04 12:58:53 50001781 430000 67.73%; 2870 us/sq; ETA 0d 00:10; 5d1edc8ceca74739
2019-09-04 12:59:22 50001781 440000 69.30%; 2869 us/sq; ETA 0d 00:09; ea09dc56f88da2d5
2019-09-04 12:59:51 50001781 450000 70.88%; 2870 us/sq; ETA 0d 00:09; b8f65c44da16de4a
2019-09-04 13:00:19 50001781 460000 72.45%; 2870 us/sq; ETA 0d 00:08; 3d736562f926feef
2019-09-04 13:00:48 50001781 470000 74.03%; 2872 us/sq; ETA 0d 00:08; f3d3c48e37f43911
2019-09-04 13:01:17 50001781 480000 75.61%; 2869 us/sq; ETA 0d 00:07; bba070d88e98da6c
2019-09-04 13:01:46 50001781 490000 77.18%; 2872 us/sq; ETA 0d 00:07; cf8f1fc9f1b625f2
2019-09-04 13:02:14 50001781 500000 78.76%; 2870 us/sq; ETA 0d 00:06; d8f71fe45cd6f826
2019-09-04 13:02:43 50001781 510000 80.33%; 2869 us/sq; ETA 0d 00:06; 98741e7599457544
2019-09-04 13:03:12 50001781 520000 81.91%; 2870 us/sq; ETA 0d 00:05; 52abcb428b1911ce
2019-09-04 13:03:41 50001781 530000 83.48%; 2872 us/sq; ETA 0d 00:05; 090130b256f94a74
2019-09-04 13:04:10 50001781 540000 85.06%; 2869 us/sq; ETA 0d 00:05; 126d786c4aac7519
2019-09-04 13:04:38 50001781 550000 86.63%; 2869 us/sq; ETA 0d 00:04; 791bb2e18fe5c2c3
2019-09-04 13:05:07 50001781 560000 88.21%; 2870 us/sq; ETA 0d 00:04; f5d36f0483684169
2019-09-04 13:05:36 50001781 570000 89.78%; 2870 us/sq; ETA 0d 00:03; 810578cfd7c6080e
2019-09-04 13:06:05 50001781 580000 91.36%; 2870 us/sq; ETA 0d 00:03; b40d29da4dcc0e70
2019-09-04 13:06:33 50001781 590000 92.93%; 2872 us/sq; ETA 0d 00:02; 2a9a2764b6e6b770
2019-09-04 13:07:02 50001781 600000 94.51%; 2869 us/sq; ETA 0d 00:02; 0aa6b86e8621d47d
2019-09-04 13:07:31 50001781 610000 96.08%; 2872 us/sq; ETA 0d 00:01; 73d8a20f091dd815
2019-09-04 13:08:00 50001781 620000 97.66%; 2872 us/sq; ETA 0d 00:01; c2adf07b3ea6c52c
2019-09-04 13:08:28 50001781 630000 99.23%; 2869 us/sq; ETA 0d 00:00; 7fc58198458ecd8a
2019-09-04 13:08:43 P-1 stage2 using 291 buffers of 22.0 MB each
2019-09-04 13:08:43 P-1 (B1=440000, B2=8360000, D=30030): primes 525451, expanded 529413, doubles 91707 (left 343777), singles 342037, total 433744 (83%)
2019-09-04 13:08:43 50001781 P-1 stage2: 264 blocks starting at block 15 (433744 selected)
2019-09-04 13:11:06 Round 1 of 9: init 5.40 s; 3.14 ms/mul; 43851 muls
2019-09-04 13:11:06 50001781 P-1 stage1 GCD: no factor
2019-09-04 13:13:28 Round 2 of 9: init 5.01 s; 3.14 ms/mul; 43611 muls
2019-09-04 13:15:51 Round 3 of 9: init 4.99 s; 3.14 ms/mul; 43957 muls
2019-09-04 13:18:13 Round 4 of 9: init 5.02 s; 3.14 ms/mul; 43828 muls
2019-09-04 13:20:36 Round 5 of 9: init 4.99 s; 3.14 ms/mul; 43849 muls
2019-09-04 13:22:59 Round 6 of 9: init 5.04 s; 3.14 ms/mul; 43914 muls
2019-09-04 13:25:21 Round 7 of 9: init 5.01 s; 3.14 ms/mul; 43766 muls
2019-09-04 13:27:43 Round 8 of 9: init 5.09 s; 3.14 ms/mul; 43653 muls
2019-09-04 13:30:06 Round 9 of 9: init 5.07 s; 3.14 ms/mul; 43909 muls
2019-09-04 13:30:07 100002499 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.34 bits/word
2019-09-04 13:30:07 using short carry kernels
2019-09-04 14:13:12 Note: no config.txt file found
[/CODE]That 43 minute gap at the end of the log is where the crash occurred.[QUOTE]

What is the "additional information" on the windows crash window?[/QUOTE]Re "additional information" fields 1 through 4 best I could find with a bit of web searching was the not encouraging [URL]https://stackoverflow.com/questions/12922961/what-do-the-details-of-an-appcrash-message-mean[/URL]

[QUOTE]

With command line "-pm1", if there is a crash, is there a delay between the last line written to output, and the crash? (i.e. does the crash occur when the GCD (on CPU) comes back?)[/QUOTE]A rerun of the rather faster 11M case from .bat file, same parameters as before, with task manager cpu usage observed for the process gave a successful completion just now. I didn't change anything from the 11M run that failed yesterday.[CODE]Thu 09/05/2019 10:15:18.53 C:\Users\ken\Documents\gpuowl-win-v6.7-2-g7cf95d0>gpuowl-win -device 0 -use ORIG_X2 -maxAlloc 10240 -pm1 11000159
-B1 100000 -B2 1500000
2019-09-05 10:15:18 gpuowl v6.7-2-g7cf95d0
2019-09-05 10:15:18 Note: no config.txt file found
2019-09-05 10:15:18 config: -device 0 -use ORIG_X2 -maxAlloc 10240 -pm1 11000159 -B1 100000 -B2 1500000
2019-09-05 10:15:18 11000159 FFT 576K: Width 8x8, Height 64x8, Middle 9; 18.65 bits/word
2019-09-05 10:15:18 using short carry kernels
2019-09-05 10:15:19 OpenCL args "-DEXP=11000159u -DWIDTH=64u -DSMALL_HEIGHT=512u -DMIDDLE=9u -DWEIGHT_STEP=0xa.327adcb358cc8p-3 -DIWEIGHT_ST
EP=0xc.8d6f66d928aap-4 -DWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-3 -DIWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -
cl-std=CL2.0"
2019-09-05 10:15:19

2019-09-05 10:15:19 OpenCL compilation in 124 ms
2019-09-05 10:15:19 11000159 P-1 starting stage1
2019-09-05 10:15:26 11000159 10000 6.93%; 684 us/sq; ETA 0d 00:02; 857dc3f55c49f7f9
2019-09-05 10:15:33 11000159 20000 13.85%; 678 us/sq; ETA 0d 00:01; 8c796ecf1913c37c
2019-09-05 10:15:39 11000159 30000 20.78%; 679 us/sq; ETA 0d 00:01; f57a3dec737f6b42
2019-09-05 10:15:46 11000159 40000 27.71%; 678 us/sq; ETA 0d 00:01; d1c262f051139c4c
2019-09-05 10:15:53 11000159 50000 34.63%; 686 us/sq; ETA 0d 00:01; 7685ea05184d048c
2019-09-05 10:16:00 11000159 60000 41.56%; 684 us/sq; ETA 0d 00:01; 567e68b51cd96072
2019-09-05 10:16:07 11000159 70000 48.48%; 684 us/sq; ETA 0d 00:01; 77e00f0b94aabab3
2019-09-05 10:16:14 11000159 80000 55.41%; 684 us/sq; ETA 0d 00:01; 6c257069b38766c0
2019-09-05 10:16:21 11000159 90000 62.34%; 684 us/sq; ETA 0d 00:01; b4861281d5a4562a
2019-09-05 10:16:27 11000159 100000 69.26%; 684 us/sq; ETA 0d 00:01; 27ab55b3171901c7
2019-09-05 10:16:34 11000159 110000 76.19%; 684 us/sq; ETA 0d 00:00; 9a4e427a3913db26
2019-09-05 10:16:41 11000159 120000 83.12%; 684 us/sq; ETA 0d 00:00; e0cb66fc9cf30198
2019-09-05 10:16:48 11000159 130000 90.04%; 684 us/sq; ETA 0d 00:00; 7aac556735c65680
2019-09-05 10:16:55 11000159 140000 96.97%; 684 us/sq; ETA 0d 00:00; a366518040c11593
2019-09-05 10:16:58 P-1 stage2 using 1646 buffers of 4.5 MB each
2019-09-05 10:16:58 P-1 (B1=100000, B2=1500000, D=30030): primes 104563, expanded 104563, doubles 19996 (left 64571), singles 64571, total 8
4567 (81%)
2019-09-05 10:16:58 11000159 P-1 stage2: 48 blocks starting at block 3 (84567 selected)
2019-09-05 10:17:40 Round 1 of 1: init 8.28 s; 0.70 ms/mul; 47901 muls
2019-09-05 10:17:40 11000159 P-1 stage1 GCD: no factor
2019-09-05 10:17:41 11000159 P-1 final GCD: no factor
2019-09-05 10:17:41 {"exponent":"11000159", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.7-2-g7cf95d0"}, "time
stamp":"2019-09-05 15:17:41 UTC", "fft-length":589824, "B1":100000, "B2":1500000}
2019-09-05 10:17:41 Bye[/CODE]So I tried [CODE]gpuowl-win -device 0 -use ORIG_X2 -maxAlloc 10240 -pm1 24000577 -B1 220000 -B2 3960000[/CODE] and saw no notable delay between the round 18 of 18 console line output and the appcrash; certainly much closer than the stage1gcd took. Cpu usage profile on the Windows 7 GTX1080Ti run was a bit more (~5 seconds more) than one core average during stage 1, rising to 2 cores during stage 2 and stage1gcd overlap, returning to 1 core usually during stage2. This is on an 8-core hyperthreaded system, dual xeon e5520; approx 12:54 cpu shown in task manager to the point of crash. (Unfortunately the process entry disappears from task manager when the crash occurs.)
Note, in a duplicating run of 24000577 to same bounds on Win7 x64 and AMD RX550 running now, I do not see the equivalent cpu overhead; 15 minutes in to a 1 hour stage one estimated duration, it's showing 0% cpu usage and 8 seconds accumulated on that process; that is the same system as contains the RX480 and is dual Xeon E5645.[QUOTE]
Could you ever run stage2 to completion on windows without a crash? (i.e. maybe the cause is Windows, not Nvidia?)[/QUOTE]Yes, see the known-prime case provided earlier p=6972593 and now 11M on Windows and NVIDIA GTX1080Ti. Also larger exponents I've run on Win7 and AMD including p~200M.

Thanks for the quick turnaround to post 1324, one hour after your questions posed. Will give it a try.

kriesel 2019-09-05 18:55

[QUOTE=preda;525225]Ken: crash at end of stage2 might be fixed now, please retry.[/QUOTE]Better on v6.7-3-745faae; running multiple test cases from worktodo now on GTX1080Ti.
It's hit an odd case, when a worktodo line is ignored, it gets confused about exponent, reporting a result for an exponent it completed but substituting the [B]10M[/B] exponent value of one it's still running, at the console, in the log, and in [B]results.txt[/B] in the record with the fftlength, B1, and B2 of the [B]24M[/B] exponent of a Mersenne prime used as a high-confidence no-factor case test:
[CODE]2019-09-05 13:25:02 24036583 260000 81.88%; 1366 us/sq; ETA 0d 00:01; 8ce7bc8b431645fa
2019-09-05 13:25:16 24036583 270000 85.03%; 1366 us/sq; ETA 0d 00:01; 204263c8f6c0b028
2019-09-05 13:25:30 24036583 280000 88.18%; 1368 us/sq; ETA 0d 00:01; 70804d872c97c737
2019-09-05 13:25:43 24036583 290000 91.32%; 1374 us/sq; ETA 0d 00:01; 1ad360e96af9200b
2019-09-05 13:25:57 24036583 300000 94.47%; 1373 us/sq; ETA 0d 00:00; 44e5901c72dbac9f
2019-09-05 13:26:11 24036583 310000 97.62%; 1371 us/sq; ETA 0d 00:00; 7f422aba9a518aa5
2019-09-05 13:26:21 P-1 stage2 using 156 buffers of 10.0 MB each
2019-09-05 13:26:22 P-1 (B1=220000, B2=3960000, D=30030): primes 260946, expanded 262000, doubles 47491 (left 166492), singles 165964, total
213455 (82%)
2019-09-05 13:26:22 24036583 P-1 stage2: 126 blocks starting at block 7 (213455 selected)
2019-09-05 13:26:40 Round 1 of 18: init 1.45 s; 1.49 ms/mul; 11487 muls
2019-09-05 13:26:59 Round 2 of 18: init 1.28 s; 1.49 ms/mul; 11549 muls
2019-09-05 13:26:59 [B]24036583[/B] P-1 stage1 GCD: no factor
2019-09-05 13:27:17 Round 3 of 18: init 1.31 s; 1.49 ms/mul; 11388 muls
2019-09-05 13:27:35 Round 4 of 18: init 1.31 s; 1.49 ms/mul; 11496 muls
2019-09-05 13:27:54 Round 5 of 18: init 1.31 s; 1.49 ms/mul; 11508 muls
2019-09-05 13:28:12 Round 6 of 18: init 1.31 s; 1.49 ms/mul; 11644 muls
2019-09-05 13:28:31 Round 7 of 18: init 1.33 s; 1.49 ms/mul; 11554 muls
2019-09-05 13:28:49 Round 8 of 18: init 1.31 s; 1.49 ms/mul; 11629 muls
2019-09-05 13:29:08 Round 9 of 18: init 1.31 s; 1.49 ms/mul; 11486 muls
2019-09-05 13:29:26 Round 10 of 18: init 1.33 s; 1.49 ms/mul; 11571 muls
2019-09-05 13:29:45 Round 11 of 18: init 1.29 s; 1.49 ms/mul; 11589 muls
2019-09-05 13:30:04 Round 12 of 18: init 1.33 s; 1.49 ms/mul; 11554 muls
2019-09-05 13:30:22 Round 13 of 18: init 1.29 s; 1.49 ms/mul; 11597 muls
2019-09-05 13:30:41 Round 14 of 18: init 1.33 s; 1.49 ms/mul; 11544 muls
2019-09-05 13:30:59 Round 15 of 18: init 1.33 s; 1.49 ms/mul; 11607 muls
2019-09-05 13:31:18 Round 16 of 18: init 1.31 s; 1.49 ms/mul; 11527 muls
2019-09-05 13:31:36 Round 17 of 18: init 1.34 s; 1.49 ms/mul; 11631 muls
2019-09-05 13:31:55 Round 18 of 18: init 1.33 s; 1.49 ms/mul; 11716 muls
2019-09-05 13:31:55 worktodo.txt: ":B1=460000,B2=8740000;PFactor=0,1,2,[B]51558151,-1,73,2" ignored[/B]
2019-09-05 13:31:55 [B]100000081 [/B]FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.34 bits/word
2019-09-05 13:31:56 using short carry kernels
2019-09-05 13:31:56 OpenCL args "-DEXP=100000081u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xc.a5067a8c5cb2p-3 -DIWEIGHT
_STEP=0xa.1f74af2719fap-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-mat
h -cl-std=CL2.0"
2019-09-05 13:32:00

2019-09-05 13:32:00 OpenCL compilation in 3650 ms
2019-09-05 13:32:01 100000081 P-1 starting stage1
2019-09-05 13:32:21 100000081 P-1 final GCD: no factor
2019-09-05 13:32:21 {"exponent":"[B]100000081[/B]", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.7-3-g745faae"}, "tim
estamp":"2019-09-05 18:32:21 UTC", "aid":"0", "fft-length":1310720, "B1":780000, "B2":17380000}
2019-09-05 13:32:39 [B]100000081[/B] 10000 0.89%; 3804 us/sq; ETA 0d 01:11; 3201fb08e5351d29
2019-09-05 13:33:17 100000081 20000 1.78%; 3799 us/sq; ETA 0d 01:10; 195100e088564810[/CODE]FYI on v6.7-2-7cf95d0 I was able to duplicate the stage2 crash on a 4GB AMD RX480 with[CODE][gpuowl-win -device 1 -user kriesel -cpu condorella/rx550 -use ORIG_X2 -pm1 24000577 -B1 220000 -B2 3960000
[/CODE]I also note stage2 P-1 lacks ETA and res64 values in console output and the log. One of the ways to tell in CUDAPm1 that something has gone awry in stage 2 is the res64 values are bad values, repeating, or cycling among very few values.

kriesel 2019-09-05 20:04

[QUOTE=kriesel;525272]Better on v6.7-3-745faae; running multiple test cases from worktodo now on GTX1080Ti.
It's hit an odd case, when a worktodo line is ignored, it gets confused about exponent, reporting a result for an exponent it completed but substituting the [B]10M[/B] exponent value of one it's still running, at the console, in the log, and in [B]results.txt[/B] in the record with the fftlength, B1, and B2 of the [B]24M[/B] exponent of a Mersenne prime used as a high-confidence no-factor case test[/QUOTE]Actually, the 24M exponent NF result appeared with the 24M fft length but the 10M exponent, B1 and B2 values. And it appears to happen whether there was an ignored worktodo line or a normal one.
Stage1 of the current line runs in parallel with the stage2 gcd of the previous line, and can have differing exponent, B1, B2, fft length. Result record and console and log output get a mix of variables when the stage 2 gcd completes; what is output contains the unfortunate blend

currentexponent, previousfftlength, previous factor & factor-state, currentB1, currentB2.

A temporary workaround is I think to use command line options, do not use worktodo.txt

Uncwilly 2019-09-05 20:29

[QUOTE=kriesel;525189]Perhaps he and uncwilly could put together a thread for cleaning that up.[/QUOTE]
If Aaron will provide the list, I will tend it over at the DC & TC thread. Or another in the Marin's Mersenne-aries sub.

preda 2019-09-05 21:24

[QUOTE=kriesel;525272]it gets confused about exponent, reporting a result for an exponent it completed but substituting the [B]10M[/B] exponent value of one it's still running[/QUOTE]

Yes, thanks, hopefully fixed now.

AlsXZ 2019-09-06 14:32

Hello! I just started play with gpuOwL and have some questions. As I understood that soft optimized for AMD GPUs. I'm trying to run it on Nvidia GeForce GTX 1050 TI on Linux. It works with -use NO_ASM only without it I have errors) and I got such results for 92M exponent: ETA 16.5 days and 15590 us/sq. What do you think - is it ok speed for this GPU or something going wrong?

kriesel 2019-09-06 14:36

[QUOTE=preda;525286]Yes, thanks, hopefully fixed now.[/QUOTE]
Testing v6.7-4-g278407a P-1, looks promising so far.
Sure would be nice if it also had these features (without breaking anything for PRP of course):
[LIST][*]save files for P-1 & resume from them. Higher exponents or slower gpus can take days or weeks. Probably save under different extensions than for PRP and versus stage.[*]By default, setting P-1 bounds automatically for a fit to gputo72 bounds based on exponent, rather than the current fixed B1 and B2 defaults. See [URL]https://www.mersenneforum.org/showpost.php?p=525112&postcount=1306[/URL] The pdf attachment at [URL]https://www.mersenneforum.org/showpost.php?p=522257&postcount=23[/URL] gives simple power fits for B1 and B2. Rounding would be good. Rounding up might be good. Exponent-scaled bounds seem straightforward to implement. It would make running P-1 more automatic, less user involvement.[*]less cpu usage on NVIDIA in P-1. See near end of [URL]https://www.mersenneforum.org/showpost.php?p=525251&postcount=1325[/URL] for a description of considerably higher cpu usage on NVIDIA than on AMD in gpuowl P-1, an additional core continuously.. I have no idea why it does that.[*]If there's a constraint on stage 1 or stage 2 feasible exponent, due to program limits or gpu memory or whatever, log to console and log file what the issue is and skip a worktodo item that exceeds the limit and try to continue with the next worktodo entry. It would in some cases be possible to run stage 1 and not stage 2.[/LIST]

kriesel 2019-09-06 15:09

[QUOTE=AlsXZ;525329]Hello! I just started play with gpuOwL and have some questions. As I understood that soft optimized for AMD GPUs. I'm trying to run it on Nvidia GeForce GTX 1050 TI on Linux. It works with -use NO_ASM only without it I have errors) and I got such results for 92M exponent: ETA 16.5 days and 15590 us/sq. What do you think - is it ok speed for this GPU or something going wrong?[/QUOTE]Times don't sound drastically wrong to me without really checking. I get about 9 ms/iter in CUDALucas on 51M on a GTX 1050 Ti so would expect around 17 ms/iter CUDALucas at 92M. Try some other -use choices, like ORIG_X2. Try different ffts -fft +1 etc. The fastest fft is generally but not always the default, in my experience, on AMD or NVIDIA. See the last attachment at [url]https://www.mersenneforum.org/showpost.php?p=488535&postcount=2[/url] for examples (My experience is on Windows.) Run a PRP double check to confirm your setup's reliability. Have fun!

preda 2019-09-06 15:13

[QUOTE=AlsXZ;525329]Hello! I just started play with gpuOwL and have some questions. As I understood that soft optimized for AMD GPUs. I'm trying to run it on Nvidia GeForce GTX 1050 TI on Linux. It works with -use NO_ASM only without it I have errors) and I got such results for 92M exponent: ETA 16.5 days and 15590 us/sq. What do you think - is it ok speed for this GPU or something going wrong?[/QUOTE]

I don't know how fast GTX1050ti is with cudaLucas, that would give a comparison point. GpuOwl is indeed optimized for AMD.

preda 2019-09-06 15:24

[QUOTE=kriesel;525330][LIST][*]save files for P-1 & resume from them. Higher exponents or slower gpus can take days or weeks. Probably save under different extensions than for PRP and versus stage.
- yes, looking into this.
[*]By default, setting P-1 bounds automatically for a fit to gputo72 bounds based on exponent, rather than the current fixed B1 and B2 defaults. See [URL]https://www.mersenneforum.org/showpost.php?p=525112&postcount=1306[/URL] The pdf attachment at [URL]https://www.mersenneforum.org/showpost.php?p=522257&postcount=23[/URL] gives simple power fits for B1 and B2. Rounding would be good. Rounding up might be good. Exponent-scaled bounds seem straightforward to implement. It would make running P-1 more automatic, less user involvement.

- yes it makes sense to provide a sensible default for the bounds
[*]less cpu usage on NVIDIA in P-1. See near end of [URL]https://www.mersenneforum.org/showpost.php?p=525251&postcount=1325[/URL] for a description of considerably higher cpu usage on NVIDIA than on AMD in gpuowl P-1, an additional core continuously.. I have no idea why it does that.

- I have a suspicion why is that: Nvidia chose the "busy-wait" by default for CUDA. i.e. spin a CPU core 100% just to gain a tiny bit of latency when reading events from the GPU. A very wasteful default IMO. Anyway, with CUDA this default can be configured to "non-busy" wait, but I have no idea how to do that for OpenCL-on-Nvidia.
[*]If there's a constraint on stage 1 or stage 2 feasible exponent, due to program limits or gpu memory or whatever, log to console and log file what the issue is and skip a worktodo item that exceeds the limit and try to continue with the next worktodo entry. It would in some cases be possible to run stage 1 and not stage 2.

- a problem is: if that line is not done, it should remain in worktodo.txt. But if it stays it worktodo.txt, GpuOwl will always attempt it first. (i.e. the way worktodo.txt is handled now would need to extended, and it's not clear to me how)[/LIST][/QUOTE]

answered inline

preda 2019-09-06 15:33

Recent changes
 
Some recent changes to savefiles:

- instead of putting all the .owl files in the current directory, now a sub-folder with the name equal to the exponent (e.g. "95123123") is created per exponent, and the savefiles put in that folder.

So, if upgrading with an ongoing exponent, it would be good to move the savefile to this folder in order for GpuOwl to find it and continue it instead of starting from the beginnig because the folder is empty.

- the "previous" savefile is now named 95123123-old.owl instead of 95123123-prev.owl
- the "temporary/new" savefile is now 95123123-new.owl instead of 95123123-temp.owl
- there are no persistent checkpoints created anymore (before they were created every 20M iters)
- if the normal checkpoint 95123123.owl can't be loaded, the 95123123-old.owl is automatically attempted

The "folder per exponent" is a bit in preparation for P-1 savefiles.

kriesel 2019-09-06 15:38

[QUOTE=preda;525335]answered inline[/QUOTE]
Thanks for the update.

On the last one, gpuowl could convert the problematic worktodo entry to a comment. Then it's still there for the user to refer to and react to, and readily and efficiently skipped next time by gpuowl.
Suppose a line which can be run in stage 1 P-1 but not stage 2 is present.
It could run stage one, then convert the line to a comment.

Or if an exponent is mistyped, to too many digits for gpuowl to handle, comment out that worktodo line and go on to the next instead of halting.

A single leading character is already enough to cause gpuowl to not run a worktodo line.
; or # for example.
Although it does complain a bit.
Mfaktc, mfakto, and CUDAPm1 support as comment identifiers
//
\\
#

kriesel 2019-09-06 15:43

[QUOTE=preda;525336]Some recent changes to savefiles:
...
- there are no persistent checkpoints created anymore (before they were created every 20M iters)
...
The "folder per exponent" is a bit in preparation for P-1 savefiles.[/QUOTE]I'm sorry to see the checkpoints go. If/when a mersenne prime is found with gpuowl, they allow rapid parallel confirmation. Other applications give the user a choice to save them or not.
At what commit or version did this change occur?

Glad to hear you're working toward P-1 save files.

preda 2019-09-06 15:57

[QUOTE=kriesel;525339]I'm sorry to see the checkpoints go. If/when a mersenne prime is found with gpuowl, they allow rapid parallel confirmation. Other applications give the user a choice to save them or not.
At what commit or version did this change occur?
[/QUOTE]

most recent commit now, tagged v6.8: v6.8-0-gab732a0

About prime confirmation, that would most likely be done with LL tests. For rapid confirmation of PRP, the VDF proof system is much better.

PontiacGTX 2019-09-06 16:19

[QUOTE=preda;525193]uint64_t is not a type in OpenCL. Use "unsigned long" instead, which is guaranteed (in OpenCL) to be 64bits.[/QUOTE]
Thanks I was trying to figure out why it couldnt use unsigned long long on a opencl kernel I didnt know it used unsigned long as a equivalent 64 bit integer value

kriesel 2019-09-06 16:29

[QUOTE=preda;525335]answered inline[/QUOTE][QUOTE][COLOR=Blue]less cpu usage on NVIDIA in P-1. See near end of [URL="https://www.mersenneforum.org/showpost.php?p=525251&postcount=1325"]https://www.mersenneforum.org/showpo...postcount=1325[/URL] for a description of considerably higher cpu usage on NVIDIA than on AMD in gpuowl P-1, an additional core continuously.. I have no idea why it does that.[/COLOR]

- I have a suspicion why is that: Nvidia chose the "busy-wait" by default for CUDA. i.e. spin a CPU core 100% just to gain a tiny bit of latency when reading events from the GPU. A very wasteful default IMO. Anyway, with CUDA this default can be configured to "non-busy" wait, but I have no idea how to do that for OpenCL-on-Nvidia.[/QUOTE]Impact varies. On a system capable of and configured for hyperthreading it's not too bad. On a system where hyperthreading is either not implemented or not enabled (such as turning it off in the BIOS), the impact can be considerable. Some of my systems are old processors with fast new gpus. These need to run multiple cores per worker in mprime/prime95, to avoid exponent expiration before completion. It's also often more efficient to run multiple cores per worker; there's only so much cache to go around. Without hyperthreading, cpu intensive activities like P-1 gcd or the spin-wait stop an entire worker, multiple cores, completely, for the duration even while they use only one of those cores.

Re mitigating the 100% core with OpenCL, looks like NVIDIA introduced the issue around driver 270, shrugged and left it up to the developers:
[URL]https://github.com/openmm/openmm/issues/1541[/URL] mentions getting it to 10% or 4%;
[URL]https://devtalk.nvidia.com/default/topic/978985/cuda-programming-and-performance/opencl-busy-wait-still-not-fixed/2[/URL]
[URL]https://devtalk.nvidia.com/default/topic/494659/execute-kernels-without-100-cpu-busy-wait-/[/URL]
I'm generally at driver 378 or above; the latest gpus require some 4xx.
It's always something or other.

kriesel 2019-09-06 16:33

[QUOTE=preda;525342]most recent commit now, tagged v6.8: v6.8-0-gab732a0[/QUOTE]Thanks. Will build and try later. Do you regard 6.7-4 as suitable for general use?[QUOTE]About prime confirmation, that would most likely be done with LL tests. For rapid confirmation of PRP, the VDF proof system is much better.[/QUOTE]It's likely that in the case of a PRP-found prime, both PRP and multiple LL on different software and systems would be done. Historically 3 or more confirmation runs are done by different persons.

Are you planning to also implement VDF?

preda 2019-09-06 21:46

[QUOTE=kriesel;525347]Thanks. Will build and try later. Do you regard 6.7-4 as suitable for general use?[/QUOTE] Yes I think it's fine.

[QUOTE]
Are you planning to also implement VDF?[/QUOTE] I would love to. It's a bigger chunk of work, and also requires wider agreement (on data, data formats) in order to allow implementing independent verifiers. (the verifiers would need to be independent to some degree in order to guarantee "no cheating" by the verifier)

kriesel 2019-09-06 22:41

Gpuowl v6.7-4-278407a Windows build
 
1 Attachment(s)
There's a copy of the help output text and the license along with the stripped executable in the 7z file. The main feature addition is P-1 on NVIDIA. Note there's no P-1 save file capability yet, so any P-1 restart is from the beginning.

kriesel 2019-09-07 00:30

1 Attachment(s)
[QUOTE=preda;525342]most recent commit now, tagged v6.8: v6.8-0-gab732a0
[/QUOTE]
msys2 / MINGW64 make did not like that one at all, producing quite a shower of messages about it.

kriesel 2019-09-07 11:51

On a Win7 Pro x64 system with 12GB of ram, 36GB page file, and gpu with 4GB (GTX1050Ti) gpuowl 6.7-4 produced mad paging for an hour before terminating its attempt to start stage 2 on a 50M P-1 run :[CODE]2019-09-06 23:33:12 50001781 610000 96.08%; 12509 us/sq; ETA 0d 00:05; 73d8a20f091dd815
2019-09-06 23:35:17 50001781 620000 97.66%; 12520 us/sq; ETA 0d 00:03; c2adf07b3ea6c52c
2019-09-06 23:37:22 50001781 630000 99.23%; 12502 us/sq; ETA 0d 00:01; 7fc58198458ecd8a
2019-09-07 00:54:35 Exception gpu_error: OUT_OF_HOST_MEMORY clCreateBuffer at clwrap.cpp:273 makeBuf
_
2019-09-07 00:54:36 Bye[/CODE]cmd line was gpuowl-win -user kriesel -cpu condorette -use ORIG_X2 -device 0, worktodo entry B1=440000,B2=8360000;PFactor=0,1,2,50001781,-1,73,2. The same system and gpu have run CUDAPm1 to completion both stages on up to 384M exponents.

preda 2019-09-07 13:38

[QUOTE=kriesel;525362]msys2 / MINGW64 make did not like that one at all, producing quite a shower of messages about it.[/QUOTE]

Please retry (I attempted a fix)

preda 2019-09-07 13:41

[QUOTE=kriesel;525370]On a Win7 Pro x64 system with 12GB of ram, 36GB page file, and gpu with 4GB (GTX1050Ti) gpuowl 6.7-4 produced mad paging for an hour before terminating its attempt to start stage 2 on a 50M P-1 run :[CODE]2019-09-06 23:33:12 50001781 610000 96.08%; 12509 us/sq; ETA 0d 00:05; 73d8a20f091dd815
2019-09-06 23:35:17 50001781 620000 97.66%; 12520 us/sq; ETA 0d 00:03; c2adf07b3ea6c52c
2019-09-06 23:37:22 50001781 630000 99.23%; 12502 us/sq; ETA 0d 00:01; 7fc58198458ecd8a
2019-09-07 00:54:35 Exception gpu_error: OUT_OF_HOST_MEMORY clCreateBuffer at clwrap.cpp:273 makeBuf
_
2019-09-07 00:54:36 Bye[/CODE]cmd line was gpuowl-win -user kriesel -cpu condorette -use ORIG_X2 -device 0, worktodo entry B1=440000,B2=8360000;PFactor=0,1,2,50001781,-1,73,2. The same system and gpu have run CUDAPm1 to completion both stages on up to 384M exponents.[/QUOTE]

Did you not specify -maxAlloc?

kriesel 2019-09-07 14:03

[QUOTE=preda;525378]Did you not specify -maxAlloc?[/QUOTE]Correct, -maxAlloc not used on that run, which went to over 11GB peak working set and hit as high as 8000 page faults/sec. A retry with -maxAlloc 3072 started shortly after the earlier post has made it to stage 2 and so far has peak working set 182MB.

kriesel 2019-09-07 14:21

[QUOTE=preda;525377]Please retry (I attempted a fix)[/QUOTE]Thanks, much better; clean compile on gpuowl v6.8-2-g0f3059b in Win7Pro x64 with msys2/mingw64 (indicated as 20180531 in Control Panel/Programs/Programs and Features; not sure if that indicates any subsequent updates).

AlsXZ 2019-09-08 05:43

Sorry if I choose wrong forum thread but can you explain such case: on 1050TI I got near 315 Ghz/day in 24 hours on TF task. The almost same amount Ghz/day I got for one LL or PRP task which takes 15-16 days of computing on the same 1050 TI GPU. So 315 Ghz/days per 1 day or 16 days. Why so bid difference?

nomead 2019-09-08 06:45

There are some other factors (ahem...) at play too, but the simplest way to put it is that LL/PRP and TF work uses different compute capabilities of the GPU core. TF is mostly 32-bit integer (INT32) while the FFT multiplication in LL/PRP uses mostly 64-bit double floating point (FP64) arithmetic. And in Nvidia consumer cards, the FP64 capability has been deliberately limited to protect their datacenter range of cards. So much so, that for every FP64 operation, the card can do 32 FP32 operations in the same time in parallel. On the Pascal cards (GTX10xx) INT32 is somewhat slower than FP32, so that's why the ratio in work done per day isn't quite 1:32. I don't know the exact figure though. But in the Turing generation of cards (GTX16xx and RTX20xx) INT32 is now as fast as FP32, so the difference between TF and LL work GHz-d/d is several times bigger.

For example, my Ryzen 5 3600 can do more LL work per day in mprime than my RTX 2080 (non-Super) can in CUDALucas. So that's why I only do trial factoring on the card.

This is also the reason why the Radeon VII is such a beast at PRP work, because there, the FP64 to FP32 ratio was only limited to 1:4. On AMD consumer cards in general I think the ratio has mostly been 1:16, please correct me because I'm sure I'm wrong.

kriesel 2019-09-08 16:39

[QUOTE=nomead;525442]On AMD consumer cards in general I think the ratio has mostly been 1:16, please correct me because I'm sure I'm wrong.[/QUOTE]This is why Preda was discouraged back around v3.8 gpuowl, on implementing an NVIDIA compatible gpuowl; GTX10xx correctly seemed slow to him, at 1:32. There is wide variation in the NVIDIA line. Some older NVIDIA gpus have much better DP performance ratios. See [url]https://www.mersenneforum.org/showpost.php?p=490612&postcount=3[/url]

kriesel 2019-09-08 17:41

gpuowl P-1 stage 2 terminology questions
 
What is a block?
What is a round?
What determines how many rounds are required for stage 2?
What Brent Suyama extension exponent is used? Does it vary?

What determines the maximum exponent that gpuowl can complete in P-1 stage 1 or stage 2, other than run time or available fft lengths? (No doubt available gpu ram is a constraint, but without more info, that does not enable computing or predicting max exponent per gpu model based on gpu specifications. "Just try it" is an unsatisfying answer when run times may be weeks or longer, depending on gpu model and exponent)

kriesel 2019-09-08 18:05

relative performance of Gpuowl P-1 and CUDAPm1
 
On the same GTX1080Ti,

CUDPm1 V0.20 exponent 300M 1.2 days to complete both stages. (Includes stage 2 gcd time). Empirical fit is time proportional to exponent to the 1.986 power)

Gpuowl V6.7-4 exponent 298M 1.068 days to compelte both stages. (Not charged for stage 2 gcd time because that occurs in parallel with another exponent's stage 1 run if more work is queued and running from worktodo.txt. Empirical fit is time proportional to exponent tothe 1.83 power.)
(Fits are likely to bend upward to ~2.1 power at higher exponent as more data are acquired at higher exponents. That's how it went with LL and PRP applications.)
Extrapolating the gpuowl result to 300M at p[SUP]2[/SUP] would give ~1.082 days, [B]9.8% faster[/B] than CUDAPm1 on the same hardware and exponent.

Gpuowl may be able to run higher P-1 2-stage exponents than CUDAPm1 on the same hardware. That is under test now.

Gpuowl will not run on a test gpu Quadro 2000 (compute capability 2.1, opencl 1.1/1.2), producing a shower of cl compile errors relating to atomics. I think it requires at least OpenCL 1.2 and therefore a CUDA compute capability above 2.x. [B]An explicit test for opencl version[/B] by gpuowl might be a good thing. ("Gpuowl requires OpenCL 1.2 support for atomics, which this gpu does not support. Exiting now." or some such helpful message.)

kriesel 2019-09-08 20:15

Gpuowl v 6.7 on 4GB GTX1050Ti with -maxAlloc 3072 bailed on stage 2 of a 223M exponent P-1 run, indicating not enough memory. [url]https://www.mersenne.org/M223000051[/url]. CUDAPm1 can run up to 384M on that same gpu.

preda 2019-09-09 12:47

[QUOTE=kriesel;525491]What is a block?
What is a round?
What determines how many rounds are required for stage 2?
What Brent Suyama extension exponent is used? Does it vary?

What determines the maximum exponent that gpuowl can complete in P-1 stage 1 or stage 2, other than run time or available fft lengths? (No doubt available gpu ram is a constraint, but without more info, that does not enable computing or predicting max exponent per gpu model based on gpu specifications. "Just try it" is an unsatisfying answer when run times may be weeks or longer, depending on gpu model and exponent)[/QUOTE]

I'll try to answer some of those questions, but I realize this is a bit obscure.

In stage1 we compute Base = 3^powerSmooth(B1).

the "Brent-Suyama exponent" is E=2, meaning:

In stage2 we multiply with
Base^(a^2) - Base^(b^2) == Base^(b^2) * (Base^(a^2 - b^2) - 1), and
Base^(a^2 - b^2) == Base^((a - b)*(a + b))

We try to cover all the primes in the range [B1, B2] with numbers of the form a-b or a+b as above, where "a" is a multiple of D: a==k*D.

In GpuOwl, D is a primorial D = 2*3*5*7*11*13 == 30030.
When D has such a form, it turns out that all the primes can be covered with values (a+b) or (a-b) where b<D/2 and b is relative prime to all the prime factors of D, i.e. the number of possible values of b needed to cover any prime is:
J = 1*2*4*6*10*12 / 2 == 2880

The abstract idea is to precompute all the 2880 values of Base^(b^2), and next iterate k with a=k*D over regions of size D. The range of "k" is given by the need to cover the range [B1,B2] with intervals of size D. Such intervals of size D are called "Blocks", and thus the number of Blocks is roughly equal to (B2 - B1)/30030.

Now, to precompute the 2880 values we'd need 2880 "buffers" (in the PRP sense), which are about 40MB in size each. 2880 * 40MB is a bit too much for the GPU memory, so we divide 2880 into a number of *rounds*.

We compute a number of buffers N that fit in GPU RAM, and N divides 2880. Now Rounds == 2880/N.

So:
- Blocks is (roughly) given by (B2 - B1)/D, (D==30030)
- Rounds is given by: 2880 / (nb. of buffers that fit in RAM)
- E=2 (Brent-Suyama exponent)

The above design works well with large amounts of RAM, where the RAM can fit 200-400 buffers (8GB-16GB with 40MB buffers) or more, such that the number of rounds stays small. If there's too little RAM, let's say enough to fit 4 buffers, then the number of rounds 2880/4 would be too large and a lot of time would be spent just switching from one round to the next. GpuOwl refuses to run stage2 if at least 24 buffers can't be fit (120 rounds).

preda 2019-09-09 12:57

[QUOTE=kriesel;525514]Gpuowl v 6.7 on 4GB GTX1050Ti with -maxAlloc 3072 bailed on stage 2 of a 223M exponent P-1 run, indicating not enough memory. [url]https://www.mersenne.org/M223000051[/url]. CUDAPm1 can run up to 384M on that same gpu.[/QUOTE]

I think this is because of GpuOwl's protection to have at least 24 buffers in stage2; otherwise the number of rounds would be large and wasteful. You could increase the -maxAlloc to e.g. 3900 or 4000, and test with a very low B1=1000 (to not waste time before stage2 starts), maybe it would start stage2. You can estimate the buffer size from the FFT size used (FFT-size * 8).

kriesel 2019-09-09 14:27

[QUOTE=preda;525573]I'll try to answer some of those questions, but I realize this is a bit obscure.
...
The above design works well with large amounts of RAM, where the RAM can fit 200-400 buffers (8GB-16GB with 40MB buffers) or more, such that the number of rounds stays small. If there's too little RAM, let's say enough to fit 4 buffers, then the number of rounds 2880/4 would be too large and a lot of time would be spent just switching from one round to the next. GpuOwl refuses to run stage2 if at least 24 buffers can't be fit (120 rounds).[/QUOTE]
Thanks for the detailed explanation.
So, # of relative primes/round ~ 2880/rounds. I've seen at least the following numbers of rounds in stage 2 in initial testing of gpuowl 6.6 & 6.7 P-1: 9 13 18 40 41 45 72 90 92 144 261. 13 41 92 261 don't evenly divide 2880.

On an 11GB GTX1080Ti in gpuowl V6.7-4-278407a I got the following:
[CODE]...
2019-09-06 16:50:52 298000033 3400000 98.20%; 12685 us/sq; ETA 0d 00:13; ce82eeba45fe1874
2019-09-06 16:52:59 298000033 3410000 98.49%; 12680 us/sq; ETA 0d 00:11; 15dcd7efd36a1a41
2019-09-06 16:55:06 298000033 3420000 98.78%; 12678 us/sq; ETA 0d 00:09; 62e9338608a7eaba
2019-09-06 16:57:12 298000033 3430000 99.06%; 12667 us/sq; ETA 0d 00:07; da9050a0970d90ae
2019-09-06 16:59:19 298000033 3440000 99.35%; 12673 us/sq; ETA 0d 00:05; fefb68a0f511ba43
2019-09-06 17:01:26 298000033 3450000 99.64%; 12683 us/sq; ETA 0d 00:03; 9361be5349669b91
2019-09-06 17:03:33 298000033 3460000 99.93%; 12693 us/sq; ETA 0d 00:01; 57dd39e62c29b825
2019-09-06 17:04:04 P-1 stage2 using 11 buffers of 144.0 MB each
2019-09-06 17:04:05 P-1 (B1=2400000, B2=55200000, D=30030): primes 3117147, expanded 3208391, doubles 496457 (left 2156356), singles 2124233, total 2620690 (84%)
2019-09-06 17:04:05 298000033 P-1 stage2: 1759 blocks starting at block 80 (2620690 selected)
2019-09-06 17:07:09 Round 1 of 261: init 1.19 s; 18.34 ms/mul; 9981 muls
2019-09-06 17:10:14 Round 2 of 261: init 1.33 s; 18.34 ms/mul; 10017 muls
2019-09-06 17:13:21 Round 3 of 261: init 1.56 s; 18.28 ms/mul; 10111 muls
2019-09-06 17:16:28 Round 4 of 261: init 1.64 s; 18.23 ms/mul; 10171 muls
...
2019-09-07 06:27:14 Round 260 of 261: init 1.67 s; 18.59 ms/mul; 10046 muls
2019-09-07 06:30:22 Round 261 of 261: init 1.93 s; 18.59 ms/mul; 10008 muls
2019-09-07 06:30:24 406000003 FFT 24576K: Width 256x4, Height 256x4, Middle 12; 16.13 bits/word
2019-09-07 06:30:24 using short carry kernels
2019-09-07 06:30:24 OpenCL args "-DEXP=406000003u -DWIDTH=1024u -DSMALL_HEIGHT=1024u -DMIDDLE=12u -DWEIGHT_STEP=0xe.974d6f9929278p-3 -DIWEIGHT_STEP=0x8.c5c3982c20ae8p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-09-07 06:30:28
2019-09-07 06:30:28 OpenCL compilation in 3666 ms
2019-09-07 06:30:34 406000003 P-1 starting stage1
2019-09-07 06:33:24 406000003 10000 0.21%; 16863 us/sq; ETA 0d 21:47; 4b5360aace601e72
2019-09-07 06:36:13 406000003 20000 0.43%; 16894 us/sq; ETA 0d 21:47; 6d8775dbd7676169
2019-09-07 06:39:02 406000003 30000 0.64%; 16902 us/sq; ETA 0d 21:45; 0339219ee80c94e3
2019-09-07 06:40:39 298000033 P-1 final GCD: no factor
2019-09-07 06:40:39 {"exponent":"298000033", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.7-4-g278407a"}, "timestamp":"2019-09-07 11:40:39 UTC", "user":"kriesel", "computer":"dodo-gtx1080ti", "aid":"0", "fft-length":18874368, "B1":2400000, "B2":55200000}
2019-09-07 06:41:51 406000003 40000 0.86%; 16904 us/sq; ETA 0d 21:42; e23d26ca2c9c1976
[/CODE]At what version was the 120 round limit instituted?
This example above is a considerable exception to the 120 rounds limit described. It's also seemingly not using much of the gpu ram or the 10240MB maxAlloc. 11 x 144 = 1584MB.
11 buffers x 261 rounds = 2871[B] not 2880[/B]. I'm familiar from running lots of CUDAPm1 with the last round being a smaller "runt" round at times, when # of buffers does not exactly divide the total count. How is that handled in gpuowl?
On a gpuowl V6.7-4-278407a RX550 (4GB; -maxAlloc 3072 I think) run on 100002769 it used 144 rounds.
I have no issue with higher number of rounds, having run CUDAPm1 to 480 or 960 (NRP down to 1); just thought you would want to know.
(Such runs in CUDAPm1 are not normally recommended, because its selection of bounds usually is too low when memory is that limiting. Bounds too low don't retire the primenet task.)

I have gpus with as little as 1GB, but they are not an issue since they're opencl 1.1/1.2 and won't run gpuowl at all due to errors on atomic uint during opencl compile at launch. But I also have 1.5, 2, 2.5, 3, 4, & 5.25 GB gpus. One of the 2GB gpus is an RX550 which can't run CUDAPm1, while many of the other low-ram gpus can.


All times are UTC. The time now is 21:16.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.