mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

kriesel 2018-11-01 04:48

pthread
 
[QUOTE=tServo;499215]kriesel,

Reading the old thread from Victor you referenced, I'm thinking the correct thread library name might be "libwinpthread" .[/QUOTE]
Both pthread and libwinpthread are present on my system. For what it's worth, make with -pthread seems to have worked in V5.0.

[CODE]
C:\>dir/s libwinpthread*.*

Directory of C:\msys64\mingw64\bin

10/29/2018 01:23 AM 57,829 libwinpthread-1.dll
1 File(s) 57,829 bytes

Directory of C:\msys64\mingw64\share\licenses

10/30/2018 01:00 PM <DIR> libwinpthread
0 File(s) 0 bytes

Directory of C:\msys64\mingw64\x86_64-w64-mingw32\lib

10/29/2018 01:23 AM 69,858 libwinpthread.a
10/29/2018 01:23 AM 88,994 libwinpthread.dll.a
2 File(s) 158,852 bytes

Directory of C:\Users\ken\Documents\clLucas_x64_1.04

09/11/2015 01:38 AM 56,978 libwinpthread-1.dll
1 File(s) 56,978 bytes

Total Files Listed:
4 File(s) 273,659 bytes

C:\>dir/s pthread*.*

Directory of C:\msys64\mingw64\x86_64-w64-mingw32\include

10/29/2018 01:23 AM 34,696 pthread.h
10/29/2018 01:23 AM 3,449 pthread_compat.h
10/29/2018 01:23 AM 1,304 pthread_signal.h
10/29/2018 01:23 AM 2,979 pthread_time.h
10/29/2018 01:23 AM 5,379 pthread_unistd.h
5 File(s) 47,807 bytes

Total Files Listed:
5 File(s) 47,807 bytes[/CODE]

preda 2018-11-01 07:43

[QUOTE=kriesel;499222]As always, documentation.[/QUOTE]
If there are specific documentation needs (or pain points), it's easier for me to address them one-by-one.

[QUOTE]Which versions' save files can be continued with which versions?[/QUOTE]
v5.0 can read its own savefile version (8) and the previous (7). Because I don't track these neatly in a table, I don't know exactly to which version they correspond. Probably somebody moving from v3 to v5 would be affected, but all he has to do would be to finish the old exponent with the old version. The "header" of a savefile can be explored with "head -1 89204567.owl" which prints the very first line only.

[QUOTE]Some radix-3 transforms, and maybe 7 if it helps speed.
6M and 12M in particular.
It's a particularly long jump between 20M and 36M, so adding 24M or 32M or both would be good.
Similarly between 40M and 72M, 48M or 64M or both.[/QUOTE]
I'll keep this in mind.

[QUOTE]Nonzero offset, pseudorandom at start time.[/QUOTE]
I'm not convinced of the benefit. There would be significant work involved, and the source-code would be more complex. I'd need to think about the interaction with a "full" base (when B1 is not 0, base != 3), there is a chance that it wouldn't work with a "full" base, or that it would be more expensive.

[QUOTE]A result output for stage one of P-1. There currently is none (at least if both B1 and B2 were specified).[/QUOTE]
Yes, I was thinking about this myself in the past, but, and following some discussion, it seems that a single "compound" result at the end, encompassing both PRP & P-1, is preferred. I'm neutral on this choice, but I don't see not-having a separate P-1-first-stage result as a problem, for this reason:
- the probability of finding a factor there is small, e.g. around 2%.
- IF a factor is found, there will be a result written on the spot and the task ends.
So all there's missing are "negative P-1 first stage", not a big deal IMO.

[QUOTE]Closer following of spelling and grammar. beginnig -> beginning[/QUOTE]
Thanks, will fix.

[QUOTE]
1 mul but 2 or more muls (justify with a space for the singular to preserve alignment)

Investigate or explain how a mul time in V5.0 can be negative or positive.[/QUOTE]
OK, this is part of how the time for MUL vs. SQ is derived (which was answering my own question about how fast is the MUL). It is somewhat experimental, and will very likely be changed or dropped.

I basically have access to the total time, and I know the number of SQs and MULs that produced that time. Combining multiple such "lines" with different MULs allows to estimate the time ratio between the two.

OTOH if there is time variation from causes independent of the number of SQ/MUL (e.g. from the GPU throttling), most of this variation will be (wrongly) allocated to the MULs because they are much fewer than the SQ thus more "flexible". Anyway, this is an experiment waiting for a proper end.

preda 2018-11-01 07:50

[QUOTE=kriesel;499222]
Investigate or explain how a mul time in V5.0 can be negative or positive.[/QUOTE]
But, the big news here is that you are able to run v5.0! :)
(I'm glad the compilation finally worked)

preda 2018-11-01 08:16

PRP-1 validation
 
1 Attachment(s)
For anybody wanting to experiment with PRP-1, I would recommend running a couple of validation runs before starting serious work.

Validation would consist in taking an exponent with known factors (maybe somebody has links to such lists on the forum?). (usually this would be an exponent that was factored with P-1, but could be TF as well).

For the known factor, working out a pair B1/B2 that would cover it; creating a PRP-1 assignment from that, running it, and verifying that it does find the expected factor.

There can be variations such:
- test "first stage" only (B1 covers the factor), or first+second stage.
- do multiple stop/restart, does it still find the factor?
etc.

I attach a table of P-1 factors; I don't remember where I found it, but it most likely was posted by James somewhere on the forum.
An example from that table:

86014009,262147231459344118478999,78,4967,78167

Means this factor can be covered with B1=4967 and B2=78167, but any value larger than that should work.

A PRP-1 validation assignment could be:

B1=20000,B2=100000;86014009

OR, testing first-stage:
B1=80000;86014009

SELROC 2018-11-01 09:12

2 Attachment(s)
[QUOTE=preda;499232]For anybody wanting to experiment with PRP-1, I would recommend running a couple of validation runs before starting serious work.

Validation would consist in taking an exponent with known factors (maybe somebody has links to such lists on the forum?). (usually this would be an exponent that was factored with P-1, but could be TF as well).

For the known factor, working out a pair B1/B2 that would cover it; creating a PRP-1 assignment from that, running it, and verifying that it does find the expected factor.

There can be variations such:
- test "first stage" only (B1 covers the factor), or first+second stage.
- do multiple stop/restart, does it still find the factor?
etc.

I attach a table of P-1 factors; I don't remember where I found it, but it most likely was posted by James somewhere on the forum.
An example from that table:

86014009,262147231459344118478999,78,4967,78167

Means this factor can be covered with B1=4967 and B2=78167, but any value larger than that should work.

A PRP-1 validation assignment could be:

B1=20000,B2=100000;86014009

OR, testing first-stage:
B1=80000;86014009[/QUOTE]


I just did a couple of quick tests on v5.0, I am a bit confused.


859433 is a prime.
[URL]https://www.mersenne.org/report_exponent/?exp_lo=859433[/URL]


*Test1 on amdgpu-pro:
[URL]https://www.mersenneforum.org/attachment.php?attachmentid=19201&stc=1&d=1541063294[/URL]


*Test2 on rocm:
[URL]https://www.mersenneforum.org/attachment.php?attachmentid=19202&stc=1&d=1541063326[/URL]


the result is "C" in both cases.


BTW, in this case rocm is slower than amdgpu-pro. Rocm is faster on large exponents.

preda 2018-11-01 11:32

Thanks, that's a genuine error. I'll fix ASAP (24h).

The final residue shows the computation is fine, it's just the logic for deciding prime/not-prime at the end that's broken. Will fix.

[QUOTE=SELROC;499235]I just did a couple of quick tests on v5.0, I am a bit confused.


859433 is a prime.
[URL]https://www.mersenne.org/report_exponent/?exp_lo=859433[/URL]


*Test1 on amdgpu-pro:
[URL]https://www.mersenneforum.org/attachment.php?attachmentid=19201&stc=1&d=1541063294[/URL]


*Test2 on rocm:
[URL]https://www.mersenneforum.org/attachment.php?attachmentid=19202&stc=1&d=1541063326[/URL]


the result is "C" in both cases.


BTW, in this case rocm is slower than amdgpu-pro. Rocm is faster on large exponents.[/QUOTE]

SELROC 2018-11-01 12:08

[QUOTE=preda;499241]Thanks, that's a genuine error. I'll fix ASAP (24h).

The final residue shows the computation is fine, it's just the logic for deciding prime/not-prime at the end that's broken. Will fix.[/QUOTE]


The following log line shows a difference between exponent and iterations number: 859433 vs. 859600


2018-11-01 09:22:55 0 859433 10000/859600 [ 1.16%], 0.49 ms/it; ETA 0d 00:07; 21bc9a2e362200a7


is that normal ?

kriesel 2018-11-01 12:21

[QUOTE=preda;499232]For anybody wanting to experiment with PRP-1, I would recommend running a couple of validation runs before starting serious work.

Validation would consist in taking an exponent with known factors (maybe somebody has links to such lists on the forum?). (usually this would be an exponent that was factored with P-1, but could be TF as well).

For the known factor, working out a pair B1/B2 that would cover it; creating a PRP-1 assignment from that, running it, and verifying that it does find the expected factor.

There can be variations such:
- test "first stage" only (B1 covers the factor), or first+second stage.
- do multiple stop/restart, does it still find the factor?
etc.

I attach a table of P-1 factors; I don't remember where I found it, but it most likely was posted by James somewhere on the forum.
An example from that table:

86014009,262147231459344118478999,78,4967,78167

Means this factor can be covered with B1=4967 and B2=78167, but any value larger than that should work.

A PRP-1 validation assignment could be:

B1=20000,B2=100000;86014009

OR, testing first-stage:
B1=80000;86014009[/QUOTE]
From my draft rewrite of the CUDAPm1 readme file, a list over a wider exponent range:

[CODE] Run CUDAPm1 on some exponents with known factors that should be found, and
see whether you find them. Easiest way is to select from the following list,
exponents at or near the size you plan to run, and put them in the worktodo
file. The bounds necessary to find factors vary by exponent. CUDAPm1's
automatic parameter selection will be enough to find most but not all.

Exponent Min B1 Min B2 fft length notes
4444091 7 2,557 256k
50001781 94,709 4,067,587 2688k
51558151 5,953 2,034,041 2880k
54447193 1,181 682,009 3072k
58610467 70,843 694,201 3200k
61012769 10,273 1,572,097 3360k
81229789 6,709 11,282,221 4704K
100000081 1,289 7,554,653 5600K
120002191 1,563 3,109,391 7168K
150000713 15,131 2,294,519 8640K
200000183 953 1,138,061 11200K
200001187 204,983 207,821 11200K
200003173 4,651 229,813 11200K
249500221 4 2.58951e+9 14336K big bounds, much memory & time
249500501 307 167,381 14336K
290001377 2,551 34,354,769 16384K takes days[/CODE]

kriesel 2018-11-01 13:03

[QUOTE=preda;499228]If there are specific documentation needs (or pain points), it's easier for me to address them one-by-one.

v5.0 can read its own savefile version (8) and the previous (7). Because I don't track these neatly in a table, I don't know exactly to which version they correspond. Probably somebody moving from v3 to v5 would be affected, but all he has to do would be to finish the old exponent with the old version. The "header" of a savefile can be explored with "head -1 89204567.owl" which prints the very first line only.

(re variable offset)
I'm not convinced of the benefit. There would be significant work involved, and the source-code would be more complex. I'd need to think about the interaction with a "full" base (when B1 is not 0, base != 3), there is a chance that it wouldn't work with a "full" base, or that it would be more expensive.

(re B1 no factor result line)
Yes, I was thinking about this myself in the past, but, and following some discussion, it seems that a single "compound" result at the end, encompassing both PRP & P-1, is preferred. I'm neutral on this choice, but I don't see not-having a separate P-1-first-stage result as a problem, for this reason:
- the probability of finding a factor there is small, e.g. around 2%.
- IF a factor is found, there will be a result written on the spot and the task ends.
So all there's missing are "negative P-1 first stage", not a big deal IMO.

OK, this is part of how the time for MUL vs. SQ is derived (which was answering my own question about how fast is the MUL). It is somewhat experimental, and will very likely be changed or dropped.

I basically have access to the total time, and I know the number of SQs and MULs that produced that time. Combining multiple such "lines" with different MULs allows to estimate the time ratio between the two.

OTOH if there is time variation from causes independent of the number of SQ/MUL (e.g. from the GPU throttling), most of this variation will be (wrongly) allocated to the MULs because they are much fewer than the SQ thus more "flexible". Anyway, this is an experiment waiting for a proper end.[/QUOTE]
Having the documentation scattered over multiple posts spanning dozens or hundreds of posts in the thread is a chronic pain point. I encourage following the CUDALucas model of a fairly comprehensive readme.txt, and updating it as regularly as the 0.1 releases are made. Even stating what's not known (eg exponent limits per fft length are uncertain) is useful.
A table, even if sparsely populated, for save file compatibility, would be ideal. I'm still running V1.9 a bit. V3.8 can continue from V1.9's save files in my experience.
A list of what versions code make what version save file would be a good start.

TF availability documentation had been a problem.
Save file format description would be useful for the occasional coder.

Documentation that distributes with the code is best.
Documentation is like code for the user.
Re the details of spelling and grammar and formatting, I'd be willing to work with you on it.

Re a B1 no factor found result line, that would be useful in the case where a run is terminated after B1, without performing B2 or PRP. I just finished B1 to 10^6 on p=48500017, which already has two PRPs done, one of them by me, with a previous version of gpuowl, before PRP-1 capability existed, with zero offset. PRP-1 could be used to do stage one P-1 only, on AMD gpus with opencl. Supporting nonzero offset is useful in that it ensures PRP tests are useful without the user having to check for a previous gpuowl or other zero offset run. And there may be a zero offset run under way in a previous version of gpuowl. Storing the same way (in offset and transform independent form) is useful, although separately recording offset and continuing to completion with the same offset as was earlier used could be useful. Running a full PRP-1 just to get B2 done seems not worth it, especially since it will duplicate zero offset PRP. Have you considered a P-1 only mode or version? There are people running P-1 deeper on mersenne exponents that have already been primality tested once or twice.

kriesel 2018-11-01 13:10

[QUOTE=SELROC;499242]The following log line shows a difference between exponent and iterations number: 859433 vs. 859600

2018-11-01 09:22:55 0 859433 10000/859600 [ 1.16%], 0.49 ms/it; ETA 0d 00:07; 21bc9a2e362200a7

is that normal ?[/QUOTE]

I think so. As I understand it, the computation needs to be carried past iteration p, to the next multiple of the block size, so the final error check can be done against it. Yes it looks odd because in earlier versions it was displayed differently.

Good catch on the C on a prime. We should check a few known primes, probably at every major release if not minor release. It's also an argument for double checking with different software or at least different offset, every exponent.

A list of tests to be done on every major release (which would necessarily get updated as program features change) would be a good thing. I wonder what Preda's testing consists of.

kriesel 2018-11-01 13:27

V5.0 crashes on too-small exponent
 
I suggest bounds testing the worktodo exponent value, and if out of bounds, issuing a polite message and log entry, skipping the worktodo entry, and continuing to the next worktodo entry.

Also, documentation of bounds would be good.
[CODE]C:\msys64\home\ken\gpuowl-compile\v5.0>openowl.exe -user kriesel -cpu condorella-rx480 -device 0
2018-11-01 08:22:21 gpuowl 5.0-f604bb1
2018-11-01 08:22:21 condorella-rx480 -user kriesel -cpu condorella-rx480 -device 0
2018-11-01 08:22:21 condorella-rx480 107 FFT 512K: Width 64x8, Height 64x8; 0.00 bits/word
2018-11-01 08:22:21 condorella-rx480 using long carry kernels
2018-11-01 08:22:22 condorella-rx480 Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics
2018-11-01 08:22:25 condorella-rx480 OpenCL compilation in 3291 ms, with "-DEXP=107u -DWIDTH=512u -DSMALL_HEIGHT=512u -DMIDDLE=1u -I. -cl-fast-relaxed-math -cl
-std=CL2.0 "
2018-11-01 08:22:25 condorella-rx480 107.owl not found, starting from the beginnig.
Assertion failed!

Program: C:\msys64\home\ken\gpuowl-compile\v5.0\openowl.exe
File: state.cpp, Line 124

Expression: w >= 0 && w < (1 << len)

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.[/CODE]Also crashes on assorted other exponents up to at least 216091.
[CODE]C:\msys64\home\ken\gpuowl-compile\v5.0>openowl.exe -user kriesel -cpu condorella-rx480 -device 0
2018-11-01 08:43:30 gpuowl 5.0-f604bb1
2018-11-01 08:43:30 condorella-rx480 -user kriesel -cpu condorella-rx480 -device 0
2018-11-01 08:43:30 condorella-rx480 216091 FFT 512K: Width 64x8, Height 64x8; 0.41 bits/word
2018-11-01 08:43:30 condorella-rx480 using long carry kernels
2018-11-01 08:43:31 condorella-rx480 Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics
2018-11-01 08:43:34 condorella-rx480 OpenCL compilation in 3322 ms, with "-DEXP=216091u -DWIDTH=512u -DSMALL_HEIGHT=512u -DMIDDLE=1u -I. -cl-fast-relaxed-math
-cl-std=CL2.0 "
2018-11-01 08:43:34 condorella-rx480 216091.owl not found, starting from the beginnig.
2018-11-01 08:43:34 condorella-rx480 powerSmooth(216091, 2000) has 2916 bits
Assertion failed!

Program: C:\msys64\home\ken\gpuowl-compile\v5.0\openowl.exe
File: state.cpp, Line 24

Expression: 0 <= w && w < (1 << nBits)

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
C:\msys64\home\ken\gpuowl-compile\v5.0>[/CODE]
Makes sense it would fail since it's <1 bit/word. Enforcing some liberal bounds would be a good feature.


All times are UTC. The time now is 23:10.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.