![]() |
pthread
[QUOTE=tServo;499215]kriesel,
Reading the old thread from Victor you referenced, I'm thinking the correct thread library name might be "libwinpthread" .[/QUOTE] Both pthread and libwinpthread are present on my system. For what it's worth, make with -pthread seems to have worked in V5.0. [CODE] C:\>dir/s libwinpthread*.* Directory of C:\msys64\mingw64\bin 10/29/2018 01:23 AM 57,829 libwinpthread-1.dll 1 File(s) 57,829 bytes Directory of C:\msys64\mingw64\share\licenses 10/30/2018 01:00 PM <DIR> libwinpthread 0 File(s) 0 bytes Directory of C:\msys64\mingw64\x86_64-w64-mingw32\lib 10/29/2018 01:23 AM 69,858 libwinpthread.a 10/29/2018 01:23 AM 88,994 libwinpthread.dll.a 2 File(s) 158,852 bytes Directory of C:\Users\ken\Documents\clLucas_x64_1.04 09/11/2015 01:38 AM 56,978 libwinpthread-1.dll 1 File(s) 56,978 bytes Total Files Listed: 4 File(s) 273,659 bytes C:\>dir/s pthread*.* Directory of C:\msys64\mingw64\x86_64-w64-mingw32\include 10/29/2018 01:23 AM 34,696 pthread.h 10/29/2018 01:23 AM 3,449 pthread_compat.h 10/29/2018 01:23 AM 1,304 pthread_signal.h 10/29/2018 01:23 AM 2,979 pthread_time.h 10/29/2018 01:23 AM 5,379 pthread_unistd.h 5 File(s) 47,807 bytes Total Files Listed: 5 File(s) 47,807 bytes[/CODE] |
[QUOTE=kriesel;499222]As always, documentation.[/QUOTE]
If there are specific documentation needs (or pain points), it's easier for me to address them one-by-one. [QUOTE]Which versions' save files can be continued with which versions?[/QUOTE] v5.0 can read its own savefile version (8) and the previous (7). Because I don't track these neatly in a table, I don't know exactly to which version they correspond. Probably somebody moving from v3 to v5 would be affected, but all he has to do would be to finish the old exponent with the old version. The "header" of a savefile can be explored with "head -1 89204567.owl" which prints the very first line only. [QUOTE]Some radix-3 transforms, and maybe 7 if it helps speed. 6M and 12M in particular. It's a particularly long jump between 20M and 36M, so adding 24M or 32M or both would be good. Similarly between 40M and 72M, 48M or 64M or both.[/QUOTE] I'll keep this in mind. [QUOTE]Nonzero offset, pseudorandom at start time.[/QUOTE] I'm not convinced of the benefit. There would be significant work involved, and the source-code would be more complex. I'd need to think about the interaction with a "full" base (when B1 is not 0, base != 3), there is a chance that it wouldn't work with a "full" base, or that it would be more expensive. [QUOTE]A result output for stage one of P-1. There currently is none (at least if both B1 and B2 were specified).[/QUOTE] Yes, I was thinking about this myself in the past, but, and following some discussion, it seems that a single "compound" result at the end, encompassing both PRP & P-1, is preferred. I'm neutral on this choice, but I don't see not-having a separate P-1-first-stage result as a problem, for this reason: - the probability of finding a factor there is small, e.g. around 2%. - IF a factor is found, there will be a result written on the spot and the task ends. So all there's missing are "negative P-1 first stage", not a big deal IMO. [QUOTE]Closer following of spelling and grammar. beginnig -> beginning[/QUOTE] Thanks, will fix. [QUOTE] 1 mul but 2 or more muls (justify with a space for the singular to preserve alignment) Investigate or explain how a mul time in V5.0 can be negative or positive.[/QUOTE] OK, this is part of how the time for MUL vs. SQ is derived (which was answering my own question about how fast is the MUL). It is somewhat experimental, and will very likely be changed or dropped. I basically have access to the total time, and I know the number of SQs and MULs that produced that time. Combining multiple such "lines" with different MULs allows to estimate the time ratio between the two. OTOH if there is time variation from causes independent of the number of SQ/MUL (e.g. from the GPU throttling), most of this variation will be (wrongly) allocated to the MULs because they are much fewer than the SQ thus more "flexible". Anyway, this is an experiment waiting for a proper end. |
[QUOTE=kriesel;499222]
Investigate or explain how a mul time in V5.0 can be negative or positive.[/QUOTE] But, the big news here is that you are able to run v5.0! :) (I'm glad the compilation finally worked) |
PRP-1 validation
1 Attachment(s)
For anybody wanting to experiment with PRP-1, I would recommend running a couple of validation runs before starting serious work.
Validation would consist in taking an exponent with known factors (maybe somebody has links to such lists on the forum?). (usually this would be an exponent that was factored with P-1, but could be TF as well). For the known factor, working out a pair B1/B2 that would cover it; creating a PRP-1 assignment from that, running it, and verifying that it does find the expected factor. There can be variations such: - test "first stage" only (B1 covers the factor), or first+second stage. - do multiple stop/restart, does it still find the factor? etc. I attach a table of P-1 factors; I don't remember where I found it, but it most likely was posted by James somewhere on the forum. An example from that table: 86014009,262147231459344118478999,78,4967,78167 Means this factor can be covered with B1=4967 and B2=78167, but any value larger than that should work. A PRP-1 validation assignment could be: B1=20000,B2=100000;86014009 OR, testing first-stage: B1=80000;86014009 |
2 Attachment(s)
[QUOTE=preda;499232]For anybody wanting to experiment with PRP-1, I would recommend running a couple of validation runs before starting serious work.
Validation would consist in taking an exponent with known factors (maybe somebody has links to such lists on the forum?). (usually this would be an exponent that was factored with P-1, but could be TF as well). For the known factor, working out a pair B1/B2 that would cover it; creating a PRP-1 assignment from that, running it, and verifying that it does find the expected factor. There can be variations such: - test "first stage" only (B1 covers the factor), or first+second stage. - do multiple stop/restart, does it still find the factor? etc. I attach a table of P-1 factors; I don't remember where I found it, but it most likely was posted by James somewhere on the forum. An example from that table: 86014009,262147231459344118478999,78,4967,78167 Means this factor can be covered with B1=4967 and B2=78167, but any value larger than that should work. A PRP-1 validation assignment could be: B1=20000,B2=100000;86014009 OR, testing first-stage: B1=80000;86014009[/QUOTE] I just did a couple of quick tests on v5.0, I am a bit confused. 859433 is a prime. [URL]https://www.mersenne.org/report_exponent/?exp_lo=859433[/URL] *Test1 on amdgpu-pro: [URL]https://www.mersenneforum.org/attachment.php?attachmentid=19201&stc=1&d=1541063294[/URL] *Test2 on rocm: [URL]https://www.mersenneforum.org/attachment.php?attachmentid=19202&stc=1&d=1541063326[/URL] the result is "C" in both cases. BTW, in this case rocm is slower than amdgpu-pro. Rocm is faster on large exponents. |
Thanks, that's a genuine error. I'll fix ASAP (24h).
The final residue shows the computation is fine, it's just the logic for deciding prime/not-prime at the end that's broken. Will fix. [QUOTE=SELROC;499235]I just did a couple of quick tests on v5.0, I am a bit confused. 859433 is a prime. [URL]https://www.mersenne.org/report_exponent/?exp_lo=859433[/URL] *Test1 on amdgpu-pro: [URL]https://www.mersenneforum.org/attachment.php?attachmentid=19201&stc=1&d=1541063294[/URL] *Test2 on rocm: [URL]https://www.mersenneforum.org/attachment.php?attachmentid=19202&stc=1&d=1541063326[/URL] the result is "C" in both cases. BTW, in this case rocm is slower than amdgpu-pro. Rocm is faster on large exponents.[/QUOTE] |
[QUOTE=preda;499241]Thanks, that's a genuine error. I'll fix ASAP (24h).
The final residue shows the computation is fine, it's just the logic for deciding prime/not-prime at the end that's broken. Will fix.[/QUOTE] The following log line shows a difference between exponent and iterations number: 859433 vs. 859600 2018-11-01 09:22:55 0 859433 10000/859600 [ 1.16%], 0.49 ms/it; ETA 0d 00:07; 21bc9a2e362200a7 is that normal ? |
[QUOTE=preda;499232]For anybody wanting to experiment with PRP-1, I would recommend running a couple of validation runs before starting serious work.
Validation would consist in taking an exponent with known factors (maybe somebody has links to such lists on the forum?). (usually this would be an exponent that was factored with P-1, but could be TF as well). For the known factor, working out a pair B1/B2 that would cover it; creating a PRP-1 assignment from that, running it, and verifying that it does find the expected factor. There can be variations such: - test "first stage" only (B1 covers the factor), or first+second stage. - do multiple stop/restart, does it still find the factor? etc. I attach a table of P-1 factors; I don't remember where I found it, but it most likely was posted by James somewhere on the forum. An example from that table: 86014009,262147231459344118478999,78,4967,78167 Means this factor can be covered with B1=4967 and B2=78167, but any value larger than that should work. A PRP-1 validation assignment could be: B1=20000,B2=100000;86014009 OR, testing first-stage: B1=80000;86014009[/QUOTE] From my draft rewrite of the CUDAPm1 readme file, a list over a wider exponent range: [CODE] Run CUDAPm1 on some exponents with known factors that should be found, and see whether you find them. Easiest way is to select from the following list, exponents at or near the size you plan to run, and put them in the worktodo file. The bounds necessary to find factors vary by exponent. CUDAPm1's automatic parameter selection will be enough to find most but not all. Exponent Min B1 Min B2 fft length notes 4444091 7 2,557 256k 50001781 94,709 4,067,587 2688k 51558151 5,953 2,034,041 2880k 54447193 1,181 682,009 3072k 58610467 70,843 694,201 3200k 61012769 10,273 1,572,097 3360k 81229789 6,709 11,282,221 4704K 100000081 1,289 7,554,653 5600K 120002191 1,563 3,109,391 7168K 150000713 15,131 2,294,519 8640K 200000183 953 1,138,061 11200K 200001187 204,983 207,821 11200K 200003173 4,651 229,813 11200K 249500221 4 2.58951e+9 14336K big bounds, much memory & time 249500501 307 167,381 14336K 290001377 2,551 34,354,769 16384K takes days[/CODE] |
[QUOTE=preda;499228]If there are specific documentation needs (or pain points), it's easier for me to address them one-by-one.
v5.0 can read its own savefile version (8) and the previous (7). Because I don't track these neatly in a table, I don't know exactly to which version they correspond. Probably somebody moving from v3 to v5 would be affected, but all he has to do would be to finish the old exponent with the old version. The "header" of a savefile can be explored with "head -1 89204567.owl" which prints the very first line only. (re variable offset) I'm not convinced of the benefit. There would be significant work involved, and the source-code would be more complex. I'd need to think about the interaction with a "full" base (when B1 is not 0, base != 3), there is a chance that it wouldn't work with a "full" base, or that it would be more expensive. (re B1 no factor result line) Yes, I was thinking about this myself in the past, but, and following some discussion, it seems that a single "compound" result at the end, encompassing both PRP & P-1, is preferred. I'm neutral on this choice, but I don't see not-having a separate P-1-first-stage result as a problem, for this reason: - the probability of finding a factor there is small, e.g. around 2%. - IF a factor is found, there will be a result written on the spot and the task ends. So all there's missing are "negative P-1 first stage", not a big deal IMO. OK, this is part of how the time for MUL vs. SQ is derived (which was answering my own question about how fast is the MUL). It is somewhat experimental, and will very likely be changed or dropped. I basically have access to the total time, and I know the number of SQs and MULs that produced that time. Combining multiple such "lines" with different MULs allows to estimate the time ratio between the two. OTOH if there is time variation from causes independent of the number of SQ/MUL (e.g. from the GPU throttling), most of this variation will be (wrongly) allocated to the MULs because they are much fewer than the SQ thus more "flexible". Anyway, this is an experiment waiting for a proper end.[/QUOTE] Having the documentation scattered over multiple posts spanning dozens or hundreds of posts in the thread is a chronic pain point. I encourage following the CUDALucas model of a fairly comprehensive readme.txt, and updating it as regularly as the 0.1 releases are made. Even stating what's not known (eg exponent limits per fft length are uncertain) is useful. A table, even if sparsely populated, for save file compatibility, would be ideal. I'm still running V1.9 a bit. V3.8 can continue from V1.9's save files in my experience. A list of what versions code make what version save file would be a good start. TF availability documentation had been a problem. Save file format description would be useful for the occasional coder. Documentation that distributes with the code is best. Documentation is like code for the user. Re the details of spelling and grammar and formatting, I'd be willing to work with you on it. Re a B1 no factor found result line, that would be useful in the case where a run is terminated after B1, without performing B2 or PRP. I just finished B1 to 10^6 on p=48500017, which already has two PRPs done, one of them by me, with a previous version of gpuowl, before PRP-1 capability existed, with zero offset. PRP-1 could be used to do stage one P-1 only, on AMD gpus with opencl. Supporting nonzero offset is useful in that it ensures PRP tests are useful without the user having to check for a previous gpuowl or other zero offset run. And there may be a zero offset run under way in a previous version of gpuowl. Storing the same way (in offset and transform independent form) is useful, although separately recording offset and continuing to completion with the same offset as was earlier used could be useful. Running a full PRP-1 just to get B2 done seems not worth it, especially since it will duplicate zero offset PRP. Have you considered a P-1 only mode or version? There are people running P-1 deeper on mersenne exponents that have already been primality tested once or twice. |
[QUOTE=SELROC;499242]The following log line shows a difference between exponent and iterations number: 859433 vs. 859600
2018-11-01 09:22:55 0 859433 10000/859600 [ 1.16%], 0.49 ms/it; ETA 0d 00:07; 21bc9a2e362200a7 is that normal ?[/QUOTE] I think so. As I understand it, the computation needs to be carried past iteration p, to the next multiple of the block size, so the final error check can be done against it. Yes it looks odd because in earlier versions it was displayed differently. Good catch on the C on a prime. We should check a few known primes, probably at every major release if not minor release. It's also an argument for double checking with different software or at least different offset, every exponent. A list of tests to be done on every major release (which would necessarily get updated as program features change) would be a good thing. I wonder what Preda's testing consists of. |
V5.0 crashes on too-small exponent
I suggest bounds testing the worktodo exponent value, and if out of bounds, issuing a polite message and log entry, skipping the worktodo entry, and continuing to the next worktodo entry.
Also, documentation of bounds would be good. [CODE]C:\msys64\home\ken\gpuowl-compile\v5.0>openowl.exe -user kriesel -cpu condorella-rx480 -device 0 2018-11-01 08:22:21 gpuowl 5.0-f604bb1 2018-11-01 08:22:21 condorella-rx480 -user kriesel -cpu condorella-rx480 -device 0 2018-11-01 08:22:21 condorella-rx480 107 FFT 512K: Width 64x8, Height 64x8; 0.00 bits/word 2018-11-01 08:22:21 condorella-rx480 using long carry kernels 2018-11-01 08:22:22 condorella-rx480 Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics 2018-11-01 08:22:25 condorella-rx480 OpenCL compilation in 3291 ms, with "-DEXP=107u -DWIDTH=512u -DSMALL_HEIGHT=512u -DMIDDLE=1u -I. -cl-fast-relaxed-math -cl -std=CL2.0 " 2018-11-01 08:22:25 condorella-rx480 107.owl not found, starting from the beginnig. Assertion failed! Program: C:\msys64\home\ken\gpuowl-compile\v5.0\openowl.exe File: state.cpp, Line 124 Expression: w >= 0 && w < (1 << len) This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information.[/CODE]Also crashes on assorted other exponents up to at least 216091. [CODE]C:\msys64\home\ken\gpuowl-compile\v5.0>openowl.exe -user kriesel -cpu condorella-rx480 -device 0 2018-11-01 08:43:30 gpuowl 5.0-f604bb1 2018-11-01 08:43:30 condorella-rx480 -user kriesel -cpu condorella-rx480 -device 0 2018-11-01 08:43:30 condorella-rx480 216091 FFT 512K: Width 64x8, Height 64x8; 0.41 bits/word 2018-11-01 08:43:30 condorella-rx480 using long carry kernels 2018-11-01 08:43:31 condorella-rx480 Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics 2018-11-01 08:43:34 condorella-rx480 OpenCL compilation in 3322 ms, with "-DEXP=216091u -DWIDTH=512u -DSMALL_HEIGHT=512u -DMIDDLE=1u -I. -cl-fast-relaxed-math -cl-std=CL2.0 " 2018-11-01 08:43:34 condorella-rx480 216091.owl not found, starting from the beginnig. 2018-11-01 08:43:34 condorella-rx480 powerSmooth(216091, 2000) has 2916 bits Assertion failed! Program: C:\msys64\home\ken\gpuowl-compile\v5.0\openowl.exe File: state.cpp, Line 24 Expression: 0 <= w && w < (1 << nBits) This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. C:\msys64\home\ken\gpuowl-compile\v5.0>[/CODE] Makes sense it would fail since it's <1 bit/word. Enforcing some liberal bounds would be a good feature. |
| All times are UTC. The time now is 23:10. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.