mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   The P-1 factoring CUDA program (https://www.mersenneforum.org/showthread.php?t=17835)

kriesel 2019-01-27 15:33

Very early excessive roundoff and quit on CUDAPm1 v0.22
 
CUDAPm1 v0.22 had excessive roundoff with its own chosen fft and threads settings, and terminated, on multiple attempts on 4 of 4 test exponents. CUDAPm1 v0.20 had no trouble on the same assignments, on the same gpu and host system; first two completed, third nearing completion.
[CODE]batch wrapper reports (re)launch at Wed 01/23/2019 19:13:33.15 reset count 2 of max 3
CUDAPm1 v0.22
------- DEVICE 0 -------
name GeForce GTX 1070
Compatibility 6.1
clockRate (MHz) 1708
memClockRate (MHz) 4004
totalGlobalMem 8589934592
totalConstMem 65536
l2CacheSize 2097152
sharedMemPerBlock 49152
regsPerBlock 65536
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 15
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment 512
deviceOverlap 1

CUDA reports 7995M of 8192M GPU memory free.
Using threads: norm1 512, mult 32, norm2 128.
Using up to 7857M GPU memory.
Selected B1=1920000, B2=45600000, 4.08% chance of finding a factor
Starting stage 1 P-1, M187000019, B1 = 1920000, B2 = 45600000, fft length = 10368K
Doing 2770223 iterations
Iteration = 100, err = 0.49959 >= 0.40, quitting.
Estimated time spent so far: 0:00

batch wrapper reports exit at Wed 01/23/2019 19:13:55.57 [/CODE][CODE]batch wrapper reports (re)launch at Wed 01/23/2019 19:20:31.18 reset count 2 of max 3
CUDAPm1 v0.22
------- DEVICE 0 -------
name GeForce GTX 1070
Compatibility 6.1
clockRate (MHz) 1708
memClockRate (MHz) 4004
totalGlobalMem 8589934592
totalConstMem 65536
l2CacheSize 2097152
sharedMemPerBlock 49152
regsPerBlock 65536
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 15
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment 512
deviceOverlap 1

CUDA reports 7995M of 8192M GPU memory free.
Using threads: norm1 512, mult 32, norm2 32.
Using up to 7884M GPU memory.
Selected B1=2540000, B2=62865000, 4.22% chance of finding a factor
Starting stage 1 P-1, M249000043, B1 = 2540000, B2 = 62865000, fft length = 13824K
Doing 3664015 iterations
Iteration = 100, err = 0.49955 >= 0.40, quitting.
Estimated time spent so far: 0:00

batch wrapper reports exit at Wed 01/23/2019 19:21:06.43 [/CODE][CODE]batch wrapper reports (re)launch at Wed 01/23/2019 19:23:26.18 reset count 2 of max 3
CUDAPm1 v0.22
------- DEVICE 0 -------
name GeForce GTX 1070
Compatibility 6.1
clockRate (MHz) 1708
memClockRate (MHz) 4004
totalGlobalMem 8589934592
totalConstMem 65536
l2CacheSize 2097152
sharedMemPerBlock 49152
regsPerBlock 65536
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 15
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment 512
deviceOverlap 1

CUDA reports 7995M of 8192M GPU memory free.
Using threads: norm1 512, mult 32, norm2 512.
Using up to 7776M GPU memory.
Selected B1=2895000, B2=70927500, 4.08% chance of finding a factor
Starting stage 1 P-1, M302000059, B1 = 2895000, B2 = 70927500, fft length = 18432K
Doing 4176850 iterations
Iteration = 100, err = 0.49925 >= 0.40, quitting.
Estimated time spent so far: 0:00

batch wrapper reports exit at Wed 01/23/2019 19:24:10.37 [/CODE][CODE]batch wrapper reports (re)launch at Wed 01/23/2019 19:26:56.20 reset count 2 of max 3
CUDAPm1 v0.22
------- DEVICE 0 -------
name GeForce GTX 1070
Compatibility 6.1
clockRate (MHz) 1708
memClockRate (MHz) 4004
totalGlobalMem 8589934592
totalConstMem 65536
l2CacheSize 2097152
sharedMemPerBlock 49152
regsPerBlock 65536
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 15
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment 512
deviceOverlap 1

CUDA reports 7995M of 8192M GPU memory free.
Using threads: norm1 512, mult 32, norm2 1024.
Using up to 7776M GPU memory.
Selected B1=3600000, B2=84600000, 4.05% chance of finding a factor
Starting stage 1 P-1, M369000029, B1 = 3600000, B2 = 84600000, fft length = 20736K
Doing 5192497 iterations
Iteration = 100, err = 0.49578 >= 0.40, quitting.
Estimated time spent so far: 0:00

batch wrapper reports exit at Wed 01/23/2019 19:27:59.83 [/CODE]CUDAPm1 v0.20 completed these on the same gpu and system during the same system boot
[URL]https://www.mersenne.org/report_exponent/?exp_lo=187000019&exp_hi=&full=1[/URL]
[URL]https://www.mersenne.org/report_exponent/?exp_lo=249000043&exp_hi=&full=1[/URL]
CUDAPm1 v0.20 on M302000059 is less than a day away from completing stage 2 and is looking good so far.

The only differences in the respective cudapm1.ini files were:
[CODE]SaveAllCheckpoints=1 @0.20, 0 @0.22
Threads=1024 @0.20, default (256)@0.22
[/CODE]Making the CUDAPm1 v0.22 ini file match the v0.20 ini file did not resolve the early excessive roundoff issue. (Tested only on the M187m exponent)


M187m fft length 10368k on v0.22;
v0.22 threads file entry 10368 512 32 128 1.8490

M187m fft length 10368k on v0.20:
v0.20 threads file entry 10368 32 32 32 14.4726

An M187m attempt running now on v0.22 with v0.20's threads counts appears to have avoided the early excessive roundoff error.
Edit: whoops, no, bad zero res produced, detected, program stop.
[CODE]batch wrapper reports (re)launch at Sun 01/27/2019 9:50:14.25 reset count 0 of max 3
CUDAPm1 v0.22
------- DEVICE 0 -------
name GeForce GTX 1070
Compatibility 6.1
clockRate (MHz) 1708
memClockRate (MHz) 4004
totalGlobalMem 8589934592
totalConstMem 65536
l2CacheSize 2097152
sharedMemPerBlock 49152
regsPerBlock 65536
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 15
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment 512
deviceOverlap 1

CUDA reports 7995M of 8192M GPU memory free.
Using threads: norm1 32, mult 32, norm2 32.
Using up to 7857M GPU memory.
Selected B1=1920000, B2=45600000, 4.08% chance of finding a factor
Starting stage 1 P-1, M187000019, B1 = 1920000, B2 = 45600000, fft length = 10368K
Doing 2770223 iterations
Iteration 100000 *** Current residue matches known bad value, aborting
M187000019, 0x0000000000000000, n = 10368K, CUDAPm1 v0.22 err = 0.00000 (19:46 real, 11.8584 ms/iter, ETA 8:47:44)
Estimated time spent so far: 19:46

batch wrapper reports exit at Sun 01/27/2019 10:10:21.76 [/CODE]

petrw1 2019-03-12 18:02

Anyone running this on a GTX-980 or similar?

Reliable?
Recommended?
Expected thruput (i.e. GhzDays/Day)?

Thanks

kriesel 2019-03-12 18:58

[QUOTE=petrw1;510678]Anyone running this on a GTX-980 or similar?

Reliable?
Recommended?
Expected thruput (i.e. GhzDays/Day)?

Thanks[/QUOTE]
I've run it on a variety of GTX10x0 and others. Throughput will be similar to running CUDALucas on the gpu; much reduced from TF GhzD/day, by around a factor of 15 I think. CUDAPm1's math is mostly similar to LL or PRP. (DP performance dependent.) The gcd part of CUDALucas runs on a core of the cpu, which can result in a temporary stall of a prime95 or mprime worker, although hyperthreading may prevent that. Judging by [URL]https://www.mersenne.ca/cudalucas.php[/URL], performance would be about 1.1 times that of a GTX1060. It's probably capable of running P-1 stage 2 for exponents up to 350M or so. See [URL]https://www.mersenneforum.org/showthread.php?p=489180#post489180[/URL] for more info. Note CUDAPm1 was described by its author as alpha software.

Cubox 2019-03-17 04:03

[QUOTE=kriesel;498660]
Possibly Cubox might help in some way. [URL]https://www.mersenneforum.org/showpost.php?p=481663&postcount=552[/URL]
[/QUOTE]

Sorry, studies kept me busy and I completely forgot the project.
I might give it a try again, but no promises.

Any areas that need a look, that I can start into? Something I can't really break if possible :P

Cubox 2019-03-17 04:15

Checking the source very quickly I see this: [url]https://github.com/ah42/cuda-p1/blob/ffe52b53c3c2955f1750574e62d80eba3ed6d455/CUDAPm1.cu#L2635[/url]

This is the kind of rusting I can find fun to polish

kriesel 2019-03-17 15:05

[QUOTE=Cubox;510952]Checking the source very quickly I see this: [URL]https://github.com/ah42/cuda-p1/blob/ffe52b53c3c2955f1750574e62d80eba3ed6d455/CUDAPm1.cu#L2635[/URL]

This is the kind of rusting I can find fun to polish[/QUOTE]
Good one, I hadn't seen that. Welcome back.

There's lots to pick from. See posts 676 & 677 of this thread, and the full bug and wish list (attachment in post 3 of [URL]https://www.mersenneforum.org/showthread.php?t=23389[/URL])
My personal preference would be addition of ISO style date and time to one-second or better precision to screen output, elapsed time on gcds, optional logging of stdout and stderr, and progress on 676 & 677. See also 653; those silent halts in stage 2 can be a bit maddening. Aaron implemented some and deferred some of what I offered in [URL]https://www.mersenneforum.org/showpost.php?p=499685&postcount=612[/URL]
First step would be to get a development and build environment going, and confirm the ability to build a small CUDA example and then either v0.20 (requires gmp I think) or preferably v0.22 (requires MPIR as I recall) before attempting further changes.

Cubox 2019-03-17 17:01

Noted.

Would it be OK to try to rearrange the source code a bit?
Right now the main .cu file is huge and difficult to navigate, I'd like to clean up a bit.
Moving "math" functions into separate files, leaving startup and config/argument list parsing into the main file. I saw references to LL in the code, maybe remove unused code.

I'm not going to touch any function that does math, such as the stage2 function. Don't want to break anything. But it might be more easy to navigate into the source if it's more clean.

Cubox 2019-03-17 20:22

Also, the binary won't run without the cudart and cufft DLLs present. Won't even give any error message on the command line.

If nothing can be done to print an error (if Windows won't start the binary we can't do anything about it), it should be documented for people wanting to try for themselves. Or maybe package them with the binaries? Same for the ini file.

I will try to get a build environment. Hopefully without Visual Studio I can build v0.22 (no point in trying 0.20). Are Github pull requests alright for submitting changes?

Cheers

kriesel 2019-03-17 20:29

[QUOTE=Cubox;510972]Noted.

Would it be OK to try to rearrange the source code a bit?
Right now the main .cu file is huge and difficult to navigate, I'd like to clean up a bit.
Moving "math" functions into separate files, leaving startup and config/argument list parsing into the main file. I saw references to LL in the code, maybe remove unused code.

I'm not going to touch any function that does math, such as the stage2 function. Don't want to break anything. But it might be more easy to navigate into the source if it's more clean.[/QUOTE]
What blocks of code do you think can be removed?

I wouldn't bet on those "references to LL" being a reliable indicator of what may be unused and deletable, or there being much that can be entirely stripped. For example, I think part of a comment section at CUDAPm1.cu ~line 1399 " * End LL/GPU Functions, Begin Utility/CPU Functions *" just didn't get changed to "End P-1/GPU Functions...".
Or parse.c ~line 544: case NO_TEST_EQUAL: printf("doesn't begin with Test= or DoubleCheck=.\n");break;
The message should be about "doesn't begin with PFactor="
There are lots of references to the variables LL_tests or LL_saved. Those are legitimate variables and code for P-1, involved in determining what are good P-1 bounds to run.
(I'd rather state here a thing or two that might already be obvious to you, than have you miss something and use your time inefficiently as a result. And I've probably left a lot out too so be careful.)

The CUDAPm1 program was probably created from a complete copy of some earlier version of CUDALucas. Some comments, print statements, etc., did not get changed along with the rest of the code. Similarly, the editing of the readme file and ini file were not completed then.

Will you be working mainly in linux or Windows? It will be much easier for me to test Windows executables, CUDA 5.0 - 8.0 range only currently.

Re github use, please take that up with Aaron Haviland, as it's his repo.
The necessity for the dlls, ini file etc can be addressed in the readme and may already be.

Cubox 2019-03-17 21:37

[QUOTE=kriesel;510991]What blocks of code do you think can be removed?

I wouldn't bet on those "references to LL" being a reliable indicator of what may be unused and deletable, or there being much that can be entirely stripped. For example, I think part of a comment section at CUDAPm1.cu ~line 1399 " * End LL/GPU Functions, Begin Utility/CPU Functions *" just didn't get changed to "End P-1/GPU Functions...".
Or parse.c ~line 544: case NO_TEST_EQUAL: printf("doesn't begin with Test= or DoubleCheck=.\n");break;
The message should be about "doesn't begin with PFactor="
There are lots of references to the variables LL_tests or LL_saved. Those are legitimate variables and code for P-1, involved in determining what are good P-1 bounds to run.
(I'd rather state here a thing or two that might already be obvious to you, than have you miss something and use your time inefficiently as a result. And I've probably left a lot out too so be careful.)
[/QUOTE]

I have not identified any code at all yet, but I am willing to go and check if I can find any.
I will obviously not remove anything before making sure it is not necessary.

[QUOTE=kriesel;510991]
Will you be working mainly in linux or Windows? It will be much easier for me to test Windows executables, CUDA 5.0 - 8.0 range only currently.
[/QUOTE]

Windows, latest (Enterprise edition). With a GTX 1070, so 6.1. Don't have anything else.

kriesel 2019-03-17 22:13

[QUOTE=Cubox;510994]I have not identified any code at all yet, but I am willing to go and check if I can find any.
I will obviously not remove anything before making sure it is not necessary.

Windows, latest (Enterprise edition). With a GTX 1070, so 6.1. Don't have anything else.[/QUOTE]Excellent. Nice card. When you get to that point, please build for CC 2.0 and up if practical. Something I found confusing early on, and still find inconvenient, is the difference between Compute Capability, CUDA level, and the multiple ways of expressing driver versions. I can run anything from CUDA4.0 to CUDA8.0 on a Compute Capability 2.0 gpu such as Quadro 4000, if I'm careful, but not if too high a driver version is installed, such as 24.21.14.1195 (411.95), which was ok with CC3.0 but not 2.0. Too high a driver level can make a low Compute Capability card disappear as far as CUDA apps are concerned.
I haven't found a good CC/CUDA/driver min and max limits scorecard for what works with what yet. Wading through specs and SDK documentation to find it is tedious. And then there is the whole model name/ family name connection.

Cubox 2019-03-23 03:33

I'm still working on having the project build, but close.

In the meantime, is it OK to ask some (sometimes basic) questions about P-1 and how it is implemented for GIMPS?
It's not directly related to the project, so feel free to tell me where to ask elsewhere.

Cubox 2019-03-23 04:22

Build successful!

[url]https://cubox.me/files/gimps/CUDAPm1-23032019.zip[/url]

With Cuda 10.1 which is higher than the release binary on Github I think.

I'll see about compute capabilities

Cubox 2019-03-23 04:40

I am seeing big gains on speed between my binary compiled for 6.1 compute capability, versus the binary from GitHub, which is compiled for 3.5 compute capability:

[C]Iteration 5000 M61408363, 0xb18fee1c5cbbc536, n = 3360K, CUDAPm1 v0.22 err = 0.17188 (0:27 real, 5.3726 ms/iter, ETA 1:17:04)
Iteration 10000 M61408363, 0x717c9ea7258d4438, n = 3360K, CUDAPm1 v0.22 err = 0.16406 (0:26 real, 5.3408 ms/iter, ETA 1:16:10)
Iteration 15000 M61408363, 0xdf4aaa1700855aac, n = 3360K, CUDAPm1 v0.22 err = 0.17188 (0:27 real, 5.3603 ms/iter, ETA 1:16:00)[/C]

VS

[C]Iteration 5000 M61408363, 0xb18fee1c5cbbc536, n = 3360K, CUDAPm1 v0.22 err = 0.17969 (0:21 real, 4.0929 ms/iter, ETA 58:43)
Iteration 10000 M61408363, 0x717c9ea7258d4438, n = 3360K, CUDAPm1 v0.22 err = 0.17188 (0:20 real, 4.0922 ms/iter, ETA 58:22)
Iteration 15000 M61408363, 0xdf4aaa1700855aac, n = 3360K, CUDAPm1 v0.22 err = 0.16602 (0:20 real, 4.0929 ms/iter, ETA 58:02)[/C]

The 6.1 binary: [url]https://cubox.me/files/gimps/CUDAPm1.exe[/url]

Cubox 2019-03-23 04:56

The CUDA compiler refuses to make any binary under CC3.0, so 2.0 is not possible without getting an earlier CUDA SDK version.

Here is a binary for 3.0: [url]https://cubox.me/files/gimps/CC3.0-CUDAPm1.exe[/url]

kriesel 2019-03-23 13:51

[QUOTE=Cubox;511483]The CUDA compiler refuses to make any binary under CC3.0, so 2.0 is not possible without getting an earlier CUDA SDK version.[/QUOTE]Congrats on getting a build to work. Sounds like you're making good progress.
I think you would need a CUDA 8 or earlier SDK for CC2.0 output. It won't address CC2.0 from 10.1 SDK, but have you tried multiple CC level & PTX output? Supposedly multiple versions can be included, so a variety of gpus can have what's optimal for them available, I think from the same exe file. Then from the CUDA 8 SDK, CC2.0-6.5 or so could be covered in one exe. About half my gpus are CC2.x.

In a CUDA 9.2 SDK mfaktc makefile, it looks like
[CODE]# generate code for various compute capabilities
# not available in cuda9.2 NVCCFLAGS += --generate-code arch=compute_11,code=sm_11 # CC 1.1, 1.2 and 1.3 GPUs will use this code (1.0 is not possible for mfaktc)
# not available in cuda9.2 NVCCFLAGS += --generate-code arch=compute_20,code=sm_20 # CC 2.x GPUs will use this code, one code fits all!
NVCCFLAGS += --generate-code arch=compute_30,code=sm_30 # all CC 3.x GPUs _COULD_ use this code
NVCCFLAGS += --generate-code arch=compute_35,code=sm_35 # but CC 3.5 (3.2?) _CAN_ use funnel shift which is useful for mfaktc
NVCCFLAGS += --generate-code arch=compute_50,code=sm_50 # CC 5.x GPUs will use this code
NVCCFLAGS += --generate-code arch=compute_52,code=sm_52 # CC 5.2 GPUs will use this code
NVCCFLAGS += --generate-code arch=compute_61,code=sm_61 # CC 6.1+ GPUs will use this code, GTX 10xx for example
NVCCFLAGS += --generate-code arch=compute_70,code=sm_70 # CC 7.x GPUs will use this code
[/CODE]

Cubox 2019-03-23 18:50

I thought you could only have one CC level per binary. I'll try multiples. I was going to make one binary for each CC level...

If I make Cuda8 work, what I can do one binary that will cover 2.0-6.5 like you mentioned, and then one from the latest Cuda (10.1 right now) that will do 3.0-max

Cubox 2019-03-23 19:17

The links I give with links to builds can become dead links at any time.

Here is my latest build, using CUDA 10.1, with all compute capabilities from 3.0 to 7.1 (used the list kriesel took from mfaktc).

Changed the version name in the code to v0.22.notstable, since it is not tested at all, but has no changes from 0.22 code (yet).

It should behave the same: [url]https://cubox.dev/files/gimps/CUDAPm1.exe[/url]

kriesel 2019-03-23 20:37

[QUOTE=Cubox;511547]The links I give with links to builds can become dead links at any time.

Here is my latest build, using CUDA 10.1, with all compute capabilities from 3.0 to 7.1 (used the list kriesel took from mfaktc).

Changed the version name in the code to v0.22.notstable, since it is not tested at all, but has no changes from 0.22 code (yet).

It should behave the same: [URL]https://cubox.dev/files/gimps/CUDAPm1.exe[/URL][/QUOTE]
I see a couple of your recent build links are already 404's.
My CUDA 9.2 example from an edited mfaktc makefile ended at cc 7.0. I couldn't tell whether you added cc7.5 for your CUDA 10.1 build.
From the NVIDIA "Turing Compatibility Guide" available at the documentation url for the CUDA10 SDK, [CODE]nvcc.exe -ccbin "C:\vs2010\VC\bin"
-Xcompiler "/EHsc /W3 /nologo /O2 /Zi /MT"
-gencode=arch=compute_50,code=sm_50
-gencode=arch=compute_52,code=sm_52
-gencode=arch=compute_60,code=sm_60
-gencode=arch=compute_61,code=sm_61
-gencode=arch=compute_70,code=sm_70
-gencode=arch=compute_75,code=sm_75
-gencode=arch=compute_75,code=compute_75
--compile -o "Release\mykernel.cu.obj" "mykernel.cu"[/CODE]Note the last -gencode line, which looks to the casual glance to be duplication, but isn't. As the Turing compatibility guide explains (and previous ones also), code=compute_xx is PTX (future-proofing), while code=sm_xx is cubin which is compute-capability specific. (There's a just-in-time final compile of PTX to cubin, if I've understood correctly enough.)

Some do these compiles from batch files, for example,
[CODE]nvcc -ccbin "C:\Program Files\Microsoft Visual Studio .NET 2003\Vc7\bin" -cubin -DWIN32 -D_CONSOLE -D_MBCS -Xcompiler /EHsc,/W3,/nologo,/Wp64,/O2,/Zi,/MT -I"C:\CUDA\include" -I./ -I"C:\Program Files\NVIDIA Corporation\NVIDIA SDK 10\NVIDIA CUDA SDK\common\inc" %1

:I use it from command line like this
: runnvcccubin.bat file_name.cu

:You may need to change the Visual Studio and NVIDIA SDK directories to make it work in your environment.

:https://devtalk.nvidia.com/default/topic/368105/cuda-occupancy-calculator-helps-pick-optimal-thread-block-size/[/CODE]Not sure if that's helpful; see the included url for context.

Best reference I've found yet for correlating CC, CUDA, gpu family, gpu model etc is [URL]https://en.wikipedia.org/wiki/CUDA[/URL]

Cubox 2019-03-23 21:31

Rebuild the binary (same URL) with those CCs:
[C] -gencode=arch=compute_50,code=sm_50
-gencode=arch=compute_52,code=sm_52
-gencode=arch=compute_60,code=sm_60
-gencode=arch=compute_61,code=sm_61
-gencode=arch=compute_70,code=sm_70
-gencode=arch=compute_75,code=sm_75
-gencode=arch=compute_75,code=compute_75[/C]

pepi37 2019-03-23 21:42

Card : Geforce 1050Ti
OS Win7 x64
Latest Nvidia drivers


I need to download cufft64_10.dll to make CudaPm-1 work
Dll size is 110 MB

James Heinrich 2019-03-23 22:04

You can find all the CUDA DLLs you need at
[url]https://download.mersenne.ca/CUDA-DLLs[/url]

Confusingly-named DLL, but you're looking for the CUDA v10.1 versions:
[url]https://download.mersenne.ca/CUDA-DLLs/CUDA-10.1[/url]

kriesel 2019-03-23 22:24

[QUOTE=Cubox;511578]Rebuild the binary (same URL) with those CCs:
[C] -gencode=arch=compute_50,code=sm_50
-gencode=arch=compute_52,code=sm_52
-gencode=arch=compute_60,code=sm_60
-gencode=arch=compute_61,code=sm_61
-gencode=arch=compute_70,code=sm_70
-gencode=arch=compute_75,code=sm_75
-gencode=arch=compute_75,code=compute_75[/C][/QUOTE]
I didn't mean for you to drop 3.0, 3.5 from the 10.1 build, it was just a quick copy/paste verbatim from the reference.
Do you plan to do a CUDA8 SDK build going back to cc2.0?
Something like [CODE]#CUDA SDK 8.0 cc 2.0-6.1 +PTX, requires compatible drivers and dlls
# CC 1.x unsupported; other omitted CC steps 2.1, 3.2, 3.7, 5.3, 6.0, 6.2, 7.2
NVCCFLAGS += --generate-code arch=compute_20,code=sm_20 # CC 2.x GPUs will use this code; cc2.0 has 32 alu lanes (int & single precision)
# NVCCFLAGS += --generate-code arch=compute_21,code=sm_21 # CC 2.1 GPUs could use this code in SP apps like mfaktc; cc2.1 has 48 alu lanes
NVCCFLAGS += --generate-code arch=compute_30,code=sm_30 # all CC 3.x GPUs _COULD_ use this code
NVCCFLAGS += --generate-code arch=compute_35,code=sm_35 # but CC 3.5 (3.2?) _CAN_ use funnel shift which is useful for mfaktc
NVCCFLAGS += --generate-code arch=compute_50,code=sm_50 # CC 5.x GPUs will use this code
NVCCFLAGS += --generate-code arch=compute_52,code=sm_52 # CC 5.2 GPUs will use this code
NVCCFLAGS += --generate-code arch=compute_61,code=sm_61 # CC 6.1+ GPUs will use this code, GTX 10xx for example
NVCCFLAGS += --generate-code arch=compute_61,code=compute_61 # future-proof with PTX, eg CC 7.x+ GPUs will use this code[/CODE][CODE]#CUDA SDK 10.x cc 3.0-7.5 +PTX, requires compatible drivers and dlls
# CC 2.x and lower unsupported; other omitted CC steps 3.2, 3.7, 5.3, 6.0, 6.2, 7.2
NVCCFLAGS += --generate-code arch=compute_30,code=sm_30 # all CC 3.x GPUs _COULD_ use this code
NVCCFLAGS += --generate-code arch=compute_35,code=sm_35 # but CC 3.5 (3.2?) _CAN_ use funnel shift which is useful for mfaktc
NVCCFLAGS += --generate-code arch=compute_50,code=sm_50 # CC 5.x GPUs will use this code
NVCCFLAGS += --generate-code arch=compute_52,code=sm_52 # CC 5.2 GPUs will use this code
NVCCFLAGS += --generate-code arch=compute_61,code=sm_61 # CC 6.1+ GPUs will use this code, GTX 10xx for example
NVCCFLAGS += --generate-code arch=compute_70,code=sm_70 # CC 7. GPUs will use this code
NVCCFLAGS += --generate-code arch=compute_75,code=sm_75 # CC 7.5+ GPUs will use this code, RTX 20xx for example
NVCCFLAGS += --generate-code arch=compute_75,code=compute_75 # future-proof with PTX, eg CC 8.x+ GPUs will use this code[/CODE]An older cudapm1 make file I had here went all the way back to cc1.3, but I think there are few if any of those cards left.

Cubox 2019-03-25 15:07

Added CC 3.0 and 3.5 to latest build (same URL)

Flags are:
[C]compute_30,sm_30
compute_35,sm_35
compute_50,sm_50
compute_52,sm_52
compute_60,sm_60
compute_61,sm_61
compute_70,sm_70
compute_75,sm_75
compute_75,compute_75[/C]

[url]https://cubox.dev/files/gimps/[/url] contains latest exe, with the DLLs required to run the program and the .ini you can have.

Regarding CUDA8, it does not support VS2017, only VS2015. It's a lot of work to support it, I'd rather spend the time actually editing the code.

masser 2019-04-04 04:07

Many thanks to all of the developers of CUDAPm1. I have V.22 running an assignment on my dinky GT1030. First P-1 assignment in years for me.

GhettoChild 2019-04-23 16:42

Is this where I post my errors?
 
[CODE]
CUDAPm1 v0.20
------- DEVICE 0 -------
name GeForce GTX 770
Compatibility 3.0
clockRate (MHz) 1202
memClockRate (MHz) 3505
totalGlobalMem zu
totalConstMem zu
l2CacheSize 524288
sharedMemPerBlock zu
regsPerBlock 65536
warpSize 32
memPitch zu
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 8
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment zu
deviceOverlap 1

CUDA reports 3961M of 4096M GPU memory free.
Index 88
No GeForce GTX 770 threads.txt file found. Using default thread sizes.
For optimal thread selection, please run
./CUDALucas -cufftbench 9216 9216 r
for some small r, 0 < r < 6 e.g.
Using threads: norm1 256, mult 256, norm2 128.
Using up to 4536M GPU memory.
WARNING: There may not be enough GPU memory for stage 2!
Selected B1=1515000, B2=45071250, 5.11% chance of finding a factor
Starting stage 1 P-1, M150000713, B1 = 1515000, B2 = 45071250, fft length = 9216
K
Doing 2186688 iterations
Iteration 400000 M150000713, 0x****************, n = 9216K, CUDAPm1 v0.20 err =
0.02441 (1:51:01 real, 16.6531 ms/iter, ETA 8:15:53)
Iteration 800000 M150000713, 0x****************, n = 9216K, CUDAPm1 v0.20 err =
0.02441 (1:51:05 real, 16.6628 ms/iter, ETA 6:25:06)
Iteration 1200000 M150000713, 0x****************, n = 9216K, CUDAPm1 v0.20 err =
0.02588 (1:50:59 real, 16.6478 ms/iter, ETA 4:33:46)
Iteration 1600000 M150000713, 0x****************, n = 9216K, CUDAPm1 v0.20 err =
0.02588 (1:51:04 real, 16.6604 ms/iter, ETA 2:42:54)
Iteration 2000000 M150000713, 0x****************, n = 9216K, CUDAPm1 v0.20 err =
0.02539 (1:51:01 real, 16.6536 ms/iter, ETA 51:49)
M150000713, 0x****************, n = 9216K, CUDAPm1 v0.20
Stage 1 complete, estimated total time = 10:06:59
Starting stage 1 gcd.
M150000713 Stage 1 found no factor (P-1, B1=1515000, B2=45071250, e=2, n=9216K C
UDAPm1 v0.20)
Starting stage 2.
Using b1 = 1515000, b2 = 45071250, d = 4620, e = 2, nrp = 51
C:/Users/filbert/Documents/Visual Studio 2010/Projects/CUDAPm1/CUDAPm1.cu(3356)
: cudaSafeCall() Runtime API error 2: out of memory.

CUDA reports 3949M of 4096M GPU memory free.
Index 96
No GeForce GTX 770 threads.txt file found. Using default thread sizes.
For optimal thread selection, please run
./CUDALucas -cufftbench 11200 11200 r
for some small r, 0 < r < 6 e.g.
Using threads: norm1 256, mult 256, norm2 128.
Using up to 4637M GPU memory.
WARNING: There may not be enough GPU memory for stage 2!
Selected B1=2075000, B2=68993750, 5.91% chance of finding a factor
Starting stage 1 P-1, M200001187, B1 = 2075000, B2 = 68993750, fft length = 1120
0K
Doing 2994040 iterations
Iteration 400000 M200001187, 0x****************, n = 11200K, CUDAPm1 v0.20 err =
0.23438 (2:23:49 real, 21.5717 ms/iter, ETA 15:32:37)
C:/Users/filbert/Documents/Visual Studio 2010/Projects/CUDAPm1/CUDAPm1.cu(1130)
: cudaSafeCall() Runtime API error 30: unknown error.
[/CODE]

I had no other major apps running at the time. Admittedly I have only 3GB of system ram vs 4GB of GPU ram on a GTX 770. The 2nd crash only happened the moment I opened a single InCognito Chrome tab with no other chrome windows or tabs open at all; and only navigated to this mersenneforum attempting to post about the first crash. The PC was not actively doing anything else or running any other major program actively.

kriesel 2019-04-23 19:46

[QUOTE=GhettoChild;514477][CODE]
CUDAPm1 v0.20
------- DEVICE 0 -------
name GeForce GTX 770
Compatibility 3.0
clockRate (MHz) 1202
memClockRate (MHz) 3505
totalGlobalMem zu
totalConstMem zu
l2CacheSize 524288
sharedMemPerBlock zu
regsPerBlock 65536
warpSize 32
memPitch zu
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 8
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment zu
deviceOverlap 1

CUDA reports 3961M of 4096M GPU memory free.
Index 88
No GeForce GTX 770 threads.txt file found. Using default thread sizes.
For optimal thread selection, please run
./CUDALucas -cufftbench 9216 9216 r
for some small r, 0 < r < 6 e.g.
Using threads: norm1 256, mult 256, norm2 128.
Using up to 4536M GPU memory.
WARNING: There may not be enough GPU memory for stage 2!
Selected B1=1515000, B2=45071250, 5.11% chance of finding a factor
Starting stage 1 P-1, M150000713, B1 = 1515000, B2 = 45071250, fft length = 9216
K
Doing 2186688 iterations
Iteration 400000 M150000713, 0x****************, n = 9216K, CUDAPm1 v0.20 err =
0.02441 (1:51:01 real, 16.6531 ms/iter, ETA 8:15:53)
Iteration 800000 M150000713, 0x****************, n = 9216K, CUDAPm1 v0.20 err =
0.02441 (1:51:05 real, 16.6628 ms/iter, ETA 6:25:06)
Iteration 1200000 M150000713, 0x****************, n = 9216K, CUDAPm1 v0.20 err =
0.02588 (1:50:59 real, 16.6478 ms/iter, ETA 4:33:46)
Iteration 1600000 M150000713, 0x****************, n = 9216K, CUDAPm1 v0.20 err =
0.02588 (1:51:04 real, 16.6604 ms/iter, ETA 2:42:54)
Iteration 2000000 M150000713, 0x****************, n = 9216K, CUDAPm1 v0.20 err =
0.02539 (1:51:01 real, 16.6536 ms/iter, ETA 51:49)
M150000713, 0x****************, n = 9216K, CUDAPm1 v0.20
Stage 1 complete, estimated total time = 10:06:59
Starting stage 1 gcd.
M150000713 Stage 1 found no factor (P-1, B1=1515000, B2=45071250, e=2, n=9216K C
UDAPm1 v0.20)
Starting stage 2.
Using b1 = 1515000, b2 = 45071250, d = 4620, e = 2, nrp = 51
C:/Users/filbert/Documents/Visual Studio 2010/Projects/CUDAPm1/CUDAPm1.cu(3356)
: cudaSafeCall() Runtime API error 2: out of memory.

CUDA reports 3949M of 4096M GPU memory free.
Index 96
No GeForce GTX 770 threads.txt file found. Using default thread sizes.
For optimal thread selection, please run
./CUDALucas -cufftbench 11200 11200 r
for some small r, 0 < r < 6 e.g.
Using threads: norm1 256, mult 256, norm2 128.
Using up to 4637M GPU memory.
WARNING: There may not be enough GPU memory for stage 2!
Selected B1=2075000, B2=68993750, 5.91% chance of finding a factor
Starting stage 1 P-1, M200001187, B1 = 2075000, B2 = 68993750, fft length = 1120
0K
Doing 2994040 iterations
Iteration 400000 M200001187, 0x****************, n = 11200K, CUDAPm1 v0.20 err =
0.23438 (2:23:49 real, 21.5717 ms/iter, ETA 15:32:37)
C:/Users/filbert/Documents/Visual Studio 2010/Projects/CUDAPm1/CUDAPm1.cu(1130)
: cudaSafeCall() Runtime API error 30: unknown error.
[/CODE]I had no other major apps running at the time. Admittedly I have only 3GB of system ram vs 4GB of GPU ram on a GTX 770. The 2nd crash only happened the moment I opened a single InCognito Chrome tab with no other chrome windows or tabs open at all; and only navigated to this mersenneforum attempting to post about the first crash. The PC was not actively doing anything else or running any other major program actively.[/QUOTE]
Re the question in your post title, yes.
It's not necessary to mask P-1 interim residues. And masking them might conceal symptoms, like known-bad or repeating or cycling residues.
Take the following quoted line of your output very seriously. CUDAPM1 v0.20 is known to run for hours or days, uselessly producing unchanging stage 2 interim residues, in such a case. I think the memory crunch is a bit more severe in v0.22 although that contains some bug fixes, so you could give that a try. You could try dialing back on exponent to perhaps fit in your small system ram. I have no CUDAPm1 experience with 3GB system ram or GPU ram larger than system ram.

[QUOTE]WARNING: There may not be enough GPU memory for stage 2![/QUOTE] See also CUDAPm1 issues 46 and 71 in the attachment at [URL="http://www.mersenneforum.org/showpost.php?p=488534&postcount=3"]http://www.mersenneforum.org/showpost.php?p=4885[B]3[/B]4&postcount=3[/URL]
Runtime API error 30 is typically the NVIDIA driver timeout and recovery issue in Windows. See CUDALucas issue 1 in the attachment at [URL="http://www.mersenneforum.org/showpost.php?p=488524&postcount=3"]http://www.mersenneforum.org/showpost.php?p=4885[B]2[/B]4&postcount=3[/URL]

For a possible way of recovering, see batch wrapper files and DEVCON [URL]http://www.mersenneforum.org/showpost.php?p=488513&postcount=10[/URL]
Good luck.

GhettoChild 2019-04-24 03:41

off-topic from my errors, how do I specify an FFT size to use per test in the worktodo file? I know in command line you just put "[B]-f[/B] [I]FFT_LENGTH[/I][B]k[/B]" . I have not seen anyone specify it in the worktodo file; it would allow more automated scripting.

Thank you for all your help.

kriesel 2019-04-24 03:59

[QUOTE=GhettoChild;514523]off-topic from my errors, how do I specify an FFT size to use per test in the worktodo file? I know in command line you just put "[B]-f[/B] [I]FFT_LENGTH[/I][B]k[/B]" . I have not seen anyone specify it in the worktodo file; it would allow more automated scripting.

Thank you for all your help.[/QUOTE]
Read the ini file's comments.
I usually don't bother to specify, just let the program pick, and then it can adjust according to excess roundoff error. If you specify a length, it will halt instead of adjusting fft length to get around the error.

henryzz 2019-04-24 10:34

Is that on a 32-bit system by any chance? I can't see any other reason someone would have only 3GB of RAM these days.

GhettoChild 2019-04-24 13:01

[QUOTE=kriesel;514524]Read the ini file's comments.
I usually don't bother to specify, just let the program pick, and then it can adjust according to excess roundoff error. If you specify a length, it will halt instead of adjusting fft length to get around the error.[/QUOTE]
Last time I tried doing what is listed in the ini file, the program stated the line is unreadable and it skipped the line. The ini file instructions might only work for CUDALucas? Also I don't know where in the worktodo line the FFT needs to be specified. I mean the position of the variable might effect if the program accepts it. There are 7 variables per line.

@henryzz:
It's 64-bit; I put that on everything the CPU permits except my tablet since that breaks license & driver support. I just can't afford more ram. It's a DDR2 PC. :rant: RAM that old in Montreal, QC, Canada costs a fortune. The entire PC is a collection of donated parts. I was shocked to learn it costs $15-$20CAD just for a 2" PCI-e 6-pin to 8-pin adaptor here. Another problem, UPS batteries don't exist in stores here; but that's a whole other rant unrelated to this forum.

Got this error just now the moment I clicked post in the quick reply box. The display went black for a second or two aswell. Just posting for referrence, I can live with it if the issue is just not enough PC/GPU ram.

[CODE]
CUDA reports 3961M of 4096M GPU memory free.
Index 101
No GeForce GTX 770 threads.txt file found. Using default thread sizes.
For optimal thread selection, please run
./CUDALucas -cufftbench 14112 14112 r
for some small r, 0 < r < 6 e.g.
Using threads: norm1 256, mult 256, norm2 128.
Using up to 4851M GPU memory.
WARNING: There may not be enough GPU memory for stage 2!
Selected B1=2565000, B2=90416250, 6.5% chance of finding a factor
Starting stage 1 P-1, M249500501, B1 = 2565000, B2 = 90416250, fft length = 1411
2K
Doing 3699899 iterations
Iteration 400000 M249500501, 0xf4e102b03fc12715, n = 14112K, CUDAPm1 v0.20 err =
0.25293 (3:01:37 real, 27.2433 ms/iter, ETA 24:58:20)
C:/Users/filbert/Documents/Visual Studio 2010/Projects/CUDAPm1/CUDAPm1.cu(1130)
: cudaSafeCall() Runtime API error 30: unknown error.

[/CODE]

hansl 2019-05-09 03:30

Is it possible that CudaPm1 could support finding Fermat factors? I am wondering if it would be useful for fully factoring F12?

LaurV 2019-05-09 12:50

It could, but from the amount of the ECM done to F12, you may not expect to find a factor of it by P-1 in the next few thousand years...

hansl 2019-05-11 22:16

[QUOTE=LaurV;516221]It could, but from the amount of the ECM done to F12, you may not expect to find a factor of it by P-1 in the next few thousand years...[/QUOTE]
Ok, the magnitude of these sort of tasks is starting to sink in a bit.

Anyways, I'd still like to try running this program (more for its intended purpose than F12 now).

I tried running the release 0.22 on linux, but I have CUDA 10.1 installed, so it just spits this message out:
[code]
./CUDAPm1-0.22-cuda10-linux: error while loading shared libraries: libcufft.so.10.0: cannot open shared object file: No such file or directory
[/code]

The cuda install has these files/symlinks under /usr/local/cuda/lib64:
[code]
libcufft.so
libcufft.so.10
libcufft.so.10.1.105
[/code]

Would it be safe/reliable to create symlinks "libcufft.so.10.0" to the actual 10.1 file?

Assuming 10.0 installs have a similar symlink for 10.0 -> 10, maybe the next release could be improved to support more minor versions by looking for just "xxx.10", with no minor version suffix?

Or am I better off just attempting a fresh build of my own?

kriesel 2019-06-06 16:27

ambitious fft length limit
 
CUDAPm1 v0.20 has its threshold for the 21952k fft length set a bit too high.
[CODE]Device GeForce GTX 1060 3GB
Compatibility 6.1
clockRate (MHz) 1771
memClockRate (MHz) 4004

fft max exp ms/iter
...

21952 392070229 47.6967
23040 411074273 47.8943[/CODE]Attempts to run M392000107 quickly ran into excessive roundoff issues. Forcing it to the next higher fft length, which has very little speed penalty in this case, seems to address it.[CODE]Using threads: norm1 256, mult 512, norm2 1024.
Using up to 2572M GPU memory.
Selected B1=2960000, B2=41440000, 3.52% chance of finding a factor
Starting stage 1 P-1, M392000107, B1 = 2960000, B2 = 41440000, fft length = 21952K
Doing 4269810 iterations
Iteration = 5600, err = 0.41016 >= 0.40, quitting.
Estimated time spent so far: 0:00

Using threads: norm1 256, mult 512, norm2 1024.
Using up to 2744M GPU memory.
Selected B1=3075000, B2=55350000, 3.72% chance of finding a factor
Starting stage 1 P-1, M392000107, B1 = 3075000, B2 = 55350000, fft length = 21952K
Doing 4435766 iterations
Iteration = 1400, err = 0.47754 >= 0.40, quitting.
Estimated time spent so far: 0:00

Using threads: norm1 256, mult 512, norm2 1024.
Using up to 2744M GPU memory.
Selected B1=3075000, B2=55350000, 3.72% chance of finding a factor
Starting stage 1 P-1, M392000107, B1 = 3075000, B2 = 55350000, fft length = 21952K
Doing 4435766 iterations
Iteration = 1400, err = 0.47754 >= 0.40, quitting.
Estimated time spent so far: 0:00

Using threads: norm1 256, mult 128, norm2 128.
Using up to 2700M GPU memory.
Selected B1=2960000, B2=41440000, 3.52% chance of finding a factor
Starting stage 1 P-1, M392000107, B1 = 2960000, B2 = 41440000, fft length = 23040K
Doing 4269810 iterations
SIGINT caught, writing checkpoint.
Estimated time spent so far: 12:29

CUDAPm1 v0.20
------- DEVICE 0 -------
name GeForce GTX 1060 3GB
Compatibility 6.1
clockRate (MHz) 1771
memClockRate (MHz) 4004
totalGlobalMem zu
totalConstMem zu
l2CacheSize 1572864
sharedMemPerBlock zu
regsPerBlock 65536
warpSize 32
memPitch zu
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 9
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment zu
deviceOverlap 1

CUDA reports 2927M of 3072M GPU memory free.
Using threads: norm1 256, mult 128, norm2 128.
Using up to 2700M GPU memory.
Selected B1=2960000, B2=41440000, 3.52% chance of finding a factor
Using B1 = 2960000 from savefile.
Continuing stage 1 from a partial result of M392000107 fft length = 23040K, iteration = 15601[/CODE]

kriesel 2019-06-07 01:50

v0.22 by comparison
 
[CODE]batch wrapper reports Starting cudaPm1-0.22-cuda8.exe on GeForceGTX10603GB at Thu 06/06/2019 17:58:01.61
CUDAPm1 v0.22
------- DEVICE 0 -------
name GeForce GTX 1060 3GB
Compatibility 6.1
clockRate (MHz) 1771
memClockRate (MHz) 4004
totalGlobalMem 3221225472
totalConstMem 65536
l2CacheSize 1572864
sharedMemPerBlock 49152
regsPerBlock 65536
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 9
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment 512
deviceOverlap 1

CUDA reports 2927M of 3072M GPU memory free.
Using threads: norm1 512, mult 32, norm2 32.
Using up to 2700M GPU memory.
Selected B1=3395000, B2=42437500, 3.6% chance of finding a factor
Starting stage 1 P-1, M392000107, B1 = 3395000, B2 = 42437500, fft length = [B]23040K[/B]
Doing 4898441 iterations
Iteration = 100, err = 0.49584 >= 0.40, quitting.
Estimated time spent so far: 0:00

batch wrapper reports exiting at Thu 06/06/2019 17:59:00.04 [/CODE]

GhettoChild 2019-06-11 02:18

Advice please?
 
5 Attachment(s)
So I'm using a 4GB eVGA GTX 770 Classified (Kepler). I was originally unable to run any version of CUDAPm1 higher than CUDA 5.5, on this Win7 HP 64-bit, PC currently set to default clock speeds. I kept getting "[I][B]the program can't start because api-ms-win-crt-runtime--1-1-0.dll is missing[/B][/I]"; same message if I tried to launce GeForce Experience. I tried installing Service Pack 1 but that didn't help, waste of time & space. Then I installed [U][I]Microsoft KB2999226[/I][/U]. That seems to have solved that, I can now run higher than CUDA 5.5 versions.

So according to this GPU's specs, I should be able to run CUDA 10. For the most part CUDA 8 seems to run ok but I can't understand why CUDA 10 in both CUDAPm1 v021 & 0.22 just fail right from the start. Attached are my lengthy screen output logs.

kriesel 2019-06-11 04:14

[CODE]fft size = 30870K, ave time = 61.7556 msec, max-ave = 0.08239
C:/Users/Aaron/Documents/Visual Studio 2017/Projects/CUDAPm1/CUDAPm1/bench.cu(202) : cudaSafeCall() Runtime API error 30: unknown error.
[/CODE]Try splitting up your -fftbench. (Save the results of the first run under a different filename, before making the second, or the second will overwrite/delete the first)
-fftbench 1 16384
-fftbench 16384 32768
I've never had to go that low, but gpu models vary. API error 30 seems to be associated with the Windows TDR (timeout detection and recovery). You may want to try increasing the timeout period in the registry settings. Not using the system interactively during benchmarking is recommended.

[CODE]I:\GIMPS GPU Computing\CUDAPm1>cudapm1_win64_20181114_CUDA_10_v021.exe -threadbench 1 32768 2>&1
CUDAPm1 v0.21
can't parse options[/CODE]CUDALucas syntax for threadbench does not work in CUDAPm1.
CUDAPm1 -cufftbench 4608 4608 2
for threadbench at 4608k, 2 passes.
I usually wrap that in a batch file and for loop, with only the fft lengths listed in the fft file, with parameter substitution.

The zero values for benchmark timings are a problem.
The 0x0000000000000001 res64 values are a problem.

I see you're running CUDAPm1 v0.21. I have no experience with that version. But much will be similar to v0.20, which you could try. Or v0.22.

For some background info see [URL]https://www.mersenneforum.org/showthread.php?t=23389[/URL]
If CUDA8 runs well, on your setup, use it.

GhettoChild 2019-06-11 20:53

I've used an assortment of versions and the uploads are both v021 and v0.22. v0.20 CUDA5.5 never gave these problems to do the identical scripted tests. I run the same script of work as a "Self Test" batch file on each version the first time before I start new work. CUDA8 v0.22 seems to work as it passes most of the tests; I'm only having an issue at Stage 2 M120002191 FFT 6912K it just crashes without status. Then no version of CUDA10 passes any tests; I either get zeros ending with a 1 for residues or an error code. CUDA10 v0.22 did complete the cufftbench once without issues from 1-32768: "totalavg: 488.4344 totallen: 8133237 totaltime: 16651.6445". I couldn't get a pass again after that one. The drive these are running on do have less than 4GB free space but it's an extra drive not the main OS drive.

With your suggestion about splitting the cufftbench, this is all a batch file that I run. I don't know of a way to make the batch file pause between commands for input to continue; as well splitting the work into two batch files equally defeats the purpose of lengthy-unattended scripting.

Doesn't higher CUDA suites mean higher performance? Isn't CUDA10 faster than CUDA8? That's why I'm trying to get it to work. Also despite CUDA8 seems to work, it crashed at Stage 2 M120002191 FFT 6912K but v0.20 CUDA5.5 completed that exponent and larger without issue.

kriesel 2019-06-11 21:33

[QUOTE=GhettoChild;519139]I'm only having an issue at Stage 2 M120002191 FFT 6912K it just crashes without status. [/QUOTE]Sometimes a relaunch from that point (or 2 or 3) will coax it to finish. It depends on getting different roundoff behavior from one attempt to the next. I've hit rough spots that simply would not run on certain cards. Even a different card of the same gpu model could run what one card couldn't. The gpu BIOS versions differed. Higher and lower exponents worked ok, but around 85M would not run to completion on a certain Quadro 2000. There was another case around 171M as I recall. Stage 2 silent halt was described by the author as occurring due to excessive roundoff error. Sometimes a retry produces less roundoff error.
[QUOTE]
With your suggestion about splitting the cufftbench, this is all a batch file that I run. I don't know of a way to make the batch file pause between commands for input to continue; as well splitting the work into two batch files equally defeats the purpose of lengthy-unattended scripting.[/QUOTE]MODIFY the batch file, to split the cufftbench task into pieces. Instead of "cudapm1 -cufftbench 1 32768 2 >>logfile" in the batch file,

cudapm1 -cufftbench 1 16384 2 >>logfile
rename "(whatever) fft.txt" "(whatver) fft save.txt"
if errorlevel 1 goto exit
: preceding detects rename failure and bails, preventing loss of first cufftbench pass
cudapm1 -cufftbench 16384 32768 2 >>logfile
echo whatever prompt you want, for interactive action via a separate process, such as: merge fft files via a text editor, then press a key here
pause

(threadbench portion etc of batch file)


:exit
(end of batch file)
[URL]https://www.computerhope.com/pausehlp.htm[/URL]

[QUOTE]Doesn't higher CUDA suites mean higher performance? Isn't CUDA10 faster than CUDA8? That's why I'm trying to get it to work. Also despite CUDA8 seems to work, it crashed at Stage 2 M120002191 FFT 6912K but v0.20 CUDA5.5 completed that exponent and larger without issue.[/QUOTE]No. If it won't run, CUDA10's performance is zero. I've not heard that I recall of any significant performance advantage to CUDA10 on older cards. CUDA 10 is necessary for compatibility for some new cards that have a higher compute capability level than is supported in CUDA 8 or CUDA 9.x. A card that requires CUDA10 can't run on CUDA8. Continuing downward, CUDA8 or above is required for the GTX10xx family. I've often seen CUDA levels higher than required for a card give LESS performance than older CUDA levels also supported on the same hardware. For example, CUDA 5 or 6 generally are faster than CUDA 8 in CUDALucas on GTX480, which is a compute capability 2.0 card. Even NVIDIA makes no claims of performance improvements for the fft library at CUDA10.x, which is what counts for CUDAPm1 and CUDALucas performance, at [URL]https://developer.nvidia.com/cuda-toolkit/whatsnew[/URL] All that stuff they list as improved there? Not used in CUDALucas or CUDAPm1.
See also [URL]https://www.mersenneforum.org/showpost.php?p=519147&postcount=8[/URL] and its attachment.

kriesel 2019-06-11 22:34

I've also noticed in some cases (small ram gpus) higher CUDA level means slightly lower maximum benchmarkable fft length. Perhaps a couple percent lower.

GhettoChild 2019-06-14 22:24

2 Attachment(s)
I was able to complete v0.22 CUDA8 up to M200001187 iteration 32 had error 0.4074 so that's acceptable a test for me for now. Increasing the unused memory parameter from defualt 100 to 300 allowed it to reach this far. CUDA10 fft bench completed and resulted slower than CUDA8 like you stated; however all other test exponents and selftests resulted in residues of zeros ending in a 1.

kriesel 2019-07-05 19:46

Dual instances, 9% throughput gain on gtx1080Ti for DC-range exponent P-1
 
In a nutshell, a GTX1080 Ti gpu, with 11GB, goes underutilized in at least two ways with one CUDAPm1 v0.20 instance. During stage 2, memory usage is ~4.5GB. During stage 1 gcd and stage 2 gcd, gpu memory is still occupied but the gcd is done on a cpu core and the gpu cores sit idle. For p~48M, on my test system, that's ~10% of the time that the gpu cores are idle.

Initial testing indicates that running two instances with a slight stagger gives about 9.5% higher aggregate throughput. Presumably this is due to one instance making full use of the gpu cores while the other is waiting on the cpu core to perform a gcd. After each instance have completed a couple of exponents through both stages, the stagger appears stable. I've initiated logging in GPU-Z to check whether any gpu idle or low utilization percentage occurs. So far, in almost an hour of logging, there's only about 20 seconds of gpu core idle indication.

Peak gpu memory usage is ~8593MB when stage 2 of two instances coincide, indicating 3 instances probably would not fit without memory contention. Since cudapm1 queries available gpu ram at the beginning and determines bounds for later use based on that, and NRP values, it might run into insufficient-memory problems with 3 or more instances. I think the case for any potential incremental throughput from a third instance is weak.

(All testing done in Windows 7, dual-E5520-Xeon system. Effect would be proportionally smaller with a proportionally faster cpu core.)

petrw1 2019-07-06 04:34

[QUOTE=kriesel;520820]In a nutshell, a GTX1080 Ti gpu, [/QUOTE]

Did I miss it or did you post the actual time it would take to complete a P-1.
For example a current P-1 in the 85M ranges to recommended bounds?

Thanks

kriesel 2019-07-06 13:02

[QUOTE=petrw1;520852]Did I miss it or did you post the actual time it would take to complete a P-1.
For example a current P-1 in the 85M ranges to recommended bounds?

Thanks[/QUOTE]On the gtx1080 Ti, p~48M (DC wavefront), sums of stage 1 and stage 2 timings in console output:
one instance ~38:46 baseline
two instances ~70:08, throughput ~1/35:04, [B]~10.56%[/B] faster than single instance after I adjusted the stagger between the two instances (by stopping one for two minutes, then resuming it) to eliminate a brief 20-30 second recurring period of gpu cores idle.

For single-instance run times, nrp, bounds selected by the program etc, versus gpu model and exponent, see the attachments at [URL]https://www.mersenneforum.org/showpost.php?p=498673&postcount=9[/URL] and the couple of posts preceding.

On the gtx1080 Ti, 90M a single instance is about 2.5 hours; the run time scaling is about p[SUP]2.05[/SUP].

Note, I don't think running multiple instances will work well at p>~61M on the 11GB GTX1080Ti because of stage 2 memory requirements. A 16GB gpu should be ok to p~89M.

masser 2019-07-06 15:55

[QUOTE=kriesel;520820]... goes underutilized in at least two ways with one CUDAPm1 v0.20 instance...[/QUOTE]

Perhaps I missed something, but why v0.20, instead of v0.22?

ixfd64 2019-07-06 20:44

I believe the 0.22 fork has some regressions.

LaurV 2019-08-14 09:13

Cudapm1 does not run on RTX2080Ti on Win7. All tests are ok, the "-selftest" passes (all 5 factors are found in seconds, the test is supposed to take 16 seconds, but it is much faster on this card), the -cufftbench (for both fft and threads) work well and write the correct files.

However, when "-selftest2" is run, or when a "real task" is done, the program stops with no GPU activity. For the -selftest2 the "stop" occurs when first GCD is called, and the CPU shows a 5% activity (one core of 20 is busy) but there is no progress and no output (the GCD in cause should take no more than 100 milliseconds, to half second). For a real "test case" the stop occurs exactly after the FFT, B1 and B2 are selected (and printed on screen), there is no CPU nor any GPU occupancy, but the GPU is "hooked" somehow because the clock (in GPU-Z) stays high, it does not go to 50MHz or so, as when the card is empty. In all these situations, the only possible exit is killing the process (ctrl+c will show the sigint message, but never exit).


Edit: this is valid for all versions I could dld from [URL="https://download.mersenne.ca/CUDAPm1/"]James' mirror[/URL] (i.e. including the last ones). Anyone is running this in RTX cards?

masser 2019-08-14 14:25

Did you try adjusting the UnusedMem setting in the .ini file? I only have a weak GPU, but I was having a lot of stalls until I turned up this value to about 20% of the GPU's memory.

hansl 2019-08-21 15:37

[QUOTE=hansl;516516]
Would it be safe/reliable to create symlinks "libcufft.so.10.0" to the actual 10.1 file?
[/QUOTE]
This was from a few months ago but I just got around to trying out and it definitely doesn't work to try symlinking/renaming 10.1 to 10.0.

I was able to build for 10.1 though, so its running now.

One question: It did some benchmarks where it looks like the best result was:
[code]
fft size = 5120K, ave time = 0.8334 msec, Norm1 threads 512, Norm2 threads 1024
[/code]

However during the actual Pm1 I get:
[code]
Iteration 5000 M[redacted], 0x[redacted], n = 5120K, CUDAPm1 v0.22 err = 0.14844 (0:50 real, 10.1213 ms/iter, ETA 3:33:22)
[/code]

I guess I was expecting the ms/iter to roughly match the msec from the benchmark, or does it not really work that way? Currently the difference is a factor of 12.14x

This is on a GTX 1660 6GB (non-Ti)

kriesel 2019-08-22 23:33

[QUOTE=hansl;524152]I guess I was expecting the ms/iter to roughly match the msec from the benchmark, or does it not really work that way? Currently the difference is a factor of 12.14x[/QUOTE]The match is fairly close in CUDAPm1 [B]v0.20, [/B]and not in [B]v0.22.
[/B]
With modern gpus it's hard to get a close match because clock speeds fluctuate, system activity varies, etc.

c10ck3r 2019-08-27 19:30

Any guidance on how to correct error "device_number >= device_count" when using CUDAPm1 for the first time (0.22)?
TIA

kriesel 2019-08-27 23:02

[QUOTE=c10ck3r;524682]Any guidance on how to correct error "device_number >= device_count" when using CUDAPm1 for the first time (0.22)?
TIA[/QUOTE]
How many gpus are in the system? The first one is device number 0.
If that's not it, have a look further in the getting started guide

[url]https://www.mersenneforum.org/showpost.php?p=489051&postcount=4[/url]

c10ck3r 2019-08-28 00:06

[QUOTE=kriesel;524690]How many gpus are in the system? The first one is device number 0.
If that's not it, have a look further in the getting started guide

[URL]https://www.mersenneforum.org/showpost.php?p=489051&postcount=4[/URL][/QUOTE]


Just 1, and device_number is set to 0. I downloaded all .dll files last week- perhaps one of them is causing the issue, since the error also shows '(This is probably a driver problem)'?
GTX1050 for reference, I have the following drivers all in the folder containing CUDAPm1:
cudart32_101
cudart64_31_9
cudart64_101
cufft64_10
cufft64_31_9
cufftw64_10

hansl 2019-08-28 15:39

[QUOTE=c10ck3r;524695]Just 1, and device_number is set to 0. I downloaded all .dll files last week- perhaps one of them is causing the issue, since the error also shows '(This is probably a driver problem)'?
GTX1050 for reference, I have the following drivers all in the folder containing CUDAPm1:
cudart32_101
cudart64_31_9
cudart64_101
cufft64_10
cufft64_31_9
cufftw64_10[/QUOTE]
Do you have the latest nvidia drivers installed? Nvidia control panel recognizes it, etc?

kriesel 2019-08-28 18:54

[QUOTE=c10ck3r;524695]Just 1, and device_number is set to 0. I downloaded all .dll files last week- perhaps one of them is causing the issue, since the error also shows '(This is probably a driver problem)'?
GTX1050 for reference, I have the following drivers all in the folder containing CUDAPm1:
cudart32_101
cudart64_31_9
cudart64_101
cufft64_10
cufft64_31_9
cufftw64_10[/QUOTE]
Which CUDA version CUDApm1 are you trying to run? On what OS, 32 or 64-bit? (Likely 64 if reasonably modern hardware). CUDApm1 needs a capable gpu, a suitable driver for the gpu, and cudart and cudafft dlls that match the CUDA version for which the CUDAPm1 executable was compiled and also the bitness.
You have the two extremes, very new and very old, plus a couple outliers cudart32_101 as 32-bit and cufftw which is not needed. CUDArt64_101 version does not match cufft64_10 (V10.1 vs. V10.0).

If you run nvidia-smi to get details about the gpu, what does it tell you? See [url]https://www.mersenneforum.org/showpost.php?p=490744&postcount=15[/url]
Have you run any other CUDA software on it? if so, what versions worked then?

A GTX1050 would need CUDA8 dlls to run mfaktc, but should run somewhat older CUDA level software such as CUDALucas or CUDAPM1 ok. I mostly run the later dates of CUDA5.5 or 5.0 CUDAPm1. Never 3.2 or older though. See [url]https://download.mersenne.ca/CUDAPm1/old-experimental[/url]

c10ck3r 2019-08-29 06:36

[QUOTE=kriesel;524749]Which CUDA version CUDApm1 are you trying to run? On what OS, 32 or 64-bit? (Likely 64 if reasonably modern hardware). CUDApm1 needs a capable gpu, a suitable driver for the gpu, and cudart and cudafft dlls that match the CUDA version for which the CUDAPm1 executable was compiled and also the bitness.
[...]

A GTX1050 would need CUDA8 dlls to run mfaktc, but should run somewhat older CUDA level software such as CUDALucas or CUDAPM1 ok. I mostly run the later dates of CUDA5.5 or 5.0 CUDAPm1. Never 3.2 or older though. See [URL]https://download.mersenne.ca/CUDAPm1/old-experimental[/URL][/QUOTE]
Switching to 5.5 fixed it, thank you!

kriesel 2019-08-29 13:03

[QUOTE=c10ck3r;524776]Switching to 5.5 fixed it, thank you![/QUOTE]Sweet. You're welcome. What size exponents do you plan to run? See
[url]https://www.mersenneforum.org/showthread.php?p=489365#post489365[/url] and following posts for an idea of exponent limits on other gpu models.
Please provide any success or failure info versus exponent sizes tried, and I'll add it.
Also whether your GTX1050 a 2GB or 3GB unit.

c10ck3r 2019-09-01 02:47

[QUOTE=kriesel;524790]Sweet. You're welcome. What size exponents do you plan to run? See
[URL]https://www.mersenneforum.org/showthread.php?p=489365#post489365[/URL] and following posts for an idea of exponent limits on other gpu models.
Please provide any success or failure info versus exponent sizes tried, and I'll add it.
Also whether your GTX1050 a 2GB or 3GB unit.[/QUOTE]
Tested it out with a 94M exponent, B1=855k, B2=~18.2M, e=2. 2GB model. 9.4451 Ghz-Days, don't have an exact duration, but at the beginning it gave ETA of 6.75-7.5hrs, not sure if that was Stage 1 ETA only.
I did, however, find out that I can't run either mfaktc or CUDAPm1 at the same time as 4x P-1 in P95, even though I've got 32GB RAM and a Threadripper. Froze my computer within about 5 minutes of starting with both programs. Yikes lol


Edit: Is it possible to do Stage 1 only with CUDAPm1 and then perform Stage 2 in P95? Just thinking since I have quite a bit of RAM available, this could allow me to focus on some deeper runs of P-1

kriesel 2019-09-01 12:29

[QUOTE=c10ck3r;524957]Tested it out with a 94M exponent, B1=855k, B2=~18.2M, e=2. 2GB model. 9.4451 Ghz-Days, don't have an exact duration, but at the beginning it gave ETA of 6.75-7.5hrs, not sure if that was Stage 1 ETA only.
I did, however, find out that I can't run either mfaktc or CUDAPm1 at the same time as 4x P-1 in P95, even though I've got 32GB RAM and a Threadripper. Froze my computer within about 5 minutes of starting with both programs. Yikes lol

Edit: Is it possible to do Stage 1 only with CUDAPm1 and then perform Stage 2 in P95? Just thinking since I have quite a bit of RAM available, this could allow me to focus on some deeper runs of P-1[/QUOTE]CUDAPm1 gives an ETA per stage only, yes. See the Polite directive in CUDAPm1.ini for system responsiveness.

On prime95 I mix PRP and P-1. It seems to like one P-1 per cpu package better, although that could be system dependent. P-1 likes lots of memory.

Re starting with stage 1 on CUDAPm1 and finishing with stage 2 on prime95 for the same exponent, I think that would require some software development, if it's possible at all. See [URL]https://www.mersenneforum.org/showpost.php?p=524873&postcount=24[/URL]

storm5510 2019-11-15 14:32

[I]I will put this here hoping someone will see it...[/I]

On occasion, [I]CUDAPm1[/I] will stop running and exit after completing stage 1. This behavior could suggest that it does not detect enough RAM onboard my GPU for stage 2. It is a GTX 1080 with 8 GB of RAM. Below is how I ran it, by command line:

[CODE]cudapm1 98181383 -b1 710000 -b2 13490000[/CODE]I got these bounds from [B]James Heinrich's[/B] "P-1 Probability Calculator" page on [URL]https://www.mersenne.ca[/URL].

I am running it again with [I]Prime95[/I]. Is it possible this B2 is too large to run on my GPU?

kriesel 2019-11-15 18:17

[QUOTE=storm5510;530661][I]I will put this here hoping someone will see it...[/I]

On occasion, [I]CUDAPm1[/I] will stop running and exit after completing stage 1. This behavior could suggest that it does not detect enough RAM onboard my GPU for stage 2. It is a GTX 1080 with 8 GB of RAM. Below is how I ran it, by command line:

[CODE]cudapm1 98181383 -b1 710000 -b2 13490000[/CODE]I got these bounds from [B]James Heinrich's[/B] "P-1 Probability Calculator" page on [URL]https://www.mersenne.ca[/URL].

I am running it again with [I]Prime95[/I]. Is it possible this B2 is too large to run on my GPU?[/QUOTE]Did it produce a message stating it completed the stage 1 gcd? If so, you could report the stage one result.

Normally I would expect that exponent to run to completion on that model gpu easily. (Or even ~triple that exponent.) But odd things happen sometimes in CUDAPm1. Sometimes a resumption after a stop will do the job. I've hit spots (on the exponent number line) that one gpu can't finish, that if I move it to another gpu of the same or larger gpu ram size, the second can finish. Even in one case, the same model but a different gpu BIOS rev. Quadro 2000's hit issues around 85.5 and 171M exponent. (That model is not recommended for P-1 because the feasible or program-selected bounds are too low.)

There's another thing that happens sometimes that the author described. An excessive roundoff error in stage 2 will produce a very quiet exit, no error message to say why.
For more info, see [URL]https://www.mersenneforum.org/showpost.php?p=489365&postcount=7[/URL] Limits
[URL]https://www.mersenneforum.org/showpost.php?p=498673&postcount=9[/URL] Limit and run time detail by gpu model
[URL]https://www.mersenneforum.org/showpost.php?p=509937&postcount=3[/URL] Errors in general
[URL]https://www.mersenneforum.org/showpost.php?p=501182&postcount=7[/URL] P-1 progress
[URL]http://www.mersenneforum.org/showpost.php?p=488534&postcount=3[/URL] CUDAPm1 bug and wish list

The most recent versions of Gpuowl can also run on NVIDIA and do P-1 including save files. There may be bugs in it that prevent it from completing P-1 on some exponents, but I haven't found them yet, in admittedly much less sampling of exponents on gpuowl than on CUDAPm1. On large-gpu-ram models like the GTX1080, gpuowl appears capable of going to higher exponents.
See [URL]https://www.mersenneforum.org/showpost.php?p=525955&postcount=17[/URL]
Run times would be about 30% longer on a GTX1080 than the GTX1080Ti
Zip file for Windows gpuowl v6.11-9 at [URL]https://www.mersenneforum.org/showpost.php?p=526331&postcount=1403[/URL]

In any event, have fun!

storm5510 2019-11-15 22:48

[QUOTE=kriesel;530681]Did it produce a message stating it completed the stage 1 gcd? If so, you could report the stage one result.

Normally I would expect that exponent to run to completion on that model gpu easily. (Or even ~triple that exponent.) But odd things happen sometimes in CUDAPm1. Sometimes a resumption after a stop will do the job. I've hit spots (on the exponent number line) that one gpu can't finish, that if I move it to another gpu of the same or larger gpu ram size, the second can finish. Even in one case, the same model but a different gpu BIOS rev. Quadro 2000's hit issues around 85.5 and 171M exponent. (That model is not recommended for P-1 because the feasible or program-selected bounds are too low.)

There's another thing that happens sometimes that the author described. An excessive roundoff error in stage 2 will produce a very quiet exit, no error message to say why.
For more info, see [URL]https://www.mersenneforum.org/showpost.php?p=489365&postcount=7[/URL] Limits
[URL]https://www.mersenneforum.org/showpost.php?p=498673&postcount=9[/URL] Limit and run time detail by gpu model
[URL]https://www.mersenneforum.org/showpost.php?p=509937&postcount=3[/URL] Errors in general
[URL]https://www.mersenneforum.org/showpost.php?p=501182&postcount=7[/URL] P-1 progress
[URL]http://www.mersenneforum.org/showpost.php?p=488534&postcount=3[/URL] CUDAPm1 bug and wish list

The most recent versions of Gpuowl can also run on NVIDIA and do P-1 including save files. There may be bugs in it that prevent it from completing P-1 on some exponents, but I haven't found them yet, in admittedly much less sampling of exponents on gpuowl than on CUDAPm1. On large-gpu-ram models like the GTX1080, gpuowl appears capable of going to higher exponents.
See [URL]https://www.mersenneforum.org/showpost.php?p=525955&postcount=17[/URL]
Run times would be about 30% longer on a GTX1080 than the GTX1080Ti
Zip file for Windows gpuowl v6.11-9 at [URL]https://www.mersenneforum.org/showpost.php?p=526331&postcount=1403[/URL]

In any event, have fun![/QUOTE]

Thank you for the reply! :smile:

The roundoff error never exceeded 0.035. I believe this is what is displayed on every line as "err." Also, it completed the GCD before it dropped out. There was not a results file.

Windows 10, which I am running, has a very detailed Task Manager. It displays information about a GPU, if one is present. At the time, I didn't think the GPU's RAM usage was excessive. It is possible I was looking at stage 1 though.

[I]gpuOwl[/I]. I have seen posts about it going back a while. I was under the impression it was a Linux only project. I will give it a try. It uses OpenCL. GPU-Z says I have this capability but I have never ran anything which uses it.

Microsoft has continually updated Windows 10. What I have is v1903 plus some maintenance updates. After a point, the older version of [I]CUDAPm1[/I] and [I]CUDALucas[/I] simply would not start. I replaced them with newer ones I found in James' archive on [I]mersenne.org[/I]. The newer ones did not seem to have any problems, until today.

It will take me a while to digest everything in your links. I appreciate the effort.

[B][I]Edit[/I][/B]

I tried gpuOwl. It gives me all the below and then exits. It would help to see a config file and worktodo example.


[CODE]2019-11-15 18:01:36 gpuowl v6.11-9-g9ae3189
2019-11-15 18:01:36 Note: no config.txt file found
2019-11-15 18:01:36 config: -pm1 98181383
2019-11-15 18:01:36 98181383 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.02 bits/word
2019-11-15 18:01:37 OpenCL args "-DEXP=98181383u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xf.bbe27b81b7e38p-3 -DIWEIGHT_STEP=0x8.22a2337ec7b7p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-11-15 18:01:37 OpenCL compilation error -11 (args -DEXP=98181383u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xf.bbe27b81b7e38p-3 -DIWEIGHT_STEP=0x8.22a2337ec7b7p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-11-15 18:01:37 <kernel>:197:3: error: invalid output constraint '=v' in asm
X2(u[0], u[2]);
^
<kernel>:174:37: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
<kernel>:197:3: error: invalid output constraint '=v' in asm
<kernel>:175:37: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \
^
<kernel>:198:3: error: invalid output constraint '=v' in asm
X2_mul_t4(u[1], u[3]);
^
<kernel>:180:37: note: expanded from macro 'X2_mul_t4'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (t.x) : "v" (b.x), "v" (t.x)); \
^
<kernel>:198:3: error: invalid output constraint '=v' in asm
<kernel>:181:37: note: expanded from macro 'X2_mul_t4'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.y), "v" (b.y)); \
^
<kernel>:199:3: error: invalid output constraint '=v' in asm
X2(u[0], u[1]);
^
<kernel>:174:37: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
<kernel>:199:3: error: invalid output constraint '=v' in asm
<kernel>:175:37: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \
^
<kernel>:200:3: error: invalid output constraint '=v' in asm
X2(u[2], u[3]);
^
<kernel>:174:37: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
<kernel>:200:3: error: invalid output constraint '=v' in asm
<kernel>:175:37: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \
^
<kernel>:266:3: error: invalid output constraint '=v' in a2019-11-15 18:01:37 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:229 build
2019-11-15 18:01:37 Bye
[/CODE]

kracker 2019-11-15 23:34

[QUOTE=storm5510;530703]Thank you for the reply! :smile:

The roundoff error never exceeded 0.035. I believe this is what is displayed on every line as "err." Also, it completed the GCD before it dropped out. There was not a results file.

Windows 10, which I am running, has a very detailed Task Manager. It displays information about a GPU, if one is present. At the time, I didn't think the GPU's RAM usage was excessive. It is possible I was looking at stage 1 though.

[I]gpuOwl[/I]. I have seen posts about it going back a while. I was under the impression it was a Linux only project. I will give it a try. It uses OpenCL. GPU-Z says I have this capability but I have never ran anything which uses it.

Microsoft has continually updated Windows 10. What I have is v1903 plus some maintenance updates. After a point, the older version of [I]CUDAPm1[/I] and [I]CUDALucas[/I] simply would not start. I replaced them with newer ones I found in James' archive on [I]mersenne.org[/I]. The newer ones did not seem to have any problems, until today.

It will take me a while to digest everything in your links. I appreciate the effort.

[B][I]Edit[/I][/B]

I tried gpuOwl. It gives me all the below and then exits. It would help to see a config file and worktodo example.


[CODE]2019-11-15 18:01:36 gpuowl v6.11-9-g9ae3189
2019-11-15 18:01:36 Note: no config.txt file found
2019-11-15 18:01:36 config: -pm1 98181383
2019-11-15 18:01:36 98181383 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.02 bits/word
2019-11-15 18:01:37 OpenCL args "-DEXP=98181383u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xf.bbe27b81b7e38p-3 -DIWEIGHT_STEP=0x8.22a2337ec7b7p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-11-15 18:01:37 OpenCL compilation error -11 (args -DEXP=98181383u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xf.bbe27b81b7e38p-3 -DIWEIGHT_STEP=0x8.22a2337ec7b7p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-11-15 18:01:37 <kernel>:197:3: error: invalid output constraint '=v' in asm
X2(u[0], u[2]);
^
<kernel>:174:37: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
<kernel>:197:3: error: invalid output constraint '=v' in asm
<kernel>:175:37: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \
^
<kernel>:198:3: error: invalid output constraint '=v' in asm
X2_mul_t4(u[1], u[3]);
^
<kernel>:180:37: note: expanded from macro 'X2_mul_t4'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (t.x) : "v" (b.x), "v" (t.x)); \
^
<kernel>:198:3: error: invalid output constraint '=v' in asm
<kernel>:181:37: note: expanded from macro 'X2_mul_t4'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.y), "v" (b.y)); \
^
<kernel>:199:3: error: invalid output constraint '=v' in asm
X2(u[0], u[1]);
^
<kernel>:174:37: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
<kernel>:199:3: error: invalid output constraint '=v' in asm
<kernel>:175:37: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \
^
<kernel>:200:3: error: invalid output constraint '=v' in asm
X2(u[2], u[3]);
^
<kernel>:174:37: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
<kernel>:200:3: error: invalid output constraint '=v' in asm
<kernel>:175:37: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \
^
<kernel>:266:3: error: invalid output constraint '=v' in a2019-11-15 18:01:37 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:229 build
2019-11-15 18:01:37 Bye
[/CODE][/QUOTE]

Use -use ORIG_X2 either in the command line or config.txt - honestly that should be the default when running under windows...

storm5510 2019-11-16 00:10

[QUOTE=kracker;530709]Use -use ORIG_X2 either in the command line or config.txt - honestly that should be the default when running under windows...[/QUOTE]

There is no example configuration text in the archive, so I do not know how to format it. The same applies to a worktodo file.

[B]Edit: Please disregard. With some experimentation, I have it running.[/B]

kriesel 2019-11-16 16:04

[QUOTE=storm5510;530703]Thank you for the reply! :smile:

The roundoff error never exceeded 0.035. I believe this is what is displayed on every line as "err." Also, it completed the GCD before it dropped out. There was not a results file.

Windows 10, which I am running, has a very detailed Task Manager. It displays information about a GPU, if one is present. At the time, I didn't think the GPU's RAM usage was excessive. It is possible I was looking at stage 1 though.

[I]gpuOwl[/I]. I have seen posts about it going back a while. I was under the impression it was a Linux only project. I will give it a try. It uses OpenCL. GPU-Z says I have this capability but I have never ran anything which uses it.

Microsoft has continually updated Windows 10. What I have is v1903 plus some maintenance updates. After a point, the older version of [I]CUDAPm1[/I] and [I]CUDALucas[/I] simply would not start. I replaced them with newer ones I found in James' archive on [I]mersenne.org[/I]. The newer ones did not seem to have any problems, until today.

It will take me a while to digest everything in your links. I appreciate the effort.

[B][I]Edit[/I][/B]

I tried gpuOwl. It gives me all the below and then exits. It would help to see a config file and worktodo example.


[CODE]2019-11-15 18:01:36 gpuowl v6.11-9-g9ae3189
2019-11-15 18:01:36 Note: no config.txt file found
2019-11-15 18:01:36 config: -pm1 98181383
2019-11-15 18:01:36 98181383 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.02 bits/word
2019-11-15 18:01:37 OpenCL args "-DEXP=98181383u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xf.bbe27b81b7e38p-3 -DIWEIGHT_STEP=0x8.22a2337ec7b7p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-11-15 18:01:37 OpenCL compilation error -11 (args -DEXP=98181383u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0xf.bbe27b81b7e38p-3 -DIWEIGHT_STEP=0x8.22a2337ec7b7p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-11-15 18:01:37 <kernel>:197:3: error: invalid output constraint '=v' in asm
X2(u[0], u[2]);
^
<kernel>:174:37: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
<kernel>:197:3: error: invalid output constraint '=v' in asm
<kernel>:175:37: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \
^
<kernel>:198:3: error: invalid output constraint '=v' in asm
X2_mul_t4(u[1], u[3]);
^
<kernel>:180:37: note: expanded from macro 'X2_mul_t4'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (t.x) : "v" (b.x), "v" (t.x)); \
^
<kernel>:198:3: error: invalid output constraint '=v' in asm
<kernel>:181:37: note: expanded from macro 'X2_mul_t4'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.y), "v" (b.y)); \
^
<kernel>:199:3: error: invalid output constraint '=v' in asm
X2(u[0], u[1]);
^
<kernel>:174:37: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
<kernel>:199:3: error: invalid output constraint '=v' in asm
<kernel>:175:37: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \
^
<kernel>:200:3: error: invalid output constraint '=v' in asm
X2(u[2], u[3]);
^
<kernel>:174:37: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.x) : "v" (t.x), "v" (b.x)); \
^
<kernel>:200:3: error: invalid output constraint '=v' in asm
<kernel>:175:37: note: expanded from macro 'X2'
__asm( "v_add_f64 %0, %1, -%2\n" : "=v" (b.y) : "v" (t.y), "v" (b.y)); \
^
<kernel>:266:3: error: invalid output constraint '=v' in a2019-11-15 18:01:37 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:229 build
2019-11-15 18:01:37 Bye
[/CODE][/QUOTE]In my experience CUDAPm1 can silently exit in stage 2 without ever displaying an excessive roundoff error. But that does not mean one didn't occur, only that a message showing it didn't occur. Sometimes specifying a higher fft length than it selects for itself and running again will get it through the bad spot.
For the gpuowl issue you had, try using -use ORIG_X2 or -use FMA_X2 in the gpuowl command line. It's defaulting to INLINE_X2 and not handling it well. ORIG or FMA may be faster; try and see.
GpuOwl has always been developed on linux. However, going back to the very earliest versions, it has also been ported to Windows. Mostly kracker and I have posted Windows compiled versions. (Hope I haven't slighted anyone there.)
The bigger change was when it also was made to work on NVIDIA even though Preda generally only owns AMD gpus for development testing. I even tested it on Intel IGPs a couple times.

storm5510 2019-11-16 16:34

[QUOTE=kriesel;530754]In my experience CUDAPm1 can silently exit in stage 2 without ever displaying an excessive roundoff error. But that does not mean one didn't occur, only that a message showing it didn't occur. Sometimes specifying a higher fft length than it selects for itself and running again will get it through the bad spot.

For the gpuowl issue you had, try using -use ORIG_X2 or -use FMA_X2 in the gpuowl command line. It's defaulting to INLINE_X2 and not handling it well. ORIG or FMA may be faster; try and see.

GpuOwl has always been developed on linux. However, going back to the very earliest versions, it has also been ported to Windows. Mostly kracker and I have posted Windows compiled versions. (Hope I haven't slighted anyone there.)
The bigger change was when it also was made to work on NVIDIA even though Preda generally only owns AMD gpus for development testing. I even tested it on Intel IGPs a couple times.[/QUOTE]

I did not know it was possible to specify a higher FFT for CUDAPm1, and I do not know how. The newer release, from this year, I have been running has been extremely reliable except for what I wrote of above.

As for gpuOwl, I did not have the entire package, just an update. I found the rest in [B]James Heinrich's[/B] archive on [I]mersenne.ca[/I]. I recall writing this in another topic. No matter. Despite having to do some experimentation, it runs quite well now. I successfully guessed a [I]config.txt[/I] layout. It still refuses my [I]worktodo.txt[/I] files though. Incorrect syntax, I believe it says.

kriesel 2019-11-16 17:58

[QUOTE=storm5510;530760]I did not know it was possible to specify a higher FFT for CUDAPm1, and I do not know how.
...
As for gpuOwl... It still refuses my [I]worktodo.txt[/I] files though. Incorrect syntax, I believe it says.[/QUOTE]
specify fft length on the command line, with -f option. For example,
cudapm1 -f 4608k

When in doubt, use -h to see the options. And note, despite what that says, -r does nothing.

For worktodo syntax for any of the common GIMPS applications, see [URL]https://www.mersenneforum.org/showpost.php?p=522098&postcount=22[/URL]

storm5510 2019-11-16 23:27

[QUOTE=kriesel;530770]specify fft length on the command line, with -f option. For example,
cudapm1 -f 4608k

When in doubt, use -h to see the options. And note, despite what that says, -r does nothing.

For worktodo syntax for any of the common GIMPS applications, see [URL]https://www.mersenneforum.org/showpost.php?p=522098&postcount=22[/URL][/QUOTE]

Thank you!

I was not sure about the command line format as I have never ran it this way before. The "-h" command displayed a lot of other things, but not much about the basics. However, I have it running.

For what I was running, it decided on 5760 for an FFT size. I believe this was what it was using before when it quietly stopped. The command line insisted on multiples of 1024, so I set it to 6144 Now, it is wait and see.

I hope you do not mind: I copied all the text from your link above and saved it here. I have been looking for something like this for [U]years[/U]!

:smile:

aaronhaviland 2019-11-23 16:36

*tap*tap* is this thing still on?

I freed up some time from other hobbies recently, and am thinking about getting back involved with this. It'll take me a while to get my devel environment back up to speed, but I figured I'd throw this out there if there's any interest.

Of note, I'm completely out of touch with the current state of all things factoring related, so if there's anything I should be aware of before I dive back in, please let me know!

Thanks, and happy factoring :)

(On a side note: I think my GPU is dying. I only get reliable results if I underclock it. This is sad for me, but it also gives me a reliable source of hardware errors to test with that I hadn't had before, so who knows, maybe it can help me make things more robust.)

storm5510 2020-09-08 12:28

[QUOTE=aaronhaviland;531324]*tap*tap* is this thing still on?[/QUOTE]

*Tap *tap... Nope. It's dead. :smile:

James Heinrich 2020-09-08 12:34

[QUOTE=storm5510;556408]*Tap *tap... Nope. It's dead. :smile:[/QUOTE]But [url=https://www.mersenneforum.org/forumdisplay.php?f=171]gpuowl[/url] is very much alive, and the current go-to program for all non-TF GPU worktypes, other than TF for which stick with [URL="https://www.mersenneforum.org/showthread.php?t=12827"]mfaktc[/URL]/[URL="https://www.mersenneforum.org/showthread.php?t=15646"]mfakto[/URL].

storm5510 2020-09-08 15:36

[QUOTE=James Heinrich;556409][B]But [URL="https://www.mersenneforum.org/forumdisplay.php?f=171"]gpuowl[/URL] is very much alive[/B], and the current go-to program for all non-TF GPU worktypes, other than TF for which stick with [URL="https://www.mersenneforum.org/showthread.php?t=12827"]mfaktc[/URL]/[URL="https://www.mersenneforum.org/showthread.php?t=15646"]mfakto[/URL].[/QUOTE]

I meant to add that. Sorry! :grin:

kriesel 2020-09-08 16:15

There are some NVIDIA gpus which can not run gpuowl, but can run cudapm1 or CUDALucas. They tend to be limited by gpu ram, and by cudapm1 issues. See [URL]https://www.mersenneforum.org/showthread.php?t=23389[/URL].

Addition of the Jacobi check to CUDALucas would be a big plus, and likely to be useful in confirming the next Mersenne prime discovery.

Implementing error checks in P-1 (either gpuowl or CUDAPm1 or anything else) is more complicated, and also needed. (In the LL series, the correct Jacobi symbol is known. In P-1 computations, it must be both calculated and checked.) See also [URL]https://www.mersenneforum.org/showthread.php?t=24168[/URL] for some ideas, particularly [URL]https://www.mersenneforum.org/showpost.php?p=515641&postcount=10[/URL]

But yeah, no reply in the thread for about 10 months, means it might have already missed Aaron Haviland's window of availability.

Dylan14 2020-12-12 18:07

1 Attachment(s)
I have made a PKGBUILD for this software. You can find it here: [url]https://aur.archlinux.org/packages/cudapm1/[/url].

Also, attached is a CUDA 11.1 Linux build of cudapm1, compiled on an Arch Linux system with a GTX 1660 ti.


All times are UTC. The time now is 23:18.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.