mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   The P-1 factoring CUDA program (https://www.mersenneforum.org/showthread.php?t=17835)

James Heinrich 2018-08-29 23:48

[QUOTE=kriesel;494893]My greedy cell provider is another story[/QUOTE]Wow, sounds like you live in the backwoods of Elbonia. Or Canada.

kriesel 2018-08-30 05:41

[QUOTE=James Heinrich;494895]Wow, sounds like you live in the backwoods of Elbonia. Or Canada.[/QUOTE]
About ten miles from the state capitol building and a billion dollar plus annual funded research land grant university. But there are lakes in the way.

Xyzzy 2018-08-30 16:12

We would give our left testicle for decent Internet connectivity.

:mike:

chalsall 2018-08-30 18:20

[QUOTE=Xyzzy;494946]We would give our left testicle for decent Internet connectivity.[/QUOTE]

I would give my right testicle for a good library.

henryzz 2018-09-02 21:40

[QUOTE=kriesel;494883]Thanks for the tip. The earliest version available there is VS 2013. (I'd hoped to be able to get back to VS2010.)

After multiple failed download attempts via my crappy slow costly ISP (768k/128k DSL, 4.8GB 14 hour download projected if things were working well, actual 1.5GB max per attempt, multiple days elapsed), the utility contractor working in my neighborhood to install fiber put an end to it by cutting the neighborhood's telco voice/DSL cable. Driving to another location got the 4.8GB ISO download on the first try in under 3 hours. With such slow and unreliable internet, I tend to go for a full install image that can be put on a local file server, download once, and reuse locally. Crappy-slow-costly-ISP was immediately contacted within 10 minutes of the start of the outage, took an hour of phone time to generate a trouble ticket, and projected beginning to repair after a week of no service, and claimed they would process a bill credit. Service cut was on the first day of the billing cycle. I've already received a bill for a full month's service not received or receivable, beginning the day the cable was cut, and the bill did not include the promised credit for outage. The DSL in this neighborhood runs from the nearest village, miles away, preventing high speed, instead of running from the nearest hut, a half mile away, that could probably provide 25.Mbps.[/QUOTE]

The earliest community edition was 2013. It was called the express edition before then. 2010 express is downloadable via that link.

kriesel 2018-10-20 02:16

[QUOTE=James Heinrich;494889]I've had ISP troubles like that in the past (it once took an ISP 6 weeks of no internet before they fixed whatever was broken), so I can sympathize. I'm happy to be on 250Mbps service now (4.8GB ISO should complete in under 3 mins). I hope your fiber install is completed soon.[/QUOTE]
Unfortunately, the Oct deadline has passed without a sufficient number of neighbors signing up, so the schedule for fiber install for my neighborhood has been delayed by nominally 6 months, to next June. And the rate of signup (I'm tracking via their website) looks ominously slow for even that delayed schedule. If/when it happens, they're offering 300, 400, and 1000Mbps.

James Heinrich 2018-10-20 02:28

[QUOTE=kriesel;498319]If/when it happens, they're offering 300, 400, and 1000Mbps.[/QUOTE]If I went with fibre in my "village" of 0.4M population, I can get fibre "to my neighbourhood" and get up to 100/10 service for $80/mo. If I lived in the 5.0M population city up the road, the exact same price (with the same company) will get me 1000/750 service (and an extra $10/mo makes is 1500/940). Yay for small(ish) towns. :ick:

kriesel 2018-10-20 06:25

[QUOTE=James Heinrich;498320]If I went with fibre in my "village" of 0.4M population, I can get fibre "to my neighbourhood" and get up to 100/10 service for $80/mo. If I lived in the 5.0M population city up the road, the exact same price (with the same company) will get me 1000/750 service (and an extra $10/mo makes is 1500/940). Yay for small(ish) towns. :ick:[/QUOTE]
If/when it comes, 300/300 is $40/mo here, plus $7.95 for modem rent, less than I'm paying for slow slow DSL when the mandatory POTS is added in along with a menu of fees and taxes. For $85/mo, a DVR and 125+ channels too. Top end is 1000/400 plus tv plus phone at $145/mo. If/when it's actually installed in my neighborhood, it's fiber to just inside the home. This is in a small 40 year old development near farm fields, nature preserve, and lake, 3 miles from a village of 10,000, and 10 miles from the state capital population ~200K, where at UW-Madison every dorm room has at least 100/100, probably gigabit since they include 802.11ac wireless. But Wisconsin overall is among the very worst states in the US for connection speed.

[url]https://www.wpr.org/wisconsin-broadband-speed-among-worst-nation[/url]
UW-Madison peers with local internet providers at up to 10G if not higher. I've met and worked with Paul N. listed at [url]https://kb.wisc.edu/ns/page.php?id=6636[/url]

aaronhaviland 2018-10-23 22:05

[QUOTE=kriesel;494317]Hi,


I stumbled on this a while back, noted it, forgot about it, and recently had another look. Has anyone compiled and run this? If so, how did it compare to the sourceforge version, which is what's mirrored at mersenne.ca?

[URL]https://github.com/ah42/cuda-p1[/URL][/QUOTE]

Oh hey, that's me! I just coincidentally just started looking in this direction again because I finally decided to get a new GPU (I'm still using the GTX 660 I got in 2012 lol. RTX 2070 should be here next week...)

I completely forgot that I had that repo...

The code I have on github should be an improvement based on the code from Sourceforge, but it's been so long, I don't remember what I actually improved.

Looking at it now, it doesn't look like SF code has been updated since I forked it in 2013. I know I've been away for a while, but is that really the most recent P-1 GPU code?

aaronhaviland 2018-10-23 23:17

[QUOTE=aaronhaviland;498613]The code I have on github should be an improvement based on the code from Sourceforge, but it's been so long, I don't remember what I actually improved.[/QUOTE]

I just found this draft of a message I had intended to send to the original authors regarding my changes.
[INDENT][INDENT]I've been playing around with the code for a few months and have come across a few bugs, added features, or tweaked things (mostly making some things more to my liking, or trying to make things work better in my specific environment).

BTW, this is PM'd and not posted publicly because I wanted to contribute, and not hijack the program. A fork is a fork, but I'm trying to be a friendly fork :)

You can browse my changes here: [url]https://github.com/ah42/cuda-p1/commits/master[/url]

My environment is 64-bit Linux, so I may have broken things for Win builds. My Makefile changes are specific to working in Ubuntu with nvidia-cuda-toolkit installed via apt-get (I've got PPAs for cuda 5.5 and 6.0)

Quick summary of some changes thus far:
- (feature) Parse Pminus1 lines in worktodo.txt
- (bug) integer overflow caused infinite loop entering stage2 with certain combinations of b1/b2
- (bug) Bypass some code if exiting with an error
- (bug) memory leaks in stage2
- (bug) estimate "tf'd-to" level based on P95 defaults, instead of using an across-the-board default value if one is not specified
- (bug) Makefile: Future proof binaries by building PTX for latest known arch.
- (bug) Set minimum B1/B2. Smaller M()'s were computing poor probability and defaulting to B1=0
- (bug) Pre-computed Dickman's values were not giving results that agreed with the probability calculator at mersenne.ca. Replaced with a function to compute at runtime.
- reduce the available amount of thread combinations to speed up benchmarking. (no impact on any of my devices: 460, 660. Excluded ranges were never chosen by any device.)
- Added quick-running tests (low b1/b2) to Makefile to verify compiled binary finds factors in stage 1 and stage 2 (make test)

A lot of other changes are just reformatting, or tweaking things for my own happiness (i.e. changing the B1/B2/e selection routines for different probabilities, sending the entire codebase through eclipse to reformat in a consistent coding style)

TODO:
- split device code from host code into separate files, providing a clean separation between the two realms. Also, makes it much easier to change the compiler/options for the host code. This sort of system was used for the setiatahome CUDA binary, and I found it provided for a very clean codebase and worked quite well.
- try to clean up some of the kernels. I'd rather do this before the above, however with all the code being in one large .cu, it makes it more difficult.
- continue tuning/adjusting the bounds-calculation algorithms. It looks like this code was just lifted from elsewhere (mprime?) and wedged into place.[/INDENT][/INDENT]

kriesel 2018-10-24 04:27

[QUOTE=aaronhaviland;498613]Oh hey, that's me! I just coincidentally just started looking in this direction again because I finally decided to get a new GPU (I'm still using the GTX 660 I got in 2012 lol. RTX 2070 should be here next week...)

I completely forgot that I had that repo...

The code I have on github should be an improvement based on the code from Sourceforge, but it's been so long, I don't remember what I actually improved.

Looking at it now, it doesn't look like SF code has been updated since I forked it in 2013. I know I've been away for a while, but is that really the most recent P-1 GPU code?[/QUOTE]Yes, your fork and sourceforge's Nov 2013 before it seem to be the most current. I've run lots of cases on the sourceforge executables and tabulated issues encountered, over the past year. See #3 of [URL]https://www.mersenneforum.org/showthread.php?t=23389[/URL] Feel free to identify any that your fork address, and tackle any.

kriesel 2018-10-24 13:42

[QUOTE=aaronhaviland;498613]
Looking at it now, it doesn't look like SF code has been updated since I forked it in 2013. I know I've been away for a while, but is that really the most recent P-1 GPU code?[/QUOTE]
Jerry (flashjh) considered including my little code edits posted with [URL]https://www.mersenneforum.org/showpost.php?p=462600&postcount=503[/URL] but as far as I know, nothing came of it. See also [URL]https://www.mersenneforum.org/showpost.php?p=463662&postcount=511[/URL], [URL]https://www.mersenneforum.org/showpost.php?p=490466&postcount=568[/URL], [URL]https://www.mersenneforum.org/showpost.php?p=494224&postcount=575[/URL]

Possibly Cubox might help in some way. [URL]https://www.mersenneforum.org/showpost.php?p=481663&postcount=552[/URL]
I'd like to figure out how to do successful CUDAPm1 Windows builds and get some out there for CUDA levels above 5.5, as well as bug fixes and enhancements, and try out how your fork compares, but am currently occupied with other things. The latest SDK is CUDA 10, so there's a lot of catching up to do.

James Heinrich 2018-10-24 14:43

For what it's worth, [i]preda[/i] is doing interesting things with gpuowl, including some magical combination of PRP+P-1, which appears nearly ready for production.

ET_ 2018-10-24 14:58

[QUOTE=James Heinrich;498664]For what it's worth, [i]preda[/i] is doing interesting things with gpuowl, including some magical combination of PRP+P-1, which appears nearly ready for production.[/QUOTE]

The only problem is that such magics don't happen on CUDA :smile:

kriesel 2018-10-24 15:02

[QUOTE=James Heinrich;498664]For what it's worth, [I]preda[/I] is doing interesting things with gpuowl, including some magical combination of PRP+P-1, which appears nearly ready for production.[/QUOTE]Yes, that work is very interesting, and described in his recent posts in [URL]https://www.mersenneforum.org/showthread.php?t=22204&page=70[/URL]. Unfortunately Preda has abandoned efforts toward CUDA or OpenCl on NVIDIA and sold his NVIDIA test GPU.
There appears to be no appreciable mersenne searching software development activity for NVIDIA, for either PRP or P-1, with either CUDA or OpenCl. To my knowledge there's no usable PRP NVIDIA code available at all.

kriesel 2018-10-24 18:01

[QUOTE=aaronhaviland;498613]
The code I have on github should be an improvement based on the code from Sourceforge, but it's been so long, I don't remember what I actually improved.
[/QUOTE]
I found reading the commit notes in your fork interesting. "fencepost error" may account for some of the anomalies I've seen in the Sourceforge-version-derived Windows executables.
Please review those notes and carry it forward!

I have a collection of reference material, specific to CUDAPm1 (Sourceforge versions), at [URL]https://www.mersenneforum.org/showthread.php?p=498673#post498673[/URL]
Post 7 is a summary/overview of testing I've done on CUDAPm1 v0.20, mostly September 2013 cuda 5.5 version, some November 2013 cuda 5.0, on Windows. Posts 8 and 9 are new and contain attachments showing detail, separately per gpu model, 8 models, ranging from 1 to 8 gb gpu ram. Total test effort was I think >1 gpu-year to date.

For most of that I have been able to submit at least stage 1 results to primenet, and for many stage 2, although some runs failed before printing a stage 1 gcd result, factor or no factor found, and some runs failed at other points. Some were completed by moving to a different gpu. Others can't be completed that way either.

If anyone has a way of converting or moving a pre-gcd stage 1 run from CUDAPm1 to some other software that can perform the gcd check for a factor, please share, either here or by PM. (Or a CUDAPm1 Windows executable or source code that doesn't have that issue...)

These tests indicate that currently, 0 of 8 gpu models evaluated can complete stage 1 and 2 above exponent value ~433,000,000 (maybe as low as ~431M max for the GTX1060 3gb). Prime95 can go higher, but is also capped well below the mersenne.org limit of 10[SUP]9[/SUP] (~595M, except FMA3-capable hardware ~920M).

preda 2018-10-24 19:08

[QUOTE=kriesel;498677]
If anyone has a way of converting or moving a pre-gcd stage 1 run from CUDAPm1 to some other software that can perform the gcd check for a factor, please share, either here or by PM. (Or a CUDAPm1 Windows executable or source code that doesn't have that issue...)

These tests indicate that currently, 0 of 8 gpu models evaluated can complete stage 1 and 2 above exponent value ~433,000,000 (maybe as low as ~431M max for the GTX1060 3gb). Prime95 can go higher, but is also capped well below the mersenne.org limit of 10[SUP]9[/SUP] (~595M, except FMA3-capable hardware ~920M).[/QUOTE]

GpuOwl does the GCD on the CPU, using GMP, and it's pretty small and simple code, see e.g.:
[url]https://github.com/preda/gpuowl/blob/master/GCD.cpp[/url]

More work is probably in transforming the "balanced bits" from the GPU representation into "compact words" for the CPU. (i.e. importing the data GPU-to-CPU). After that, doing the GCD with GMP is easy.

kriesel 2018-10-24 19:40

[QUOTE=preda;498684]GpuOwl does the GCD on the CPU, using GMP, and it's pretty small and simple code, see e.g.:
[URL]https://github.com/preda/gpuowl/blob/master/GCD.cpp[/URL]

More work is probably in transforming the "balanced bits" from the GPU representation into "compact words" for the CPU. (i.e. importing the data GPU-to-CPU). After that, doing the GCD with GMP is easy.[/QUOTE]
Thanks. Having no idea what gpuOwL's PRP-1 limits are, not having run it at all yet, and myself regarding it as a somewhat different animal than standalone P-1 run capability, I omitted it from the limit description.
I've looked at the various CUDA app code (CUDALucas, mfaktc, CUDAPm1) but have not succeeded in getting a successful build completed for a CUDA app (unmodified code) yet on Windows, or spent much time trying.

CUDAPm1 looked to me to be using GMP also. But the available Windows CUDAPm1 executables are linked to an old GMP version (2013 or earlier), and after looking through GMP's revision history of the past few years, I think there might be some issues due to that, not present in gpuOwL, or future builds of CUDAPm1 with a current GMP version for that matter.
I think a fair description of the CUDAPm1 testing I've done is partial factorial black box. I run almost entirely unique exponents on different systems containing different gpu models, with some system cpu models matching but mostly not. Slightly different exponents will sometimes fail on the same gpu model but a different box can run them to completion. (Quadro 2000, ~84.8M exponent for example.) These differences could easily be the effect of some bug triggering on certain operands rather than sensitivity to memory sizes, cpu type, gpu type, OS, etc.

SELROC 2018-10-24 20:19

[QUOTE=preda;498684]GpuOwl does the GCD on the CPU, using GMP, and it's pretty small and simple code, see e.g.:
[URL]https://github.com/preda/gpuowl/blob/master/GCD.cpp[/URL]

More work is probably in transforming the "balanced bits" from the GPU representation into "compact words" for the CPU. (i.e. importing the data GPU-to-CPU). After that, doing the GCD with GMP is easy.[/QUOTE]


libgmp-dev is a separate package that has become a dependency for gpuOwl on Debian and I think also on other flavors of Linux.
If compilation fails, install libgmp-dev.

preda 2018-10-24 20:34

@kriesel, I feel the pain for your testing, such a situation would have driven me mad.

(I understand this thread is about CUDA P-1, but diverting to GpuOwl, it does a normal P-1 first-stage to any limit B1, and within the general exponent limits (PRP) of GpuOwl; the "different beast" starts with the second stage P-1. But there, GpuOwl can do any B2<=Exponent in B2 iterations during the PRP.

So to clarify about GpuOwl:
- first stage is up to any B1, and just as efficient as any P-1 first-stage.
- second-stage is "fancy", and done in parallel with the PRP, but any B2<=Exponent can be covered in the first B2 iterations of the PRP
)

kriesel 2018-10-24 20:51

[QUOTE=SELROC;498690]libgmp-dev is a separate package that has become a dependency for gpuOwl on Debian and I think also on other flavors of Linux.
If compilation fails, install libgmp-dev.[/QUOTE]
...and presumably a whole linux install on the Windows boxes? ;)
This, for Windows, is rather dated (2006); [url]https://cs.nyu.edu/~exact/core/gmp/index.html[/url]

kriesel 2018-10-24 23:17

[QUOTE=preda;498691]@kriesel, I feel the pain for your testing, such a situation would have driven me mad.

(I understand this thread is about CUDA P-1, but diverting to GpuOwl, it does a normal P-1 first-stage to any limit B1, and within the general exponent limits (PRP) of GpuOwl; the "different beast" starts with the second stage P-1. But there, GpuOwl can do any B2<=Exponent in B2 iterations during the PRP.

So to clarify about GpuOwl:
- first stage is up to any B1, and just as efficient as any P-1 first-stage.
- second-stage is "fancy", and done in parallel with the PRP, but any B2<=Exponent can be covered in the first B2 iterations of the PRP[/QUOTE]
Re testing and sanity, it can try one's patience, yes, but fortunately I have a lot of that, and it's renewable within limits.

Re gpuOwL B2, do I understand you correctly that its B2 is limited to no more than the exponent? (Seems reasonable.) If so I'll add that to the available software summary I maintain.

kriesel 2018-10-25 00:29

[QUOTE=SELROC;498690]libgmp-dev is a separate package that has become a dependency for gpuOwl on Debian and I think also on other flavors of Linux.
If compilation fails, install libgmp-dev.[/QUOTE]

All the online references to libgmp-dev I can find are referring to Debian, Ubuntu, etc.
I'm attempting gpuowl builds in Mingw64/msys2 atop Windows 7 X64. (Freshly updated tonight to current, and g++ is v8.203)
$ g++ --version
g++.exe (Rev3, Built by MSYS2 project) 8.2.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

libgmp-10.dll is present in C:\msys64\mingw64\bin and is much older (added to the system in August 2018, file date is Jan 12 2017)

preda 2018-10-25 14:18

[QUOTE=kriesel;498700]
Re gpuOwL B2, do I understand you correctly that its B2 is limited to no more than the exponent? (Seems reasonable.) If so I'll add that to the available software summary I maintain.[/QUOTE]

Yes, B2 can be anything up to exponent.

The user may also enter a larger B2 value than the exponent (that way he asks for more primes to be tested), but the effective B2 in that case will be equal to the exponent.

Let's consider an example:
Exponent = 80'000'001
Testing with:
B1=1000000,B2=80000001;80000001
or, equivalent:
B1=1000000;80000001
(because by default, if not specified, B2==Exponent),
will test in second stage all the primes from 1M to 80000001, and report that B2 in the result.

Now the tricky case, where the entered B2 is larger than exponent:
B1=1000000,B2=160000000;80000001
In this situation, all the primes from 1M to Exponent (80000001) are tested in second stage,
and in addition to that, about 62% of the primes from Exponent to 160M are tested too.
But because "right after" (i.e. within the first few primes after) the exponent will be some prime > Exponent that is not tested, the reported B2 will still be == Exponent.

Covering 62% of the primes in the range [Exp, 2*Exp] is still a good thing, but unfortunately it can't be reported within the "B2" framework (which requires that absolutely all primes <= B2 be tested).

PS: not to mention that, in addition to this set of "explicit" primes to be tested, a large number of additional primes (let's say about 3 times as many) are tested too. But these can be very large primes, thus with reduced benefit compared to the "small" primes that are under B2.

aaronhaviland 2018-10-27 04:10

[QUOTE=kriesel;498677]I found reading the commit notes in your fork interesting. "fencepost error" may account for some of the anomalies I've seen in the Sourceforge-version-derived Windows executables.
Please review those notes and carry it forward![/QUOTE]

This was one of the reasons why I implemented a self-test as part of the build process on linux.

[QUOTE]I have a collection of reference material, specific to CUDAPm1 (Sourceforge versions), at [URL]https://www.mersenneforum.org/showthread.php?p=498673#post498673[/URL]
Post 7 is a summary/overview of testing I've done on CUDAPm1 v0.20, mostly September 2013 cuda 5.5 version, some November 2013 cuda 5.0, on Windows. Posts 8 and 9 are new and contain attachments showing detail, separately per gpu model, 8 models, ranging from 1 to 8 gb gpu ram. Total test effort was I think >1 gpu-year to date. [/QUOTE]That's a lot of material to go through, and I apologise, but I'm glossing over it for now. I have revived my github code, and have updated the linux build to CUDA 9.1. Some older hardware (below compute capability 3.0) is no longer supported with the newer versions of CUDA, but it's unlikely there's many of them around these days.

[QUOTE]For most of that I have been able to submit at least stage 1 results to primenet, and for many stage 2, although some runs failed before printing a stage 1 gcd result, factor or no factor found, and some runs failed at other points. Some were completed by moving to a different gpu. Others can't be completed that way either.

If anyone has a way of converting or moving a pre-gcd stage 1 run from CUDAPm1 to some other software that can perform the gcd check for a factor, please share, either here or by PM. (Or a CUDAPm1 Windows executable or source code that doesn't have that issue...)
[/QUOTE]IIRC, there were bugs in the handoff from stage1 to stage2 that I resolved (or at least bludgeoned with a hammer)

[QUOTE]These tests indicate that currently, 0 of 8 gpu models evaluated can complete stage 1 and 2 above exponent value ~433,000,000 (maybe as low as ~431M max for the GTX1060 3gb). Prime95 can go higher, but is also capped well below the mersenne.org limit of 10[SUP]9[/SUP] (~595M, except FMA3-capable hardware ~920M).[/QUOTE]I'm launching a ~511M run tonight with a known factor with my 780. I'll be curious to see how long it runs before dying, or if it finds the factor. I've never tested anything over 70M before

kriesel 2018-10-27 10:56

[QUOTE=aaronhaviland;498869]This was one of the reasons why I implemented a self-test as part of the build process on linux.

That's a lot of material to go through[/QUOTE]
Definitely. See also the bug and wish list at [URL]http://www.mersenneforum.org/showpost.php?p=488534&postcount=3[/URL]

[QUOTE]I have revived my github code, and have updated the linux build to CUDA 9.1. Some older hardware (below compute capability 3.0) is no longer supported with the newer versions of CUDA, but it's unlikely there's many of them around these days.
[/QUOTE]Understood. Gpus with 2.x or lower are probably a small fraction of the total active hardware population globally, but I have several Quadro 2000s (2.1), 2 Quadro 4000 (2.0), a Quadro 5000 (2.0), and a GTX480 (2.0) running, constituting the majority of my fleet.
[QUOTE]

IIRC, there were bugs in the handoff from stage1 to stage2 that I resolved (or at least bludgeoned with a hammer)[/QUOTE]I've been meaning to attempt a Windows build of your version for a while now.
[QUOTE]I'm launching a ~511M run tonight with a known factor with my 780. I'll be curious to see how long it runs before dying, or if it finds the factor. I've never tested anything over 70M before[/QUOTE]That's likely to take some days to get through stage 1. Please post how it turns out. If it requires some new fixes to complete, or you make any additional fixes or improvements, please refresh your github repository.

aaronhaviland 2018-10-28 00:40

[QUOTE=kriesel;498880]That's likely to take some days to get through stage 1. Please post how it turns out. If it requires some new fixes to complete, or you make any additional fixes or improvements, please refresh your github repository.[/QUOTE]Since I knew the minimum B1/2 values to find the known factor, it didn't take long at all for stage 1, however stage 2 was completely borked. (The whole run was about 9 hours)
I'm going to run on a "normal" size exponent just to validate it still works at that level properly before I go any further.

I did push a few minor commits lastnight,, but the only code change so far was an issue writing/saving the cufft fft/threads benchmark files. I plan to make those happen automatically if they don't already exist. (Using saved values for an extinct-ish card is stupid, and the time-savings of running an optimal fft size is worth the cost of running the benchmark)

Honestly the first thing I really want to do, once I validate it works again, is a major refactor of the code, just because I feel like it's a huge jumble of blah every time i look through it. (no offense meant to the original authors. Great work getting it that far.)

aaronhaviland 2018-10-28 16:44

[QUOTE=kriesel;498880]Definitely. See also the bug and wish list at [URL]http://www.mersenneforum.org/showpost.php?p=488534&postcount=3[/URL][/QUOTE]

I'm curious what code changes you've made, vs what code changes I've made. I know I made some prior to my first github import, and I wasn't tracking them at that time.

kriesel 2018-11-05 20:15

CUDAPm1 v0.20 bug and wish list updated
 
See the attachment at [URL]https://www.mersenneforum.org/showpost.php?p=488534&postcount=3[/URL]

kriesel 2018-11-06 01:19

4 Attachment(s)
[QUOTE=aaronhaviland;498973]I'm curious what code changes you've made, vs what code changes I've made. I know I made some prior to my first github import, and I wasn't tracking them at that time.[/QUOTE]
Hi, sorry for the delay responding.
What's your build OS, still Ubuntu; version #? I'm aiming for Win7 x64.

My cudapm1 draft changes have not made it into executable form, or onto sourceforge or github yet. I invite you to fold them into your current efforts. I've been delayed in working on getting a proper build environment for CUDAPm1 on Windows. At this point I would begin by trying to compile simpler cuda code first, then an existing set of cudapm1 code, without my changes, to prove out a build environment, before merging my changes. That cudapm1 code could just as well be your latest version at that point.

My draft changes are of multiple types, none of which provide speed improvements or other core algorithm changes.

1) Misc minor edits for housekeeping (see the attachment at [URL="https://www.mersenneforum.org/showpost.php?"]https://www.mersenneforum.org/showpost.php?p=462600&postcount=503[/URL] and see change note #8 in an attachment at [URL="https://www.mersenneforum.org/showpost.php?"]https://www.mersenneforum.org/showpost.php?p=463662&postcount=511.[/URL]

2) Addition of output options and date/time stamps (see the test/demo program attached to this post)
See attached additions.7z
Change existing printf and fprintf calls to dprintf and dfprintf respectively, to incorporate logging control throughout the program. Extending ini file reading and command line parsing to accommodate it has not been written. Add date/time stamps to iteration or transform output lines, and at transition times such as start and end of gcd computations, has not been written. Output=4 would be useful for benchmarking or testing.

3) Sanity checking of fft and threads benchmarking (modified, untested, in fact the code fragment draft is still a comment inline in old code) See attached modified cudapm1.cu (derived from the sourceforge v0.20 version)

4) Incomplete rewrite of readme.txt (copy in current state attached. End users, use with extreme caution or not at all.) See attached readme-cudapm1-rewrite.txt

5) Editing of cudapm1.ini (other than the fragment re logging below)
see attached cudapm1.ini

readme.txt fragment re logging via dprintf etc of additions.7z
[CODE]
Output control is available from a command line option -o, or ini file directive output
-o 0 prints stdout content to both console and log file. (dual)
-o 1 suppresses logging screen output to file, does output to screen (default; traditional)
-o 2 suppresses screen output, logs stdout to log file (log only)
-o 3 suppresses both logging to file and screen output to stdout (silent mode)
-o 4 prints stdout and stderr content to both console and log file. (dual stdout and stderr)
-o 5 stdout to console, stderr to console and log
-o 6 stdout to log file, stderr to console and log
-0 7 stdout suppressed, stderr to console and log
Output to stderr, addition of results to results file, consuming of worktodo file, and save to save files, thread files, or fft files occur regardless of this output flag.

stdout stdout2log stderr stderr2log
0 y y y n
1 y n y n
2 n y y n
3 n n y n
4 y y y y
5 y n y y
6 n y y y
7 n n y y
[/CODE]ini file fragment re logging via dprintf etc of additions.7z
[CODE]
# Output control is available from a command line option -o, or ini file directive output
# output=0 prints stdout content to both console and log file. (dual)
# output=1 suppresses logging screen output to file, does output to screen (default; traditional)
# output=2 suppresses screen output, logs stdout to log file (log only)
# output=3 suppresses both logging to file and screen output to stdout (silent mode)
# output=4 prints stdout and stderr content to both console and log file. (dual stdout and stderr)
# output=5 stdout to console, stderr to console and log
# output=6 stdout to log file, stderr to console and log
# output=7 stdout suppressed, stderr to console and log
# Output to stderr, addition of results to results file, consuming of worktodo file, and save to
# save files, thread files, or fft files occur regardless of this output flag.
#
# stdout stdout2log stderr stderr2log
# 0 y y y n
# 1 y n y n
# 2 n y y n
# 3 n n y n
# 4 y y y y
# 5 y n y y
# 6 n y y y
# 7 n n y y

output=1
[/CODE]Sample console output of dprintf etc test/demo program "Additions"
[CODE]Additions.c ver 8/31/2017

Opened for append testlogfile.txt
B The system time is: 16:41:33.655 UTC
at: Mon 2018-11-05 16:41:33.655 UTC
Starting at Local time 2018-11-05 10:41:33.656, UTC 2018-11-05 16:41:33.656.


flag 0 follows
Flag 0=0 expected should print stdout to both log file and screen.
Stderr should be unaffected by flag=0 called with 0

flag 1 follows
Flag 1=1 expected should print stdout to screen but not log file.
Stderr should be unaffected by flag=1 called with 1

flag 2 follows
Stderr should be unaffected by flag=2 called with 2

flag 3 follows
Stderr should be unaffected by flag=3 called with 3

flag 4 follows
Flag 4=4 expected should print stdout to both log file and screen.
Stderr should be duplicated by flag=4 called with 4

flag 5 follows
Flag 5=5 expected should print stdout to screen but not log file.
Stderr should be duplicated by flag=5 called with 5

flag 6 follows
Stderr should be duplicated by flag=6 called with 6

flag 7 follows
Stderr should be duplicated by flag=7 called with 7

flag 8 follows

Warning--output flag value=8 is outside expected bounds of 0-7 on entry to dprintf.
Flag 8=8 expected should print stdout to both log file and screen and warn about flag
.8

Warning--output flag value=8 outside expected bounds of 0-7 on entry to dfprintf.
Stderr should be duplicated by flag=8 called with 8

flag -47 follows

Warning--output flag value=-47 is outside expected bounds of 0-7 on entry to dprintf.

Flag -47 should print stdout to both log file and screen and warn about flag.-47

Warning--output flag value=-47 outside expected bounds of 0-7 on entry to dfprintf.
Stderr should be unaffected by flag=-47 called with -47
a=398.000000, b=1.000000 final b=inf

Exiting at Local time 2018-11-05 10:41:33.671, UTC 2018-11-05 16:41:33.671. Elapsed time of the run, 0.016 seconds


End program at: Mon 2018-11-05 16:41:33.671
[/CODE]Sample log file content of dprintf etc test/demo program "Additions"
[CODE]Starting at Local time 2018-11-05 10:41:33.656, UTC 2018-11-05 16:41:33.656.


flag 0 follows
Flag 0=0 expected should print stdout to both log file and screen.
Flag 2=2 expected should print stdout to log file but not screen.
Flag 4=4 expected should print stdout to both log file and screen.
Stderr should be duplicated by flag=4 called with 4
Stderr should be duplicated by flag=5 called with 5
Flag 6=6 expected should print stdout to log file but not screen.
Stderr should be duplicated by flag=6 called with 6
Stderr should be duplicated by flag=7 called with 7
Flag 8=8 expected should print stdout to both log file and screen and warn about flag.8
Stderr should be duplicated by flag=8 called with 8
Flag -47 should print stdout to both log file and screen and warn about flag.-47

Exiting at Local time 2018-11-05 10:41:33.671, UTC 2018-11-05 16:41:33.671. Elapsed time of the run, 0.016 seconds
[/CODE]cudalucas/cudapm1 option flag list, alphabetized[CODE]

-b proposed bios version confirmation of specific device
-c n checkpoint
-cufftbench create fft file
-d n device number (zero based)

-f n fftlength

-h help and exit
-i filename ini file
-info
-k keyboard input enabled

-m proposed model name confirmation of specific device
-memtest

-o proposed output control flag

-p proposed pci slot id string confirmation of specific device
-polite n

-r n run short or long selftest
-s <folder> save checkpoints
-threadbench create threads file
-threads

-u proposed userid and optional systemid-gpuid string to prepend to results lines
-v version and exit
-w proposed estimate work durations and schedule
-x n screen report interval
[/CODE]

aaronhaviland 2018-11-12 02:40

[QUOTE=kriesel;499685]Hi, sorry for the delay responding.
What's your build OS, still Ubuntu; version #? I'm aiming for Win7 x64.
[/QUOTE]Ubuntu currently for this project, but I have others in Visual Studio. (I'm not exactly a fan of frontends, and prefer console compilations myself)
I do plan to build on both platforms in the future, but I prefer to code in *nix.

[QUOTE]1) Misc minor edits for housekeeping (see the attachment at [URL="https://www.mersenneforum.org/showpost.php?"]https://www.mersenneforum.org/showpost.php?p=462600&postcount=503[/URL] and see change note #8 in an attachment at [URL="https://www.mersenneforum.org/showpost.php?"]https://www.mersenneforum.org/showpost.php?p=463662&postcount=511[/URL][/QUOTE] - Done, see commits b2d11b1 through d5c7a6f

[QUOTE]2) Addition of output options and date/time stamps (see the test/demo program attached to this post)[/QUOTE] - Passing on this one for now. Put in TODO. May re-visit later

[QUOTE]3) Sanity checking of fft and threads benchmarking[/QUOTE] - Trying to understand this. I'm guessing there are some combinations of cards/threads where the FFT just bails out and returns quickly, and this is an attempt to catch it?

[QUOTE]4) Incomplete rewrite of readme.txt[/QUOTE] - Tabled for now, put in TODO.

[QUOTE]5) Editing of cudapm1.ini (other than the fragment re logging below)[/QUOTE] - Done, See commit 0b4f2c2

kriesel 2018-11-12 06:27

[QUOTE=aaronhaviland;500127]
(3) - Trying to understand this. I'm guessing there are some combinations of cards/threads where the FFT just bails out and returns quickly, and this is an attempt to catch it?
[/QUOTE]
Yes. See for example [URL]https://www.mersenneforum.org/showpost.php?p=456324&postcount=2591[/URL] where 1024 squaring threads is bad, gives timings half what others do, in CUDALucas. There are also cases where 32 threads is bad. Compute capability 2.0 I think. CUDAPm1 issue #16.

There are also cases where certain fft lengths give bad results. As I recall these were found for old CUDA levels. See also [URL]https://www.mersenneforum.org/showpost.php?p=463280&postcount=2608[/URL] for the fft benchmark analogous issue.

See also the bad-residues cases, at least some of which are related to the threads issues. The CUDALucas issues 2 to 5 in its bug and wish list are worth examining.

The too-early returns for some thread counts or fft lengths trash the thread or fft benchmarking respectively.

CUDALucas was modified to trap for a select few bad-residue cases; 0x02, 0x00, and 0xfffffffffffffffd. The CUDALucas v2.06beta traps for its known bad residues. Since CUDAPM1 was derived from CUDALucas, years before, it has some of the same issues as well as some of its own. CUDAPm1's list of bad residues is longer.
%badresidues=(
'cllucas', '0x0000000000000002, 0xffffffff80000000',
'cudalucas', '0x0000000000000000, 0x0000000000000002, 0xfffffffffffffffd',
'cudapm1', '0x0000000000000000, 0x0000000000000001, 0xfff7fffbfffdfffe, 0xfff7fffbfffdffff, 0xfff7fffbfffffffe, 0xfff7fffbffffffff, 0xfff7fffffffdfffe, 0xfff7fffffffdffff, 0xfff7fffffffffffe, 0xfff7ffffffffffff, 0xfffffffbfffdfffe, 0xfffffffbfffdffff, 0xfffffffbfffffffe, 0xfffffffbffffffff, 0xfffffffffffdfffe, 0xfffffffffffdffff, 0xfffffffffffffffe, 0xffffffffffffffff',
'gpuowl', '0x0000000000000000',
'mfaktc', '',
'mfakto', ''
); #fff* added to cudapm1 list 7/19/18

tServo 2018-11-12 16:30

[QUOTE=kriesel;498686]

CUDAPm1 looked to me to be using GMP also. But the available Windows CUDAPm1 executables are linked to an old GMP version (2013 or earlier), and after looking through GMP's revision history of the past few years, I think there might be some issues due to that, not present in gpuOwL, or future builds of CUDAPm1 with a current GMP version for that matter.
.[/QUOTE]
krisel,
I suggest you avoid GMP and use MPIR instead. It's a rewrite of GMP for windows, designed to be compiled by Visual Studio and, I believe, yasm.
It's not trivial to install ( each person must compile it for themselves ), but it avoids all the GMP headaches. However, each version is then optimized for THAT machine.
One of its authors, Brian Gladman, posts here occasionally.
Be sure to use the 'generate GMP headers' option, which is specifically for porting code from GMP to windows.
I have used MPIR extensively, but have never ported anything from GMP.
It is at mpir.org

tServo 2018-11-12 16:39

[QUOTE=kriesel;498319]Unfortunately, the Oct deadline has passed without a sufficient number of neighbors signing up, so the schedule for fiber install for my neighborhood has been delayed by nominally 6 months, to next June. And the rate of signup (I'm tracking via their website) looks ominously slow for even that delayed schedule. If/when it happens, they're offering 300, 400, and 1000Mbps.[/QUOTE]

kriesel,
What company is promising all this?
The reason I ask is that here in Champaign-Urbana, we have had 2 different attempts
to provide everybody with fiber, exactly as you described.
The most recent had also had a web page where you could see how many of your neighbors have signed up, blah blah blah.
They also kept delaying it saying not enough have signed up for it, etc etc etc.

They finally crashed and burned; the whole thing looking like a scam
Bad feelings all around.
Some government started corporation was handed the contract to actually do it.
I don't know their status. I will look into it.
Meanwhile, evil Comcast still has my business.

kriesel 2018-11-12 18:07

[QUOTE=tServo;500158]kriesel,
What company is promising all this?
The reason I ask is that here in Champaign-Urbana, we have had 2 different attempts
to provide everybody with fiber, exactly as you described.
The most recent had also had a web page where you could see how many of your neighbors have signed up, blah blah blah.
They also kept delaying it saying not enough have signed up for it, etc etc etc.

They finally crashed and burned; the whole thing looking like a scam
Bad feelings all around.
Some government started corporation was handed the contract to actually do it.
I don't know their status. I will look into it.
Meanwhile, evil Comcast still has my business.[/QUOTE]
Was your experience part of this? [URL]http://www.wandtv.com/story/29018456/fiber-optic-internet-installation-underway-in-champaign-urbana[/URL]

Here, it's TDS Fiber, part of the same company as US Cellular. [URL]https://en.wikipedia.org/wiki/Telephone_and_Data_Systems[/URL]
Same MO, maps of various zones in the area with deadlines and rollout date projections, and number of households needed yet per zone. My zone has been stuck at the same number needed yet, for months. Regression fit indicates > 2 years to reach the required number, and that extrapolation continues to slide later by the day.

The company they contracted with to do utility marking is USIC, the same company that mismarked for Verizon a Sun Prairie WI gas line by 25 feet (7.6m, the width of a house, or in this case the sidewalk instead of the middle of the street), leading to explosions at multiple addresses and multi-structure fire in the downtown, the death of a firefighter and injuries of several others, and the evacuation of several dozen structures including homes. Several structures both sides of the street were classified destroyed and the street was shut down for days. It was a 4" gas line that got hit by another contractor because of the badly misplaced marking. I guess I should be glad it was only the telco cable that got cut in my neighborhood when they were passing cable through here to hook up a more distant neighborhood, not a gas or sewer line cut. [URL]https://www.jsonline.com/story/news/politics/2018/07/11/sun-prairie-firefighter-killed-explosion-destroyed-his-own-bar/774919002/[/URL]
[URL]https://chicagoareafire.com/blog/tag/sun-prairie-fire-department/[/URL]

kriesel 2018-11-12 18:10

[QUOTE=tServo;500156]krisel,
I suggest you avoid GMP and use MPIR instead. It's a rewrite of GMP for windows, designed to be compiled by Visual Studio and, I believe, yasm.
It's not trivial to install ( each person must compile it for themselves ), but it avoids all the GMP headaches. However, each version is then optimized for THAT machine.
One of its authors, Brian Gladman, posts here occasionally.
Be sure to use the 'generate GMP headers' option, which is specifically for porting code from GMP to windows.
I have used MPIR extensively, but have never ported anything from GMP.
It is at mpir.org[/QUOTE]
What makes an MPIR install specific to a machine? Optimization for cpu type? Can it be statically linked in and copied to another similar machine without doing the build fresh on each box? It would get tedious installing the build environment on every system I'd want to run the end product on.

tServo 2018-11-12 20:11

[QUOTE=kriesel;500160]Was your experience part of this? [URL]http://www.wandtv.com/story/29018456/fiber-optic-internet-installation-underway-in-champaign-urbana[/URL]

[/QUOTE]

Yes, that's us. In fact, I know Jenny referenced in the article. She never got her fiber.

The nonprofit that was formed from the ashes of that failure looks like its a failure, too.
In fact, its website ( uc2b.net ) looks just like the previous failures: ie, it has a map, sign up form, etc. But if you look at reports, meeting notes, & blog entries, they are all 2 to 4 years old. In other words, it's dead.

The bottom line here is these things fail at an alarming rate. Buyer beware !

The technology I've seen that might be a game changer is 5G cell.

tServo 2018-11-12 20:16

[QUOTE=kriesel;500161]What makes an MPIR install specific to a machine? Optimization for cpu type? Can it be statically linked in and copied to another similar machine without doing the build fresh on each box? It would get tedious installing the build environment on every system I'd want to run the end product on.[/QUOTE]

Yes, it's quite flexible. If you goto mpir.org and look at section 2.1 of the documentation, "Install Options", it gives that info. One can just use C & avoid assembler for the most generic binary, one can choose to make a build of one's oldest architecture, cross platform builds can be done, and a fat binary that has several binaries can be built.

aaronhaviland 2018-11-13 02:40

I was actually looking at MPIR yesterday, myself. As I understand it, it's just a fork of GMP mostly because the GMP authors refuse to have anything to do with Windows.

One of the supported build methods for MPIR can make generic non-hardware specific libraries, which is what I'm planning to do.

Also, I believe that, since most of the heavy lifting is done in CUDA rather than on the host, the need for optimised host code is less important, and we should have no problem linking in said generic build.

kriesel 2018-11-13 03:39

[QUOTE=aaronhaviland;500182]I was actually looking at MPIR yesterday, myself. As I understand it, it's just a fork of GMP mostly because the GMP authors refuse to have anything to do with Windows.

One of the supported build methods for MPIR can make generic non-hardware specific libraries, which is what I'm planning to do.

Also, I believe that, since most of the heavy lifting is done in CUDA rather than on the host, the need for optimised host code is less important, and we should have no problem linking in said generic build.[/QUOTE]
As I understand it, gmp is used for the gcd on one host cpu core, while the gpu idles. That is typically a small fraction of total run time, but not trivial. I've seen it take the better part of an hour on higher exponents, per gcd.

Any chance you'll make a Windows executable available?
I estimate I'll finish my current limits testing of CUDAPm1 v0.20 in a week or two.

preda 2018-11-13 06:54

[QUOTE=kriesel;500185]As I understand it, gmp is used for the gcd on one host cpu core, while the gpu idles. That is typically a small fraction of total run time, but not trivial. I've seen it take the better part of an hour on higher exponents, per gcd.

Any chance you'll make a Windows executable available?
I estimate I'll finish my current limits testing of CUDAPm1 v0.20 in a week or two.[/QUOTE]

Just to add my GCD timing on i7-7820X, with GMP using one CPU core, for 332M exponent the GCD takes 282s, while for a 89M exponent it take 56s.

While "waiting" for the GCD, the GPU doesn't have to be idle. It could continue, "optimistically" assuming that the GCD found no factor (and stop if the GCD turns out positive).

aaronhaviland 2018-11-13 11:17

[QUOTE=kriesel;500185]As I understand it, gmp is used for the gcd on one host cpu core, while the gpu idles. That is typically a small fraction of total run time, but not trivial. I've seen it take the better part of an hour on higher exponents, per gcd.

Any chance you'll make a Windows executable available?
I estimate I'll finish my current limits testing of CUDAPm1 v0.20 in a week or two.[/QUOTE]
I agree it's not trivial, but still minor in proportion to the length of time spent in gpu-land, I don't know if this could be moved to GPU as well, but it's on my list. From a quick scan, it looks like the GMP code is too widely used on the host-side, not just during GCD, so it may not be that simple.

And yes, once I figure out the whole GMP/MPIR thing. (This is the whole reason I'm looking at MPIR in the first place).

aaronhaviland 2018-11-13 11:29

[QUOTE=preda;500193]While "waiting" for the GCD, the GPU doesn't have to be idle. It could continue, "optimistically" assuming that the GCD found no factor (and stop if the GCD turns out positive).[/QUOTE]
Seems like a good idea, but may be more difficult in practice. It's not a multi-threaded program currently. Still worth investigating if we need to stay in CPU-land too long.

kriesel 2018-11-14 19:18

[QUOTE=aaronhaviland;500202]Seems like a good idea, but may be more difficult in practice. It's not a multi-threaded program currently. Still worth investigating if we need to stay in CPU-land too long.[/QUOTE]
Not sure if you're aware, but Preda has implemented some multithreading in gpuowl. Given the odds of factoring success in a stage, it's about 98% likely to be useful.
Other approaches I've thought of are having a second app running tf alongside (TF has a very small gpu ram footprint), or using a foreground/background tasks approach in CUDAPm1. Foreground is the first worktodo entry, background is the next. That's definitely more complicated and has memory requirement implications. Task switch times need to be short to be useful. Another possibility might be to use multiple cpu cores when available to speed the gcd.

It's more than the gcd's that are causing gpu stalls. I think the disk writes of checkpoint save files are also, with larger exponents being more noticeable.

I've seen in recent testing that sometimes CUDAPm1 significantly underutilized gpu memory in stage 2. Not sure what that's about, or if it's still present in your modified version.

aaronhaviland 2018-11-14 23:58

1 Attachment(s)
Success compiling with MPIR.

64-bit binary attached
Requires CUDA 10, and a GPU with Compute Capability >= 3.5. Unsure of other requirements, I'm not too familiar with Windows dependencies.

[CODE]Microsoft Windows [Version 10.0.17134.407]
C:\Users\Aaron\Documents\Visual Studio 2017\Projects\CUDAPm1\x64\Release>CUDAPm1.exe 7990427 -b1 986 -b2 124000
CUDAPm1 v0.21
Assuming exponent is trial factored to 63 bits
------- DEVICE 0 -------
name GeForce RTX 2070
Compatibility 7.5
clockRate (MHz) 1710
memClockRate (MHz) 7001
totalGlobalMem 8589934592
totalConstMem 65536
l2CacheSize 4194304
sharedMemPerBlock 49152
regsPerBlock 65536
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 1024
multiProcessorCount 36
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment 512
deviceOverlap 1

No GeForceRTX2070_fft.txt file found. Using default fft lengths.
For optimal fft selection, please run
./CUDAPm1 -cufftbench 1 8192 r
for some small r, 0 < r < 6 e.g.
CUDA reports 6723M of 8192M GPU memory free.
No GeForceRTX2070_threads.txt file found. Running benchmark.
CUDA bench, testing various thread sizes for fft 448K, doing 15 passes.
fft size = 448K, square time = 0.0436 msec, threads 32
fft size = 448K, square time = 0.0449 msec, threads 64
fft size = 448K, square time = 0.0336 msec, threads 128
fft size = 448K, square time = 0.0335 msec, threads 256
fft size = 448K, square time = 0.0356 msec, threads 512
fft size = 448K, square time = 0.0438 msec, threads 1024

Best square time for fft = 448K, time: 0.0335, t = 256

fft size = 448K, ave time = 0.0408 msec, Norm1 threads 32, Norm2 threads 32
fft size = 448K, ave time = 0.0407 msec, Norm1 threads 32, Norm2 threads 64
fft size = 448K, ave time = 0.0408 msec, Norm1 threads 32, Norm2 threads 128
fft size = 448K, ave time = 0.0412 msec, Norm1 threads 32, Norm2 threads 256
fft size = 448K, ave time = 0.0419 msec, Norm1 threads 32, Norm2 threads 512
fft size = 448K, ave time = 0.0433 msec, Norm1 threads 32, Norm2 threads 1024
fft size = 448K, ave time = 0.0402 msec, Norm1 threads 64, Norm2 threads 32
fft size = 448K, ave time = 0.0402 msec, Norm1 threads 64, Norm2 threads 64
fft size = 448K, ave time = 0.0405 msec, Norm1 threads 64, Norm2 threads 128
fft size = 448K, ave time = 0.0406 msec, Norm1 threads 64, Norm2 threads 256
fft size = 448K, ave time = 0.0408 msec, Norm1 threads 64, Norm2 threads 512
fft size = 448K, ave time = 0.0428 msec, Norm1 threads 64, Norm2 threads 1024
fft size = 448K, ave time = 0.0394 msec, Norm1 threads 128, Norm2 threads 32
fft size = 448K, ave time = 0.0394 msec, Norm1 threads 128, Norm2 threads 64
fft size = 448K, ave time = 0.0397 msec, Norm1 threads 128, Norm2 threads 128
fft size = 448K, ave time = 0.0400 msec, Norm1 threads 128, Norm2 threads 256
fft size = 448K, ave time = 0.0411 msec, Norm1 threads 128, Norm2 threads 512
fft size = 448K, ave time = 0.0423 msec, Norm1 threads 128, Norm2 threads 1024
fft size = 448K, ave time = 0.0401 msec, Norm1 threads 256, Norm2 threads 32
fft size = 448K, ave time = 0.0394 msec, Norm1 threads 256, Norm2 threads 64
fft size = 448K, ave time = 0.0395 msec, Norm1 threads 256, Norm2 threads 128
fft size = 448K, ave time = 0.0403 msec, Norm1 threads 256, Norm2 threads 256
fft size = 448K, ave time = 0.0408 msec, Norm1 threads 256, Norm2 threads 512
fft size = 448K, ave time = 0.0423 msec, Norm1 threads 256, Norm2 threads 1024
fft size = 448K, ave time = 0.0417 msec, Norm1 threads 512, Norm2 threads 32
fft size = 448K, ave time = 0.0416 msec, Norm1 threads 512, Norm2 threads 64
fft size = 448K, ave time = 0.0417 msec, Norm1 threads 512, Norm2 threads 128
fft size = 448K, ave time = 0.0424 msec, Norm1 threads 512, Norm2 threads 256
fft size = 448K, ave time = 0.0428 msec, Norm1 threads 512, Norm2 threads 512
fft size = 448K, ave time = 0.0425 msec, Norm1 threads 512, Norm2 threads 1024

Best time for fft = 448K, time: 0.0394, t1 = 128, t2 = 256, t3 = 64
Using threads: norm1 256, mult 128, norm2 128.
Using up to 4119M GPU memory.
Starting stage 1 P-1, M7990427, B1 = 986, B2 = 124000, fft length = 448K
Doing 1452 iterations
M7990427, 0x32318b15f9d83ab6, n = 448K, CUDAPm1 v0.21
Stage 1 complete, estimated total time = 0:01
Starting stage 1 gcd.
M7990427 Stage 1 found no factor (P-1, B1=986, B2=124000, e=0, n=448K CUDAPm1 v0.21)
Starting stage 2.
Using b1 = 986, b2 = 124000, d = 420, e = 4, nrp = 96
Zeros: 4430, Ones: 8530, Pairs: 2981
Processing 1 - 96 of 96 relative primes.
Initializing pass... done. transforms: 1987, err = 0.02539, (0.71 real, 0.3550 ms/tran, ETA NA)
Transforms: 9204 M7990427, 0x456fdf3be182449c, n = 448K, CUDAPm1 v0.21 err = 0.02734 (0:03 real, 0.2873 ms/tran, ETA 0:02)
Transforms: 8928 M7990427, 0x2acd8bf807caa816, n = 448K, CUDAPm1 v0.21 err = 0.02734 (0:02 real, 0.2912 ms/tran, ETA 0:00)

Stage 2 complete, 20119 transforms, estimated total time = 0:05
Starting stage 2 gcd.
M7990427 has a factor: 10509037975912491881 (P-1, B1=986, B2=124000, e=4, n=448K CUDAPm1 v0.21)


C:\Users\Aaron\Documents\Visual Studio 2017\Projects\CUDAPm1\x64\Release>[/CODE]

kriesel 2018-11-15 01:56

[QUOTE=aaronhaviland;500255]Success compiling with MPIR.

64-bit binary attached
Requires CUDA 10, and a GPU with Compute Capability >= 3.5. Unsure of other requirements, I'm not too familiar with Windows dependencies.

[CODE]Microsoft Windows [Version 10.0.17134.407]
...[/CODE][/QUOTE]
Congrats. Thanks for sharing the exe. Which commit is this, 1165353?

Any chance of a CUDA 8 build?
Or share the process of setting up a Windows build environment perhaps?

I may give your exe a spin on Win7, but would rather not have to upgrade all gpu systems' drivers to CUDA10 capability until I know it works on Win7 and the necessary driver version doesn't impact throughput. Speed and limit testing of v0.20 will be completed first. Plus many of my gpus are CC 2.x.

Is there any reason to believe this version will handle big exponents like the 511M you reported issues with earlier?

aaronhaviland 2018-11-15 02:12

[QUOTE=kriesel;500258]Congrats. Thanks for sharing the exe. Which commit is this, 1165353?

Any chance of a CUDA 8 build?
Or share the process of setting up a Windows build environment perhaps?

I may give your exe a spin on Win7, but would rather not have to upgrade all gpu systems' drivers to CUDA10 capability until I know it works on Win7 and the necessary driver version doesn't impact throughput. Speed and limit testing of v0.20 will be completed first. Plus many of my gpus are CC 2.x.

Is there any reason to believe this version will handle big exponents like the 511M you reported issues with earlier?[/QUOTE]
It's commit b456ecbffc908927ccb37d0240f66af6ef2e4bb


I can try to set up a Win7/CUDA 8 VM build environment, but I make no promises.


There's been no functional code change yet that would improve the ability to process higher exponents. So far, it's all just been mostly housekeeping.

VictordeHolland 2018-11-15 14:36

Thanks!
I'll try it on my GTX1080ti when I've got some time.

kriesel 2018-11-15 15:13

[QUOTE=kriesel;500245]
I've seen in recent testing that sometimes CUDAPm1 significantly underutilized gpu memory in stage 2. Not sure what that's about, or if it's still present in your modified version.[/QUOTE]The above was based on GPU-Z's indication of memory usage. I think now the issue is with GPU-Z.

[CODE]During a CUDAPm1 v0.20 run on a 300M exponent, stage 2,
nvidia-smi reports on a GTX1080Ti,
FB Memory Usage
Total : 11264 MiB
Used : 4967 MiB
Free : 6297 MiB
Utilization
Gpu : 99 %
Memory : 74 %
Encoder : 0 %
Decoder : 0 %

while GPU-Z 2.14.0 reports for the same gpu at the same time,
memory usage (dedicated) 750MB
memory usage (dynamic) 43MB

total is 793MB

4967-4096=871
871-793=78[/CODE]It's not clear whether GPU-Z's numbers are decimal MB or MiB.
But there seems to at least sometimes be a large discrepancy, more than 2^32 bytes,
between what nvidia-smi reports and what GPU-Z reports as memory used, for large-memory gpus. Or maybe what GPU-Z reports are small subsets of the total used. But as I recall it seemed to be a good indicator on a 4GB or smaller memory gpu.

HWMonitor indicates different usage and terms yet:
Memory 25%
Frame buffer 75%

aaronhaviland 2018-11-15 21:57

I seem to recall making some modifications to the memory allocations prior to my first git commit, but I cannot recall what they are.

We have to remember that it checks the available RAM before stage 1, as part of the bounds calculations:
[CODE]CUDA reports 7473M of 7949M GPU memory free.
Using threads: norm1 256, mult 128, norm2 64.
Using up to 7350M GPU memory.
Selected B1=660000, B2=14520000, 4.02% chance of finding a factor
Starting stage 1 P-1, M58039669, B1 = 660000, B2 = 14520000, fft length = 3200K
...

Starting stage 2.
Using b1 = 660000, b2 = 14520000, d = 2310, e = 12, nrp = 240
Zeros: 650369, Ones: 742591, Pairs: 145550
Processing 1 - 240 of 480 relative primes.[/CODE]But this memory is not actually allocated until much later, and the amount could have changed in that time. We have to be very careful not to exceed it because therein lies fatal errors, and we do not have control over other applications that may also be using the same memory.

One reason I find the code uses less memory than what is available is that it (based on my understanding, at least):
[LIST=1][*]Determines the value of nrp based on the available memory and fft size (and for some reason restricts it to 4GiB on Windows. Possibly a 32-bit issue, or something from older CUDA versions?)[*]Determines how many passes it takes to process all relative primes[*]Balances each pass so they're the same size.[/LIST]E.g. for my above exponent:
[LIST=1][*]nrp initially = 287 (would use all of the available ram)[*]Requires ~1.7 passes for 480 relative primes[*]Round up to make that 2 passes. Now nrp=240 relative primes per pass instead of running two wildly different sized passes (287 in the first and 193 in the second)[*]actual ram usage is 240*x instead of 287*x[/LIST]I'm not sure of [I]all[/I] of the reasons for this, but the one I can definitely be thankful for is that it is much less likely to crash from insufficient memory.

aaronhaviland 2018-11-15 22:13

I like this nvidia-smi view because it's a nice simple summary, and i still get to see how much memory each application is using

[CODE]+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.73 Driver Version: 410.73 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2070 Off | 00000000:01:00.0 On | N/A |
| 0% 56C P2 126W / 185W | 6891MiB / 7949MiB | 100% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1181 G /usr/lib/xorg/Xorg 191MiB |
| 0 2298 G cinnamon 94MiB |
| 0 9334 G /usr/lib/firefox/firefox 3MiB |
| 0 15779 C ./CUDAPm1 6587MiB |
| 0 16292 G /usr/lib/firefox/firefox 3MiB |
+-----------------------------------------------------------------------------+[/CODE]

science_man_88 2018-11-16 00:21

[QUOTE=aaronhaviland;500309]

One reason I find the code uses less memory than what is available is that it (based on my understanding, at least):
[LIST=1][*]Determines the value of nrp based on the available memory and fft size (and for some reason restricts it to 4GiB on Windows. Possibly a 32-bit issue, or something from older CUDA versions?)[/LIST][/QUOTE]

if addressed per byte, 4 GiB is all a 32 bit ( 4 byte) computer can address.

kriesel 2018-11-16 00:46

In V0.20 CUDAPm1, higher exponents and particularly in smaller memory gpus, NRP is smaller, and I see instances of several passes sometimes followed by a runt final pass. I'm even seeing small prime numbers occasionally for the value of NRP for most passes of a run.

aaronhaviland 2018-11-16 00:49

1 Attachment(s)
[QUOTE=science_man_88;500314]if addressed per byte, 4 GiB is all a 32 bit ( 4 byte) computer can address.[/QUOTE]
Yeah... that's why I'm speculating it might be a 32-bit specific issue.


Anyway, here's the Cuda-8.0, 64-bit, compute capability 2.0 binary I didn't promise (lol). Completely untested... I'm actually attaching it here so I can download it when I reboot into windows.

kriesel 2018-11-16 00:59

[QUOTE=aaronhaviland;500317]Yeah... that's why I'm speculating it might be a 32-bit specific issue.

Anyway, here's the Cuda-8.0, 64-bit, compute capability 2.0 binary I didn't promise (lol). Completely untested... I'm actually attaching it here so I can download it when I reboot into windows.[/QUOTE]
Thanks! Is that the same commit as the other image, or today's (c1afcee...)?

aaronhaviland 2018-11-16 01:03

[QUOTE=kriesel;500319]Thanks! Is that the same commit as the other image, or today's (c1afcee...)?[/QUOTE]
Same commit.


c1afcee is effectively the same, just prior to the minor fixes i needed for VS to process the build.

kriesel 2018-11-16 05:38

[QUOTE=aaronhaviland;500309]I seem to recall making some modifications to the memory allocations prior to my first git commit, but I cannot recall what they are.

We have to remember that it checks the available RAM before stage 1, as part of the bounds calculations:
...
But this memory is not actually allocated until much later, and the amount could have changed in that time. [/QUOTE]Much later, indeed. Even on fast gpus, a stage may take days for high exponents. Seems like recalculating right before stage 2 setup could help.[QUOTE]We have to be very careful not to exceed it (available memory) because therein lies fatal errors, and we do not have control over other applications that may also be using the same memory.

One reason I find the code uses less memory than what is available is that it (based on my understanding, at least):
[LIST=1][*]Determines the value of nrp based on the available memory and fft size (and for some reason restricts it to 4GiB on Windows. Possibly a 32-bit issue, or something from older CUDA versions?)[/LIST][/QUOTE]Could be left over from old compiler version limitations. I think it more likely a consequence of using the same code base for 64 bit and 32 bit application builds. Up to CUDA7.5 builds, 32bit builds were possible. I don't think 32bit builds are necessarily necessary any more. I'd be interested in other people's thoughts on that. There were some speed advantages in 32bit in older CUDA versions for CUDALucas, but they were not dramatic and perhaps not highly reproducible in benchmarking.

VictordeHolland 2018-11-16 11:50

1 Attachment(s)
[QUOTE=aaronhaviland;500255]Success compiling with MPIR.

64-bit binary attached
Requires CUDA 10, and a GPU with Compute Capability >= 3.5. Unsure of other requirements, I'm not too familiar with Windows dependencies.

[CODE]Microsoft Windows [Version 10.0.17134.407]
C:\Users\Aaron\Documents\Visual Studio 2017\Projects\CUDAPm1\x64\Release>CUDAPm1.exe 7990427 -b1 986 -b2 124000
CUDAPm1 v0.21
Assuming exponent is trial factored to 63 bits
------- DEVICE 0 -------
name GeForce RTX 2070
Compatibility 7.5
clockRate (MHz) 1710
memClockRate (MHz) 7001
totalGlobalMem 8589934592
totalConstMem 65536
l2CacheSize 4194304
sharedMemPerBlock 49152
regsPerBlock 65536
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 1024
multiProcessorCount 36
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment 512
deviceOverlap 1

No GeForceRTX2070_fft.txt file found. Using default fft lengths.
For optimal fft selection, please run
./CUDAPm1 -cufftbench 1 8192 r
for some small r, 0 < r < 6 e.g.
CUDA reports 6723M of 8192M GPU memory free.
No GeForceRTX2070_threads.txt file found. Running benchmark.
CUDA bench, testing various thread sizes for fft 448K, doing 15 passes.
fft size = 448K, square time = 0.0436 msec, threads 32
fft size = 448K, square time = 0.0449 msec, threads 64
fft size = 448K, square time = 0.0336 msec, threads 128
fft size = 448K, square time = 0.0335 msec, threads 256
fft size = 448K, square time = 0.0356 msec, threads 512
fft size = 448K, square time = 0.0438 msec, threads 1024

Best square time for fft = 448K, time: 0.0335, t = 256

fft size = 448K, ave time = 0.0408 msec, Norm1 threads 32, Norm2 threads 32
fft size = 448K, ave time = 0.0407 msec, Norm1 threads 32, Norm2 threads 64
fft size = 448K, ave time = 0.0408 msec, Norm1 threads 32, Norm2 threads 128
fft size = 448K, ave time = 0.0412 msec, Norm1 threads 32, Norm2 threads 256
fft size = 448K, ave time = 0.0419 msec, Norm1 threads 32, Norm2 threads 512
fft size = 448K, ave time = 0.0433 msec, Norm1 threads 32, Norm2 threads 1024
fft size = 448K, ave time = 0.0402 msec, Norm1 threads 64, Norm2 threads 32
fft size = 448K, ave time = 0.0402 msec, Norm1 threads 64, Norm2 threads 64
fft size = 448K, ave time = 0.0405 msec, Norm1 threads 64, Norm2 threads 128
fft size = 448K, ave time = 0.0406 msec, Norm1 threads 64, Norm2 threads 256
fft size = 448K, ave time = 0.0408 msec, Norm1 threads 64, Norm2 threads 512
fft size = 448K, ave time = 0.0428 msec, Norm1 threads 64, Norm2 threads 1024
fft size = 448K, ave time = 0.0394 msec, Norm1 threads 128, Norm2 threads 32
fft size = 448K, ave time = 0.0394 msec, Norm1 threads 128, Norm2 threads 64
fft size = 448K, ave time = 0.0397 msec, Norm1 threads 128, Norm2 threads 128
fft size = 448K, ave time = 0.0400 msec, Norm1 threads 128, Norm2 threads 256
fft size = 448K, ave time = 0.0411 msec, Norm1 threads 128, Norm2 threads 512
fft size = 448K, ave time = 0.0423 msec, Norm1 threads 128, Norm2 threads 1024
fft size = 448K, ave time = 0.0401 msec, Norm1 threads 256, Norm2 threads 32
fft size = 448K, ave time = 0.0394 msec, Norm1 threads 256, Norm2 threads 64
fft size = 448K, ave time = 0.0395 msec, Norm1 threads 256, Norm2 threads 128
fft size = 448K, ave time = 0.0403 msec, Norm1 threads 256, Norm2 threads 256
fft size = 448K, ave time = 0.0408 msec, Norm1 threads 256, Norm2 threads 512
fft size = 448K, ave time = 0.0423 msec, Norm1 threads 256, Norm2 threads 1024
fft size = 448K, ave time = 0.0417 msec, Norm1 threads 512, Norm2 threads 32
fft size = 448K, ave time = 0.0416 msec, Norm1 threads 512, Norm2 threads 64
fft size = 448K, ave time = 0.0417 msec, Norm1 threads 512, Norm2 threads 128
fft size = 448K, ave time = 0.0424 msec, Norm1 threads 512, Norm2 threads 256
fft size = 448K, ave time = 0.0428 msec, Norm1 threads 512, Norm2 threads 512
fft size = 448K, ave time = 0.0425 msec, Norm1 threads 512, Norm2 threads 1024

Best time for fft = 448K, time: 0.0394, t1 = 128, t2 = 256, t3 = 64
Using threads: norm1 256, mult 128, norm2 128.
Using up to 4119M GPU memory.
Starting stage 1 P-1, M7990427, B1 = 986, B2 = 124000, fft length = 448K
Doing 1452 iterations
M7990427, 0x32318b15f9d83ab6, n = 448K, CUDAPm1 v0.21
Stage 1 complete, estimated total time = 0:01
Starting stage 1 gcd.
M7990427 Stage 1 found no factor (P-1, B1=986, B2=124000, e=0, n=448K CUDAPm1 v0.21)
Starting stage 2.
Using b1 = 986, b2 = 124000, d = 420, e = 4, nrp = 96
Zeros: 4430, Ones: 8530, Pairs: 2981
Processing 1 - 96 of 96 relative primes.
Initializing pass... done. transforms: 1987, err = 0.02539, (0.71 real, 0.3550 ms/tran, ETA NA)
Transforms: 9204 M7990427, 0x456fdf3be182449c, n = 448K, CUDAPm1 v0.21 err = 0.02734 (0:03 real, 0.2873 ms/tran, ETA 0:02)
Transforms: 8928 M7990427, 0x2acd8bf807caa816, n = 448K, CUDAPm1 v0.21 err = 0.02734 (0:02 real, 0.2912 ms/tran, ETA 0:00)

Stage 2 complete, 20119 transforms, estimated total time = 0:05
Starting stage 2 gcd.
M7990427 has a factor: 10509037975912491881 (P-1, B1=986, B2=124000, e=4, n=448K CUDAPm1 v0.21)


C:\Users\Aaron\Documents\Visual Studio 2017\Projects\CUDAPm1\x64\Release>[/CODE][/QUOTE]
Looks like it works here!
W10 (1803) x64

CUDA10.0.130 (driver version 411.70)


GTX1080Ti
[code]
C:\CUDAPm1-CUDA10>CUDAPm1-CUDA10.exe 7990427 -b1 986 -b2 124000
CUDAPm1 v0.21
Assuming exponent is trial factored to 63 bits
Warning: Couldn't find .ini file. Using defaults for non-specified options.
CUDA reports 9312M of 11264M GPU memory free.
No GeForceGTX1080Ti_threads.txt file found. Running benchmark.
CUDA bench, testing various thread sizes for fft 512K, doing 15 passes.
fft size = 512K, square time = 0.0346 msec, threads 32
fft size = 512K, square time = 0.0360 msec, threads 64
fft size = 512K, square time = 0.0362 msec, threads 128
fft size = 512K, square time = 0.0363 msec, threads 256
fft size = 512K, square time = 0.0372 msec, threads 512
fft size = 512K, square time = 0.0379 msec, threads 1024

Best square time for fft = 512K, time: 0.0346, t = 32

fft size = 512K, ave time = 0.0454 msec, Norm1 threads 32, Norm2 threads 32
fft size = 512K, ave time = 0.0452 msec, Norm1 threads 32, Norm2 threads 64
fft size = 512K, ave time = 0.0450 msec, Norm1 threads 32, Norm2 threads 128
fft size = 512K, ave time = 0.0453 msec, Norm1 threads 32, Norm2 threads 256
fft size = 512K, ave time = 0.0452 msec, Norm1 threads 32, Norm2 threads 512
fft size = 512K, ave time = 0.0460 msec, Norm1 threads 32, Norm2 threads 1024
fft size = 512K, ave time = 0.0445 msec, Norm1 threads 64, Norm2 threads 32
fft size = 512K, ave time = 0.0445 msec, Norm1 threads 64, Norm2 threads 64
fft size = 512K, ave time = 0.0449 msec, Norm1 threads 64, Norm2 threads 128
fft size = 512K, ave time = 0.0451 msec, Norm1 threads 64, Norm2 threads 256
fft size = 512K, ave time = 0.0456 msec, Norm1 threads 64, Norm2 threads 512
fft size = 512K, ave time = 0.0465 msec, Norm1 threads 64, Norm2 threads 1024
fft size = 512K, ave time = 0.0452 msec, Norm1 threads 128, Norm2 threads 32
fft size = 512K, ave time = 0.0452 msec, Norm1 threads 128, Norm2 threads 64
fft size = 512K, ave time = 0.0453 msec, Norm1 threads 128, Norm2 threads 128
fft size = 512K, ave time = 0.0453 msec, Norm1 threads 128, Norm2 threads 256
fft size = 512K, ave time = 0.0461 msec, Norm1 threads 128, Norm2 threads 512
fft size = 512K, ave time = 0.0475 msec, Norm1 threads 128, Norm2 threads 1024
fft size = 512K, ave time = 0.0455 msec, Norm1 threads 256, Norm2 threads 32
fft size = 512K, ave time = 0.0455 msec, Norm1 threads 256, Norm2 threads 64
fft size = 512K, ave time = 0.0456 msec, Norm1 threads 256, Norm2 threads 128
fft size = 512K, ave time = 0.0456 msec, Norm1 threads 256, Norm2 threads 256
fft size = 512K, ave time = 0.0470 msec, Norm1 threads 256, Norm2 threads 512
fft size = 512K, ave time = 0.0477 msec, Norm1 threads 256, Norm2 threads 1024
fft size = 512K, ave time = 0.0459 msec, Norm1 threads 512, Norm2 threads 32
fft size = 512K, ave time = 0.0462 msec, Norm1 threads 512, Norm2 threads 64
fft size = 512K, ave time = 0.0463 msec, Norm1 threads 512, Norm2 threads 128
fft size = 512K, ave time = 0.0464 msec, Norm1 threads 512, Norm2 threads 256
fft size = 512K, ave time = 0.0474 msec, Norm1 threads 512, Norm2 threads 512
fft size = 512K, ave time = 0.0475 msec, Norm1 threads 512, Norm2 threads 1024

Best time for fft = 512K, time: 0.0445, t1 = 64, t2 = 32, t3 = 32
Using threads: norm1 256, mult 128, norm2 128.
Using up to 4124M GPU memory.
Starting stage 1 P-1, M7990427, B1 = 986, B2 = 124000, fft length = 512K
Doing 1452 iterations
M7990427, 0x32318b15f9d83ab6, n = 512K, CUDAPm1 v0.21
Stage 1 complete, estimated total time = 0:01
Starting stage 1 gcd.
M7990427 Stage 1 found no factor (P-1, B1=986, B2=124000, e=0, n=512K CUDAPm1 v0.21)
Starting stage 2.
Using b1 = 986, b2 = 124000, d = 420, e = 4, nrp = 96
Zeros: 4430, Ones: 8530, Pairs: 2981
Processing 1 - 96 of 96 relative primes.
Initializing pass... done. transforms: 1987, err = 0.00134, (0.53 real, 0.2650 ms/tran, ETA NA)
Transforms: 18132 M7990427, 0x2acd8bf807caa816, n = 512K, CUDAPm1 v0.21 err = 0.00146 (0:05 real, 0.3128 ms/tran, ETA 0:00)

Stage 2 complete, 20119 transforms, estimated total time = 0:05
Starting stage 2 gcd.
M7990427 has a factor: 10509037975912491881 (P-1, B1=986, B2=124000, e=4, n=512K CUDAPm1 v0.21)
[/code]Anymore test cases that I should run?

kriesel 2018-11-16 14:46

[QUOTE=VictordeHolland;500342]Looks like it works here!
...Anymore test cases that I should run?[/QUOTE]
You could try some run of the mill manual P-1 assignments.
Or get adventurous and try some larger ones. Note, run time can be quite long, and some might fail to complete. If you hit a case that fails, please share the details.

If you want some verification candidates, here's an excerpt from the draft rewrite of the CUDAPm1 readme file.
[CODE] Run CUDAPm1 on some exponents with known factors that should be found, and
see whether you find them. Easiest way is to select from the following list,
exponents at or near the size you plan to run, and put them in the worktodo
file. The bounds necessary to find factors vary by exponent. CUDAPm1's
automatic parameter selection will be enough to find most but not all.

Exponent Min B1 Min B2 fft length notes
4444091 7 2,557 256k
50001781 94,709 4,067,587 2688k
51558151 5,953 2,034,041 2880k
54447193 1,181 682,009 3072k
58610467 70,843 694,201 3200k
61012769 10,273 1,572,097 3360k
81229789 6,709 11,282,221 4704K
100000081 1,289 7,554,653 5600K
120002191 1,563 3,109,391 7168K
150000713 15,131 2,294,519 8640K
200000183 953 1,138,061 11200K
200001187 204,983 207,821 11200K
200003173 4,651 229,813 11200K
249500221 4 2.58951e+9 14336K big bounds, much memory & time
249500501 307 167,381 14336K
290001377 2,551 34,354,769 16384K takes days

PFactor=1,2,4444091,-1,70,2
PFactor=1,2,50001781,-1,74,2
PFactor=1,2,51558151,-1,74,2
PFactor=1,2,54447193,-1,74,2
PFactor=1,2,58610467,-1,74,2
PFactor=1,2,61012769,-1,74,2
PFactor=1,2,81229789,-1,75,2
PFactor=1,2,100000081,-1,76,2
Pfactor=1,2,120002191,-1,75,2
Pfactor=1,2,150000713,-1,75,2
Pfactor=1,2,200001187,-1,75,2
PFactor=1,2,249500501,-1,75,2
PFactor=1,2,290001377,-1,75,2

Exponent Factor (may be composite) Prime factors
4444091 1809798096458971047321927127 = 8888183 x 319974553 x 636358278473
50001781
4392938042637898431087689 = 3 x 182851 x 8008229
51558151
755277543419074012358186647
54447193
17261184235049628259201
58610467
69057033982979789260999
61012769 2018028590362685212673
81229789 355078783674010195200030259699844128700274440385857
= 488121804389130135740149369 x 727438890213848757119753
100000081 3441393510714285782119
120002191 100835659918276033441
150000713 1447762785107694357647
200000183 849003842550205126847
200001187 3050161780881530584679
200003173 14652109287435525414352647642348599
= 4320552944485007 x 3391257895852957657
249500221 5168661482381201657
249500501 3571511465549660434777661921959439
= 11607130072256471 x 307699788260867209
290001377 10645243382592701071676802590718709559
= 1436135993277492383 x 7412420155488583273
or 90944796249039267769901814723364335322839708522092302667497 =
* 170370076089478747961 * 371696926552024067119 * 1436135993277492383

Feel free to pick your own.
Evaluate them at their equivalent of
http://www.mersenne.ca/exponent/249500501[/CODE]

aaronhaviland 2018-11-16 22:00

[QUOTE=kriesel;500347]If you want some verification candidates, here's an excerpt from the draft rewrite of the CUDAPm1 readme file.[/QUOTE]
This is a great list. I want to include some more "quick" candidates as tests as part of the build process, beyond what I already have. (And I want to find out if Visual Studio can run tests post-compile... right now I just have Makefile rules for that on *nix)

VictordeHolland 2018-11-17 08:54

1 Attachment(s)
I ran the ones that take an hour at the most:
[code] 4,444,091 7 2,557
50,001,781 94,709 4,067,587
51,558,151 5,953 2,034,041
54,447,193 1,181 682,009
58,610,467 70,843 694,201
61,012,769 10,273 1,572,097
81,229,789 6,709 11,282,221
100,000,081 1,289 7,554,653
120,002,191 1,563 3,109,391
150,000,713 15,131 2,294,519
200,000,183 953 1,138,061
200,001,187 204,983 207,821
200,003,173 4,651 229,813


Pminus1=1,2,4444091,-1,7,2557
Pminus1=1,2,50001781,-1,94709,4067587
Pminus1=1,2,51558151,-1,5953,2034041
Pminus1=1,2,54447193,-1,1181,682009
Pminus1=1,2,58610467,-1,70843,694201
Pminus1=1,2,61012769,-1,10273,1572097
Pminus1=1,2,81229789,-1,6709,11282221
Pminus1=1,2,100000081,-1,1289,7554653
Pminus1=1,2,120002191,-1,1563,3109391
Pminus1=1,2,150000713,-1,15131,2294519
Pminus1=1,2,200000183,-1,953,1138061
Pminus1=1,2,200001187,-1,204983,207821
Pminus1=1,2,200003173,-1,4651,229813[/code]and they completed succesfully:
[code]
M4444091 has a factor: 2843992382407199 (P-1, B1=7, B2=7, e=0, n=256K CUDAPm1 v0.21)
M50001781 has a factor: 4392938042637898431087689 (P-1, B1=94709, B2=4067587, e=12, n=2816K CUDAPm1 v0.21)
M51558151 has a factor: 755277543419074012358186647 (P-1, B1=5953, B2=2034041, e=12, n=2816K CUDAPm1 v0.21)
M54447193 has a factor: 17261184235049628259201 (P-1, B1=1181, B2=682009, e=12, n=3200K CUDAPm1 v0.21)
M58610467 has a factor: 69057033982979789260999 (P-1, B1=70843, B2=694201, e=12, n=3200K CUDAPm1 v0.21)
M61012769 has a factor: 2018028590362685212673 (P-1, B1=10273, B2=1572097, e=12, n=3456K CUDAPm1 v0.21)
M81229789 has a factor: 727438890213848757119753 (P-1, B1=6709, B2=11282221, e=12, n=4480K CUDAPm1 v0.21)
M100000081 has a factor: 3441393510714285782119 (P-1, B1=1289, B2=7554653, e=12, n=5760K CUDAPm1 v0.21)
M120002191 has a factor: 100835659918276033441 (P-1, B1=1563, B2=3109391, e=12, n=6912K CUDAPm1 v0.21)
M150000713 has a factor: 1447762785107694357647 (P-1, B1=15131, B2=2294519, e=12, n=8640K CUDAPm1 v0.21)
M200000183 has a factor: 849003842550205126847 (P-1, B1=953, B2=1138061, e=12, n=11200K CUDAPm1 v0.21)
M200001187 has a factor: 3050161780881530584679 (P-1, B1=204983, B2=207821, e=12, n=11200K CUDAPm1 v0.21)
M200003173 has a factor: 14652109287435525414352647642348599 (P-1, B1=4651, B2=229813, e=12, n=11200K CUDAPm1 v0.21)
[/code]

aaronhaviland 2018-11-18 00:58

[QUOTE=aaronhaviland;500367]I want to include some more "quick" candidates as tests as part of the build process, beyond what I already have. [/QUOTE]

Aaaand on that note, I've added some built-in self-tests into the code itself, instead of relying on the build process.
[CODE]-selftest Run a quick selftest (ETA: 0:16)
-selftest2 Run a longer selftest (ETA: 17:22)[/CODE]So far I have 5 "quick" self tests (< 10s each on my hardware), and 2 "slow" self tests (~ 10m each on my hardware).
Checkpoints, worktodo.txt, and results.txt I/O are completely disabled for these tests.

aaronhaviland 2018-11-18 20:34

[QUOTE=kriesel;500133]Yes. See for example [URL]https://www.mersenneforum.org/showpost.php?p=456324&postcount=2591[/URL] where 1024 squaring threads is bad, gives timings half what others do, in CUDALucas. There are also cases where 32 threads is bad. Compute capability 2.0 I think. CUDAPm1 issue #16.

There are also cases where certain fft lengths give bad results. As I recall these were found for old CUDA levels.[/QUOTE] Check for anomalous thread timings: Commit 36ceb29
Check for anomalous fft timings: Commit 538118a

[QUOTE]CUDALucas was modified to trap for a select few bad-residue cases; 0x02, 0x00, and 0xfffffffffffffffd. The CUDALucas v2.06beta traps for its known bad residues. Since CUDAPM1 was derived from CUDALucas, years before, it has some of the same issues as well as some of its own. CUDAPm1's list of bad residues is longer.[/QUOTE]Added check for this. Commit a2c7f50

aaronhaviland 2018-11-19 00:53

Releasing all the above as v0.22
(Binaries uploaded:[URL]https://github.com/ah42/cuda-p1/releases/tag/0.22[/URL])
[LIST][*]First proper release since forking[*](Originally based on code from [URL]https://sourceforge.net/projects/cudapm1/[/URL] (r52)[*]Compute Dickman's function live, instead of using incorrect precomputed values[*]Fix memory leaks in stage2[*]Fix fencepost error causing invalid results[*]Fix potential overflows[*]Use smaller data types when possible[*]Reduce kernel branching[*]Update build for CUDA 10.0 / Compute Capability 7.5[*]Split kernel code into individual files[*]Replace GMP with MPIR for easier cross-platform builds.[*]Automatically run threadbench if required[*]Add VS2017 and eclipse build files.[*]Implement internal self-test system[*]Allow full memory allocation on 64-bit windows builds[*]Contributions from kriesel:[LIST][*]Add test for known invalid residues[*]Comment & code formatting/cleanup[*]Add test for abnormally low threadbench timings[*]Add test for abnormally low fftbench timings[/LIST] [/LIST]

LaurV 2018-11-19 03:44

Now, that is a very good job, after so long time, sir! Hat off and bow. :bow:
We will give it a spin tonight when we reach home.

VictordeHolland 2018-11-19 11:48

Wow, great job!

kriesel 2018-11-19 19:50

[QUOTE=aaronhaviland;500475]Releasing all the above as v0.22
(Binaries uploaded:[URL]https://github.com/ah42/cuda-p1/releases/tag/0.22[/URL])
[LIST][*]First proper release since forking[*](Originally based on code from [URL]https://sourceforge.net/projects/cudapm1/[/URL] (r52)[*]Compute Dickman's function live, instead of using incorrect precomputed values[*]Fix memory leaks in stage2[*]Fix fencepost error causing invalid results[*]Fix potential overflows[*]Use smaller data types when possible[*]Reduce kernel branching[*]Update build for CUDA 10.0 / Compute Capability 7.5[*]Split kernel code into individual files[*]Replace GMP with MPIR for easier cross-platform builds.[*]Automatically run threadbench if required[*]Add VS2017 and eclipse build files.[*]Implement internal self-test system[*]Allow full memory allocation on 64-bit windows builds[*]Contributions from kriesel:[LIST][*]Add test for known invalid residues[*]Comment & code formatting/cleanup[*]Add test for abnormally low threadbench timings[*]Add test for abnormally low fftbench timings[/LIST] [/LIST][/QUOTE]
Outstanding!

I've updated my reference material to point to this (Aaron's post), and emailed James Heinrich with a link for updating his mirror.
What's next Aaron? Logging extensions, date/time stamp addition, and removal of CUDAPm1 v0.2x from every iteration or transforms progress record?
What would other users like to see, assuming Aaron is open to suggestions?
I'll test this in my production running and for changes in limits, after finishing out some V0.20 limits testing that is still ongoing.

Stef42 2018-11-19 22:05

I have some issues getting Stage 2 going with the 0.22 version. It starts filling the GPU memory all the way to 9200 mb, then juist quits (CMD window closes). I'm using a GTX 1080 Ti with 11GB memory. (Windows 10 Home x64, driver 411.70)

CMD output:
[QUOTE]No GeForceGTX1080Ti_fft.txt file found. Using default fft lengths.
For optimal fft selection, please run
./CUDAPm1 -cufftbench 1 8192 r
for some small r, 0 < r < 6 e.g.
CUDA reports 9312M of 11264M GPU memory free.
Using threads: norm1 512, mult 256, norm2 512.
No stage 2 checkpoint.
Using up to 9200M GPU memory.
Selected B1=905000, B2=19683750, 3.49% chance of finding a factor
Using B1 = 905000 from savefile.
Continuing stage 2 from a partial result of M89326001 fft length = 5120K
Starting stage 2.
Using b1 = 905000, b2 = 19683750, d = 840, e = 12, nrp = 192[/QUOTE]

James Heinrich 2018-11-20 05:25

[QUOTE=Stef42;500522]...then juist quits (CMD window closes)[/QUOTE]If you're running it by double-clicking the exe then any message it may give when it terminates would be unfortunately lost. If you open a command prompt first and then run the program, any final error message output (if any) would remain visible.

Stef42 2018-11-20 07:33

[QUOTE=James Heinrich;500538]If you're running it by double-clicking the exe then any message it may give when it terminates would be unfortunately lost. If you open a command prompt first and then run the program, any final error message output (if any) would remain visible.[/QUOTE]

Tried that, no message what so ever. It just terminates.

kriesel 2018-11-20 08:16

[QUOTE=Stef42;500547]Tried that, no message what so ever. It just terminates.[/QUOTE]
That's not unusual for CUDAPm1 v0.20, even with console redirection to a file. As I recall the original author owftheevil posted about certain error cases terminating with no message. In my notes, post 373 2013-09-23 win64 cuda5.5 version attached, discussion of fftbench parameters & threadbench.
"excessive stage 2 round-off errors simply halt the program without error messages."
"there could be some inefficient fft lengths that I haven't looked at yet, which will cause a test to terminate with an excessive round-off error."
[URL]https://www.mersenneforum.org/showpost.php?p=353933&postcount=373[/URL]
The memory filling to 9.2GB on a mere 90m exponent is news.
On Quadro 2000,V0.20, I had issues completing exponents at 85m on one unit and not another. Also at 171m.

kriesel 2018-11-20 17:38

[QUOTE=Stef42;500522]I have some issues getting Stage 2 going with the 0.22 version. It starts filling the GPU memory all the way to 9200 mb, then juist quits (CMD window closes). I'm using a GTX 1080 Ti with 11GB memory. (Windows 10 Home x64, driver 411.70)

CMD output:[/QUOTE]
Interesting, and a possible new issue.

That exponent 89326001 has no P-1 assignment listed and is not available for assignment. [URL]https://www.mersenne.org/report_exponent/?exp_lo=89326001&exp_hi=&full=1[/URL]
I could try it here for confirmation and maybe isolation of what environment(s) it occurs in. What was the worktodo entry for it? I suspect it was something like
PFactor=1,2,89326001,-1,76,2

Stef42 2018-11-20 19:31

[QUOTE=kriesel;500584]Interesting, and a possible new issue.

That exponent 89326001 has no P-1 assignment listed and is not available for assignment. [URL]https://www.mersenne.org/report_exponent/?exp_lo=89326001&exp_hi=&full=1[/URL]
I could try it here for confirmation and maybe isolation of what environment(s) it occurs in. What was the worktodo entry for it? I suspect it was something like
PFactor=1,2,89326001,-1,76,2[/QUOTE]

I have reserved the exponent through GPU72.com.
Worktodo does indeed look like this:

[QUOTE]Pfactor=N/A,1,2,89326001,-1,76,2[/QUOTE]

A few assignments were completed from GPU72.com before this one. Funny thing was dat similar exponents in the 89M range only used roughly 4300MB memory.

kriesel 2018-11-20 23:35

[QUOTE=Stef42;500522]I have some issues getting Stage 2 going with the 0.22 version. It starts filling the GPU memory all the way to 9200 mb, then juist quits (CMD window closes). I'm using a GTX 1080 Ti with 11GB memory. (Windows 10 Home x64, driver 411.70)
[/QUOTE]
FYI, it completed ok here on Win7 x64 CUDA5.5 build V0.20, driver 378.78 in about 2 hours on a GTX 1080 Ti. I'll try closer to your case later.

[CODE]CUDA reports 10988M of 11264M GPU memory free.
Index 55
Using threads: norm1 32, mult 32, norm2 32.
Using up to 4374M GPU memory.
Selected B1=770000, B2=18672500, 3.37% chance of finding a factor
Starting stage 1 P-1, M89326001, B1 = 770000, B2 = 18672500, fft length = 5184K
...
M89326001 Stage 2 found no factor (P-1, B1=770000, B2=18672500, e=4, n=5184K CUDAPm1 v0.20)

[/CODE]

aaronhaviland 2018-11-21 03:04

[QUOTE=Stef42;500522]I have some issues getting Stage 2 going with the 0.22 version. It starts filling the GPU memory all the way to 9200 mb, then juist quits (CMD window closes). I'm using a GTX 1080 Ti with 11GB memory. (Windows 10 Home x64, driver 411.70)[/QUOTE]
In prior windows releases, this program would not make use of more than 4GiB video ram. I released that restriction for this build, because I found no issues with it on my 8GiB RTX 2070. The only other cards I had available were 3GiB and 2GiB, so I didn't bother trying them.

Noticing that you have a device with 11GiB, I'm very curious to find out if there was another reason for this limitation that I hadn't been able to determine. Especially since you mention it "starts filling the GPU memory", which it's trying to malloc, and failing.

If you could please do me a favour and fiddle with the UnusedMem value in the .ini file, and see if you can determine a value that doesn't crash. I would start with a value something like 7168, as that would simulate the old 4GiB limitation. (11GiB - 4GiB = 7GiB * 1024 = 7168)

kriesel 2018-11-21 04:29

First V0.22 try
 
Interesting benchmarking, followed by a silent halt.

it was an attempt to continue a run that had a silent halt in v0.20. V0.22 did too.[CODE]CUDAPm1 v0.22
Warning: Couldn't find or parse ini file option UnusedMem; using default 100MiB.
------- DEVICE 0 -------
name GeForce GTX 1080 Ti
Compatibility 6.1
clockRate (MHz) 1620
memClockRate (MHz) 5505
totalGlobalMem 11811160064
totalConstMem 65536
l2CacheSize 2883584
sharedMemPerBlock 49152
regsPerBlock 65536
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 28
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment 512
deviceOverlap 1

No GeForceGTX1080Ti_fft.txt file found. Using default fft lengths.
For optimal fft selection, please run
./CUDAPm1 -cufftbench 1 8192 r
for some small r, 0 < r < 6 e.g.
CUDA reports 10988M of 11264M GPU memory free.
No GeForceGTX1080Ti_threads.txt file found. Running benchmark.
CUDA bench, testing various thread sizes for fft 23040K, doing 15 passes.
fft size = 23040K, square time = [B][COLOR=Red]0.0000[/COLOR][/B] msec, threads 32
fft size = 23040K, square time = [B][COLOR=red]0.0000[/COLOR][/B] msec, threads 64
fft size = 23040K, square time = 1.4538 msec, threads 128
fft size = 23040K, square time = 1.4513 msec, threads 256
fft size = 23040K, square time = 1.4494 msec, threads 512
fft size = 23040K, square time = 1.4492 msec, threads 1024

Best square time for fft = 23040K, time: 0.0000, t = 64

fft size = 23040K, ave time = 0.1932 msec, Norm1 threads 32, Norm2 threads 32
fft size = 23040K, ave time = 0.2154 msec, Norm1 threads 32, Norm2 threads 64
fft size = 23040K, ave time = 0.2240 msec, Norm1 threads 32, Norm2 threads 128
fft size = 23040K, ave time = 0.2248 msec, Norm1 threads 32, Norm2 threads 256
fft size = 23040K, ave time = 0.2358 msec, Norm1 threads 32, Norm2 threads 512
fft size = 23040K, ave time = 0.2438 msec, Norm1 threads 32, Norm2 threads 1024
fft size = 23040K, ave time = 0.1219 msec, Norm1 threads 64, Norm2 threads 32
fft size = 23040K, ave time = 0.1329 msec, Norm1 threads 64, Norm2 threads 64
fft size = 23040K, ave time = 0.1421 msec, Norm1 threads 64, Norm2 threads 128
fft size = 23040K, ave time = 0.1421 msec, Norm1 threads 64, Norm2 threads 256
fft size = 23040K, ave time = 0.1437 msec, Norm1 threads 64, Norm2 threads 512
fft size = 23040K, ave time = 0.1453 msec, Norm1 threads 64, Norm2 threads 1024
fft size = 23040K, ave time = 0.0589 msec, Norm1 threads 128, Norm2 threads 32
fft size = 23040K, ave time = 0.0648 msec, Norm1 threads 128, Norm2 threads 64
fft size = 23040K, ave time = 0.0693 msec, Norm1 threads 128, Norm2 threads 128
fft size = 23040K, ave time = 0.0687 msec, Norm1 threads 128, Norm2 threads 256
fft size = 23040K, ave time = 0.0689 msec, Norm1 threads 128, Norm2 threads 512
fft size = 23040K, ave time = 0.0684 msec, Norm1 threads 128, Norm2 threads 1024
fft size = 23040K, ave time = 1.7076 msec, Norm1 threads 256, Norm2 threads 32
fft size = 23040K, ave time = 1.7102 msec, Norm1 threads 256, Norm2 threads 64
fft size = 23040K, ave time = 1.7152 msec, Norm1 threads 256, Norm2 threads 128
fft size = 23040K, ave time = 1.7102 msec, Norm1 threads 256, Norm2 threads 256
fft size = 23040K, ave time = 1.7119 msec, Norm1 threads 256, Norm2 threads 512
fft size = 23040K, ave time = 1.7096 msec, Norm1 threads 256, Norm2 threads 1024
fft size = 23040K, ave time = 1.6909 msec, Norm1 threads 512, Norm2 threads 32
fft size = 23040K, ave time = 1.6939 msec, Norm1 threads 512, Norm2 threads 64
fft size = 23040K, ave time = 1.6924 msec, Norm1 threads 512, Norm2 threads 128
fft size = 23040K, ave time = 1.6930 msec, Norm1 threads 512, Norm2 threads 256
fft size = 23040K, ave time = 1.6909 msec, Norm1 threads 512, Norm2 threads 512
fft size = 23040K, ave time = 1.6869 msec, Norm1 threads 512, Norm2 threads 1024

Average time for fft= 23040K, all threads variations 0.7659 msec, threshold value for valid timings set to 0.7500 of this, 0.5744 msec
Warning, time for fft = 23040K, time: 0.1932 msec, t1 = 32, t2 = 64, t3 = 32 is below threshold 0.5744 msec (0.7500 of average 0.7659)
Warning, time for fft = 23040K, time: 0.2154 msec, t1 = 32, t2 = 64, t3 = 64 is below threshold 0.5744 msec (0.7500 of average 0.7659)
Warning, time for fft = 23040K, time: 0.2240 msec, t1 = 32, t2 = 64, t3 = 128 is below threshold 0.5744 msec (0.7500 of average 0.7659)
Warning, time for fft = 23040K, time: 0.2248 msec, t1 = 32, t2 = 64, t3 = 256 is below threshold 0.5744 msec (0.7500 of average 0.7659)
Warning, time for fft = 23040K, time: 0.2358 msec, t1 = 32, t2 = 64, t3 = 512 is below threshold 0.5744 msec (0.7500 of average 0.7659)
Warning, time for fft = 23040K, time: 0.2438 msec, t1 = 32, t2 = 64, t3 = 1024 is below threshold 0.5744 msec (0.7500 of average 0.7659)
Warning, time for fft = 23040K, time: 0.1219 msec, t1 = 64, t2 = 64, t3 = 32 is below threshold 0.5744 msec (0.7500 of average 0.7659)
Warning, time for fft = 23040K, time: 0.1329 msec, t1 = 64, t2 = 64, t3 = 64 is below threshold 0.5744 msec (0.7500 of average 0.7659)
Warning, time for fft = 23040K, time: 0.1421 msec, t1 = 64, t2 = 64, t3 = 128 is below threshold 0.5744 msec (0.7500 of average 0.7659)
Warning, time for fft = 23040K, time: 0.1421 msec, t1 = 64, t2 = 64, t3 = 256 is below threshold 0.5744 msec (0.7500 of average 0.7659)
Warning, time for fft = 23040K, time: 0.1437 msec, t1 = 64, t2 = 64, t3 = 512 is below threshold 0.5744 msec (0.7500 of average 0.7659)
Warning, time for fft = 23040K, time: 0.1453 msec, t1 = 64, t2 = 64, t3 = 1024 is below threshold 0.5744 msec (0.7500 of average 0.7659)
Warning, time for fft = 23040K, time: 0.0589 msec, t1 = 128, t2 = 64, t3 = 32 is below threshold 0.5744 msec (0.7500 of average 0.7659)
Warning, time for fft = 23040K, time: 0.0648 msec, t1 = 128, t2 = 64, t3 = 64 is below threshold 0.5744 msec (0.7500 of average 0.7659)
Warning, time for fft = 23040K, time: 0.0693 msec, t1 = 128, t2 = 64, t3 = 128 is below threshold 0.5744 msec (0.7500 of average 0.7659)
Warning, time for fft = 23040K, time: 0.0687 msec, t1 = 128, t2 = 64, t3 = 256 is below threshold 0.5744 msec (0.7500 of average 0.7659)
Warning, time for fft = 23040K, time: 0.0689 msec, t1 = 128, t2 = 64, t3 = 512 is below threshold 0.5744 msec (0.7500 of average 0.7659)
Warning, time for fft = 23040K, time: 0.0684 msec, t1 = 128, t2 = 64, t3 = 1024 is below threshold 0.5744 msec (0.7500 of average 0.7659)
Timings below threshold were detected for 18 norm1 / mult / norm2 combinations for fft length 23040K and omitted from consideration for best.

Best time for fft = 23040K, time: 1.6869, t1 = 512, t2 = 64, t3 = 1024
Using threads: norm1 512, mult 128, norm2 128.
No stage 2 checkpoint.
Using up to 10800M GPU memory.
Selected B1=3965000, B2=100116250, 4.25% chance of finding a factor
Using B1 = 3310000 from savefile.
Continuing stage 2 from a partial result of M400001387 fft length = 23040K
Starting stage 2.
batch wrapper reports exit at Tue 11/20/2018 22:03:21.82 [/CODE]Corresponding benchmark numbers in v0.20 are
23040 411074273 16.5434
23040 32 32 32 17.7388
why so different in v0.22?

kriesel 2018-11-21 07:00

[QUOTE=Stef42;500522]I have some issues getting Stage 2 going with the 0.22 version. It starts filling the GPU memory all the way to 9200 mb, then juist quits (CMD window closes). I'm using a GTX 1080 Ti with 11GB memory. (Windows 10 Home x64, driver 411.70)
[/QUOTE]
Win64 CUDAPm1 V0.22 CUDA 8.0 on Windows 7 Pro & driver 378.78, program picked lower B1 and B2, occupied 10.7GB, on GTX 1080 Ti, ran to completion.[CODE]batch wrapper reports (re)launch at Tue 11/20/2018 22:43:27.36 reset count 0 of max 3
CUDAPm1 v0.22
------- DEVICE 0 -------
name GeForce GTX 1080 Ti
Compatibility 6.1
clockRate (MHz) 1620
memClockRate (MHz) 5505
totalGlobalMem 11811160064
totalConstMem 65536
l2CacheSize 2883584
sharedMemPerBlock 49152
regsPerBlock 65536
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 28
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment 512
deviceOverlap 1

CUDA reports 10988M of 11264M GPU memory free.
No entry for fft = 5184k found. Running benchmark.
CUDA bench, testing various thread sizes for fft 5184K, doing 15 passes.
fft size = 5184K, square time = 0.3257 msec, threads 32
fft size = 5184K, square time = 0.3291 msec, threads 64
fft size = 5184K, square time = 0.3289 msec, threads 128
fft size = 5184K, square time = 0.3288 msec, threads 256
fft size = 5184K, square time = 0.3293 msec, threads 512
fft size = 5184K, square time = 0.3300 msec, threads 1024

Best square time for fft = 5184K, time: 0.3257, t = 32

fft size = 5184K, ave time = 0.0443 msec, Norm1 threads 32, Norm2 threads 32
fft size = 5184K, ave time = 0.0534 msec, Norm1 threads 32, Norm2 threads 64
fft size = 5184K, ave time = 0.0524 msec, Norm1 threads 32, Norm2 threads 128
fft size = 5184K, ave time = 0.0525 msec, Norm1 threads 32, Norm2 threads 256
fft size = 5184K, ave time = 0.0522 msec, Norm1 threads 32, Norm2 threads 512
fft size = 5184K, ave time = 0.0526 msec, Norm1 threads 32, Norm2 threads 1024
fft size = 5184K, ave time = 0.4067 msec, Norm1 threads 64, Norm2 threads 32
fft size = 5184K, ave time = 0.4113 msec, Norm1 threads 64, Norm2 threads 64
fft size = 5184K, ave time = 0.4102 msec, Norm1 threads 64, Norm2 threads 128
fft size = 5184K, ave time = 0.4093 msec, Norm1 threads 64, Norm2 threads 256
fft size = 5184K, ave time = 0.4090 msec, Norm1 threads 64, Norm2 threads 512
fft size = 5184K, ave time = 0.4074 msec, Norm1 threads 64, Norm2 threads 1024
fft size = 5184K, ave time = 0.3929 msec, Norm1 threads 128, Norm2 threads 32
fft size = 5184K, ave time = 0.3937 msec, Norm1 threads 128, Norm2 threads 64
fft size = 5184K, ave time = 0.3940 msec, Norm1 threads 128, Norm2 threads 128
fft size = 5184K, ave time = 0.3950 msec, Norm1 threads 128, Norm2 threads 256
fft size = 5184K, ave time = 0.3950 msec, Norm1 threads 128, Norm2 threads 512
fft size = 5184K, ave time = 0.3946 msec, Norm1 threads 128, Norm2 threads 1024
fft size = 5184K, ave time = 0.3882 msec, Norm1 threads 256, Norm2 threads 32
fft size = 5184K, ave time = 0.3883 msec, Norm1 threads 256, Norm2 threads 64
fft size = 5184K, ave time = 0.3884 msec, Norm1 threads 256, Norm2 threads 128
fft size = 5184K, ave time = 0.3877 msec, Norm1 threads 256, Norm2 threads 256
fft size = 5184K, ave time = 0.3869 msec, Norm1 threads 256, Norm2 threads 512
fft size = 5184K, ave time = 0.3877 msec, Norm1 threads 256, Norm2 threads 1024
fft size = 5184K, ave time = 0.3860 msec, Norm1 threads 512, Norm2 threads 32
fft size = 5184K, ave time = 0.3860 msec, Norm1 threads 512, Norm2 threads 64
fft size = 5184K, ave time = 0.3861 msec, Norm1 threads 512, Norm2 threads 128
fft size = 5184K, ave time = 0.3856 msec, Norm1 threads 512, Norm2 threads 256
fft size = 5184K, ave time = 0.3845 msec, Norm1 threads 512, Norm2 threads 512
fft size = 5184K, ave time = 0.3866 msec, Norm1 threads 512, Norm2 threads 1024

Average time for fft= 5184K, all threads variations 0.3256 msec, threshold value for valid timings set to 0.7500 of this, 0.2442 msec
Warning, time for fft = 5184K, time: 0.0443 msec, t1 = 32, t2 = 32, t3 = 32 is below threshold 0.2442 msec (0.7500 of average 0.3256)
Warning, time for fft = 5184K, time: 0.0534 msec, t1 = 32, t2 = 32, t3 = 64 is below threshold 0.2442 msec (0.7500 of average 0.3256)
Warning, time for fft = 5184K, time: 0.0524 msec, t1 = 32, t2 = 32, t3 = 128 is below threshold 0.2442 msec (0.7500 of average 0.3256)
Warning, time for fft = 5184K, time: 0.0525 msec, t1 = 32, t2 = 32, t3 = 256 is below threshold 0.2442 msec (0.7500 of average 0.3256)
Warning, time for fft = 5184K, time: 0.0522 msec, t1 = 32, t2 = 32, t3 = 512 is below threshold 0.2442 msec (0.7500 of average 0.3256)
Warning, time for fft = 5184K, time: 0.0526 msec, t1 = 32, t2 = 32, t3 = 1024 is below threshold 0.2442 msec (0.7500 of average 0.3256)
Timings below threshold were detected for 6 norm1 / mult / norm2 combinations for fft length 5184K and omitted from consideration for best.

Best time for fft = 5184K, time: 0.3845, t1 = 512, t2 = 32, t3 = 512
Using threads: norm1 512, mult 128, norm2 128.
Using up to 10854M GPU memory.
Selected B1=630000, B2=10710000, 1.7% chance of finding a factor
Starting stage 1 P-1, M89326001, B1 = 630000, B2 = 10710000, fft length = 5184K
Doing 908960 iterations
Iteration 100000 M89326001, 0xe14f06f8949c9abe, n = 5184K, CUDAPm1 v0.22 err = 0.05005 (5:50 real, 3.5019 ms/iter, ETA 47:12)
Iteration 200000 M89326001, 0x2270467c553262ac, n = 5184K, CUDAPm1 v0.22 err = 0.04785 (5:52 real, 3.5179 ms/iter, ETA 41:34)
Iteration 300000 M89326001, 0x5a9e1dbc55f055ff, n = 5184K, CUDAPm1 v0.22 err = 0.04785 (5:56 real, 3.5598 ms/iter, ETA 36:07)
Iteration 400000 M89326001, 0x08db3e9c13c343d2, n = 5184K, CUDAPm1 v0.22 err = 0.05078 (5:57 real, 3.5742 ms/iter, ETA 30:19)
Iteration 500000 M89326001, 0x523ce55fab10ec94, n = 5184K, CUDAPm1 v0.22 err = 0.05078 (5:58 real, 3.5762 ms/iter, ETA 24:22)
Iteration 600000 M89326001, 0x54ded79cc40cfee8, n = 5184K, CUDAPm1 v0.22 err = 0.05273 (5:58 real, 3.5774 ms/iter, ETA 18:25)
Iteration 700000 M89326001, 0xc99c3d9fc3a34ec0, n = 5184K, CUDAPm1 v0.22 err = 0.04883 (5:57 real, 3.5727 ms/iter, ETA 12:26)
Iteration 800000 M89326001, 0x9d20b89d1a9a4877, n = 5184K, CUDAPm1 v0.22 err = 0.05273 (5:56 real, 3.5611 ms/iter, ETA 6:28)
Iteration 900000 M89326001, 0xefda9b1094553b12, n = 5184K, CUDAPm1 v0.22 err = 0.04883 (5:56 real, 3.5583 ms/iter, ETA 0:31)
M89326001, 0x05d2c8d87dcf4f23, n = 5184K, CUDAPm1 v0.22
Stage 1 complete, estimated total time = 53:52
Starting stage 1 gcd.
M89326001 Stage 1 found no factor (P-1, B1=630000, B2=10710000, e=0, n=5184K CUDAPm1 v0.22)
Starting stage 2.
Using b1 = 630000, b2 = 10710000, d = 2310, e = 12, nrp = 240
Zeros: 475228, Ones: 552452, Pairs: 105088
Processing 1 - 240 of 480 relative primes.
Initializing pass... done. transforms: 17421, err = 0.04785, (31.27 real, 1.7951 ms/tran, ETA NA)
Transforms: 205710 M89326001, 0x90102bd269087607, n = 5184K, CUDAPm1 v0.22 err = 0.04883 (6:20 real, 1.8476 ms/tran, ETA 31:27)
Transforms: 196446 M89326001, 0x266c3a943dd54799, n = 5184K, CUDAPm1 v0.22 err = 0.05273 (6:08 real, 1.8721 ms/tran, ETA 25:33)
Transforms: 201980 M89326001, 0x621dda916a4e4cbb, n = 5184K, CUDAPm1 v0.22 err = 0.04883 (6:18 real, 1.8750 ms/tran, ETA 19:21)

Processing 241 - 480 of 480 relative primes.
Initializing pass... done. transforms: 20111, err = 0.04785, (37.16 real, 1.8476 ms/tran, ETA 18:45)
Transforms: 205504 M89326001, 0x750bff764daa4a29, n = 5184K, CUDAPm1 v0.22 err = 0.05078 (6:25 real, 1.8733 ms/tran, ETA 12:23)
Transforms: 196422 M89326001, 0x5945c6a5e2e76c0e, n = 5184K, CUDAPm1 v0.22 err = 0.04883 (6:05 real, 1.8588 ms/tran, ETA 6:16)
Transforms: 201562 M89326001, 0x0e9d8ad7c2845c56, n = 5184K, CUDAPm1 v0.22 err = 0.04883 (6:14 real, 1.8586 ms/tran, ETA 0:00)

Stage 2 complete, 1245156 transforms, estimated total time = 38:39
Starting stage 2 gcd.
M89326001 Stage 2 found no factor (P-1, B1=630000, B2=10710000, e=12, n=5184K CUDAPm1 v0.22)

batch wrapper reports exit at Wed 11/21/2018 0:26:48.00
[/CODE]

kriesel 2018-11-21 12:57

V0.22 manual report worked
 
Just like V0.20.

aaronhaviland 2018-11-22 03:30

Okay, obviously I need to add some more verbosity and safety checks around certain spots to diagnose these silent halts.

I still feel like there's something wrong with the malloc in Stef42's case. It might be possible that even though X gb of ram is available, that not all of it is available to a single malloc, and we currently don't have any code to deal with that.
I didn't think to check the square() kernel timings for invalid results, so I'll need to add a check for that as well. However as a side note, I am not sure 23040K is actually the best FFT length for this exponent, so I find it interesting that it's what being chosen. I may need to enforce invalidation of previously run timings if we suspect there's issues with them.

kriesel 2018-11-22 07:10

Why don't these match?[CODE]Best time for fft = 23040K, time: 1.6869, t1 = 512, [COLOR=red][B]t2 = 64, t3 = 1024[/B][/COLOR]
Using threads: norm1 512, [COLOR=Red][B]mult 128, norm2 128[/B][/COLOR].[/CODE]fft file excerpt (from a quick v0.22 -cufftbench 1 32768 1)[CODE]16384 294471259 11.1004
18432 330441847 12.4128
18816 337176443 13.8883
20480 366326371 14.2631
20736 370806323 15.4177
21168 378363589 15.5279
23040 411074273 15.8456
23328 416101459 16.4934
23625 421284407 18.0096
24192 431175197 18.3017
25088 446794913 18.3473
32768 580225813 18.7128[/CODE]Given there's no averaging, tries=1, it looks ok to select 23040k for 400m exponent[CODE]fft size = 21168K, ave time = 15.5279 msec, max-ave = 0.00000
fft size = 21384K, ave time = 19.1372 msec, max-ave = 0.00000
fft size = 21504K, ave time = 16.2552 msec, max-ave = 0.00000
fft size = 21560K, ave time = 20.1739 msec, max-ave = 0.00000
fft size = 21600K, ave time = 16.3524 msec, max-ave = 0.00000
fft size = 21609K, ave time = 16.1015 msec, max-ave = 0.00000
fft size = 21840K, ave time = 21.0564 msec, max-ave = 0.00000
fft size = 21870K, ave time = 17.1037 msec, max-ave = 0.00000
fft size = 21875K, ave time = 17.4419 msec, max-ave = 0.00000
fft size = 21952K, ave time = 16.6437 msec, max-ave = 0.00000
fft size = 22000K, ave time = 20.8197 msec, max-ave = 0.00000
fft size = 22050K, ave time = 18.0051 msec, max-ave = 0.00000
fft size = 22113K, ave time = 21.7934 msec, max-ave = 0.00000
fft size = 22176K, ave time = 19.7003 msec, max-ave = 0.00000
fft size = 22275K, ave time = 22.6468 msec, max-ave = 0.00000
fft size = 22295K, ave time = 23.8952 msec, max-ave = 0.00000
fft size = 22400K, ave time = 17.7254 msec, max-ave = 0.00000
fft size = 22464K, ave time = 19.3179 msec, max-ave = 0.00000
fft size = 22500K, ave time = 18.3411 msec, max-ave = 0.00000
fft size = 22528K, ave time = 19.3020 msec, max-ave = 0.00000
fft size = 22638K, ave time = 23.7112 msec, max-ave = 0.00000
fft size = 22680K, ave time = 17.3189 msec, max-ave = 0.00000
fft size = 22750K, ave time = 22.4622 msec, max-ave = 0.00000
fft size = 22932K, ave time = 20.8813 msec, max-ave = 0.00000
fft size = 23040K, ave time = 15.8456 msec, max-ave = 0.00000[/CODE]

VictordeHolland 2018-11-22 09:29

[QUOTE=aaronhaviland;500632]In prior windows releases, this program would not make use of more than 4GiB video ram. I released that restriction for this build, because I found no issues with it on my 8GiB RTX 2070. The only other cards I had available were 3GiB and 2GiB, so I didn't bother trying them.

Noticing that you have a device with 11GiB, I'm very curious to find out if there was another reason for this limitation that I hadn't been able to determine. Especially since you mention it "starts filling the GPU memory", which it's trying to malloc, and failing.

If you could please do me a favour and fiddle with the UnusedMem value in the .ini file, and see if you can determine a value that doesn't crash. I would start with a value something like 7168, as that would simulate the old 4GiB limitation. (11GiB - 4GiB = 7GiB * 1024 = 7168)[/QUOTE]
I tried M89326001 and I had the same issue as Stef42 with my GTX1080ti (11GiB), stage 2 would quit without error message. I put:
[code]
UnusdedMem=7168[/code]in the CUDAPm1.ini file and it seems to run the stage 2 now. So you're on to something :).

Stef42 2018-11-23 20:25

[QUOTE=aaronhaviland;500704]Okay, obviously I need to add some more verbosity and safety checks around certain spots to diagnose these silent halts.

I still feel like there's something wrong with the malloc in Stef42's case. It might be possible that even though X gb of ram is available, that not all of it is available to a single malloc, and we currently don't have any code to deal with that.
I didn't think to check the square() kernel timings for invalid results, so I'll need to add a check for that as well. However as a side note, I am not sure 23040K is actually the best FFT length for this exponent, so I find it interesting that it's what being chosen. I may need to enforce invalidation of previously run timings if we suspect there's issues with them.[/QUOTE]

So far I have managed to finish it with:
[QUOTE]UnusedMem=2048
[/QUOTE]

aaronhaviland 2018-11-23 23:52

Okay, thanks both of you. Obviously there's a limit to cudaMalloc that differs from what is actually "free" RAM, and it varies depending on the card and the system. At this point, I believe the most likely culprit is the difference between free memory, and free contiguous memory, the latter of which is not a number we can query, but rather have to determine by trial and error (according to what I've been able to google, at least).

I'm going to try to modify so that it can keep trying to malloc in progressively smaller amounts, until it finds a value that works. But that might take a couple days to figure out, since changing the ram size will force a recalculation of other things that I haven't quite worked out yet.

aaronhaviland 2018-11-25 04:15

1 Attachment(s)
Could either one of you please run this app on one of the offending cards, and let me know the output? I threw it together really quick (cuda 10), but it reports what the driver says is free, and what the max cudaMalloc size it can claim. This would confirm the suspicions.

It defaults to device #0. Let me know if you need it to point to a different device. Since it was a quick build, I didn't include code for command-line options

Stef42 2018-11-25 19:35

[QUOTE=aaronhaviland;500923]Could either one of you please run this app on one of the offending cards, and let me know the output? I threw it together really quick (cuda 10), but it reports what the driver says is free, and what the max cudaMalloc size it can claim. This would confirm the suspicions.

It defaults to device #0. Let me know if you need it to point to a different device. Since it was a quick build, I didn't include code for command-line options[/QUOTE]

Output

[QUOTE]C:\Users\steph\Downloads>cudamalloctest.exe
Cuda reported
Free VRAM: 9314MiB
Total VRAM: 11264MiB
Max cudaMalloc: 9314MiB[/QUOTE]

kriesel 2018-11-30 20:51

gcd impact
 
1 Attachment(s)
On a dual X5650 Xeon HP 600, with prime95 workers using 2 cores each, when CUDAPm1 (0.20) uses a single core for gcd computations, it idles another core (stops one of the 6 prime95 workers). Duration for p~380M is about 18 minutes per gcd. Impact will be higher on 3-core or higher workers. The related gpu is also idle but vram committed during this time.

kriesel 2018-12-03 03:40

list updated
 
The CUDAPm1 bug and wish list has been somewhat updated.
Stef42's gpu ram issue has been added.
Various fixes have been verified and indicated.
[URL]https://www.mersenneforum.org/showpost.php?p=488534&postcount=3[/URL]

kriesel 2018-12-04 17:06

gcd time and fail (0.20); excessive roundoff in 0.22
 
[QUOTE=kriesel;501353]On a dual X5650 Xeon HP 600, with prime95 workers using 2 cores each, when CUDAPm1 (0.20) uses a single core for gcd computations, it idles another core (stops one of the 6 prime95 workers). Duration for p~380M is about 18 minutes per gcd. Impact will be higher on 3-core or higher workers. The related gpu is also idle but vram committed during this time.[/QUOTE]
On a dual Xeon E5520 Lenovo D20, when CUDAPm1 v0.20 uses a single core for stage 1 gcd computations, it idles the gpu for duration of about 39 minutes with p~414M. A prime95 instance is not much affected in this case since hyperthreading is enabled on this system, so task manager shows prime95's 50% utilization unaffected. The gcd fails.

The same t file used to start the 0.20 run was attempted on V0.22 but fails the roundoff error check in the next 100 iterations.
History of this exponent is the run was started on a gtx1060 which failed on gtx1060 with out of memory crash in stage 2 after picking a higher than expected NRP; retry there failed, wanted 4GB, restart from late s1 on gtx1050ti failed s1gcd; quadro 5000 from late s1 file failed s1gcd; gtx480 try completed stage 1 through "found no factor".

Running through a collection of c and t files (5 total) representing late stage 1 and just after stage1 gcd, neither v0.20 nor v0.22 can carry the computation forward to completion on the gtx1080ti.
[CODE]batch wrapper reports (re)launch at Tue 12/04/2018 9:39:35.52 reset count 0 of max 3
CUDAPm1 v0.20
------- DEVICE 0 -------
name GeForce GTX 1080 Ti
Compatibility 6.1
clockRate (MHz) 1620
memClockRate (MHz) 5505
totalGlobalMem zu
totalConstMem zu
l2CacheSize 2883584
sharedMemPerBlock zu
regsPerBlock 65536
warpSize 32
memPitch zu
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 28
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment zu
deviceOverlap 1

CUDA reports 10988M of 11264M GPU memory free.
Using threads: norm1 32, mult 32, norm2 32.
Using up to 5285M GPU memory.
Selected B1=3250000, B2=77187500, 3.64% chance of finding a factor
Using B1 = 3215000 from savefile.
Continuing stage 1 from a partial result of M414000007 fft length = 23328K, iteration = 4625001
M414000007, 0x4f7c556075b4f7f3, n = 23328K, CUDAPm1 v0.20
Stage 1 complete, estimated total time = 63:37:29batch wrapper reports (re)launch at Tue 12/04/2018 10:26:30.31 reset count 0 of max 3
CUDAPm1 v0.22
------- DEVICE 0 -------
name GeForce GTX 1080 Ti
Compatibility 6.1
clockRate (MHz) 1620
memClockRate (MHz) 5505
totalGlobalMem 11811160064
totalConstMem 65536
l2CacheSize 2883584
sharedMemPerBlock 49152
regsPerBlock 65536
warpSize 32
memPitch 2147483647
maxThreadsPerBlock 1024
maxThreadsPerMP 2048
multiProcessorCount 28
maxThreadsDim[3] 1024,1024,64
maxGridSize[3] 2147483647,65535,65535
textureAlignment 512
deviceOverlap 1

CUDA reports 10988M of 11264M GPU memory free.
Using threads: norm1 512, mult 64, norm2 32.
Using up to 10752M GPU memory.
Selected B1=3990000, B2=93765000, 3.86% chance of finding a factor
Using B1 = 3215000 from savefile.
Continuing stage 1 from a partial result of M414000007 fft length = 23328K, iteration = 4625001
Iteration = 4625100, err = 0.5 >= 0.40, quitting.
Estimated time spent so far: 63:33:25

batch wrapper reports exit at Tue 12/04/2018 10:28:09.65 [/CODE]

storm5510 2018-12-13 14:23

[QUOTE=Stef42;500595]I have reserved the exponent through GPU72.com.
Worktodo does indeed look like this:

[CODE]Pfactor=N/A,1,2,89326001,-1,76,2 [/CODE]A few assignments were completed from GPU72.com before this one. Funny thing was dat similar exponents in the 89M range only used roughly 4300MB memory.[/QUOTE]

My only "beef" with it is that it will not accept the long form where one can specify the bounds:

[CODE]Pminus1=1,2,<exponent>,-1,100000000,1000000000,65[/CODE]I never had any luck in trying to run it this way.

kriesel 2018-12-13 16:20

[QUOTE=storm5510;502604]My only "beef" with it is that it will not accept the long form where one can specify the bounds:

[CODE]Pminus1=1,2,<exponent>,-1,100000000,1000000000,65[/CODE]I never had any luck in trying to run it this way.[/QUOTE]
Yes, it would be nice if the alternate form was supported for worktodo entries, at least for k=1, b=2, c=-1 of
N=k b[SUP]p[/SUP]+c, for bounds B1 and B2, and prior trial factoring to F bits, and optional AID:

Pminus1=[AID,]k,b,p,c,B1,B2,F

Meanwhile, I think you can accomplish the rough equivalent from the command line, and therefore from a Windows batch file or linux shell script for a succession of assignments. From the CUDAPm1 readme:[CODE]Alternately, you can just pass in a single exponent as a command line
argument, and CUDAPm1 will then test 2^arg-1 and exit. More parameters can
be specified, such as bounds and fft length. For example (linux syntax):

./CUDAPm1 61408363 -b1 600000 -b2 12000000 -f 3360k[/CODE]Thanks for the suggestion.

storm5510 2018-12-13 23:22

1 Attachment(s)
[QUOTE=kriesel;502626]Yes, it would be nice if the alternate form was supported for worktodo entries, at least for k=1, b=2, c=-1 of
N=k b[SUP]p[/SUP]+c, for bounds B1 and B2, and prior trial factoring to F bits, and optional AID:

Pminus1=[AID,]k,b,p,c,B1,B2,F

Meanwhile, I think you can accomplish the rough equivalent from the command line, and therefore from a Windows batch file or linux shell script for a succession of assignments. [B]From the CUDAPm1 readme:[/B][CODE]Alternately, you can just pass in a single exponent as a command line
argument, and CUDAPm1 will then test 2^arg-1 and exit. More parameters can
be specified, such as bounds and fft length. For example (linux syntax):

./CUDAPm1 61408363 -b1 600000 -b2 12000000 -f 3360k[/CODE]Thanks for the suggestion.[/QUOTE]

Up until today, I had been running 0.21. I did not know 0.22 was available, and it took me a while to track down all its requited pieces, (dll's). It does not flat-out reject the longer form, but simply stops and does not proceed, as illustrated in the attached image.

The readme that is with 0.21, the one I have, is not for [I]CUDAPm1[/I], it is for [I]CUDALucas[/I]. I need to find the correct one..

kriesel 2018-12-13 23:47

[QUOTE=storm5510;502687]
The readme that is with 0.21, the one I have, is not for [I]CUDAPm1[/I], it is for [I]CUDALucas[/I]. I need to find the correct one..[/QUOTE]
There's no readme as complete for CUDAPm1, as there is for CUDALucas. You may have a draft in progress that is intended for CUDAPm1 but still has some CUDALucas-oriented content and the text string "CUDALucas" in it in some locations (as do some of the CUDAPm1 error messages in the code).

storm5510 2018-12-15 02:26

[QUOTE=kriesel;502691]There's no readme as complete for CUDAPm1, as there is for CUDALucas. You may have a draft in progress that is intended for CUDAPm1 but still has some CUDALucas-oriented content and the text string "CUDALucas" in it in some locations (as do some of the CUDAPm1 error messages in the code).[/QUOTE]

I found a very abbreviated one on [I]GitHub[/I]. It's not much.

I need to correct myself on one item. I had been running v0.20.

[CODE]CudaPm1 [exponent] [-b1 x] [-b2 x] [-f xK][/CODE]I tried this command line form and it works fine, as long as the parameters are acceptable to the program. There were instances where I only specified the B! value. The program filled in the rest.

The image I posted, I did not wait long enough. I was not used to a long delay and stopped it. Smaller values did not wait nearly as long

kriesel 2018-12-30 18:51

CUDAPm1 v0.22 threadbench issues and fftbench behavior
 
2 Attachment(s)
Found some interesting behavior in the V0.22 thread and fft benchmarking compared to v0.20.

1) some fft lengths produce for some lower threads values, zero for squaring time. Since there's no guard against it, the first such is chosen on which to test norm1 and norm2 thread counts. This effect spreads to higher threads counts at larger fft lengths on the GTX 1050 Ti. It does not spread to higher thread counts on the Quadro 2000, where it only occurs on 1024 thread at low fft lengths.

2) CUDALucas has a mask field which can be used to exclude troublesome threads counts from benchmarking. CUDAPm1 does not.
CUDALucas format: -threadbench s e i m
CUDAPm1 format: -cufftbench s s i
With increasing fft length, beginning around 4096K, some fft lengths and norm1 thread counts produce much too short benchmark times compared to other thread counts. With increasing fft length, the issue spreads to larger thread counts. It is observed to spread to nearly all thread counts above 32768K.
The added check against a threshold of 75% of average protects somewhat, until it spreads to all thread counts around 65536K.

3) The threadbench times are much shorter than the fftbench times for the same fft length. This is unlike V0.20 behavior where they are very close.

4) Testing on the Quadro 2000 indicates V0.22 cuts off (fails to complete benchmarking, crashing the application) at lower fft length than V0.20 did (35000 max vs. 36864 max for V0.20 CUDA 5.5).

5) There are steps (discontinuities) in the V0.22 GTX 1050Ti and GTX1070 threadbench times. These appear to indicate that certain calls are failing somehow. These steps do not appear in plots of V0.20 fft or thread benchmarking results.

Benchmarking is done only occasionally, so checking for success on many or all CUDA calls during benchmarking would not reduce performance of actual P-1 factoring. It may help localize where in the code the benchmarking is having issues.

A single test did not indicate an issue with finding factors on the GTX 1050 Ti, but that was at 2688k fft length. A 4608k test is under way.

(unable to upload attachments at the moment)

kriesel 2019-01-11 15:05

Caution in CUDAPm1 v0.22 at 8192K and above (bad threads file entries)
 
fft and threads files (for condor quadro 2000) were generated by the CUDAPm1 v0.22 program through benchmarking

fft file excerpt:
8192 149447533 74.5062
8400 153159473 75.8721
8640 157439981 76.5140
8820 160648739 76.8702


thread file excerpt:
8064 256 256 32 14.3103
8192 256 32 32 14.5360
8400 256 32 32 14.9088
8640 256 32 32 15.3292
8820 256 32 32 15.6550

equivalent threads file from cudapm1 v0.20 on same gpu and system:
8192 256 256 32 67.8178
8640 256 256 32 74.0336
8820 256 256 32 75.8909

current worktodo assignment:
PFactor=(aid redacted),1,2,157000033,-1,78,2


cmd console output stream excerpt:
C:\Users\Ken\My Documents\pm1-q2000>CUDAPm1-0.22-cuda8.exe -d 1 1>>cudapm1.txt
over specifications Grid = 69120
try increasing mult threads (32) or decreasing FFT length (8640K)
(program terminated)



check log of what timings were run for 8640k fft length threadbench.
It ran norm1 128, mult 64, norm2 32 and up (no mult 32 cases);
timings are ~75-100msec. (128, 64, 32 is fastest)
So where did the timing in the threads file come from?

No run was made for 8400k. Where did that timing and selection come from?

threadbench run log excerpt:
Best time for fft = 8192K, time: 79.1902, t1 = 128, t2 = 64, t3 = 32

Compare the above to the threads file contents, 256, 32, 32, anomalously fast timing recorded.
256,32,32 is not among the cases that were benchmarked, yet is recorded as fastest in the threads file


It looks like at 8192K and above, something goes wrong in the thread benchmarking and the resulting threads file entries are not to be trusted, and may crash the program.

kriesel 2019-01-11 17:09

Caution in CUDAPm1 V0.22 at high fft lengths threadbench
 
Threadbench appears to fail at 65536K and above.

excerpt of v0.22 CUDAPm1 fft file on GTX1050Ti:
[CODE]57600 1007626787 160.8784
65536 1143276383 178.7965
69120 1204418959 195.1879
73728 1282931137 201.3655
75264 1309078039 224.0846
81920 1422251777 230.3756
82944 1439645131 239.3277
84672 1468986017 258.4334
86016 1491797777 262.2423
93312 1615502269 267.5937
96768 1674025489 276.5184
98304 1700021251 281.5833
100352 1734668777 297.2951
102400 1769301077 313.3768
104976 1812840839 318.9635
110592 1907684153 320.0148
114688 1976791967 325.5219
115200 1985426669 345.7786
116640 2009707367 369.2419
131072 2147483647 370.3066[/CODE]excerpt of v0.22 CUDAPm1 thread file on GTX1050Ti:
[CODE]57600 512 64 1024 21.4178
65536 64 64 1024 0.9758
69120 64 32 1024 1.0351
73728 64 32 1024 1.1016
75264 64 128 1024 1.1244
81920 64 32 1024 1.2204
82944 64 32 1024 1.2359
84672 64 32 1024 1.2639
86016 64 256 1024 1.2806
93312 64 32 1024 1.3890
96768 64 64 1024 1.4485
98304 64 32 1024 1.4681
100352 64 32 1024 1.4976
102400 64 128 1024 1.5272
104976 64 32 1024 1.5658
110592 64 128 128 1.6593
114688 64 64 1024 1.7143
115200 64 256 1024 1.7302
116640 64 32 1024 1.7510
131072 128 128 1024 0.9830[/CODE]Normal pattern would be for the thread timings to increase with fft length.
At 57600k, only norm1 512 appear to run correctly, and these pass the comparison to average timing threshold newly added in v0.22:
[CODE]fft size = 57600K, ave time = 1.2919 msec, Norm1 threads 32, Norm2 threads 32
fft size = 57600K, ave time = 1.3330 msec, Norm1 threads 32, Norm2 threads 64
fft size = 57600K, ave time = 1.3327 msec, Norm1 threads 32, Norm2 threads 128
fft size = 57600K, ave time = 1.3389 msec, Norm1 threads 32, Norm2 threads 256
fft size = 57600K, ave time = 1.3369 msec, Norm1 threads 32, Norm2 threads 512
fft size = 57600K, ave time = 1.3217 msec, Norm1 threads 32, Norm2 threads 1024
fft size = 57600K, ave time = 0.8629 msec, Norm1 threads 64, Norm2 threads 32
fft size = 57600K, ave time = 0.8617 msec, Norm1 threads 64, Norm2 threads 64
fft size = 57600K, ave time = 0.8601 msec, Norm1 threads 64, Norm2 threads 128
fft size = 57600K, ave time = 0.8758 msec, Norm1 threads 64, Norm2 threads 256
fft size = 57600K, ave time = 0.8640 msec, Norm1 threads 64, Norm2 threads 512
fft size = 57600K, ave time = 0.8529 msec, Norm1 threads 64, Norm2 threads 1024
fft size = 57600K, ave time = 0.4292 msec, Norm1 threads 128, Norm2 threads 32
fft size = 57600K, ave time = 0.4297 msec, Norm1 threads 128, Norm2 threads 64
fft size = 57600K, ave time = 0.4284 msec, Norm1 threads 128, Norm2 threads 128
fft size = 57600K, ave time = 0.4313 msec, Norm1 threads 128, Norm2 threads 256
fft size = 57600K, ave time = 0.4308 msec, Norm1 threads 128, Norm2 threads 512
fft size = 57600K, ave time = 0.4257 msec, Norm1 threads 128, Norm2 threads 1024
fft size = 57600K, ave time = 0.2146 msec, Norm1 threads 256, Norm2 threads 32
fft size = 57600K, ave time = 0.2156 msec, Norm1 threads 256, Norm2 threads 64
fft size = 57600K, ave time = 0.2134 msec, Norm1 threads 256, Norm2 threads 128
fft size = 57600K, ave time = 0.2153 msec, Norm1 threads 256, Norm2 threads 256
fft size = 57600K, ave time = 0.2140 msec, Norm1 threads 256, Norm2 threads 512
fft size = 57600K, ave time = 0.2117 msec, Norm1 threads 256, Norm2 threads 1024
fft size = 57600K, ave time = 21.4209 msec, Norm1 threads 512, Norm2 threads 32
fft size = 57600K, ave time = 21.4196 msec, Norm1 threads 512, Norm2 threads 64
fft size = 57600K, ave time = 21.4198 msec, Norm1 threads 512, Norm2 threads 128
fft size = 57600K, ave time = 21.4198 msec, Norm1 threads 512, Norm2 threads 256
fft size = 57600K, ave time = 21.4197 msec, Norm1 threads 512, Norm2 threads 512
fft size = 57600K, ave time = 21.4178 msec, Norm1 threads 512, Norm2 threads 1024

Average time for fft= 57600K, all threads variations 4.8503 msec, threshold value for valid timings set to 0.7500 of this, 3.6378 msec
...
Timings below threshold were detected for 24 norm1 / mult / norm2 combinations for fft length 57600K and omitted from consideration for best.

Best time for fft = 57600K, time: 21.4178, t1 = 512, t2 = 64, t3 = 1024
[/CODE]At 65536k, all threads combinations run produce implausibly low timings, defeating the screening by threshold relative to average timing
[CODE]fft size = 65536K, ave time = 1.4678 msec, Norm1 threads 32, Norm2 threads 32
fft size = 65536K, ave time = 1.5144 msec, Norm1 threads 32, Norm2 threads 64
fft size = 65536K, ave time = 1.5140 msec, Norm1 threads 32, Norm2 threads 128
fft size = 65536K, ave time = 1.5219 msec, Norm1 threads 32, Norm2 threads 256
fft size = 65536K, ave time = 1.5192 msec, Norm1 threads 32, Norm2 threads 512
fft size = 65536K, ave time = 1.5035 msec, Norm1 threads 32, Norm2 threads 1024
fft size = 65536K, ave time = 0.9806 msec, Norm1 threads 64, Norm2 threads 32
fft size = 65536K, ave time = 0.9788 msec, Norm1 threads 64, Norm2 threads 64
fft size = 65536K, ave time = 0.9789 msec, Norm1 threads 64, Norm2 threads 128
fft size = 65536K, ave time = 0.9931 msec, Norm1 threads 64, Norm2 threads 256
fft size = 65536K, ave time = 0.9815 msec, Norm1 threads 64, Norm2 threads 512
fft size = 65536K, ave time = 0.9758 msec, Norm1 threads 64, Norm2 threads 1024
fft size = 65536K, ave time = 0.4885 msec, Norm1 threads 128, Norm2 threads 32
fft size = 65536K, ave time = 0.4872 msec, Norm1 threads 128, Norm2 threads 64
fft size = 65536K, ave time = 0.4867 msec, Norm1 threads 128, Norm2 threads 128
fft size = 65536K, ave time = 0.4913 msec, Norm1 threads 128, Norm2 threads 256
fft size = 65536K, ave time = 0.4916 msec, Norm1 threads 128, Norm2 threads 512
fft size = 65536K, ave time = 0.4892 msec, Norm1 threads 128, Norm2 threads 1024
fft size = 65536K, ave time = 0.2432 msec, Norm1 threads 256, Norm2 threads 32
fft size = 65536K, ave time = 0.2441 msec, Norm1 threads 256, Norm2 threads 64
fft size = 65536K, ave time = 0.2446 msec, Norm1 threads 256, Norm2 threads 128
fft size = 65536K, ave time = 0.2437 msec, Norm1 threads 256, Norm2 threads 256
fft size = 65536K, ave time = 0.2446 msec, Norm1 threads 256, Norm2 threads 512
fft size = 65536K, ave time = 0.2428 msec, Norm1 threads 256, Norm2 threads 1024
fft size = 65536K, ave time = 0.1202 msec, Norm1 threads 512, Norm2 threads 32
fft size = 65536K, ave time = 0.1200 msec, Norm1 threads 512, Norm2 threads 64
fft size = 65536K, ave time = 0.1205 msec, Norm1 threads 512, Norm2 threads 128
fft size = 65536K, ave time = 0.1206 msec, Norm1 threads 512, Norm2 threads 256
fft size = 65536K, ave time = 0.1208 msec, Norm1 threads 512, Norm2 threads 512
fft size = 65536K, ave time = 0.1193 msec, Norm1 threads 512, Norm2 threads 1024
[/CODE]Similar effects are seen on other gpu models with enough VRAM capable of attempting such large fft lengths: GTX1060, GTX1070, GTX1080, GTX1080Ti

The effect does not occur in CUDAPm1 v0.20 threadbench on the same gpus as showing the issue in v0.22. (GTX1060 untested in v0.20; 1050Ti, 1070, 1080, 1080Ti ok.)
Excerpt of CUDAPm1 V0.20 GTX1080 threads file:[CODE]57600 1024 1024 256 61.8356
65536 1024 1024 1024 67.1432
73728 32 32 32 75.6006
75264 1024 512 512 86.2849
77760 1024 32 32 86.4164
81920 1024 32 32 86.4885
82944 1024 1024 512 87.8014
84672 1024 32 32 94.2258
86400 1024 32 32 97.4938
93312 1024 512 128 99.7058
98304 1024 32 32 106.1017
100352 1024 1024 32 109.3616
102400 256 32 32 114.7461
104976 1024 1024 256 116.6407
110592 512 128 64 119.7921
114688 1024 512 32 121.0991
115200 1024 1024 128 131.2371
124416 1024 32 64 135.2243
131072 1024 1024 64 136.6869[/CODE]Excerpt of CUDAPm1 v0.22 GTX 1080 threads file:[CODE]57600 512 32 32 8.8008
65536 128 128 1024 0.4071
69120 128 256 512 0.4197
73728 128 32 512 0.4434
75264 128 64 512 0.4547
81920 128 32 512 0.4966
82944 128 128 512 0.5034
84672 128 64 512 0.5126
86016 128 256 1024 0.5356
86400 128 64 512 0.5234
93312 128 64 512 0.5630
98304 128 128 512 0.5922
100352 128 32 512 0.6011
102400 128 128 512 0.6168
104976 128 64 512 0.6262
110592 128 32 512 0.6623
114688 128 64 512 0.6866
115200 128 32 512 0.6882
131072 128 128 128 0.7455[/CODE]

GdS 2019-01-20 23:14

explanation of the Brent-Suyama coefficient (-d2) needed
 
Hi :hello:

could someone please explain the use of the Brent-Suyama coefficient, set using -d2 (multiple of 30, 210, or 2310) ?

I was experimenting with CUDAPm1 v0.22 and I was testing the exponent M22155943 which has an already known factor:

command used to run (fft length and -d2 were set automatically):
[B]cdPm1.exe 22155943 -b1 75000 -b2 350100 -e2 12[/B]

complete output in the results file:
[B]M22155943 has a factor: 149927423231592284064887 (P-1, B1=75000, B2=350100, e=12, n=1296K CUDAPm1 v0.22)[/B]

If I set the -d2 to some large value, eg 21000, which is valid, the program fails to find the factor. :ermm::surprised:

You might wonder why mess with the -d2.
I ended up experimenting with the -d2 because the stage2 of some small exponents (in the range of 6M) that I was further testing was failing to even start, and no warning was issued. If I set the -d2 to some large value, the program runs but how can I identify if a factor was skipped?
Has anyone encountered the same problem?
Lots of questions asked ... I appreciate any comment:smile:


All times are UTC. The time now is 23:19.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.