mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   The P-1 factoring CUDA program (https://www.mersenneforum.org/showthread.php?t=17835)

preda 2018-10-25 14:18

[QUOTE=kriesel;498700]
Re gpuOwL B2, do I understand you correctly that its B2 is limited to no more than the exponent? (Seems reasonable.) If so I'll add that to the available software summary I maintain.[/QUOTE]

Yes, B2 can be anything up to exponent.

The user may also enter a larger B2 value than the exponent (that way he asks for more primes to be tested), but the effective B2 in that case will be equal to the exponent.

Let's consider an example:
Exponent = 80'000'001
Testing with:
B1=1000000,B2=80000001;80000001
or, equivalent:
B1=1000000;80000001
(because by default, if not specified, B2==Exponent),
will test in second stage all the primes from 1M to 80000001, and report that B2 in the result.

Now the tricky case, where the entered B2 is larger than exponent:
B1=1000000,B2=160000000;80000001
In this situation, all the primes from 1M to Exponent (80000001) are tested in second stage,
and in addition to that, about 62% of the primes from Exponent to 160M are tested too.
But because "right after" (i.e. within the first few primes after) the exponent will be some prime > Exponent that is not tested, the reported B2 will still be == Exponent.

Covering 62% of the primes in the range [Exp, 2*Exp] is still a good thing, but unfortunately it can't be reported within the "B2" framework (which requires that absolutely all primes <= B2 be tested).

PS: not to mention that, in addition to this set of "explicit" primes to be tested, a large number of additional primes (let's say about 3 times as many) are tested too. But these can be very large primes, thus with reduced benefit compared to the "small" primes that are under B2.

aaronhaviland 2018-10-27 04:10

[QUOTE=kriesel;498677]I found reading the commit notes in your fork interesting. "fencepost error" may account for some of the anomalies I've seen in the Sourceforge-version-derived Windows executables.
Please review those notes and carry it forward![/QUOTE]

This was one of the reasons why I implemented a self-test as part of the build process on linux.

[QUOTE]I have a collection of reference material, specific to CUDAPm1 (Sourceforge versions), at [URL]https://www.mersenneforum.org/showthread.php?p=498673#post498673[/URL]
Post 7 is a summary/overview of testing I've done on CUDAPm1 v0.20, mostly September 2013 cuda 5.5 version, some November 2013 cuda 5.0, on Windows. Posts 8 and 9 are new and contain attachments showing detail, separately per gpu model, 8 models, ranging from 1 to 8 gb gpu ram. Total test effort was I think >1 gpu-year to date. [/QUOTE]That's a lot of material to go through, and I apologise, but I'm glossing over it for now. I have revived my github code, and have updated the linux build to CUDA 9.1. Some older hardware (below compute capability 3.0) is no longer supported with the newer versions of CUDA, but it's unlikely there's many of them around these days.

[QUOTE]For most of that I have been able to submit at least stage 1 results to primenet, and for many stage 2, although some runs failed before printing a stage 1 gcd result, factor or no factor found, and some runs failed at other points. Some were completed by moving to a different gpu. Others can't be completed that way either.

If anyone has a way of converting or moving a pre-gcd stage 1 run from CUDAPm1 to some other software that can perform the gcd check for a factor, please share, either here or by PM. (Or a CUDAPm1 Windows executable or source code that doesn't have that issue...)
[/QUOTE]IIRC, there were bugs in the handoff from stage1 to stage2 that I resolved (or at least bludgeoned with a hammer)

[QUOTE]These tests indicate that currently, 0 of 8 gpu models evaluated can complete stage 1 and 2 above exponent value ~433,000,000 (maybe as low as ~431M max for the GTX1060 3gb). Prime95 can go higher, but is also capped well below the mersenne.org limit of 10[SUP]9[/SUP] (~595M, except FMA3-capable hardware ~920M).[/QUOTE]I'm launching a ~511M run tonight with a known factor with my 780. I'll be curious to see how long it runs before dying, or if it finds the factor. I've never tested anything over 70M before

kriesel 2018-10-27 10:56

[QUOTE=aaronhaviland;498869]This was one of the reasons why I implemented a self-test as part of the build process on linux.

That's a lot of material to go through[/QUOTE]
Definitely. See also the bug and wish list at [URL]http://www.mersenneforum.org/showpost.php?p=488534&postcount=3[/URL]

[QUOTE]I have revived my github code, and have updated the linux build to CUDA 9.1. Some older hardware (below compute capability 3.0) is no longer supported with the newer versions of CUDA, but it's unlikely there's many of them around these days.
[/QUOTE]Understood. Gpus with 2.x or lower are probably a small fraction of the total active hardware population globally, but I have several Quadro 2000s (2.1), 2 Quadro 4000 (2.0), a Quadro 5000 (2.0), and a GTX480 (2.0) running, constituting the majority of my fleet.
[QUOTE]

IIRC, there were bugs in the handoff from stage1 to stage2 that I resolved (or at least bludgeoned with a hammer)[/QUOTE]I've been meaning to attempt a Windows build of your version for a while now.
[QUOTE]I'm launching a ~511M run tonight with a known factor with my 780. I'll be curious to see how long it runs before dying, or if it finds the factor. I've never tested anything over 70M before[/QUOTE]That's likely to take some days to get through stage 1. Please post how it turns out. If it requires some new fixes to complete, or you make any additional fixes or improvements, please refresh your github repository.

aaronhaviland 2018-10-28 00:40

[QUOTE=kriesel;498880]That's likely to take some days to get through stage 1. Please post how it turns out. If it requires some new fixes to complete, or you make any additional fixes or improvements, please refresh your github repository.[/QUOTE]Since I knew the minimum B1/2 values to find the known factor, it didn't take long at all for stage 1, however stage 2 was completely borked. (The whole run was about 9 hours)
I'm going to run on a "normal" size exponent just to validate it still works at that level properly before I go any further.

I did push a few minor commits lastnight,, but the only code change so far was an issue writing/saving the cufft fft/threads benchmark files. I plan to make those happen automatically if they don't already exist. (Using saved values for an extinct-ish card is stupid, and the time-savings of running an optimal fft size is worth the cost of running the benchmark)

Honestly the first thing I really want to do, once I validate it works again, is a major refactor of the code, just because I feel like it's a huge jumble of blah every time i look through it. (no offense meant to the original authors. Great work getting it that far.)

aaronhaviland 2018-10-28 16:44

[QUOTE=kriesel;498880]Definitely. See also the bug and wish list at [URL]http://www.mersenneforum.org/showpost.php?p=488534&postcount=3[/URL][/QUOTE]

I'm curious what code changes you've made, vs what code changes I've made. I know I made some prior to my first github import, and I wasn't tracking them at that time.

kriesel 2018-11-05 20:15

CUDAPm1 v0.20 bug and wish list updated
 
See the attachment at [URL]https://www.mersenneforum.org/showpost.php?p=488534&postcount=3[/URL]

kriesel 2018-11-06 01:19

4 Attachment(s)
[QUOTE=aaronhaviland;498973]I'm curious what code changes you've made, vs what code changes I've made. I know I made some prior to my first github import, and I wasn't tracking them at that time.[/QUOTE]
Hi, sorry for the delay responding.
What's your build OS, still Ubuntu; version #? I'm aiming for Win7 x64.

My cudapm1 draft changes have not made it into executable form, or onto sourceforge or github yet. I invite you to fold them into your current efforts. I've been delayed in working on getting a proper build environment for CUDAPm1 on Windows. At this point I would begin by trying to compile simpler cuda code first, then an existing set of cudapm1 code, without my changes, to prove out a build environment, before merging my changes. That cudapm1 code could just as well be your latest version at that point.

My draft changes are of multiple types, none of which provide speed improvements or other core algorithm changes.

1) Misc minor edits for housekeeping (see the attachment at [URL="https://www.mersenneforum.org/showpost.php?"]https://www.mersenneforum.org/showpost.php?p=462600&postcount=503[/URL] and see change note #8 in an attachment at [URL="https://www.mersenneforum.org/showpost.php?"]https://www.mersenneforum.org/showpost.php?p=463662&postcount=511.[/URL]

2) Addition of output options and date/time stamps (see the test/demo program attached to this post)
See attached additions.7z
Change existing printf and fprintf calls to dprintf and dfprintf respectively, to incorporate logging control throughout the program. Extending ini file reading and command line parsing to accommodate it has not been written. Add date/time stamps to iteration or transform output lines, and at transition times such as start and end of gcd computations, has not been written. Output=4 would be useful for benchmarking or testing.

3) Sanity checking of fft and threads benchmarking (modified, untested, in fact the code fragment draft is still a comment inline in old code) See attached modified cudapm1.cu (derived from the sourceforge v0.20 version)

4) Incomplete rewrite of readme.txt (copy in current state attached. End users, use with extreme caution or not at all.) See attached readme-cudapm1-rewrite.txt

5) Editing of cudapm1.ini (other than the fragment re logging below)
see attached cudapm1.ini

readme.txt fragment re logging via dprintf etc of additions.7z
[CODE]
Output control is available from a command line option -o, or ini file directive output
-o 0 prints stdout content to both console and log file. (dual)
-o 1 suppresses logging screen output to file, does output to screen (default; traditional)
-o 2 suppresses screen output, logs stdout to log file (log only)
-o 3 suppresses both logging to file and screen output to stdout (silent mode)
-o 4 prints stdout and stderr content to both console and log file. (dual stdout and stderr)
-o 5 stdout to console, stderr to console and log
-o 6 stdout to log file, stderr to console and log
-0 7 stdout suppressed, stderr to console and log
Output to stderr, addition of results to results file, consuming of worktodo file, and save to save files, thread files, or fft files occur regardless of this output flag.

stdout stdout2log stderr stderr2log
0 y y y n
1 y n y n
2 n y y n
3 n n y n
4 y y y y
5 y n y y
6 n y y y
7 n n y y
[/CODE]ini file fragment re logging via dprintf etc of additions.7z
[CODE]
# Output control is available from a command line option -o, or ini file directive output
# output=0 prints stdout content to both console and log file. (dual)
# output=1 suppresses logging screen output to file, does output to screen (default; traditional)
# output=2 suppresses screen output, logs stdout to log file (log only)
# output=3 suppresses both logging to file and screen output to stdout (silent mode)
# output=4 prints stdout and stderr content to both console and log file. (dual stdout and stderr)
# output=5 stdout to console, stderr to console and log
# output=6 stdout to log file, stderr to console and log
# output=7 stdout suppressed, stderr to console and log
# Output to stderr, addition of results to results file, consuming of worktodo file, and save to
# save files, thread files, or fft files occur regardless of this output flag.
#
# stdout stdout2log stderr stderr2log
# 0 y y y n
# 1 y n y n
# 2 n y y n
# 3 n n y n
# 4 y y y y
# 5 y n y y
# 6 n y y y
# 7 n n y y

output=1
[/CODE]Sample console output of dprintf etc test/demo program "Additions"
[CODE]Additions.c ver 8/31/2017

Opened for append testlogfile.txt
B The system time is: 16:41:33.655 UTC
at: Mon 2018-11-05 16:41:33.655 UTC
Starting at Local time 2018-11-05 10:41:33.656, UTC 2018-11-05 16:41:33.656.


flag 0 follows
Flag 0=0 expected should print stdout to both log file and screen.
Stderr should be unaffected by flag=0 called with 0

flag 1 follows
Flag 1=1 expected should print stdout to screen but not log file.
Stderr should be unaffected by flag=1 called with 1

flag 2 follows
Stderr should be unaffected by flag=2 called with 2

flag 3 follows
Stderr should be unaffected by flag=3 called with 3

flag 4 follows
Flag 4=4 expected should print stdout to both log file and screen.
Stderr should be duplicated by flag=4 called with 4

flag 5 follows
Flag 5=5 expected should print stdout to screen but not log file.
Stderr should be duplicated by flag=5 called with 5

flag 6 follows
Stderr should be duplicated by flag=6 called with 6

flag 7 follows
Stderr should be duplicated by flag=7 called with 7

flag 8 follows

Warning--output flag value=8 is outside expected bounds of 0-7 on entry to dprintf.
Flag 8=8 expected should print stdout to both log file and screen and warn about flag
.8

Warning--output flag value=8 outside expected bounds of 0-7 on entry to dfprintf.
Stderr should be duplicated by flag=8 called with 8

flag -47 follows

Warning--output flag value=-47 is outside expected bounds of 0-7 on entry to dprintf.

Flag -47 should print stdout to both log file and screen and warn about flag.-47

Warning--output flag value=-47 outside expected bounds of 0-7 on entry to dfprintf.
Stderr should be unaffected by flag=-47 called with -47
a=398.000000, b=1.000000 final b=inf

Exiting at Local time 2018-11-05 10:41:33.671, UTC 2018-11-05 16:41:33.671. Elapsed time of the run, 0.016 seconds


End program at: Mon 2018-11-05 16:41:33.671
[/CODE]Sample log file content of dprintf etc test/demo program "Additions"
[CODE]Starting at Local time 2018-11-05 10:41:33.656, UTC 2018-11-05 16:41:33.656.


flag 0 follows
Flag 0=0 expected should print stdout to both log file and screen.
Flag 2=2 expected should print stdout to log file but not screen.
Flag 4=4 expected should print stdout to both log file and screen.
Stderr should be duplicated by flag=4 called with 4
Stderr should be duplicated by flag=5 called with 5
Flag 6=6 expected should print stdout to log file but not screen.
Stderr should be duplicated by flag=6 called with 6
Stderr should be duplicated by flag=7 called with 7
Flag 8=8 expected should print stdout to both log file and screen and warn about flag.8
Stderr should be duplicated by flag=8 called with 8
Flag -47 should print stdout to both log file and screen and warn about flag.-47

Exiting at Local time 2018-11-05 10:41:33.671, UTC 2018-11-05 16:41:33.671. Elapsed time of the run, 0.016 seconds
[/CODE]cudalucas/cudapm1 option flag list, alphabetized[CODE]

-b proposed bios version confirmation of specific device
-c n checkpoint
-cufftbench create fft file
-d n device number (zero based)

-f n fftlength

-h help and exit
-i filename ini file
-info
-k keyboard input enabled

-m proposed model name confirmation of specific device
-memtest

-o proposed output control flag

-p proposed pci slot id string confirmation of specific device
-polite n

-r n run short or long selftest
-s <folder> save checkpoints
-threadbench create threads file
-threads

-u proposed userid and optional systemid-gpuid string to prepend to results lines
-v version and exit
-w proposed estimate work durations and schedule
-x n screen report interval
[/CODE]

aaronhaviland 2018-11-12 02:40

[QUOTE=kriesel;499685]Hi, sorry for the delay responding.
What's your build OS, still Ubuntu; version #? I'm aiming for Win7 x64.
[/QUOTE]Ubuntu currently for this project, but I have others in Visual Studio. (I'm not exactly a fan of frontends, and prefer console compilations myself)
I do plan to build on both platforms in the future, but I prefer to code in *nix.

[QUOTE]1) Misc minor edits for housekeeping (see the attachment at [URL="https://www.mersenneforum.org/showpost.php?"]https://www.mersenneforum.org/showpost.php?p=462600&postcount=503[/URL] and see change note #8 in an attachment at [URL="https://www.mersenneforum.org/showpost.php?"]https://www.mersenneforum.org/showpost.php?p=463662&postcount=511[/URL][/QUOTE] - Done, see commits b2d11b1 through d5c7a6f

[QUOTE]2) Addition of output options and date/time stamps (see the test/demo program attached to this post)[/QUOTE] - Passing on this one for now. Put in TODO. May re-visit later

[QUOTE]3) Sanity checking of fft and threads benchmarking[/QUOTE] - Trying to understand this. I'm guessing there are some combinations of cards/threads where the FFT just bails out and returns quickly, and this is an attempt to catch it?

[QUOTE]4) Incomplete rewrite of readme.txt[/QUOTE] - Tabled for now, put in TODO.

[QUOTE]5) Editing of cudapm1.ini (other than the fragment re logging below)[/QUOTE] - Done, See commit 0b4f2c2

kriesel 2018-11-12 06:27

[QUOTE=aaronhaviland;500127]
(3) - Trying to understand this. I'm guessing there are some combinations of cards/threads where the FFT just bails out and returns quickly, and this is an attempt to catch it?
[/QUOTE]
Yes. See for example [URL]https://www.mersenneforum.org/showpost.php?p=456324&postcount=2591[/URL] where 1024 squaring threads is bad, gives timings half what others do, in CUDALucas. There are also cases where 32 threads is bad. Compute capability 2.0 I think. CUDAPm1 issue #16.

There are also cases where certain fft lengths give bad results. As I recall these were found for old CUDA levels. See also [URL]https://www.mersenneforum.org/showpost.php?p=463280&postcount=2608[/URL] for the fft benchmark analogous issue.

See also the bad-residues cases, at least some of which are related to the threads issues. The CUDALucas issues 2 to 5 in its bug and wish list are worth examining.

The too-early returns for some thread counts or fft lengths trash the thread or fft benchmarking respectively.

CUDALucas was modified to trap for a select few bad-residue cases; 0x02, 0x00, and 0xfffffffffffffffd. The CUDALucas v2.06beta traps for its known bad residues. Since CUDAPM1 was derived from CUDALucas, years before, it has some of the same issues as well as some of its own. CUDAPm1's list of bad residues is longer.
%badresidues=(
'cllucas', '0x0000000000000002, 0xffffffff80000000',
'cudalucas', '0x0000000000000000, 0x0000000000000002, 0xfffffffffffffffd',
'cudapm1', '0x0000000000000000, 0x0000000000000001, 0xfff7fffbfffdfffe, 0xfff7fffbfffdffff, 0xfff7fffbfffffffe, 0xfff7fffbffffffff, 0xfff7fffffffdfffe, 0xfff7fffffffdffff, 0xfff7fffffffffffe, 0xfff7ffffffffffff, 0xfffffffbfffdfffe, 0xfffffffbfffdffff, 0xfffffffbfffffffe, 0xfffffffbffffffff, 0xfffffffffffdfffe, 0xfffffffffffdffff, 0xfffffffffffffffe, 0xffffffffffffffff',
'gpuowl', '0x0000000000000000',
'mfaktc', '',
'mfakto', ''
); #fff* added to cudapm1 list 7/19/18

tServo 2018-11-12 16:30

[QUOTE=kriesel;498686]

CUDAPm1 looked to me to be using GMP also. But the available Windows CUDAPm1 executables are linked to an old GMP version (2013 or earlier), and after looking through GMP's revision history of the past few years, I think there might be some issues due to that, not present in gpuOwL, or future builds of CUDAPm1 with a current GMP version for that matter.
.[/QUOTE]
krisel,
I suggest you avoid GMP and use MPIR instead. It's a rewrite of GMP for windows, designed to be compiled by Visual Studio and, I believe, yasm.
It's not trivial to install ( each person must compile it for themselves ), but it avoids all the GMP headaches. However, each version is then optimized for THAT machine.
One of its authors, Brian Gladman, posts here occasionally.
Be sure to use the 'generate GMP headers' option, which is specifically for porting code from GMP to windows.
I have used MPIR extensively, but have never ported anything from GMP.
It is at mpir.org

tServo 2018-11-12 16:39

[QUOTE=kriesel;498319]Unfortunately, the Oct deadline has passed without a sufficient number of neighbors signing up, so the schedule for fiber install for my neighborhood has been delayed by nominally 6 months, to next June. And the rate of signup (I'm tracking via their website) looks ominously slow for even that delayed schedule. If/when it happens, they're offering 300, 400, and 1000Mbps.[/QUOTE]

kriesel,
What company is promising all this?
The reason I ask is that here in Champaign-Urbana, we have had 2 different attempts
to provide everybody with fiber, exactly as you described.
The most recent had also had a web page where you could see how many of your neighbors have signed up, blah blah blah.
They also kept delaying it saying not enough have signed up for it, etc etc etc.

They finally crashed and burned; the whole thing looking like a scam
Bad feelings all around.
Some government started corporation was handed the contract to actually do it.
I don't know their status. I will look into it.
Meanwhile, evil Comcast still has my business.


All times are UTC. The time now is 23:19.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.