mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   The P-1 factoring CUDA program (https://www.mersenneforum.org/showthread.php?t=17835)

LaurV 2013-05-11 14:48

edit limit: for 1.5G of ram (1536mb version) the maximum block value I can use without crashing is 52 blocks for cards driving the video, and 57 blocks for "free" cards. Theoretical value for this amount of ram is would be 60 blocks, but I believe it needs some "spare". Everything over 52/57 will crash the program and with those values, everything become terrible slow. With 50/55 everything works fine, no error. Also, with those values, the cards are getting "normal hot".

so I use batch file like the following, to start all cards in the same time:
[code]
rem Tests 60*25 = 1500 MB of memory on the all gpu's (1.5G=1536M each)
rem use max 52 for the card driving the video (everything becomes very slow!! use 50 better)
rem use max 57 for the other cards (everything becomes very slow!! use 55 or lower...)
start "CUDAmemtest_0" /LOW CUDAmemtest.exe 50 1000 0
start "CUDAmemtest_1" /LOW CUDAmemtest.exe 55 1000 1
....etc other cards not driving the video
[/code]

c10ck3r 2013-05-11 18:12

So, I updated my driver, changed device number back to "0", and dropped thread count to 32. One of those (or some combination) fixed it apparently.

Karl M Johnson 2013-05-11 18:37

[QUOTE=c10ck3r;340072]So, I updated my driver, changed device number back to "0", and dropped thread count to 32. One of those (or some combination) fixed it apparently.[/QUOTE]
Updating the drivers:smile:
For good speeds, set the threads to be equal 256.

c10ck3r 2013-05-11 18:41

[QUOTE=Karl M Johnson;340075]Updating the drivers:smile:
For good speeds, set the threads to be equal 256.[/QUOTE]

It actually wouldn't run the exponent I manually reserved using 32. Running @256 :)

Karl M Johnson 2013-05-12 10:12

May be worth mentioning, the -f flag is ignored when CPm1 is executed.

LaurV 2013-05-12 16:35

[QUOTE=Karl M Johnson;340131]May be worth mentioning, the -f flag is ignored when CPm1 is executed.[/QUOTE]
The fft selection still works by adding the size at the end of each line in worktodo, so you can still tune the fft for each exponent. I found out that the rules are the same as for cudaLucas, and did my usual tuning. For example, 3456k gives me a surplus of speed about 3.5-4.5% compared with the default one selected by the program (3360k), also the error decreased from about 0.21 to 0.09 which is better.

I got 30 assignments from gpu72, did 10 of them, and for the other 20, I replaced the last ",2" on each line in worktodo with ",3,3456k" (i.e. beside of specifying the fft, I made cudaPM1 believe he will save 3 LL tests, therefore use a larger B1 by default calculus) which increased the default B1 from about 605000 to 950000, and the time of stage 1 from 1 hour and half to two hours and half (don't know yet the time of stage2, as B2 was also increased from 16775000 to 28975000). This comes with a better chance to find a factor, from about 4% to about 6% (didn't make the calculus, I just know from past experience with P95 the effecte of changing the last ",2" into ",3", or larger, to increase the chance of finding a factor.

What is missing is checkpoints (already had a restart and lost few hours of work, due to a storm here - the rainy season is starting). And is a must to have a save at the end of stage1, in case some of us wants to "extend" the B1 limit, the program should be able to reload that file and do few more iterations. Of course, this would complicate the things witht the credit, as PrimeNet give the credit in full amount, so a guy doing 10 iterations and reporting after every one will get the credit 10 times. :smile:

c10ck3r 2013-05-12 16:40

[QUOTE=LaurV;340151] What is missing is checkpoints (already had a restart and lost few hours of work, due to a storm here - the rainy season is starting). And is a must to have a save at the end of stage1, in case some of us wants to "extend" the B1 limit, the program should be able to reload that file and do few more iterations. [B]Of course, this would complicate the things witht the credit, as PrimeNet give the credit in full amount, so a guy doing 10 iterations and reporting after every one will get the credit 10 times.[/B] :smile:[/QUOTE]
This is already in issue with P95, so no real change would occur IIRC.

frmky 2013-05-12 19:22

[QUOTE=Karl M Johnson;340131]May be worth mentioning, the -f flag is ignored when CPm1 is executed.[/QUOTE]

Really? It should work. Does in linux. I run with something like
CUDAPm1 -f 3456k

Karl M Johnson 2013-05-13 05:46

Oh, right, with the K notation, it works.
However, I've noticed when the FFT size is manually specified, the algorithm that limits stage2 GPU memory to be <4096MB doesn't kick in.
[CODE]
R:\cudapm1_x64_20130505>CUDAPm1 -f 4096k
CUDAPm1 v0.10
Selected B1=630000, B2=17325000, 4.17% chance of finding a factor
CUDA reports 5618M of 6143M GPU memory free.
Using e=6, d=2310, nrp=120
Using approximately [COLOR=Red]4364M[/COLOR] GPU memory.
Starting stage 1 P-1, M62621347, B1 = 630000, B2 = 17325000, e = 6, fft length = 4096K
Doing 908959 iterations
[/CODE]

frmky 2013-05-13 06:00

[QUOTE=Karl M Johnson;340209]Oh, right, with the K notation, it works.
However, I've noticed when the FFT size is manually specified, the algorithm that limits stage2 GPU memory to be <4096MB doesn't kick in.
[CODE]
R:\cudapm1_x64_20130505>CUDAPm1 -f 4096k
CUDAPm1 v0.10
Selected B1=630000, B2=17325000, 4.17% chance of finding a factor
CUDA reports 5618M of 6143M GPU memory free.
Using e=6, d=2310, nrp=120
Using approximately [COLOR=Red]4364M[/COLOR] GPU memory.
Starting stage 1 P-1, M62621347, B1 = 630000, B2 = 17325000, e = 6, fft length = 4096K
Doing 908959 iterations
[/CODE][/QUOTE]

Try it with a low B1 to see if stage 2 runs ok. It's not limiting total memory use to 4096MB. It's limiting the size of a single allocation to 4096MB.

Karl M Johnson 2013-05-13 07:10

[URL="http://i.imgur.com/cKykkFl.png"]It does run ok[/URL], glad to see that!

However, there's this.
vRAM downclocked to 5Ghz, still happens (the max).
Technically, we should be able to run exponents with unusually high FFT length, right?
[CODE]
R:\cudapm1_x64_20130505>CUDAPm1 -f 10240k
CUDAPm1 v0.10
Selected B1=630000, B2=17325000, 4.17% chance of finding a factor
CUDA reports 5297M of 6143M GPU memory free.
Using e=6, d=2310, nrp=48
Using approximately 5150M GPU memory.
Starting stage 1 P-1, M62621963, B1 = 630000, B2 = 17325000, e = 6, fft length = 10240K
Doing 908959 iterations
Iteration = 100 >= 1000 && err = 0.5 >= 0.35, fft length = 10240K, writing checkpoint file (because -t is enabled) and exiting.

Iteration = 100, err = 0.5 >= 0.43, quitting.
Estimated time spent so far: 0:02[/CODE]


All times are UTC. The time now is 23:19.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.