mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-05-04, 07:54   #199
Stef42
 
Feb 2012
the Netherlands

1110102 Posts
Default

Quote:
Originally Posted by frmky View Post
Hmmm. Try the 64-bit version to see if it makes any difference. If it persists, we can try adding cudaDeviceSynchronize() as well, but that seemed to be hit-or-miss in the discussions.
A have done some tests, it looks to be very good.
I so far have tested up to 30000000 on the B2 value.
I'll do some further testing, but no error's anymore.
Stef42 is offline   Reply With Quote
Old 2013-05-04, 08:16   #200
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

19B16 Posts
Default

With e=12 and using the 64 bit binary, the maximum working fft length seems to be = 1024K
Once it jumps to 1120K, stage 2 doesn't run due to not enough vRAM available.

Code:
CUDAPm1 -d 0 -threads 512 -c 10000 -t -polite 0 -b1 1 -b2 1000 -e2 12 18900103
CUDAPm1 v0.10
Warning: Couldn't parse ini file option WorkFile; using default "worktodo.txt"
CUDA reports 5746M of 6143M GPU memory free.
Using e=12, d=2310, nrp=480
Using approximately 4395M GPU memory.
B1 should be at least 2, increasing it.
B2 should be at least 750750, increasing it.
Starting stage 1 P-1, M18900103, B1 = 2, B2 = 750750, e = 12, fft length = 1120K
Doing 27 iterations
M18900103, 0xd9cdc4241fd69cb5, offset = 0, n = 1120K, CUDAPm1 v0.10
Stage 1 complete, estimated total time = 0:01
Starting stage 1 gcd.
M18900103 Stage 1 found no factor (P-1, B1=2, B2=750750, e=12, n=1120K CUDAPm1 v0.10)
Starting stage 2.
C:/Users/childers/Dropbox/NFS/cudapm1/build/cudapm1-code-21/cudapm1-code-21/trunk/CUDAPm1.cu(2640) : cudaSafeCall() Runtime API error 2: out of memory.
Code:
CUDAPm1 -d 0 -threads 512 -c 10000 -t -polite 0 -b1 1 -b2 1000 -e2 12 18800137
CUDAPm1 v0.10
Warning: Couldn't parse ini file option WorkFile; using default "worktodo.txt"
CUDA reports 5754M of 6143M GPU memory free.
Using e=12, d=2310, nrp=480
Using approximately 4019M GPU memory.
B1 should be at least 2, increasing it.
B2 should be at least 750750, increasing it.
Starting stage 1 P-1, M18800137, B1 = 2, B2 = 750750, e = 12, fft length = 1024K
Doing 27 iterations
M18800137, 0x2c4be40be0856b5b, offset = 0, n = 1024K, CUDAPm1 v0.10
Stage 1 complete, estimated total time = 0:00
Starting stage 1 gcd.
M18800137 Stage 1 found no factor (P-1, B1=2, B2=750750, e=12, n=1024K CUDAPm1 v0.10)
Starting stage 2.
Zeros: 26762, Ones: 45718, Pairs: 14576
itime: 71.914939, transforms: 1, average: 71914.939000
ptime: 42.435831, transforms: 95060, average: 0.446411
ETA: 0:00
Stage 2 complete, estimated total time = 1:54
Accumulated Product: M18800137, 0xda62bc92cb243523, n = 1024K, CUDAPm1 v0.10
Starting stage 2 gcd.
M18800137 Stage 2 found no factor (P-1, B1=2, B2=750750, e=12, n=1024K CUDAPm1 v0.10)
I remember Oliver(TheJudger) saying this:
Quote:
Originally Posted by TheJudger View Post
just compile your code for 64bit and use "long long int" when printing the total amount of memory.
Code:
./deviceQuery | grep global
  Total amount of global memory:                 4800 MBytes (5032706048 bytes)
  Total amount of global memory:                 4800 MBytes (5032706048 bytes)
Oliver
But the program already reports that 5754M of 6143M GPU memory is free, so...what now?
Karl M Johnson is offline   Reply With Quote
Old 2013-05-04, 09:06   #201
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

22×232 Posts
Default

This is using the 64-bit binary? It looks suspicious that it fails crossing 4096M.

When running the second case, check to see if it really uses about 4019M when running. That value is just an estimate based on what we expect cufft to use. If that's accurate, then cuda is lying when it says we can use 5746M since it uses well below that. If this is the 64 bit binary, then I may need to limit memory use to 4096M on Windows. This won't affect most users. :-)

In the meantime, you can increase the fft size until nrp drops to 240 or decrease d to 210 using -d2 210.
frmky is online now   Reply With Quote
Old 2013-05-04, 09:25   #202
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

41110 Posts
Default

Yes, it's the 64 bit binary.
I used MSI Afterburner to measure it: 273MBs used before CPm1, 401MB at first stage, 4345MB at second stage. 4345-273 = 4072, which is "close" to the reported 4019MB.
This is what I call MSI Afterburner delta method.
ProcessXP reported commited GPU memory as 4,116,400K

CUDA might as well report free memory correctly, but for some reason it cant allocate a whole chunk of more than ~4096 MB.
That's my guess.

Using -d2 dropped vRAM usage to 615MB

Last fiddled with by Karl M Johnson on 2013-05-04 at 09:30
Karl M Johnson is offline   Reply With Quote
Old 2013-05-04, 09:37   #203
firejuggler
 
firejuggler's Avatar
 
Apr 2010
Over the rainbow

2×1,303 Posts
Default

is there a switch to make the onscreen output less frequent? (low exponent without stage 2 done). I get a new line each 2 second in stage 1 and it's a tad annoying.
firejuggler is offline   Reply With Quote
Old 2013-05-04, 09:42   #204
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3×137 Posts
Default

The -c flag?
Karl M Johnson is offline   Reply With Quote
Old 2013-05-04, 09:45   #205
firejuggler
 
firejuggler's Avatar
 
Apr 2010
Over the rainbow

2×1,303 Posts
Default

thanks
firejuggler is offline   Reply With Quote
Old 2013-05-05, 19:05   #206
Aramis Wyler
 
Aramis Wyler's Avatar
 
"Bill Staffen"
Jan 2013
Pittsburgh, PA, USA

23×53 Posts
Default

Running it on the 580 ftw with the default worktodo, it starts out well:
Code:
CUDAPm1 v0.10
Selected B1=605000, B2=16637500, 4.1% chance of finding a factor
CUDA reports 2766M of 3072M GPU memory free.
Using e=6, d=2310, nrp=80
Using approximately 2529M GPU memory.
Starting stage 1 P-1, M61262347, B1 = 605000, B2 = 16637500, e = 6, fft length = 3360K
Doing 873133 iterations
Iteration 1000 M61262347, 0xf19a7f6041953a97, n = 3360K, CUDAPm1 v0.10 err = 0.19531 (0:09 real, 9.1117 ms/iter, ETA 2:12:26)
Iteration 2000 M61262347, 0xaf1d15aad49fcee8, n = 3360K, CUDAPm1 v0.10 err = 0.19531 (0:06 real, 5.7928 ms/iter, ETA 1:24:06)
Iteration 3000 M61262347, 0xb702298e7a8c9a8e, n = 3360K, CUDAPm1 v0.10 err = 0.19922 (0:06 real, 5.9176 ms/iter, ETA 1:25:49)
Iteration 4000 M61262347, 0xc53d1695707d3dc0, n = 3360K, CUDAPm1 v0.10 err = 0.19141 (0:06 real, 5.8142 ms/iter, ETA 1:24:13)
I am thrilled to report that cudapm1 doesn't make my video card screech like cuadlucas does. I'll report back in an hour or so when it get to the end of it's ETA. Those B1, B2, and e were the default ones.
Aramis Wyler is offline   Reply With Quote
Old 2013-05-05, 19:12   #207
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

9,767 Posts
Default

Quote:
Originally Posted by Aramis Wyler View Post
I am thrilled to report that cudapm1 doesn't make my video card screech like cuadlucas does. I'll report back in an hour or so when it get to the end of it's ETA. Those B1, B2, and e were the default ones.
I get the distinct impression that we'll be losing some more GPU TFing firepower shortly...

Not that I'm complaining! Nice work Karl, Fred et al!
chalsall is offline   Reply With Quote
Old 2013-05-05, 20:06   #208
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

1101010111012 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
Is the B2>=390390 a fixed limitation, or tied to the exponent, or FFT, or...?
Does anyone have insight as to the minimum limits for B1/B2?
James Heinrich is offline   Reply With Quote
Old 2013-05-05, 20:39   #209
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

22×232 Posts
Default

nrp must be a divisor of phi(d), a seg fault is likely otherwise.

with p = smallest prime that does not divide d:
b1 < b2 / p / 53 will not pair some smaller primes, so will possibly give incorrect results.
b2 / p < d * (2 * e + 1) will give incorrect results
b2 / p < b1 will produce a seg fault at the onset of stage2.

Last fiddled with by frmky on 2013-05-05 at 21:00
frmky is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfaktc: a CUDA program for Mersenne prefactoring TheJudger GPU Computing 3497 2021-06-05 12:27
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51
World's dumbest CUDA program? xilman Programming 1 2009-11-16 10:26
Factoring program need help Citrix Lone Mersenne Hunters 8 2005-09-16 02:31
Factoring program ET_ Programming 3 2003-11-25 02:57

All times are UTC. The time now is 08:18.


Mon Aug 2 08:18:37 UTC 2021 up 10 days, 2:47, 0 users, load averages: 2.72, 2.17, 1.77

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.