![]() |
[QUOTE=frmky;339215]Hmmm. Try the 64-bit version to see if it makes any difference. If it persists, we can try adding cudaDeviceSynchronize() as well, but that seemed to be hit-or-miss in the discussions.[/QUOTE]
A have done some tests, it looks to be very good. I so far have tested up to 30000000 on the B2 value. I'll do some further testing, but no error's anymore. |
With e=12 and using the 64 bit binary, the maximum working fft length seems to be = 1024K
Once it jumps to 1120K, stage 2 doesn't run due to not enough vRAM available. [CODE] CUDAPm1 -d 0 -threads 512 -c 10000 -t -polite 0 -b1 1 -b2 1000 -e2 12 18900103 CUDAPm1 v0.10 Warning: Couldn't parse ini file option WorkFile; using default "worktodo.txt" CUDA reports 5746M of 6143M GPU memory free. Using e=12, d=2310, nrp=480 Using approximately 4395M GPU memory. B1 should be at least 2, increasing it. B2 should be at least 750750, increasing it. Starting stage 1 P-1, M18900103, B1 = 2, B2 = 750750, e = 12, fft length = 1120K Doing 27 iterations M18900103, 0xd9cdc4241fd69cb5, offset = 0, n = 1120K, CUDAPm1 v0.10 Stage 1 complete, estimated total time = 0:01 Starting stage 1 gcd. M18900103 Stage 1 found no factor (P-1, B1=2, B2=750750, e=12, n=1120K CUDAPm1 v0.10) Starting stage 2. C:/Users/childers/Dropbox/NFS/cudapm1/build/cudapm1-code-21/cudapm1-code-21/trunk/CUDAPm1.cu(2640) : cudaSafeCall() Runtime API error 2: out of memory.[/CODE][CODE]CUDAPm1 -d 0 -threads 512 -c 10000 -t -polite 0 -b1 1 -b2 1000 -e2 12 18800137 CUDAPm1 v0.10 Warning: Couldn't parse ini file option WorkFile; using default "worktodo.txt" CUDA reports 5754M of 6143M GPU memory free. Using e=12, d=2310, nrp=480 Using approximately 4019M GPU memory. B1 should be at least 2, increasing it. B2 should be at least 750750, increasing it. Starting stage 1 P-1, M18800137, B1 = 2, B2 = 750750, e = 12, fft length = 1024K Doing 27 iterations M18800137, 0x2c4be40be0856b5b, offset = 0, n = 1024K, CUDAPm1 v0.10 Stage 1 complete, estimated total time = 0:00 Starting stage 1 gcd. M18800137 Stage 1 found no factor (P-1, B1=2, B2=750750, e=12, n=1024K CUDAPm1 v0.10) Starting stage 2. Zeros: 26762, Ones: 45718, Pairs: 14576 itime: 71.914939, transforms: 1, average: 71914.939000 ptime: 42.435831, transforms: 95060, average: 0.446411 ETA: 0:00 Stage 2 complete, estimated total time = 1:54 Accumulated Product: M18800137, 0xda62bc92cb243523, n = 1024K, CUDAPm1 v0.10 Starting stage 2 gcd. M18800137 Stage 2 found no factor (P-1, B1=2, B2=750750, e=12, n=1024K CUDAPm1 v0.10)[/CODE]I remember Oliver(TheJudger) saying this: [QUOTE=TheJudger;335116]just compile your code for 64bit and use "long long int" when printing the total amount of memory. :razz: [CODE]./deviceQuery | grep global Total amount of global memory: 4800 MBytes (5032706048 bytes) Total amount of global memory: 4800 MBytes (5032706048 bytes) [/CODE]Oliver[/QUOTE] But the program already reports that 5754M of 6143M GPU memory [B]is[/B] free, so...what now? |
This is using the 64-bit binary? It looks suspicious that it fails crossing 4096M.
When running the second case, check to see if it really uses about 4019M when running. That value is just an estimate based on what we expect cufft to use. If that's accurate, then cuda is lying when it says we can use 5746M since it uses well below that. If this is the 64 bit binary, then I may need to limit memory use to 4096M on Windows. This won't affect most users. :-) In the meantime, you can increase the fft size until nrp drops to 240 or decrease d to 210 using -d2 210. |
Yes, it's the 64 bit binary.
I used MSI Afterburner to measure it: 273MBs used before CPm1, 401MB at first stage, 4345MB at second stage. 4345-273 = 4072, which is "close" to the reported 4019MB. This is what I call MSI Afterburner delta method. ProcessXP reported commited GPU memory as 4,116,400K CUDA might as well report free memory correctly, but for some reason it cant allocate a whole chunk of more than ~4096 MB. That's my guess. Using -d2 dropped vRAM usage to 615MB:smile: |
is there a switch to make the onscreen output less frequent? (low exponent without stage 2 done). I get a new line each 2 second in stage 1 and it's a tad annoying.
|
The -c flag?
|
thanks
|
Running it on the 580 ftw with the default worktodo, it starts out well:
[code] CUDAPm1 v0.10 Selected B1=605000, B2=16637500, 4.1% chance of finding a factor CUDA reports 2766M of 3072M GPU memory free. Using e=6, d=2310, nrp=80 Using approximately 2529M GPU memory. Starting stage 1 P-1, M61262347, B1 = 605000, B2 = 16637500, e = 6, fft length = 3360K Doing 873133 iterations Iteration 1000 M61262347, 0xf19a7f6041953a97, n = 3360K, CUDAPm1 v0.10 err = 0.19531 (0:09 real, 9.1117 ms/iter, ETA 2:12:26) Iteration 2000 M61262347, 0xaf1d15aad49fcee8, n = 3360K, CUDAPm1 v0.10 err = 0.19531 (0:06 real, 5.7928 ms/iter, ETA 1:24:06) Iteration 3000 M61262347, 0xb702298e7a8c9a8e, n = 3360K, CUDAPm1 v0.10 err = 0.19922 (0:06 real, 5.9176 ms/iter, ETA 1:25:49) Iteration 4000 M61262347, 0xc53d1695707d3dc0, n = 3360K, CUDAPm1 v0.10 err = 0.19141 (0:06 real, 5.8142 ms/iter, ETA 1:24:13) [/code] I am thrilled to report that cudapm1 doesn't make my video card screech like cuadlucas does. I'll report back in an hour or so when it get to the end of it's ETA. Those B1, B2, and e were the default ones. |
[QUOTE=Aramis Wyler;339347]I am thrilled to report that cudapm1 doesn't make my video card screech like cuadlucas does. I'll report back in an hour or so when it get to the end of it's ETA. Those B1, B2, and e were the default ones.[/QUOTE]
I get the distinct impression that we'll be losing some more GPU TFing firepower shortly... Not that I'm complaining! Nice work Karl, Fred et al! :smile: |
[QUOTE=James Heinrich;339025]Is the B2>=390390 a fixed limitation, or tied to the exponent, or FFT, or...?[/QUOTE]Does anyone have insight as to the minimum limits for B1/B2?
|
nrp must be a divisor of phi(d), a seg fault is likely otherwise.
with p = smallest prime that does not divide d: b1 < b2 / p / 53 will not pair some smaller primes, so will possibly give incorrect results. b2 / p < d * (2 * e + 1) will give incorrect results b2 / p < b1 will produce a seg fault at the onset of stage2. |
| All times are UTC. The time now is 23:19. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.