![]() |
[QUOTE=Karl M Johnson;339122]As a conclusion of this micro research, if you see approx. memory usage of >=3139MB, be sure that stage 2 will not work, even if you have a lot more than that.[/QUOTE]
Perhaps because it is a 32-bit binary? I'll try creating a 64-bit binary later today. |
[QUOTE=kjaget;339141]Remove the #define sscanf sscanf_s line from parse.c. [/QUOTE]
Thanks! |
New versions ...
Win32: [URL="https://www.dropbox.com/s/alz4xodjjend7bi/cudapm1_win32_20130503.zip"]https://www.dropbox.com/s/alz4xodjjend7bi/cudapm1_win32_20130503.zip[/URL] x64: [URL="https://www.dropbox.com/s/gbs9pr3ily49ric/cudapm1_x64_20130503.zip"]https://www.dropbox.com/s/gbs9pr3ily49ric/cudapm1_x64_20130503.zip[/URL] The x64 version should allow you to use more than 3GB (or 4GB, not sure which limit applies to GPU ram) of memory if your card has that much. Also, the GCD at the end will likely be a bit faster, but it doesn't really take that long anyway. As usual, please let me know of problems. |
[QUOTE=Stef42;339152]I'm getting a lot of cudaDeviceSynchronize() error 30...
Usually on high B2 value's while only 400-500MB is used (low exponents). Why this might have happened: [url]http://stackoverflow.com/questions/12200994/cuda-runtime-api-error-30-repeated-kernel-calls[/url][/QUOTE] Hmmm. Try the 64-bit version to see if it makes any difference. If it persists, we can try adding cudaDeviceSynchronize() as well, but that seemed to be hit-or-miss in the discussions. |
[QUOTE=frmky;339215]Hmmm. Try the 64-bit version to see if it makes any difference. If it persists, we can try adding cudaDeviceSynchronize() as well, but that seemed to be hit-or-miss in the discussions.[/QUOTE]
A have done some tests, it looks to be very good. I so far have tested up to 30000000 on the B2 value. I'll do some further testing, but no error's anymore. |
With e=12 and using the 64 bit binary, the maximum working fft length seems to be = 1024K
Once it jumps to 1120K, stage 2 doesn't run due to not enough vRAM available. [CODE] CUDAPm1 -d 0 -threads 512 -c 10000 -t -polite 0 -b1 1 -b2 1000 -e2 12 18900103 CUDAPm1 v0.10 Warning: Couldn't parse ini file option WorkFile; using default "worktodo.txt" CUDA reports 5746M of 6143M GPU memory free. Using e=12, d=2310, nrp=480 Using approximately 4395M GPU memory. B1 should be at least 2, increasing it. B2 should be at least 750750, increasing it. Starting stage 1 P-1, M18900103, B1 = 2, B2 = 750750, e = 12, fft length = 1120K Doing 27 iterations M18900103, 0xd9cdc4241fd69cb5, offset = 0, n = 1120K, CUDAPm1 v0.10 Stage 1 complete, estimated total time = 0:01 Starting stage 1 gcd. M18900103 Stage 1 found no factor (P-1, B1=2, B2=750750, e=12, n=1120K CUDAPm1 v0.10) Starting stage 2. C:/Users/childers/Dropbox/NFS/cudapm1/build/cudapm1-code-21/cudapm1-code-21/trunk/CUDAPm1.cu(2640) : cudaSafeCall() Runtime API error 2: out of memory.[/CODE][CODE]CUDAPm1 -d 0 -threads 512 -c 10000 -t -polite 0 -b1 1 -b2 1000 -e2 12 18800137 CUDAPm1 v0.10 Warning: Couldn't parse ini file option WorkFile; using default "worktodo.txt" CUDA reports 5754M of 6143M GPU memory free. Using e=12, d=2310, nrp=480 Using approximately 4019M GPU memory. B1 should be at least 2, increasing it. B2 should be at least 750750, increasing it. Starting stage 1 P-1, M18800137, B1 = 2, B2 = 750750, e = 12, fft length = 1024K Doing 27 iterations M18800137, 0x2c4be40be0856b5b, offset = 0, n = 1024K, CUDAPm1 v0.10 Stage 1 complete, estimated total time = 0:00 Starting stage 1 gcd. M18800137 Stage 1 found no factor (P-1, B1=2, B2=750750, e=12, n=1024K CUDAPm1 v0.10) Starting stage 2. Zeros: 26762, Ones: 45718, Pairs: 14576 itime: 71.914939, transforms: 1, average: 71914.939000 ptime: 42.435831, transforms: 95060, average: 0.446411 ETA: 0:00 Stage 2 complete, estimated total time = 1:54 Accumulated Product: M18800137, 0xda62bc92cb243523, n = 1024K, CUDAPm1 v0.10 Starting stage 2 gcd. M18800137 Stage 2 found no factor (P-1, B1=2, B2=750750, e=12, n=1024K CUDAPm1 v0.10)[/CODE]I remember Oliver(TheJudger) saying this: [QUOTE=TheJudger;335116]just compile your code for 64bit and use "long long int" when printing the total amount of memory. :razz: [CODE]./deviceQuery | grep global Total amount of global memory: 4800 MBytes (5032706048 bytes) Total amount of global memory: 4800 MBytes (5032706048 bytes) [/CODE]Oliver[/QUOTE] But the program already reports that 5754M of 6143M GPU memory [B]is[/B] free, so...what now? |
This is using the 64-bit binary? It looks suspicious that it fails crossing 4096M.
When running the second case, check to see if it really uses about 4019M when running. That value is just an estimate based on what we expect cufft to use. If that's accurate, then cuda is lying when it says we can use 5746M since it uses well below that. If this is the 64 bit binary, then I may need to limit memory use to 4096M on Windows. This won't affect most users. :-) In the meantime, you can increase the fft size until nrp drops to 240 or decrease d to 210 using -d2 210. |
Yes, it's the 64 bit binary.
I used MSI Afterburner to measure it: 273MBs used before CPm1, 401MB at first stage, 4345MB at second stage. 4345-273 = 4072, which is "close" to the reported 4019MB. This is what I call MSI Afterburner delta method. ProcessXP reported commited GPU memory as 4,116,400K CUDA might as well report free memory correctly, but for some reason it cant allocate a whole chunk of more than ~4096 MB. That's my guess. Using -d2 dropped vRAM usage to 615MB:smile: |
is there a switch to make the onscreen output less frequent? (low exponent without stage 2 done). I get a new line each 2 second in stage 1 and it's a tad annoying.
|
The -c flag?
|
thanks
|
Running it on the 580 ftw with the default worktodo, it starts out well:
[code] CUDAPm1 v0.10 Selected B1=605000, B2=16637500, 4.1% chance of finding a factor CUDA reports 2766M of 3072M GPU memory free. Using e=6, d=2310, nrp=80 Using approximately 2529M GPU memory. Starting stage 1 P-1, M61262347, B1 = 605000, B2 = 16637500, e = 6, fft length = 3360K Doing 873133 iterations Iteration 1000 M61262347, 0xf19a7f6041953a97, n = 3360K, CUDAPm1 v0.10 err = 0.19531 (0:09 real, 9.1117 ms/iter, ETA 2:12:26) Iteration 2000 M61262347, 0xaf1d15aad49fcee8, n = 3360K, CUDAPm1 v0.10 err = 0.19531 (0:06 real, 5.7928 ms/iter, ETA 1:24:06) Iteration 3000 M61262347, 0xb702298e7a8c9a8e, n = 3360K, CUDAPm1 v0.10 err = 0.19922 (0:06 real, 5.9176 ms/iter, ETA 1:25:49) Iteration 4000 M61262347, 0xc53d1695707d3dc0, n = 3360K, CUDAPm1 v0.10 err = 0.19141 (0:06 real, 5.8142 ms/iter, ETA 1:24:13) [/code] I am thrilled to report that cudapm1 doesn't make my video card screech like cuadlucas does. I'll report back in an hour or so when it get to the end of it's ETA. Those B1, B2, and e were the default ones. |
[QUOTE=Aramis Wyler;339347]I am thrilled to report that cudapm1 doesn't make my video card screech like cuadlucas does. I'll report back in an hour or so when it get to the end of it's ETA. Those B1, B2, and e were the default ones.[/QUOTE]
I get the distinct impression that we'll be losing some more GPU TFing firepower shortly... Not that I'm complaining! Nice work Karl, Fred et al! :smile: |
[QUOTE=James Heinrich;339025]Is the B2>=390390 a fixed limitation, or tied to the exponent, or FFT, or...?[/QUOTE]Does anyone have insight as to the minimum limits for B1/B2?
|
nrp must be a divisor of phi(d), a seg fault is likely otherwise.
with p = smallest prime that does not divide d: b1 < b2 / p / 53 will not pair some smaller primes, so will possibly give incorrect results. b2 / p < d * (2 * e + 1) will give incorrect results b2 / p < b1 will produce a seg fault at the onset of stage2. |
[QUOTE=frmky;339357]nrp must be a divisor of phi(d), a seg fault is likely otherwise.[/QUOTE]Thanks. However, I didn't really understand much, if any, of that (my fault, not yours) :no:
Could you rephrase in [i]really[/i] simple words such that even I could understand, please? :smile: Is B1=390390 a fixed lower limit across all exponent sizes, or what is it tied to? |
I'm working on removing those restrictions, but at my usual tectonic pace.
Edit: b2 < 390390 with e = 6 and d = 2310 currently results in unsigned data becoming negative when initializing the data for a pass. |
Here's an update for Windows x64 that limits single allocations to 4 GB.
[URL="https://www.dropbox.com/s/rwfmc23bt6ed3l5/cudapm1_x64_20130505.zip"]https://www.dropbox.com/s/rwfmc23bt6ed3l5/cudapm1_x64_20130505.zip[/URL] |
Where might I find the latest release for x64 Linux? Thanks...
|
[QUOTE=frmky;339390]Here's an update for Windows x64 that limits single allocations to 4 GB.
[URL]https://www.dropbox.com/s/rwfmc23bt6ed3l5/cudapm1_x64_20130505.zip[/URL][/QUOTE] Works fine, Primenet accepted it. [CODE]M62456171 Stage 2 found no factor (P-1, B1=625000, B2=17187500, e=6, n=3456K CUDAPm1 v0.10)[/CODE] |
[QUOTE=Aramis Wyler;339347]Running it on the 580 ftw with the default worktodo, it starts out well:
[code] CUDAPm1 v0.10 Selected B1=605000, B2=16637500, 4.1% chance of finding a factor CUDA reports 2766M of 3072M GPU memory free. Using e=6, d=2310, nrp=80 Using approximately 2529M GPU memory. Starting stage 1 P-1, M61262347, B1 = 605000, B2 = 16637500, e = 6, fft length = 3360K Doing 873133 iterations Iteration 1000 M61262347, 0xf19a7f6041953a97, n = 3360K, CUDAPm1 v0.10 err = 0.19531 (0:09 real, 9.1117 ms/iter, ETA 2:12:26) Iteration 2000 M61262347, 0xaf1d15aad49fcee8, n = 3360K, CUDAPm1 v0.10 err = 0.19531 (0:06 real, 5.7928 ms/iter, ETA 1:24:06) Iteration 3000 M61262347, 0xb702298e7a8c9a8e, n = 3360K, CUDAPm1 v0.10 err = 0.19922 (0:06 real, 5.9176 ms/iter, ETA 1:25:49) Iteration 4000 M61262347, 0xc53d1695707d3dc0, n = 3360K, CUDAPm1 v0.10 err = 0.19141 (0:06 real, 5.8142 ms/iter, ETA 1:24:13) [/code] I am thrilled to report that cudapm1 doesn't make my video card screech like cuadlucas does. I'll report back in an hour or so when it get to the end of it's ETA. Those B1, B2, and e were the default ones.[/QUOTE] I wasn't able to post this last night, but I'm posting it now. It didn't end well. The file path in the error doesn't exist on my computer, so I'm not sure if that matters. [code] Iteration 872000 M61262347, 0xdf828a4cb19fc49d, n = 3360K, CUDAPm1 v0.10 err = 0.20313 (0:06 real, 6.1631 ms/iter, ETA 0:06) Iteration 873000 M61262347, 0x92b46441f57f0dc1, n = 3360K, CUDAPm1 v0.10 err = 0.19531 (0:06 real, 6.2686 ms/iter, ETA 0:00) M61262347, 0xfd7ab9d857ea4a36, offset = 0, n = 3360K, CUDAPm1 v0.10 Stage 1 complete, estimated total time = 1:30:47 Starting stage 1 gcd. M61262347 Stage 1 found no factor (P-1, B1=605000, B2=16637500, e=6, n=3360K CUDAPm1 v0.10) Starting stage 2. Zeros: 748288, Ones: 847712, Pairs: 172477 itime: 36.417320, transforms: 1, average: 36417.320000 ptime: 997.899725, transforms: 322686, average: 3.092479 ETA: 1:26:11 itime: 44.675752, transforms: 1, average: 44675.752000 ptime: 999.123688, transforms: 322970, average: 3.093550 ETA: 1:09:16 itime: 48.428101, transforms: 1, average: 48428.101000 C:/Users/childers/Dropbox/NFS/cudapm1/build/cudapm1-code-21/cudapm1-code-21/trunk/CUDAPm1.cu(2757) : cudaDeviceSynchronize() Runtime API error 30: unknown error. [/code] |
[QUOTE=NBtarheel_33;339409]Where might I find the latest release for x64 Linux? Thanks...[/QUOTE]I don't know if any Linux executables have been distributed.
The source (SVN repository) is here: [url]http://sourceforge.net/projects/cudapm1/?source=directory[/url] Windows versions are archived here: [url]http://download.mersenne.ca/CUDAPm1/[/url] (if there are any source archives and/or non-Windows precompiled versions I can put them there too) |
[QUOTE=James Heinrich;339422]I don't know if any Linux executables have been distributed.
The source (SVN repository) is here: [url]http://sourceforge.net/projects/cudapm1/?source=directory[/url] Windows versions are archived here: [url]http://download.mersenne.ca/CUDAPm1/[/url] (if there are any source archives and/or non-Windows precompiled versions I can put them there too)[/QUOTE] The "problem" with Linux is that the executable is dependent on CUDA SDK and drivers. Luigi |
Benchmark GTX 560
1 Attachment(s)
It works fine on my GTX 560 [URL]http://gpuz.techpowerup.com/13/05/06/4nm.png[/URL]
i7 2600k @ 3,4GHz (no OC) Windows 7 Home Premium x64 Edition, Service Pack 1 [CODE] M61262347 has a factor: 195362848474407049033033 (P-1, B1=605000, B2=16637500, e=6, n=3360K CUDAPm1 v0.10)[/CODE] |
[QUOTE=Aramis Wyler;339421]It didn't end well. The file path in the error doesn't exist on my computer, so I'm not sure if that matters.[/QUOTE]I got the Windows version of the same error, about 15 minutes (~20%) into Stage 2:[code]Starting stage 2.
Zeros: 420427, Ones: 504053, Pairs: 103490 itime: 9.174569, transforms: 1, average: 9174.569000 ptime: 163.941441, transforms: 65240, average: 2.512898 ETA: 1:06:21 ... itime: 12.294913, transforms: 1, average: 12294.913000 ptime: 163.801865, transforms: 65062, average: 2.517627 ETA: 52:33 itime: 12.556195, transforms: 1, average: 12556.195000 C:/Users/gchilders/Downloads/cudapm1-code-21/cudapm1-code-21/trunk/CUDAPm1.cu(27 52) : cudaDeviceSynchronize() Runtime API error 30: unknown error.[/code] |
Does the checkpointing feature not work yet?
|
It doesn't, so if you get an error, you have to start from the very beginning.
As for the error, I remember getting the same error(unknown error 30, which was explained by Oliver in an adjacent thread) for CUDALucas from time to time. What's curious, that after that error core clock refused to go higher than 525 Mhz, while memory clock remained the same. Could be solved by a reboot, as WDDM timeout is disabled. So, to prevent that pesky error from happening, I had to downclock the memory of the GPU. |
[QUOTE=Aramis Wyler;339421]It didn't end well.[/QUOTE]
Well, good news. I ran the program again on default settings and it failed during the third ptime section again. But I looked at the error and thought that since it was the sync function, maybe the problem was related to cpu usage. I turned off prime95 and ran the thing again, and sure enough it completed: [code]Iteration 871000 M61262347, 0x268993cb3b899d21, n = 3360K, CUDAPm1 v0.10 err = 0.19531 (0:06 real, 5.8528 ms/iter, ETA 0:12) Iteration 872000 M61262347, 0xdf828a4cb19fc49d, n = 3360K, CUDAPm1 v0.10 err = 0.20313 (0:06 real, 5.8514 ms/iter, ETA 0:06) Iteration 873000 M61262347, 0x92b46441f57f0dc1, n = 3360K, CUDAPm1 v0.10 err = 0.19531 (0:06 real, 5.8502 ms/iter, ETA 0:00) M61262347, 0xfd7ab9d857ea4a36, offset = 0, n = 3360K, CUDAPm1 v0.10 Stage 1 complete, estimated total time = 1:25:43 Starting stage 1 gcd. M61262347 Stage 1 found no factor (P-1, B1=605000, B2=16637500, e=6, n=3360K CUDAPm1 v0.10) Starting stage 2. Zeros: 748288, Ones: 847712, Pairs: 172477 itime: 34.363595, transforms: 1, average: 34363.595000 ptime: 945.240834, transforms: 322686, average: 2.929290 ETA: 1:21:38 itime: 42.020420, transforms: 1, average: 42020.420000 ptime: 948.034002, transforms: 322970, average: 2.935363 ETA: 1:05:39 itime: 45.361230, transforms: 1, average: 45361.230000 ptime: 942.894161, transforms: 321866, average: 2.929462 ETA: 49:17 itime: 46.910720, transforms: 1, average: 46910.720000 ptime: 942.954856, transforms: 322050, average: 2.927977 ETA: 32:53 itime: 48.828722, transforms: 1, average: 48828.722000 ptime: 942.975542, transforms: 322794, average: 2.921292 ETA: 16:27 itime: 49.518076, transforms: 1, average: 49518.076000 ptime: 941.506243, transforms: 322458, average: 2.919780 ETA: 0:00 Stage 2 complete, estimated total time = 1:38:50 Accumulated Product: M61262347, 0xb7550a14cb5172b6, n = 3360K, CUDAPm1 v0.10 Starting stage 2 gcd. M61262347 Stage 2 found no factor (P-1, B1=605000, B2=16637500, e=6, n=3360K CUDAPm1 v0.10)[/code] Though I don't know if it was supposed to find a factor. :) |
I think there might be a problem somewhere with the calculations, because looking closer I see that after cudapm1 finished the default assignment, it went on and did a number that I had put in there from one of my old pm1 assignments:
[code]Selected B1=530000, B2=12985000, 3.11% chance of finding a factor CUDA reports 2777M of 3072M GPU memory free. Using e=6, d=2310, nrp=80 Using approximately 2529M GPU memory. Starting stage 1 P-1, M61394569, B1 = 530000, B2 = 12985000, e = 6, fft length = 3360K Doing 764962 iterations Iteration 1000 M61394569, 0x8888b22cb0159fe4, n = 3360K, CUDAPm1 v0.10 err = 0.20703 (0:09 real, 9.1390 ms/iter, ETA 1:56:21) Iteration 2000 M61394569, 0x22ce4679c47bde53, n = 3360K, CUDAPm1 v0.10 err = 0.19531 (0:06 real, 5.8427 ms/iter, ETA 1:14:17) Iteration 3000 M61394569, 0x4199d13a32c43ec1, n = 3360K, CUDAPm1 v0.10 err = 0.19531 (0:06 real, 5.8484 ms/iter, ETA 1:14:16) ... Iteration 762000 M61394569, 0x72f2c43f0662fa7d, n = 3360K, CUDAPm1 v0.10 err = 0.22949 (0:06 real, 5.8454 ms/iter, ETA 0:17) Iteration 763000 M61394569, 0x5d768a7b9cc19fc1, n = 3360K, CUDAPm1 v0.10 err = 0.19727 (0:06 real, 5.8017 ms/iter, ETA 0:11) Iteration 764000 M61394569, 0xa9c8c0938a1354e6, n = 3360K, CUDAPm1 v0.10 err = 0.20313 (0:05 real, 5.8006 ms/iter, ETA 0:05) M61394569, 0xe6ed39c645d90fd3, offset = 0, n = 3360K, CUDAPm1 v0.10 Stage 1 complete, estimated total time = 1:14:26 Starting stage 1 gcd. M61394569 Stage 1 found no factor (P-1, B1=530000, B2=12985000, e=6, n=3360K CUDAPm1 v0.10) Starting stage 2. Zeros: 576475, Ones: 669125, Pairs: 135475 itime: 34.168611, transforms: 1, average: 34168.611000 ptime: 742.552935, transforms: 254220, average: 2.920907 ETA: 1:04:43 itime: 41.946698, transforms: 1, average: 41946.698000 ptime: 743.830499, transforms: 254674, average: 2.920716 ETA: 52:04 itime: 45.455219, transforms: 1, average: 45455.219000 ptime: 740.867106, transforms: 253650, average: 2.920824 ETA: 39:08 itime: 46.824025, transforms: 1, average: 46824.025000 ptime: 741.681265, transforms: 253924, average: 2.920879 ETA: 26:08 itime: 48.740888, transforms: 1, average: 48740.888000 ptime: 743.663183, transforms: 254586, average: 2.921069 ETA: 13:05 itime: 49.611376, transforms: 1, average: 49611.376000 ptime: 742.008431, transforms: 254036, average: 2.920879 ETA: 0:00 Stage 2 complete, estimated total time = 1:18:41 Accumulated Product: M61394569, 0xc7cca920aa444fbe, n = 3360K, CUDAPm1 v0.10 Starting stage 2 gcd. M61394569 Stage 2 found no factor (P-1, B1=530000, B2=12985000, e=6, n=3360K CUDAPm1 v0.10)[/code] Problem there is that when I ran that with p95, it did find a factor: [Tue Apr 30 13:09:28 2013] P-1 found a factor in stage #2, B1=580000, B2=12035000, E=12. UID: staffen/Romeo, M61394569 has a factor: 189843460261039170580823, AID: cc392de5c69eef9aeaf12ea5c839f9e7 Now, I see that in the p95 that e was 12 (vs 6 for cudapm1), but the bounds were actually smaller than with cudapm1. |
The first one should have found a factor. I'm testing the second exponent to check if we get the same residues. If so, there's definitely something wrong in the calculations.
Edit: On the first three iterations, the residues match but the errors are different. |
[CODE]Processing 457 - 480 of 480 relative primes
itime: 18.458630, transforms: 6906, average: 2.672840 ptime: 148.896680, transforms: 52262, average: 2.849043 ETA: 0:00 Stage 2 complete, estimated total time = 55:19 Accumulated Product: M61394569, 0xe849edfe1bbc661b, n = 3360K, CUDAPm1 v0.10 Starting stage 2 gcd. M61394569 Stage 2 found no factor (P-1, B1=530000, B2=6890000, e=6, n=3360K CUDA Pm1 v0.10) [/CODE] Ran it myself, no factor found either. |
Stef42, if you still have the full output of that run, could you pm them to me?
|
I would love to, but I have closed the command prompt already.
Still, it only shows the last of stage 1. Do you might suggest a logging function/tool..? |
Never mind then. What I realy wanted to do was compare your residues with mine. Any part towards the end of stage 1 would have been sufficient. Aramis Wyler's and mine disagree at iteration 45000.
|
[QUOTE=owftheevil;339558]Never mind then. What I realy wanted to do was compare your residues with mine. Any part towards the end of stage 1 would have been sufficient. Aramis Wyler's and mine disagree at iteration 45000.[/QUOTE]
I've got this exponent until iteration 50.000 run for you. [url]https://dl.dropboxusercontent.com/u/27359940/CUDAPm1%2050000.txt[/url] |
hmm don't bother. ( was gonna rport the first few iteration, wich seem useless now)
|
As to the cudaDevice Synchronize errors people are seeing, I'm almost convinced it is an Nvidia driver bug. On Linux, I'm getting something similar, only its a timeout error (error 6) instead of an unidentified error. Here's what I know about it.
1. It occurs only with Nvidia drivers with version number >= 300. 2. It occurs only if the card CPm1 is running on is also driving the display. 3. cufftbench sees similar errors which I have traced back to a cufftPlan1d call being unable to allocate resources. Its as if the driver, going about its business managing the display, is interfering with cufft or some other kernel in CPm1. I need more testing on my card that is not driving the display, and I also am going to make a test version that does away with all the error checking and host synchronizing to see which call is actually failing. |
[QUOTE=Stef42;339559]I've got this exponent until iteration 50.000 run for you.
[URL]https://dl.dropboxusercontent.com/u/27359940/CUDAPm1%2050000.txt[/URL][/QUOTE] I'm sorry, but my memory was wrong. It was iteration 450000 where the residues began to diverge. Yours match both of ours up to 50000. The reason thats important is that I'm at work now and have no access to my results, but I do have access to his. Thanks for the input. |
[QUOTE=owftheevil;339562]I'm sorry, but my memory was wrong. It was iteration 450000 where the residues began to diverge. Yours match both of ours up to 50000. The reason thats important is that I'm at work now and have no access to my results, but I do have access to his. Thanks for the input.[/QUOTE]
Then I'll see if I can get it up to 450000 in a few hours... |
[QUOTE=owftheevil;339561]As to the cudaDevice Synchronize errors people are seeing, I'm almost convinced it is an Nvidia driver bug. On Linux, I'm getting something similar, only its a timeout error (error 6) instead of an unidentified error.[/QUOTE]
I thought the fact that it only happened (for me at least) when the cpu was occupied was important, considering Chasall's problems cropping up only when the cpu was running as well. Also if the purpose of the call directly related to the cpu (I believe it was described as keeping the cpu from running a loop while waiting?) the inverse situation where the cpu was already too busy to run a wait loop anyway might be relevant. The 580 is my main display card though - I'll try running the program on the 480 and see what it does. It has 1.5 gb of memory, which should be enough I hope. EDIT: The quote I was trying to remember: [QUOTE=owftheevil;339389]The different kernels run synchronously, the cutilSafeThreadSync call is so the cpu doesn't do busy waiting and eat up an entire cpu core.[/QUOTE] I had the wrong function call in mind. |
My run found the factor:
[CODE]Iteration 764000 M61394569, 0xa524c6ae8ad4a231, n = 3360K, CUDAPm1 v0.10 err = 0.21777 (0:07 real, 6.9230 ms/iter, ETA 0:06) M61394569, 0x30c664a860055a8f, n = 3360K, CUDAPm1 v0.10 Stage 1 complete, estimated total time = 1:28:32 . . . Accumulated Product: M61394569, 0x80e4aa01c3bb4d17, n = 3360K, CUDAPm1 v0.10 Starting stage 2 gcd. M61394569 has a factor: 189843460261039170580823 (P-1, B1=530000, B2=12985000, e=6, n=3360K CUDAPm1 v0.10) [/CODE]@ Aramis Wyler: I think the error coming at high cpu load and not at low cpu load is coincidence, although I'm not ruling anything out yet. That's the reason I want to do some runs with the explicit and implicit host synchronizations removed. |
Here's a couple more that can be used for testing. I found these overnight.
[CODE]M61747963 has a factor: 13383883517343994527281 (P-1, B1=610000, B2=610000, e=6, n=3584K CUDAPm1 v0.10) M61829329 has a factor: 894781313041001886421561 (P-1, B1=615000, B2=16912500, e=6, n=3584K CUDAPm1 v0.10) [/CODE] The first can be found in stage 1 with B1 = 3750. The second can be found with B1 = 750, B2 = 2750000. |
[QUOTE=owftheevil;339595]@ Aramis Wyler: I think the error coming at high cpu load and not at low cpu load is coincidence, although I'm not ruling anything out yet. That's the reason I want to do some runs with the explicit and implicit host synchronizations removed.[/QUOTE]
Indeed. My card appears to be the worst possible case for debugging -- it seems to be [B][I][U]just[/U][/I][/B] unstable. I have spent days running tests in different situations -- heavy CPU load, no CPU load. Low ambient temperatures, high ambient temperatures. Low vRAM usage, high vRAM usage. And all the various combinations of the above. I can find no correlation to the errors. While it's still possible that there is a software bug somewhere in the stack, the evidence seems to suggest I have a bad card. |
[QUOTE=frmky;339214]New versions ...
Win32: [URL]https://www.dropbox.com/s/alz4xodjjend7bi/cudapm1_win32_20130503.zip[/URL] As usual, please let me know of problems.[/QUOTE] So, what exactly do I have to do after downloading? The .exe doesn't seem to work as-is for me... |
I ran the same worktodo as before on my secondary card (a 480) to avoid the conditions mentioned earlier (used for display, etc). It didn't find a factor for M61262347, B1 = 605000, B2 = 16637500, e = 6, fft length = 3360K. It crashed trying to do M61394569, B1 = 605000, B2 = 16637500, e = 6, fft length = 3360K, but even more unforunate for me is that it used different bounds than my 580 did so the residues are completely different. I'll have to try to run it again and specify the bounds. The error on crash:
Iteration 587000 M61394569, 0x689e7131d4d15b81, n = 3360K, CUDAPm1 v0.10 err = 0.20313 (0:07 real, 6.7823 ms/iter, ETA 32:20) Iteration = 587400, err = 0.46094 >= 0.43, quitting. Estimated time spent so far: 1:06:31 C:/Users/childers/Dropbox/NFS/cudapm1/build/cudapm1-code-21/cudapm1-code-21/trunk/CUDAPm1.cu(1362) : cudaSafeCall() Runtime API error 17: invalid device pointer. Though it looks to me like it didn't really crash, it quit because of a rounding error... and possibly then crashed. Full output is [URL="http://workforce.calu.edu/staffen/cudapm1-480.txt"]here[/URL]. |
Yeah, if it quits during stage 1, I still have it trying to free stage 2 device memory, which hasn't been allocated yet. Fix coming soon.
|
[QUOTE=c10ck3r;339620]So, what exactly do I have to do after downloading? The .exe doesn't seem to work as-is for me...[/QUOTE]
BUMP U M P !?!?! Thanks! |
1. Goto [url=http://www.mersenne.org/manual_assignment/] this page.
2. Select P-1 factoring. 3. Put whatever the server gives you into the worktodo.txt file, which is located in CUDAPm1's folder. 4. Create a batch file, say, run.bat, right click on it and select edit. 5. Paste the following there: [CODE]CUDAPm1 pause[/CODE] 6. Save it, open CUDAPm1.ini, tweak your settings, save em. 7. Run the batch file. It will process one assignment at a time from the worktodo.txt file. |
[QUOTE=c10ck3r;339620]So, what exactly do I have to do after downloading? The .exe doesn't seem to work as-is for me...[/QUOTE]
If you extract everything from the zip in the same directory, simply double-clicking the exe will run a test case that should find a factor. If that doesn't work, run from a cmd prompt to see the error. |
This version fixes the free without malloc bug when stage 2 isn't run. owftheevil is still looking into the occasional cufft hangs.
[URL="https://www.dropbox.com/s/xucdi2ht4rbe6go/cudapm1_x64_20130509.zip"]https://www.dropbox.com/s/xucdi2ht4rbe6go/cudapm1_x64_20130509.zip[/URL] Before running CUDAPm1, it may be a good idea to test your GPU to verify that it can handle CUDALucas/CUDAPm1 calculations accurately. Here's a trial Windows x64 version of owftheevil's memtest: [URL="https://www.dropbox.com/s/4lh34niqddm5tf8/CUDAmemtest_20130509.zip"]https://www.dropbox.com/s/4lh34niqddm5tf8/CUDAmemtest_20130509.zip[/URL] |
I'm trying to make/compile the source on the big iron Linux system, but it keeps crying out that gmp.h cannot be found. Any ideas?
|
Install gmp?
|
I get the message (In command prompt):
device_number >= device_count ... exiting (This is probably a driver problem) with: NVIDIA GeForce GTX 460 Memory 768 MB Memory type 2 Driver version 6.14.13.0142 Is this owf's same problem? |
I was now able to find the factor for M61394569
@owftheevil: here is the total output of the commandline: [url]https://dl.dropboxusercontent.com/u/27359940/cudapm1%20complete.txt[/url] Funny thing is, I ran the program with 0% CPU usage. I'll be checking tonight if CPU load does (perhaps) makes a difference.... |
[QUOTE=c10ck3r;339968]I get the message (In command prompt):
device_number >= device_count ... exiting (This is probably a driver problem) with: NVIDIA GeForce GTX 460 Memory 768 MB Memory type 2 Driver version 6.14.13.0142 Is this owf's same problem?[/QUOTE] Change the "DeviceNumber=" parameter in CUDALucas.ini to 0 and/or 1 and see if anything happens. |
[QUOTE=Karl M Johnson;339970]Change the "DeviceNumber=" parameter in CUDALucas.ini to 0 and/or 1 and see if anything happens.[/QUOTE]
Nope :( |
Sounds like a driver problem. Try downloading and reinstalling the latest cuda drivers from nVidia's website.
|
[QUOTE=frmky;340000]Sounds like a driver problem. Try downloading and reinstalling the latest cuda drivers from nVidia's website.[/QUOTE]
When doing the above I strongly recommend a careful removal of the existing drivers. Uninstall all the nVidia items in 'Programs and Features', doing the graphics driver last. If you want to be really thorough, download Driver Fusion here- [URL]http://treexy.com/thanks?file=/media/30691/driver_fusion_1.6.0.exe[/URL] After uninstalling the drivers, boot into Safe Mode and run Driver Fusion. Select nVidia Display and nVidia PhysX and click 'Analyze'. When it completes, click Delete, OK, and Restart. You should then come up in normal mode with generic VGA drivers, probably at some horrible low resolution. Disable your anti-virus temporarily. Run the driver setup as an Admin. Do a Custom install, select only Graphic Driver, PhysX, and Clean Install. (An exception to this would be if you game with 3D glasses on. Then you have to add the two 3D drivers. Leave them out if you don't need them as they consume memory to no good end.) I actually go one step more and run a Registry Cleaner before I leave Safe Mode. I got the outline for this from some site, possibly Guru 3D. The author insisted on some other steps, but even I found them excessive and probably not helpful. I certainly cleared up some odd problems by making sure all the old driver components were gone. |
cudaMemTest successfully running in all my gtx580's, almost finished, no errors up to now (water cooled all, but gpu temperatures much lower comparing with cudaLucas, or mfaktc).
I had small trouble caused by the fact that cuxxx[COLOR=Red]64[COLOR=Black]_50_35.dll are n[/COLOR][/COLOR]ot included, but found them on the other zip (with P1 exe, not yet running, after tests finish I will give a try for few expos). |
edit limit: for 1.5G of ram (1536mb version) the maximum block value I can use without crashing is 52 blocks for cards driving the video, and 57 blocks for "free" cards. Theoretical value for this amount of ram is would be 60 blocks, but I believe it needs some "spare". Everything over 52/57 will crash the program and with those values, everything become terrible slow. With 50/55 everything works fine, no error. Also, with those values, the cards are getting "normal hot".
so I use batch file like the following, to start all cards in the same time: [code] rem Tests 60*25 = 1500 MB of memory on the all gpu's (1.5G=1536M each) rem use max 52 for the card driving the video (everything becomes very slow!! use 50 better) rem use max 57 for the other cards (everything becomes very slow!! use 55 or lower...) start "CUDAmemtest_0" /LOW CUDAmemtest.exe 50 1000 0 start "CUDAmemtest_1" /LOW CUDAmemtest.exe 55 1000 1 ....etc other cards not driving the video [/code] |
So, I updated my driver, changed device number back to "0", and dropped thread count to 32. One of those (or some combination) fixed it apparently.
|
[QUOTE=c10ck3r;340072]So, I updated my driver, changed device number back to "0", and dropped thread count to 32. One of those (or some combination) fixed it apparently.[/QUOTE]
Updating the drivers:smile: For good speeds, set the threads to be equal 256. |
[QUOTE=Karl M Johnson;340075]Updating the drivers:smile:
For good speeds, set the threads to be equal 256.[/QUOTE] It actually wouldn't run the exponent I manually reserved using 32. Running @256 :) |
May be worth mentioning, the -f flag is ignored when CPm1 is executed.
|
[QUOTE=Karl M Johnson;340131]May be worth mentioning, the -f flag is ignored when CPm1 is executed.[/QUOTE]
The fft selection still works by adding the size at the end of each line in worktodo, so you can still tune the fft for each exponent. I found out that the rules are the same as for cudaLucas, and did my usual tuning. For example, 3456k gives me a surplus of speed about 3.5-4.5% compared with the default one selected by the program (3360k), also the error decreased from about 0.21 to 0.09 which is better. I got 30 assignments from gpu72, did 10 of them, and for the other 20, I replaced the last ",2" on each line in worktodo with ",3,3456k" (i.e. beside of specifying the fft, I made cudaPM1 believe he will save 3 LL tests, therefore use a larger B1 by default calculus) which increased the default B1 from about 605000 to 950000, and the time of stage 1 from 1 hour and half to two hours and half (don't know yet the time of stage2, as B2 was also increased from 16775000 to 28975000). This comes with a better chance to find a factor, from about 4% to about 6% (didn't make the calculus, I just know from past experience with P95 the effecte of changing the last ",2" into ",3", or larger, to increase the chance of finding a factor. What is missing is checkpoints (already had a restart and lost few hours of work, due to a storm here - the rainy season is starting). And is a must to have a save at the end of stage1, in case some of us wants to "extend" the B1 limit, the program should be able to reload that file and do few more iterations. Of course, this would complicate the things witht the credit, as PrimeNet give the credit in full amount, so a guy doing 10 iterations and reporting after every one will get the credit 10 times. :smile: |
[QUOTE=LaurV;340151] What is missing is checkpoints (already had a restart and lost few hours of work, due to a storm here - the rainy season is starting). And is a must to have a save at the end of stage1, in case some of us wants to "extend" the B1 limit, the program should be able to reload that file and do few more iterations. [B]Of course, this would complicate the things witht the credit, as PrimeNet give the credit in full amount, so a guy doing 10 iterations and reporting after every one will get the credit 10 times.[/B] :smile:[/QUOTE]
This is already in issue with P95, so no real change would occur IIRC. |
[QUOTE=Karl M Johnson;340131]May be worth mentioning, the -f flag is ignored when CPm1 is executed.[/QUOTE]
Really? It should work. Does in linux. I run with something like CUDAPm1 -f 3456k |
Oh, right, with the K notation, it works.
However, I've noticed when the FFT size is manually specified, the algorithm that limits stage2 GPU memory to be <4096MB doesn't kick in. [CODE] R:\cudapm1_x64_20130505>CUDAPm1 -f 4096k CUDAPm1 v0.10 Selected B1=630000, B2=17325000, 4.17% chance of finding a factor CUDA reports 5618M of 6143M GPU memory free. Using e=6, d=2310, nrp=120 Using approximately [COLOR=Red]4364M[/COLOR] GPU memory. Starting stage 1 P-1, M62621347, B1 = 630000, B2 = 17325000, e = 6, fft length = 4096K Doing 908959 iterations [/CODE] |
[QUOTE=Karl M Johnson;340209]Oh, right, with the K notation, it works.
However, I've noticed when the FFT size is manually specified, the algorithm that limits stage2 GPU memory to be <4096MB doesn't kick in. [CODE] R:\cudapm1_x64_20130505>CUDAPm1 -f 4096k CUDAPm1 v0.10 Selected B1=630000, B2=17325000, 4.17% chance of finding a factor CUDA reports 5618M of 6143M GPU memory free. Using e=6, d=2310, nrp=120 Using approximately [COLOR=Red]4364M[/COLOR] GPU memory. Starting stage 1 P-1, M62621347, B1 = 630000, B2 = 17325000, e = 6, fft length = 4096K Doing 908959 iterations [/CODE][/QUOTE] Try it with a low B1 to see if stage 2 runs ok. It's not limiting total memory use to 4096MB. It's limiting the size of a single allocation to 4096MB. |
[URL="http://i.imgur.com/cKykkFl.png"]It does run ok[/URL], glad to see that!
However, there's this. vRAM downclocked to 5Ghz, still happens (the max). Technically, we should be able to run exponents with unusually high FFT length, right? [CODE] R:\cudapm1_x64_20130505>CUDAPm1 -f 10240k CUDAPm1 v0.10 Selected B1=630000, B2=17325000, 4.17% chance of finding a factor CUDA reports 5297M of 6143M GPU memory free. Using e=6, d=2310, nrp=48 Using approximately 5150M GPU memory. Starting stage 1 P-1, M62621963, B1 = 630000, B2 = 17325000, e = 6, fft length = 10240K Doing 908959 iterations Iteration = 100 >= 1000 && err = 0.5 >= 0.35, fft length = 10240K, writing checkpoint file (because -t is enabled) and exiting. Iteration = 100, err = 0.5 >= 0.43, quitting. Estimated time spent so far: 0:02[/CODE] |
No, too big an fft will cause errors too. I think it has to do with how far the carries get propagated.
|
Stage 1 save files are now implemented. It's not very polite in that it doesn't clean these up when its done. Some of you will want to keep these for extending b1 later. I'm starting work on stage 2 save files and will figure out the cleanup when that's ready.
|
[QUOTE=owftheevil;340949]Stage 1 save files are now implemented. It's not very polite in that it doesn't clean these up when its done. Some of you will want to keep these for extending b1 later. I'm starting work on stage 2 save files and will figure out the cleanup when that's ready.[/QUOTE]
Do you have a Win-32-bit compiled version of this available? |
Not yet. frmky has been doing the windows builds. I don't know when he will have time to get to it.
|
Just wanted to mention that without frmky's help none of this would be available until later this summer or maybe even fall.
|
Windows binaries with latest changes, untested as usual.
Win32 [URL="https://www.dropbox.com/s/ecwuwbezul6t65m/cudapm1_win32_20130520.zip"]https://www.dropbox.com/s/ecwuwbezul6t65m/cudapm1_win32_20130520.zip[/URL] x64 [URL="https://www.dropbox.com/s/ik1g9eza96t767q/cudapm1_x64_20130520.zip"]https://www.dropbox.com/s/ik1g9eza96t767q/cudapm1_x64_20130520.zip[/URL] |
Thank you very much! :smile:
|
Latest and greatest 64 bit binary works here:smile:
Stopped and resumed a couple of times during stage 1, here are the end results on new whql forceware: [CODE]Accumulated product stage 1: M63137587, 0x1f2595c1236f31dc, n = 3456K, CUDAPm1 v0.10 Accumulated product stage 2: M63137587, 0x412ca727e7d21026, n = 3456K, CUDAPm1 v0.10[/CODE] |
Having trouble with CUDAPm1. When I use "-b1 3100000" in the command line it works, but it stays a lot in that CPU routine that compute the product. A pari line line "n=3*10^6; lgn=log(n); z=prod(x=1,n,if(isprime(x),x^floor(lgn/log(x)),1)); ceil(log(z)/log(2))" returns in the same time, against all the logic and reason (pari should be much slower!).
But not this is the main problem. All values between 3200000 and 20M are parsed wrong, it says "B1 need to be at least 1" and does a test with B1=1 and B2=393xxx or so, which [B]does[/B] find a factor, if one exists for these values. I am not sure if smaller values starting with 1 are parsed wrong too or not (like -b1 150000) When I use a value of -b1 over 20M, it is parsed right (but never returns from the CPU multiplication routine, not ever after half hour). So, what are the restrictions for B1? Or, are there any restrictions and I am doing something completely silly? (I would like to run "CUDAPm1 160403 -b1 12000000 -b2 12000000" for example... Max value I can use is around 3M1, which is not enough, the former one is 10M. And totally ignoring the fact that he wants B2 to be 13 times higher then B1, which is totally nonsense for these numbers.) Also, how can we "extend" a former B1? I tried the test cases: CUDAPm1 58610467 -b1 70843 -b2 694201 and CUDAPm1 58610467 -b1 694201 -b2 694201 they both find the factor [edit, first one in stage 2, second one in stage 1, as it is normal] if started from scratch (delete the checkpoint file in between). But now assuming I have a run with the first, I want, when I run the second, that it should continue from where B1 left. This is not possible, as the former B1 is recorded in the file, and if I let the file there, it is totally ignoring my command line, it says "found limits in the file" and only runs stage 2. If I delete the file, obviously it starts from the scratch, duplicating the most of the work. This is not what was intended when we talked about "extending B1". OTOH, resuming stage1 works very nice, and I believe it is only about ignoring that former B1 stored in the file (I did not look into the sources however, and for the record, I use win7 64 bits binaries). Question: why are you doing that whole product in the beginning? You can do exponentiation for every prime, this would make it easy to "extend" the B1 limit, and you would not need to stress the CPU "only" (the GPU is idle in this time, for [U]minutes[/U], depends how big B1 is). [CODE] >CUDAPm1 630893 -b1 3100000 mkdir: cannot create directory `savefiles': File exists CUDA reports 1306M of 1535M GPU memory free. Using e=6, d=2310, nrp=480 Using approximately 155M GPU memory. B2 should be at least 390390, increasing it. B2 should be at least 40300000, increasing it. [COLOR=Red]<<<< here it stays about 2 minutes, GPU is iddle, CPU hard computing the product, then everything continues normally.[/COLOR] Starting stage 1 P-1, M630893, B1 = 3100000, B2 = 40300000, e = 6, fft length = 40K Doing 4471985 iterations Iteration 10000 M630893, 0x280b630169a8b5f7, n = 40K, CUDAPm1 v0.10 err = 0.00049 (0:17 real, 1.6675 ms/iter, ETA 2:04:00) Iteration 20000 M630893, 0xfb3b1f4975308539, n = 40K, CUDAPm1 v0.10 err = 0.00046 (0:01 real, 0.1044 ms/iter, ETA 7:44) Iteration 30000 M630893, 0xc90545f20507538b, n = 40K, CUDAPm1 v0.10 err = 0.00046 (0:01 real, 0.1039 ms/iter, ETA 7:41) Iteration 40000 M630893, 0x3ff1f732d6ebab86, n = 40K, CUDAPm1 v0.10 err = 0.00046 (0:01 real, 0.1041 ms/iter, ETA 7:41) [/CODE] |
LaurV, thanks for your input. I'll have time for a more complete response in about an hour, but for now I'll just say that most of what you are talking about hasn't been implemented yet, or hasn't been cleaned up yet. I was unaware of any problems parsing b1, I'll take a look as soon as I have time.
|
[QUOTE]Having trouble with CUDAPm1. When I use "-b1 3100000" in the command line it works, but it stays a lot in that CPU routine that compute the product. A pari line line "n=3*10^6; lgn=log(n); z=prod(x=1,n,if(isprime(x),x^floor(lgn/log(x)),1)); ceil(log(z)/log(2))" returns in the same time, against all the logic and reason (pari should be much slower!). [/QUOTE]My lack of imagination strikes again. As in who the heck would want to spend that much time doing p-1? Well I know the answer to that question now. Currently, the computation of the products of powers of primes is rather inefficient. And now that I realize some people will want to use huge b1's, I should probably split large b1's into two parts, a reasonable length large exponent and then piecewise smaller exponents to fill in the gap.
[QUOTE]But not this is the main problem. All values between 3200000 and 20M are parsed wrong, it says "B1 need to be at least 1" and does a test with B1=1 and B2=393xxx or so, which [B]does[/B] find a factor, if one exists for these values. I am not sure if smaller values starting with 1 are parsed wrong too or not (like -b1 150000) When I use a value of -b1 over 20M, it is parsed right (but never returns from the CPU multiplication routine, not ever after half hour). [/QUOTE]Like I said earlier, I was not aware of this problem. I'll look into it. [QUOTE]So, what are the restrictions for B1? Or, are there any restrictions and I am doing something completely silly? (I would like to run "CUDAPm1 160403 -b1 12000000 -b2 12000000" for example... Max value I can use is around 3M1, which is not enough, the former one is 10M. And totally ignoring the fact that he wants B2 to be 13 times higher then B1, which is totally nonsense for these numbers.)[/QUOTE]Currently there are a few silly restrictions caused by my lack of boundary case considerations in the initialization of stage 2. These are first on the list to be removed after stage 2 save files are working. Exactly what the restrictions are depend on many factors, so it hard to say exactly how big b1 must be. If e is the B-S exponent, d is the primorial being used, and p is the smallest prime which does not divide d, then b2 / p <= b1 and b2 / p / d >= 2 * e + 1 are the primary restrictions. [QUOTE]Also, how can we "extend" a former B1?[/QUOTE]You can't yet. Its on the list of things to do. The code for splitting large b1's up will automatically provided most of this. [QUOTE]Question: why are you doing that whole product in the beginning? You can do exponentiation for every prime, this would make it easy to "extend" the B1 limit, and you would not need to stress the CPU "only" (the GPU is idle in this time, for [U]minutes[/U], depends how big B1 is).[/QUOTE]Speed. 0's in the binary representation of the exponent require a squaring, 1's require an additional multiplication by the base. If the base is 3, this can be done with a modified normalization kernel with negligible increase in time, but with a huge integer base, it requires an additional fft multiplication. |
Past the edit time limit. I realized I left out a factor of 53 in the bit about b1 and b2 restrictions.
b2 / d / 53 <= b1, otherwise some small primes > b1 do not get included. |
1 Attachment(s)
[QUOTE=owftheevil;341927]Speed. 0's in the binary representation of the exponent require a squaring, 1's require an additional multiplication by the base. If the base is 3, this can be done with a modified normalization kernel with negligible increase in time, but with a huge integer base, it requires an additional fft multiplication.[/QUOTE]
That is generally not true, I mean in average, you get the same amount of squarings and shifts/additions if you do the primes one by one, or multiply them first. Example: 5*7*11=385, or in binary 101*111*1011=110000001, in the "one by one" you do 2+2+3=7 squaring and 1+2+2=5 shift/add, in the "mul first" case you do 8 squaring and 2 shift/add, certainly advantage for "one by one" for a small base. 7*61=427 (I deliberately select primes with many ones!) = 111*111101=110101011, (7+6) against (8+5). If there are many factors, you must do their sum(size[i]-1) squarings, but when you multiply first you do size(sum-1), i.e. you waste a squaring for each aditional factor. 75 factors, each having 20 bits, you do 75*19=1425 squarings if you take them one by one (because you ignore the first bit being 1 every time in exponentiation), but if you do the product first, you have also 75*(20-1)=1425 squarings. Statistically, in 1425 bits you have to do 713 multiplications (they can be 1 or 0), in any case. Even if not, computing them one by one has a lot of other advantages (easy to extend the B1, lower waiting state, etc). Also, for the "one huge prime variation", (the stage 2 part), you always can get a compromise which is "almost" same fast, but used much less memory (this does not work if you want to have the BrS extension, but if you don't want it, you can do stage 2 almost as same fast, but using less memory, which will allow much higher B2, and will also put the "cheap cards" in the game, right now only people with a lot of mem on the card can do "good" P-1 with the GPU). You can have a look to a pari implementation attached below, I think I posted it in the past, but I don't know where. Rename to delete the ".txt", load it in pari with \r, and use "mpm1(257,,10^6,10^7)" as a sample command. The stage 2 of that is VERY fast (well, for pari!) and it does not consume memory (of course, no BrS extension, just classical). I am also working to implement something like this with cuda, but I stopped due to the lack of math knowledge related to fft and due to poor skill with cuda, which I am still learning. The last cuda version works with my gtx580 with a kind of karatsuba multiplication (divide and conquer, integer only) and it is quite slow. I can give parameters for the number of iterations after which a GCD() is taken, which (depending of the expo) can speed up the things. For example if B2 is very big, better take a couple of GCD's, one might get lucky and find the factor faster, etc. You can consider these things (and many other) for the future versions of cudaPm1. |
[QUOTE]That is generally not true, I mean in average, you get the same amount of squarings and shifts/additions if you do the primes one by one, or multiply them first.
Example: 5*7*11=385, or in binary 101*111*1011=110000001, in the "one by one" you do 2+2+3=7 squaring and 1+2+2=5 shift/add, in the "mul first" case you do 8 squaring and 2 shift/add, certainly advantage for "one by one" for a small base. 7*61=427 (I deliberately select primes with many ones!) = 111*111101=110101011, (7+6) against (8+5). If there are many factors, you must do their sum(size[i]-1) squarings, but when you multiply first you do size(sum-1), i.e. you waste a squaring for each aditional factor. 75 factors, each having 20 bits, you do 75*19=1425 squarings if you take them one by one (because you ignore the first bit being 1 every time in exponentiation), but if you do the product first, you have also 75*(20-1)=1425 squarings. Statistically, in 1425 bits you have to do 713 multiplications (they can be 1 or 0), in any case. Even if not, computing them one by one has a lot of other advantages (easy to extend the B1, lower waiting state, etc). [/QUOTE] You are correct, but I was not claiming that the total number of operations was reduced by lumping the prime powers together. Multiplying by 3 as opposed to multiplying by a huge integer is where the time gets saved. [QUOTE]Also, for the "one huge prime variation", (the stage 2 part), you always can get a compromise which is "almost" same fast, but used much less memory (this does not work if you want to have the BrS extension, but if you don't want it, you can do stage 2 almost as same fast, but using less memory, which will allow much higher B2, and will also put the "cheap cards" in the game, right now only people with a lot of mem on the card can do "good" P-1 with the GPU). [/QUOTE] I don't know what you mean by "one huge prime variation" and "cheap cards". I would be interested in hearing hows this works. Again, thanks for your input. Carl |
On the bright side, look what I have found in the logs when I just arrived home now:
[CODE]M631247 has a factor: 83947913480780207864691737 (P-1, B1=3100000, B2=3100000, e=6, n=40K CUDAPm1 v0.10)[/CODE]Unfortunately other 30 of them had no factor, and the situation would have been differently if I could give a 12M limit for B1 for a 10% chance :razz: BTW, for all of them, the B2 was automatically extended to 40300000, which is what the log shows (the screen redirected in a text file), the test was done, in a single "gulp" (1 to 480 primes), but the results.txt only shows 3M1 for both limits (?!), I don't know if the test was really done to which B2 limit. For the factor found, the "k" is smoother than B1 so this proves nothing, there was no stage 2. On the dark side, PrimeNet says: [CODE]No factor lines found: 0 Mfaktc no factor lines found: 0 Mfakto no factor lines found: 0 CUDAPm1-factor lines found: 1 [COLOR=Red]Insufficient information for accurate CPU credit. For stats purposes, assuming factor was found using ECM with B1 = 50000.[/COLOR] CPU credit is 0.0170 GHz-days. CUDAPm1-nofactor lines found: 38 CPU credit is 0.1130 GHz-days. CPU credit is 0.1130 GHz-days. CPU credit is 0.1130 GHz-days. <etc>[/CODE] and robbed me of 0.1 GD of credit :razz: The "factor" line was not the firtst in the file, and anyhow, I thought this problem is only for TF/mfaktc factors seen as P-1, but not for P-1 seen as ECM :surprised:shock: |
[QUOTE=LaurV;341977]I thought this problem is only for TF/mfaktc factors seen as P-1, but not for P-1 seen as ECM :surprised:shock:[/QUOTE]I still need to grind through the manual-results code and get it parsing sensibly. Several result types (e.g. mfakt*, CUDAPm1) are not unclear as to the result type, but the extant PrimeNet code still "guesses" how the factor was found, even though it should already be known.
I've posted this before somewhere a while back, but this is the abbreviated logic for how PrimeNet currently guesses as to the factor method:[code]add_user_msg( $resp, "Insufficient information for accurate CPU credit."); if ( $k == 1 && $b == 2 && $n % 4096 == 0 && $c == 1 ) { add_user_msg( $resp, "For stats purposes, assuming factor was found using ECM with B1 = 44M."); } else if ( $k != 1 || $b != 2 || $c != -1 || $n < 16000000) { if ($bits > 55) { add_user_msg( $resp, "For stats purposes, assuming factor was found using ECM with B1 = 50000."); } else { add_user_msg( $resp, "For stats purposes, assuming factor was found by trial factoring using prime95."); } } else { if ($bits > $t_results[tflimit]) { add_user_msg( $resp, "For stats purposes, assuming factor was found using P-1 with B1 = 800000."); } else { add_user_msg( $resp, "For stats purposes, assuming factor was found by trial factoring using prime95."); } }[/code] Hopefully I'll get to rewriting the manual-results parsing logic within the next couple weeks or so. |
[QUOTE=LaurV;341977][STRIKE]I don't know if the test was really done to which B2 limit. For the factor found, the "k" is smoother than B1 so this proves nothing, there was no stage 2. [/STRIKE]
[/QUOTE] Ignore that! I was stupid. The results are stored properly in the results file, with the right extended B2, and the test is done to extended B2, too (just found a new factor). After a shower and a some dinner I can read properly the results.txt :smile: |
[QUOTE=James Heinrich;341979]I still need to grind through the manual-results code and get it parsing sensibly.[/QUOTE]
Looking into my log file, I see that the "factor" lines were searched first, and the "no factor" were searched only [B]after [/B](see the former post, log included). Maybe that is the problem? Adding "no factor lines in front would be no use in this case (like in mfaktx would have no preliminary "no factor" line, misfit is changing a line order, problem solved). Just an idea... |
The whole manual results parsing code needs to be reworked. Fortunately I already have nicely-working code on mersenne.ca that I can pretty much just copy over. I just need a little while to sit down and work through it.
|
[QUOTE=James Heinrich;341982]The whole manual results parsing code needs to be reworked. Fortunately I already have nicely-working code on mersenne.ca that I can pretty much just copy over. I just need a little while to sit down and work through it.[/QUOTE]
Thanks James :bow: Luigi |
From LaurV:
[QUOTE]But not this is the main problem. All values between 3200000 and 20M are parsed wrong, it says "B1 need to be at least 1" and does a test with B1=1 and B2=393xxx or so, which [B]does[/B] find a factor, if one exists for these values. [/QUOTE] Integer overflow. It will go away when the silly b1 and b2 restrictions are removed. |
LaurV,
I took a look at your pari implementation of p-1. What you are doing is essentially the case e = 1 and d = 6. CUDAPm1 currently can deal with d = 6, but to get e =1, I would have to include clauses in some of the stage 2 initialization functions, and provide for incrementing the main stage 2 loop counter by 1 instead of 2. Not hard to do. Any coder that doesn't suck could do it in an hour. I might be able to do it in a day. It would save 16 * (fft length) bytes of device memory, but take a performance hit directly related to the number of prime pairs 6*k - 5, 6*k +5 and 6*k -1, 6*k + 1 between b2 / 5 and b2 compared to using e = 2 and d = 6. By the way, very clean and easy to read code. I wish I could do that. |
I understand perfectly about the performance hit. It was just an idea, trying to convince you to "work out" the stage 2 a bit. Because the long time for huge B2 in cudapm1 is "bothersome". If it crush in the middle -> waste... [edit: by the way, "cheap cards" was referring to cheap video cards, without much memory (which could participate to the game if the game won't require a "pro" card) and not to some algorithmic tricks, and "one large prime variation" is another name for stage 2 of P-1. Sorry if my English makes trouble for you, I was never a good speaker :smile:]
[QUOTE=owftheevil;342148] By the way, very clean and easy to read code. I wish I could do that.[/QUOTE] That is because I did not have to deal with any of the "low level" things, Pari internal kitchen takes care about. It is like you are reading the 50 lines long "main()" function, from a 10-thousand lines C program, which main() is only calling the functions. It has nothing to be "unclear" or "untidy" about it. But thanks for the compliment anyhow :razz: |
[QUOTE]Sorry if my English makes trouble for you, I was never a good speaker[/QUOTE]
Your English is fine, I was just being stupid. It didn't take me long to realize that by "cheap cards" you meant "cheap cards". The other phrase became clear when I read your code. |
Anyone working on the worktodo parsing? :smile:
Luigi |
Not as far as I know.
I've got stage 2 error reporting and checkpoints working, just need to fix the ETA estimates when resuming stage 2. |
[QUOTE=owftheevil;344980]Not as far as I know.
I've got stage 2 error reporting and checkpoints working, just need to fix the ETA estimates when resuming stage 2.[/QUOTE] Thanks :smile: |
| All times are UTC. The time now is 23:18. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.