![]() |
Well, I only work in a TF environment so maybe that's the difference, but my TF rate increases slightly when my screen goes blank.
My best guess is under Power Options > Edit Plan Settings (for your specific power plan) > Advanced Power Options > PCI Express you might find something. All I have is the one option set to disabled so check that this is what you have also. Otherwise, the fix is to set the display to always be on but turn it off yourself when you don't need it. |
I've noticed there hasn't been any new code checked in since late 2013. Has development ceased, or are people still working on it behind the scenes?
|
[QUOTE=TheMawn;401446]Well, I only work in a TF environment so maybe that's the difference, but my TF rate increases slightly when my screen goes blank.
My best guess is under Power Options > Edit Plan Settings (for your specific power plan) > Advanced Power Options > PCI Express you might find something. All I have is the one option set to disabled so check that this is what you have also. Otherwise, the fix is to set the display to always be on but turn it off yourself when you don't need it.[/QUOTE] Already off. |
Are there any linux binaries around for P-1?
I tried compiling, but epic fails all around. I can only get latest environment, and that needs latest drivers which don't seem to work for me. -- Craig |
You use Windows? Linux? I can help with linux, but someone else will have to jump in if its windows.
|
[QUOTE=owftheevil;404479]You use Windows? Linux? I can help with linux, but someone else will have to jump in if its windows.[/QUOTE]
I have a windows binary. I need a binary for linux. -- Craig |
[QUOTE=nucleon;404484]I need a binary for linux.[/QUOTE]
Try this one: [URL="https://www.dropbox.com/s/mr0z8e9pbifla4a/cudapm1-0.20.tar.gz?dl=0"]https://www.dropbox.com/s/mr0z8e9pbifla4a/cudapm1-0.20.tar.gz?dl=0[/URL] It is compiled with cuda 5.5 for 64-bit linux. |
[QUOTE=frmky;404632]Try this one: [URL="https://www.dropbox.com/s/mr0z8e9pbifla4a/cudapm1-0.20.tar.gz?dl=0"]https://www.dropbox.com/s/mr0z8e9pbifla4a/cudapm1-0.20.tar.gz?dl=0[/URL]
It is compiled with cuda 5.5 for 64-bit linux.[/QUOTE] Thank you heaps. Much appreciated. |
I'm running the code from the previous link on a g2.8xlarge instance on AWS. I'm getting approximately 25GHz-days/day P-1 per GPU.
Also, I'll note, I've found 2x factors. -- Craig |
[QUOTE=nucleon;404870]I'm running the code from the previous link on a g2.8xlarge instance on AWS. I'm getting approximately 25GHz-days/day P-1 per GPU.[/QUOTE]
Personally (at least for TF'ing) I find the cg1.4xlarge instances to be better value (in us-east-1d -- "Spot" often less than $0.14 an hour for two Titans). |
1 Attachment(s)
Last week hasn't been that great.
|
Anyone has a v52 binary for win 64? possibly with rt 5.5 or so?
I took about 30 assignments in 666M which I am TF-ing to ~80 bits and in the same time (on a different card, in parallel) I do P-1 for survivors. Stage 1 goes well (FFT size 38416k), but it crashes when entering stage 2. I had that problem long ago (see pages 38-42 of this thread) which was fixed at the time by playing with the number of threads. I didn't update cudapm1 since then, I know there were some fixes. It works well for 333M, but these expos at 666M may be a bit too high... |
Can someone teach me
1. Where is the last available win64 binary for cudaPM1 program? 2. How can I convince it to run only stage 1 of the algorithm. 3. In case of 2, how can I [STRIKE]resume[/STRIKE] [U]extend[/U] B1? (i.e. resuming already finished stage 1, with a larger B1, when I have the last checkpoint file saved at the end of stage 1 run with the older/smaller B1) Thanks in advance. |
Two questions:
1. There was talk of adding worktodo.txt parsing to the program. Anyone know whether it has been implemented? 2. I asked this several months ago but didn't get an answer: the code hasn't been updated since late 2013. Has development ceased, or is someone still working on it behind the scenes? |
worktodo; code development
[QUOTE=ixfd64;425040]Two questions:
1. There was talk of adding worktodo.txt parsing to the program. Anyone know whether it has been implemented? 2. I asked this several months ago but didn't get an answer: the code hasn't been updated since late 2013. Has development ceased, or is someone still working on it behind the scenes?[/QUOTE] The worktodo was implemented. It certainly seems to be working on my test installation created days ago. Some of the names here are the same as on cudalucas etc. Last I heard, flashjh (Jerry) has been working on updating CUDALucas on Windows to reflect code developed in 2013 and address some other bugs and wishlist items. I'm seeing frequent halts to the cudapm1 program, running it on a GTX480. That's a CC2.0 card, subject to the driver timeout issue for Nvidia driver level >~300, regardless of whether it's running CUDALucas, CUDApm1, or anything else. The cudapm1 error message is: C:/Users/filbert/Documents/Visual Studio 2010/Projects/CUDAPm1/CUDAPm1.cu(3581) : cudaDeviceSynchronize() Runtime API error 30: unknown error. I'm also seeing runs of round-off error under 0.08, followed by termination with this message from cudapm1: err = 0.5 >= 0.40, quitting. Restarting it it continues fine for about an hour, and has another round-off error & quits. |
[QUOTE=kriesel;458044]...The cudapm1 error message is:
C:/Users/filbert/Documents/Visual Studio 2010/Projects/CUDAPm1/CUDAPm1.cu(3581) : cudaDeviceSynchronize() Runtime API error 30: unknown error...[/QUOTE] Keep plugging away at it. I believe several around here would like to see you succeed with this. I would. :smile: |
[QUOTE=ixfd64;425040]Two questions:
1. There was talk of adding worktodo.txt parsing to the program. Anyone know whether it has been implemented? 2. I asked this several months ago but didn't get an answer: the code hasn't been updated since late 2013. Has development ceased, or is someone still working on it behind the scenes?[/QUOTE] IIRC, the author of CUDAPM1 is user owftheevil, whom I have not seen on the forum in a while. I don't know if anyone else did more work from the source code. Dubslow or FlashJH (Jerry) are some who might know more. |
compile and sourceforge update request
1 Attachment(s)
Hi,
I've taken a stab at some minor cosmetic fixes for CUDAPm1 and its ini file. Could someone (perhaps batalov, flashjh, jgchilders, owftheevil, frmky?) please recompile for Windows (at least for 64-bit & CUDA 5.5;other variations can wait), get an updated .exe to me to test, and post the updated ini file to sourceforge? (It's been a long time since I mucked with any C of any flavor, so I'd like to test it myself first, and don't have a build environment.) From reading the forum recently, I see Friday Nov. 22 2013 owftheevil indicated making some of his last source code changes to date. The date of the executable at [URL]https://sourceforge.net/projects/cudapm1/files/CUDAPm1-0.20/[/URL] is a few days earlier than that (Nov 18 2013, [URL]http://www.mersenneforum.org/showthread.php?t=17835&page=39[/URL] post #427), while the date of the .cu file at [URL]https://sourceforge.net/p/cudapm1/code/HEAD/tree/trunk/[/URL] is a few days later (Monday Nov 25 2013), so I'm unsure whether the last fix by owftheevil is in the exe there. (Was r52 fully synced and does it contain the change described in #427 and any changes relating to #444, message date Nov 25 2013?) Similarly the question "is it current with the latest code" arises with James Heinrich's mirror [URL]http://download.mersenne.ca/CUDAPm1/[/URL] where the exe date is also Monday Nov 18 2013. So I suspect the available Windows executables currently correspond to r50, not r52. Thanks, Ken |
Error 30
[QUOTE=storm5510;461081]Keep plugging away at it. I believe several around here would like to see you succeed with this. I would. :smile:[/QUOTE]
Sadly there does not seem much hope of resolving that recurrent Error 30. It's an NVIDIA driver issue impacting compute capability 2.0 or lower GPUs in combination with driver releases above around 300. if I recall correctly. There was an effort to persuade NVIDIA to fix it but supporting older cards for a niche set of users wasn't enough of a priority. |
Err=0.50>=0.40 (failing GPU)
[QUOTE=kriesel;458044]... I'm also seeing runs of round-off error under 0.08, followed by termination with this message from cudapm1:
err = 0.5 >= 0.40, quitting. Restarting it it continues fine for about an hour, and has another round-off error & quits.[/QUOTE] It appears that was a case of an older GPU declining in reliability. It got to the point it would get stuck on a particular pass of stage two in CUDAPm1, regardless of how many restart attempts were made. It became more frequent over time. The problem pass would vary from exponent to exponent. The checkpoint files were fine and another GPU, same model, could carry them to completion without error. Thorough memory testing of the declining GPU showed that while testing 10 25MB blocks would test error-free, several of blocks 23-40 would error, even when it was significantly underclocked. I recommend essentially full range memory testing. This GPU is likely to be replaced. |
I can compile, I'll see if I can get it today.
|
1 Attachment(s)
This is interesting. I wanted to try it and see what happens. :smile:
|
device number
[QUOTE=storm5510;463503]This is interesting. I wanted to try it and see what happens. :smile:[/QUOTE]
Device numbering in CUDAPm1 is zero-based if I recall correctly. It is so in CUDALucas. First gpu device is 0, second is one, ... It defaults to device zero if no device is specified on the command line or in the ini file. I think that message happens for any of the following (and possibly other) cases: - a device number higher than the last device number physically present and properly installed is specified. For example, specifying -d 2 on a system where two gpus d 0 and d 1 are present. - a device timeout has occurred and Windows hasn't yet restarted the display device driver, so from the point of view of the OS and app, while the GPU is physically present it's not available for use - a device timeout has occurred and Windows has attempted to restart the display device driver, but a thermal issue or other issue prevented the GPU from restarting, so from the point of view of the OS and app, while the GPU is physically present it's not available for use until the issue is resolved at least temporarily and the driver restarted - the software was run on a system containing no qualifying device - the software was run on a system containing a qualifying device but no suitable driver yet successfully installed and active. - running a version requiring a CUDA level higher than the installed driver supports. |
The 'DeviceNumber' was set at 1 in the configuration file. I changed it to zero. The application became responsive. It doesn't want to go beyond a 1000 iteration average error test.
|
CUDAPm1 startup
[QUOTE=storm5510;463546]The 'DeviceNumber' was set at 1 in the configuration file. I changed it to zero. The application became responsive. It doesn't want to go beyond a 1000 iteration average error test.[/QUOTE]
It can take a while, minutes, for the next line of output to appear, depending on what the setting for screen output interval is and the exponent or fft length. For example, on a GTX480, it's nearly four minutes for 50,000 iterations below: Iteration 1000, average error = 0.19992 x= 0.25 (max error = 0.26172), continuing test. Iteration 50000 M43158547, 0xdd951715b61e6699, n = 2304K, CUDAPm1 v0.20 err = 0.29688 (3:45 real, 4.4892 ms/iter, ETA 16:46) Iteration 100000 M43158547, 0xadcc2bec0b8ae426, n = 2304K, CUDAPm1 v0.20 err = 0.29297 (3:42 real, 4.4537 ms/iter, ETA 12:56) |
split error message
1 Attachment(s)
Jerry, please see item 8 in the attachment.
|
After doing some reading back through the pages here, I found the proper parameter for doing bench tests. The example was, "-cufftbench 1 8192 r." I didn't want to respond to this, Then I saw where someone had used a value of "1" in the place of the "r." It ran the tests after that.
A cosmetic request: In my humble opinion, the console output lines are way too long. If the program name and version number could be removed, that would help. I had to stretch the console window to the full width of my screen to keep it all on a single line each time. |
-r option in CUDAPm1 not implemented
Its presence in the CUDAPm1 help message output seems to be a holdover from its CUDALucas ancestry. Specifying -r on the command line does not result in any residue check tests running in CUDAPm1; it goes straight to continuation of work present in the worktodo file. If I read the source code correctly, the residue check function did not get implemented for CUDAPm1.
|
CUDAPm1 bug and feature wish list
1 Attachment(s)
The topic and attachment are not intended to be critical of the fine and free development done. My intent is to make its use easier and more productive, and maybe aid further development. These are things I've learned by using the program or very recently looking at the source code. Please feel free to PM me with any additions, corrections or suggestions.
|
The server did not understand the results below.
[CODE]M82595957 has a factor: 3960668801233058686019823786839 (P-1, B1=730000, B2=730000, e=0, n=4608K, aid=xxxxxxxxxxxxC10420CBB1142D2B6669 )[/CODE] I shortened it to this: [CODE]M82595957 has a factor: 3960668801233058686019823786839 (P-1, B1=730000, B2=730000, e=0, n=4608K)[/CODE] The server still did not understand. [U]Note[/U]: I replaced some of the AID numbers with an 'x' in the first statement. Ideas? |
[QUOTE=storm5510;465018]The server did not understand the results below.
[CODE]M82595957 has a factor: 3960668801233058686019823786839 (P-1, B1=730000, B2=730000, e=0, n=4608K, aid=xxxxxxxxxxxxC10420CBB1142D2B6669 )[/CODE] I shortened it to this: [CODE]M82595957 has a factor: 3960668801233058686019823786839 (P-1, B1=730000, B2=730000, e=0, n=4608K)[/CODE] The server still did not understand. [U]Note[/U]: I replaced some of the AID numbers with an 'x' in the first statement. Ideas?[/QUOTE] It will understand this: [CODE] M82595957 has a factor: 3960668801233058686019823786839 (P-1, B1=730000, B2=730000) [/CODE] |
It looks like a CudaPm1 result, but it's lacking the program identifier.
The manual results form is, on purpose, very particular about formatting. Do not edit the result lines before attempting to submit them. |
[QUOTE=James Heinrich;465022]It looks like a CudaPm1 result, but it's lacking the program identifier.
The manual results form is, on purpose, very particular about formatting. Do not edit the result lines before attempting to submit them.[/QUOTE] Guilty! I was playing with a small sorting program and didn't realize it was truncating them. I ran another and formatted this one like the second. Problem solved. :blush: |
cudapm1 bug and wish lst update
1 Attachment(s)
Here is today's version of the list I am maintaining. As always, this is in appreciation of the authors' past contributions. Users may want to browse this for workarounds included in some of the descriptions, and for an awareness of some known pitfalls. Please respond with any comments, additions or suggestions you may have.
|
short of memory in stage 2, repeating residual
Is this a known problem? It warns before starting stage 1 there may not be enough memory for stage 2 for an exponent near 300M (wanting about 3% more than the GPU has), goes ahead and completes stage 1 using ~670MB, reports a residual for stage 1, uses about 5/6 of the gpu's 1.5GB memory for stage 2, and despite the earlier memory warning, chugs along in stage 2, one relative prime at a time, reporting the final stage 1 residual with each. Iteration times appear to be normal.
CUDAPm1 v0.20 ------- DEVICE 1 ------- name GeForce GTX 480 Compatibility 2.0 clockRate (MHz) 1401 memClockRate (MHz) 1848 totalGlobalMem 1610612736 totalConstMem 65536 l2CacheSize 786432 sharedMemPerBlock 49152 regsPerBlock 32768 warpSize 32 memPitch 2147483647 maxThreadsPerBlock 1024 maxThreadsPerMP 1536 multiProcessorCount 15 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 65535,65535,65535 textureAlignment 512 deviceOverlap 1 CUDA reports 1434M of 1536M GPU memory free. Index 107 Using threads: norm1 256, mult 128, norm2 128. Using up to 1584M GPU memory. WARNING: There may not be enough GPU memory for stage 2! Selected B1=2660000, B2=17955000, 5.04% chance of finding a factor Starting stage 1 P-1, M299500177, B1 = 2660000, B2 = 17955000, fft length = 18432K Doing 3837955 iterations ... Iteration 3750000 M299500177, 0x16fc277b4c69b54a, n = 18432K, CUDAPm1 v0.20 err = 0.03320 (30:13 real, 36.2668 ms/iter, ETA 53:09) Iteration 3800000 M299500177, 0xe97a5cb286fcf801, n = 18432K, CUDAPm1 v0.20 err = 0.03418 (30:13 real, 36.2698 ms/iter, ETA 22:56) M299500177, 0x071ac99b54319724, n = 18432K, CUDAPm1 v0.20 Stage 1 complete, estimated total time = 38:41:27 Starting stage 1 gcd. M299500177 Stage 1 found no factor (P-1, B1=2660000, B2=17955000, e=0, n=18432K CUDAPm1 v0.20) Starting stage 2. Using b1 = 2660000, b2 = 17955000, d = 2310, e = 2, nrp = 1 Zeros: 778308, Ones: 811452, Pairs: 143662 Processing 1 - 1 of 480 relative primes. Inititalizing pass... done. transforms: 170, err = 0.03711, (3.02 real, 17.7483 ms/tran, ETA NA) Transforms: 16700 M299500177, 0x071ac99b54319724, n = 18432K, CUDAPm1 v0.20 err = 0.03711 (5:16 real, 18.8891 ms/tran, ETA 42:23:55) ... Processing 341 - 341 of 480 relative primes. Inititalizing pass... done. transforms: 265, err = 0.02988, (5.16 real, 19.4604 ms/tran, ETA 12:27:36) Transforms: 16664 M299500177, 0x071ac99b54319724, n = 18432K, CUDAPm1 v0.20 err = 0.03125 (5:15 real, 18.8980 ms/tran, ETA 12:22:20) |
[QUOTE=kriesel;465987]...Transforms: 16664 [B]M299500177[/B], 0x071ac99b54319724, n = 18432K, CUDAPm1 v0.20 err = 0.03125 (5:15 real, 18.8980 ms/tran, ETA 12:22:20)[/QUOTE]
Is there a particular reason for running an exponent this large? |
[QUOTE=storm5510;465999]Is there a particular reason for running an exponent this large?[/QUOTE]Especially when it already has a [url=http://www.mersenne.ca/exponent/299500177]known 52-bit factor[/url].
|
why
[QUOTE=James Heinrich;466000]Especially when it already has a [URL="http://www.mersenne.ca/exponent/299500177"]known 52-bit factor[/URL].[/QUOTE]
Thanks for asking. Yes it's a bit off the beaten path. That's the point of this run. I started it as a joint test of my hardware and the software & its local configuration. Does it find the factor? (That's actually the technique the author of the software described using, for qualifying an installation, but years ago at lower exponents. Sometimes things go wrong at different fft lengths. Test exponents are selected for having a factor that should be found. You're right that that's the opposite of searching to find new factors to screen out LL test candidates.) And such testing also can shed some light on the following, even if it fails the find-the-known-factor test. What is actual run-time as a function of exponent, so what's reasonable or unreasonable to run on given hardware? Does anything break at high P for CUDAPm1? What are the gpu memory requirements or default usage versus exponent and stage? What is the save file size versus p and stage? If it's memory limited, does the software handle too little gpu memory gracefully? Are there unknown or forgotten bugs that could be smoked out and dealt with before the wave of PrimeNet assignments hit the fft lengths that reveal them? Armies use scouts. I had already run current-P-1-wavefront assignments, some double-check territory exponents assigned as LLDC that had only B1 done on them, and some current or recent wavefront LL tests assigned that had only B1 done IIRC, so had about half the data for a handy chart or two already, so why not get a few points elsewhere on the log plot? It's an extension of some of the stuff I've been posting over at [URL]http://www.mersenneforum.org/showthread.php?t=22450&page=3[/URL] as I puzzle things out as a long time GIMPS participant (1996?) but new to gpu use for it. If that sort of information is already available and assembled somewhere else, and it may well be, I'd love to know where. I've read a lot of threads and thread lists, and haven't found it yet. It's a big haystack. Maybe the future gpu-newbies will find the Available Software thread and find it useful. I would have. I want to know the capabilities and limitations of the software, generally and in relation to the parameters of the models of gpu I have running (6) or on order (1). Understanding that will help deploy them in the most productive manner. And finally, it's because it interests me, more than only doing one exponent after another in ascending order at the wavefront, on each gpu or cpu. I currently have a mix of mfaktc, cudapm1, cudalucas, and prime95 running on systems, which mostly are doing production work, cranking out a mix of ECM, LL, DC, TF, & P-1, but I enjoy looking into how things will be different later and what issues may turn up. |
updated benchmark, memory requirements, limits, etc on GTX480
1 Attachment(s)
Note that for comparison, a GTX1070 can do M9100xxxx in about 6.5 hours. The GTX480 is limited by both run-time and 1.5GB video memory size.
|
[QUOTE=kriesel;466048]...cranking out a mix of ECM, LL, DC, TF, & P-1, but I enjoy looking into how things will be different later and what issues may turn up.[/QUOTE]
[B]Off Topic[/B]:I ran multiple machines for a while. I found my utility bills, rather shocking, pardon the pun. It was a 1/3 increase over each billing cycle. So, one machine is used sparingly. Only the newest one runs constantly. :smile: |
Appearance of exponent limit on Quadro 2000
Has anyone else seen something similar? A GTX480 had no equivalent problem on the same 84M exponents. This is CUDAPm1 v0.20 on Windows 64-bit Vista.
After a few successful stage 1 and stage 2 p-1 runs of ~83.5M, each following exponent >84M runs through stage 1, but not through stage 1 gcd or stage 2, crashing the program instead. Behavior is reproducible for exponents 84M+, including after program restarts, logouts, system restarts. M83496143 found no factor (P-1, B1=685000, B2=12843750, e=2, n=4608K, aid=85D38BAC023FCFF8022AABA05F602C4C CUDAPm1 v0.20) reported 11/1/17 M83496227 found no factor (P-1, B1=685000, B2=12843750, e=2, n=4608K, aid=A1656CF4111B3B15C4A71186811384FF CUDAPm1 v0.20) reported 11/2/17 M83496247 found no factor (P-1, B1=685000, B2=12843750, e=2, n=4608K, aid=5F246BFB077E96AA450384EFEC8EC599 CUDAPm1 v0.20) reported 11/3/17 M83496293 found no factor (P-1, B1=685000, B2=12843750, e=2, n=4608K, aid=725F9720C9179022C18CEA98F646F72E CUDAPm1 v0.20) reported 11/4/17 M50001781 has a factor: 4392938042637898431087689 (P-1, B1=430000, B2=5000000, e=2, n=2688K CUDAPm1 v0.20) All 5 exponents attempted above 84M failed: PFactor=A3B66EB4FAAE78E8F283D5C96AD37A__,1,2,84228073,-1,76,2 PFactor=DC8BDAFB8D89D04B3B35742B11D9CE__,1,2,84228097,-1,76,2 PFactor=C996CF4EA78E42F9610D9789BE1666__,1,2,84228103,-1,76,2 and two more A typical event log entry follows. From entry to entry, process id and application start time changes but other event data values do not. Log Name: Application Source: Application Error Date: 11/4/2017 7:23:36 PM Event ID: 1000 Task Category: (100) Level: Error Keywords: Classic User: N/A Computer: eagle Description: Faulting application CUDAPm1_win64_20131118_CUDA_50.exe, version 0.0.0.0, time stamp 0x5285815f, faulting module CUDAPm1_win64_20131118_CUDA_50.exe, version 0.0.0.0, time stamp 0x5285815f, exception code 0xc0000005, fault offset 0x000000000000dd20, process id 0xd78, application start time 0x01d355cc5142bacb. Event Xml: <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event"> <System> <Provider Name="Application Error" /> <EventID Qualifiers="0">1000</EventID> <Level>2</Level> <Task>100</Task> <Keywords>0x80000000000000</Keywords> <TimeCreated SystemTime="2017-11-05T00:23:36.000Z" /> <EventRecordID>256</EventRecordID> <Channel>Application</Channel> <Computer>eagle</Computer> <Security /> </System> <EventData> <Data>CUDAPm1_win64_20131118_CUDA_50.exe</Data> <Data>0.0.0.0</Data> <Data>5285815f</Data> <Data>CUDAPm1_win64_20131118_CUDA_50.exe</Data> <Data>0.0.0.0</Data> <Data>5285815f</Data> <Data>c0000005</Data> <Data>000000000000dd20</Data> <Data>d78</Data> <Data>01d355cc5142bacb</Data> </EventData> </Event> Normal progression, 83M: (end of stage 1) Iteration 987000 M83496293, 0xf2fb4b229c8521b0, n = 4608K, CUDAPm1 v0.20 err = 0.16919 (0:37 real, 36.8380 ms/iter, ETA 0:39) Iteration 988000 M83496293, 0x9ad528e521e85730, n = 4608K, CUDAPm1 v0.20 err = 0.16797 (0:37 real, 36.8401 ms/iter, ETA 0:03) M83496293, 0x232eab21eaf81e92, n = 4608K, CUDAPm1 v0.20 Stage 1 complete, estimated total time = 10:10:44 Starting stage 1 gcd. M83496293 Stage 1 found no factor (P-1, B1=685000, B2=12843750, e=2, n=4608K CUDAPm1 v0.20) Starting stage 2. Using b1 = 685000, b2 = 12843750, d = 2310, e = 2, nrp = 13 Zeros: 573917, Ones: 658723, Pairs: 125889 Processing 1 - 13 of 480 relative primes. Inititalizing pass... done. transforms: 270, err = 0.16406, (5.09 real, 18.8644 ms/tran, ETA NA) Transforms: 2106 M83496293, 0x52b341a257507f69, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:41 real, 19.4671 ms/tran, ETA 9:14:05) Transforms: 2010 M83496293, 0x905f255bd35e844b, n = 4608K, CUDAPm1 v0.20 err = 0.17188 (0:39 real, 19.5838 ms/tran, ETA 9:15:02) Transforms: 2014 M83496293, 0x673b942ac1fc4ae2, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:40 real, 19.5771 ms/tran, ETA 9:14:52) ... Processing 469 - 480 of 480 relative primes. Inititalizing pass... done. transforms: 357, err = 0.17090, (6.88 real, 19.2605 ms/tran, ETA 14:07) Transforms: 2090 M83496293, 0x284e7914442300ef, n = 4608K, CUDAPm1 v0.20 err = 0.17090 (0:41 real, 19.4700 ms/tran, ETA 13:26) Transforms: 2058 M83496293, 0xb1c240cc360984b8, n = 4608K, CUDAPm1 v0.20 err = 0.16797 (0:40 real, 19.5747 ms/tran, ETA 12:46) Transforms: 2012 M83496293, 0xfa21edbaa82e8d9d, n = 4608K, CUDAPm1 v0.20 err = 0.16992 (0:40 real, 19.5721 ms/tran, ETA 12:07) Transforms: 1958 M83496293, 0xfdc0e766f0aa5f44, n = 4608K, CUDAPm1 v0.20 err = 0.16992 (0:38 real, 19.5923 ms/tran, ETA 11:28) Transforms: 1980 M83496293, 0xf808c66bf88da80d, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:39 real, 19.5757 ms/tran, ETA 10:50) Transforms: 1998 M83496293, 0xed71c1b76d6c0757, n = 4608K, CUDAPm1 v0.20 err = 0.16602 (0:39 real, 19.5754 ms/tran, ETA 10:10) Transforms: 1910 M83496293, 0x9587bca9e6a92d95, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:37 real, 19.5884 ms/tran, ETA 9:33) Transforms: 1902 M83496293, 0xdd50dacef6b94028, n = 4608K, CUDAPm1 v0.20 err = 0.17383 (0:38 real, 19.5907 ms/tran, ETA 8:56) Transforms: 1930 M83496293, 0x5c01c876ba23af0e, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:38 real, 19.6468 ms/tran, ETA 8:18) Transforms: 1924 M83496293, 0x4967e5714a906dd8, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:37 real, 19.6022 ms/tran, ETA 7:40) Transforms: 1914 M83496293, 0xb5338d4f9734dcbf, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:38 real, 19.5649 ms/tran, ETA 7:03) Transforms: 1882 M83496293, 0xb3364da78f68767c, n = 4608K, CUDAPm1 v0.20 err = 0.17969 (0:37 real, 19.5884 ms/tran, ETA 6:26) Transforms: 1916 M83496293, 0x63c6b998ac49a7a0, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:37 real, 19.5861 ms/tran, ETA 5:49) Transforms: 1844 M83496293, 0x9b385d7b61a51d47, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:36 real, 19.5965 ms/tran, ETA 5:13) Transforms: 1882 M83496293, 0xe0d8af2fcfffed20, n = 4608K, CUDAPm1 v0.20 err = 0.17188 (0:37 real, 19.5938 ms/tran, ETA 4:36) Transforms: 1896 M83496293, 0x85a24d9c67bd9496, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:37 real, 19.5903 ms/tran, ETA 3:59) Transforms: 1986 M83496293, 0x71a887caf40e5bb7, n = 4608K, CUDAPm1 v0.20 err = 0.17627 (0:39 real, 19.5874 ms/tran, ETA 3:20) Transforms: 1978 M83496293, 0x65c7d9d6c70197bf, n = 4608K, CUDAPm1 v0.20 err = 0.16797 (0:39 real, 19.5815 ms/tran, ETA 2:41) Transforms: 1986 M83496293, 0x8f7ecc43a94105ef, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:39 real, 19.5769 ms/tran, ETA 2:02) Transforms: 1950 M83496293, 0xaac5ccee0aafbde0, n = 4608K, CUDAPm1 v0.20 err = 0.16797 (0:38 real, 19.5877 ms/tran, ETA 1:24) Transforms: 2036 M83496293, 0x34e6f17ecab893b1, n = 4608K, CUDAPm1 v0.20 err = 0.17188 (0:40 real, 19.5862 ms/tran, ETA 0:44) Transforms: 2024 M83496293, 0x4b29a8a5677c72db, n = 4608K, CUDAPm1 v0.20 err = 0.17578 (0:40 real, 19.5816 ms/tran, ETA 0:04) Stage 2 complete, 1710522 transforms, estimated total time = 9:18:00 Starting stage 2 gcd. M83496293 Stage 2 found no factor (P-1, B1=685000, B2=12843750, e=2, n=4608K CUDAPm1 v0.20) (results.txt entry made, worktodo modified, next exponent started) Abnormal 84M exponent: (end of stage 1 crashes before gcd, program restarted attempts to begin at stage 2 fail, stage 1 gcd message missing) Iteration 994000 M84228073, 0xf6fe7d71235ae765, n = 4608K, CUDAPm1 v0.20 err = 0.21875 (0:37 real, 36.8486 ms/iter, ETA 0:55) Iteration 995000 M84228073, 0xed35e0151d83c908, n = 4608K, CUDAPm1 v0.20 err = 0.22656 (0:36 real, 36.8537 ms/iter, ETA 0:19) M84228073, 0xc840c55fb78fc6a2, n = 4608K, CUDAPm1 v0.20 Stage 1 complete, estimated total time = 10:15:26batch wrapper reports cudapm1 exited at Sat 11/04/2017 12:12:38.23 batch wrapper reports CUDAPm1 (re)launch at Sat 11/04/2017 12:12:39.17 (from here repeats except batch wrapper date/time stamps change, until worktodo file is manually modified to remove the stuck exponent) CUDAPm1 v0.20 Warning: Couldn't parse ini file option UnusedMem; using default. ------- DEVICE 0 ------- name Quadro 2000 Compatibility 2.1 clockRate (MHz) 1251 memClockRate (MHz) 1304 totalGlobalMem 1073741824 totalConstMem 65536 l2CacheSize 262144 sharedMemPerBlock 49152 regsPerBlock 32768 warpSize 32 memPitch 2147483647 maxThreadsPerBlock 1024 maxThreadsPerMP 1536 multiProcessorCount 4 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 65535,65535,65535 textureAlignment 512 deviceOverlap 1 No Quadro 2000 fft.txt file found. Using default fft lengths. For optimal fft selection, please run ./CUDAPm1 -cufftbench 1 8192 r for some small r, 0 < r < 6 e.g. CUDA reports 952M of 1024M GPU memory free. No Quadro 2000 threads.txt file found. Using default thread sizes. For optimal thread selection, please run ./CUDAPm1 -cufftbench 4608 4608 r for some small r, 0 < r < 6 e.g. Using threads: norm1 512, mult 128, norm2 128. No stage 2 checkpoint. Using up to 828M GPU memory. Selected B1=690000, B2=12937500, 3.07% chance of finding a factor Using B1 = 690000 from savefile. Continuing stage 2 from a partial result of M84228073 fft length = 4608K batch wrapper reports cudapm1 exited at Sat 11/04/2017 12:13:34.24 batch wrapper reports CUDAPm1 (re)launch at Sat 11/04/2017 12:13:36.14 |
The 480 has 1.5x as much memory. I suspect that may be the issue.
|
1 Attachment(s)
[QUOTE=henryzz;471628]The 480 has 1.5x as much memory. I suspect that may be the issue.[/QUOTE]
Thanks for your reply. I considered that. I think it's too low an exponent for that to be the case. If you see I missed something, please explain. Maybe the gcd uses much more memory than the rest of stage 1, but I was able to run up to about 290M on a GTX480 to completion, through stage 1 with gcd, and stage 2 with gcd. Observed stage 1 memory usage is rather linear with exponent in CUDAPM1, with regression fit 54.5MB+ 2.03 bytes times exponent value p, so I'd expect p=~84.2M to require only about 225MB in stage 1. Stage 2 memory usage is impacted by both exponent and nrp selection; it picks nrp to fit within available memory up to an exponent where nrp=1, leaving at least about 200 MB of headroom on the GTX480 (presumably for the code to occupy). From these observations, and extrapolating downward in memory requirement from the two GTX480 runs with nrp=1 for p~250M and 290M, to 824MB required, I'd expect to be able to run up to p=~145M in a 1GB card. For p=83.5M, the Quadro 2000 supported nrp=13. From the nrp=13 point on the GTX480, at p=120M, it was able to run over double the exponent. The program log from the 83.5M and 84.2M runs says for stage 2 on the Quadro 2000, Using up to 828M GPU memory. The GTX480 says 1332MB for the same exponents. But I've found that is just an expression of the available memory, not the amount reported by GPU-Z as in use during a stage. The Quadro 2000 passed a maximum-feasible-size 38-block memory test. (38x25=950MB). |
[QUOTE=kriesel;471677]...The GTX480 says 1332MB for the same exponents....[/QUOTE]
Are you overclocking your GTX 480? |
[QUOTE=storm5510;472646]Are you overclocking your GTX 480?[/QUOTE]
No. I have two, on the same machine. They came with different default clocks, 701 and 725. The 725 I downclock to 702. The 701 has been reliable; the 725/702 has repeatable memory errors in the middle of the address range, that at one time were reduced by downclocking but no longer are. So I use it only for trial factoring, which occupies memory not affected by the errors. I've become an advocate of testing as much gpu memory as possible, from what I've learned on that second GTX480. |
[QUOTE=kriesel;472689]No. I have two, on the same machine. They came with different default clocks, 701 and 725. The 725 I downclock to 702. The 701 has been reliable; the 725/702 has repeatable memory errors in the middle of the address range, that at one time were reduced by downclocking but no longer are. So I use it only for trial factoring, which occupies memory not affected by the errors. I've become an advocate of testing as much gpu memory as possible, from what I've learned on that second GTX480.[/QUOTE]
Interesting! I tried it on mine once. The gain was insignificant. The one I have defaults to 700. If I run a GPU process that causes it to reset itself, then that number drops to 450. It takes a cold-boot to get back to 700. |
[QUOTE=storm5510;472713]Interesting! I tried it on mine once. The gain was insignificant. The one I have defaults to 700. If I run a GPU process that causes it to reset itself, then that number drops to 450. It takes a cold-boot to get back to 700.[/QUOTE]
One of the two goes AWOL at varying intervals. I found that to get reliable p-1 or LL tests, it required making the 702/memory error one device zero. If it was device one, and p-1 or LL were set to run on device zero, when the one goes AWOL, the bad-memory one drops to device zero and causes problems with a p-1 or LL run. When I say AWOL, it's physically there, but GPU-Z only finds the one device, and a running GPU-Z already set to track device one ceases displaying its sensor readings, Windows event log shows a driver restart, and restarted cudapm1 and cudalucas don't find a device one. Clearing that up requires a shutdown/restart, in command line, shutdown -r. |
multiple instances or dissimilar instances per gpu
Hi,
Has anyone experimented with running more than one instance of CUDAPm1 on a single GPU? Reason I ask is I'm used to seeing 100% GPU load in GPU-Z, with a single instance of CUDALucas or CUDAPm1 per GPU, but on a GTX1070 it varies 99-100%. Also I have found gains in running multiple Mfaktc instances, raising the GPU load from 98 to 100%, on a GTX480. In sharing a single GTX480 GPU between simultaneous single instances of CUDALucas and CUDAPm1, in a quick test, I'm calculating more combined throughput than either running alone, by several percent. Since I'm running numerous GPUs, if that holds up, it's the equivalent of adding another GPU. Any light you can shed on effects of multiple instances, such as confirming results, or negative results, on various GPU models, would be appreciated. |
[QUOTE=storm5510;472713]Interesting! I tried it on mine once. The gain was insignificant. The one I have defaults to 700. If I run a GPU process that causes it to reset itself, then that number drops to 450. It takes a cold-boot to get back to 700.[/QUOTE]
I have never seen a 450 clock rate on either of my GTX480's, or a lower clock after driver restart or program reset. I have seen them drop from 70x to 405, and then some seconds later down to 50.6, when there's little or no GPU processing load, and go back up with load. |
[QUOTE=kriesel;471608]Has anyone else seen something similar? A GTX480 had no equivalent problem on the same 84M exponents. This is CUDAPm1 v0.20 on Windows 64-bit Vista.
After a few successful stage 1 and stage 2 p-1 runs of ~83.5M, each following exponent >84M runs through stage 1, but not through stage 1 gcd or stage 2, crashing the program instead. Behavior is reproducible for exponents 84M+, including after program restarts, logouts, system restarts. M83496143 found no factor (P-1, B1=685000, B2=12843750, e=2, n=4608K, aid=85D38BAC023FCFF8022AABA05F602C4C CUDAPm1 v0.20) reported 11/1/17 M83496227 found no factor (P-1, B1=685000, B2=12843750, e=2, n=4608K, aid=A1656CF4111B3B15C4A71186811384FF CUDAPm1 v0.20) reported 11/2/17 M83496247 found no factor (P-1, B1=685000, B2=12843750, e=2, n=4608K, aid=5F246BFB077E96AA450384EFEC8EC599 CUDAPm1 v0.20) reported 11/3/17 M83496293 found no factor (P-1, B1=685000, B2=12843750, e=2, n=4608K, aid=725F9720C9179022C18CEA98F646F72E CUDAPm1 v0.20) reported 11/4/17 M50001781 has a factor: 4392938042637898431087689 (P-1, B1=430000, B2=5000000, e=2, n=2688K CUDAPm1 v0.20) All 5 exponents attempted above 84M failed: PFactor=A3B66EB4FAAE78E8F283D5C96AD37A__,1,2,84228073,-1,76,2 PFactor=DC8BDAFB8D89D04B3B35742B11D9CE__,1,2,84228097,-1,76,2 PFactor=C996CF4EA78E42F9610D9789BE1666__,1,2,84228103,-1,76,2 and two more A typical event log entry follows. From entry to entry, process id and application start time changes but other event data values do not. Log Name: Application Source: Application Error Date: 11/4/2017 7:23:36 PM Event ID: 1000 Task Category: (100) Level: Error Keywords: Classic User: N/A Computer: eagle Description: Faulting application CUDAPm1_win64_20131118_CUDA_50.exe, version 0.0.0.0, time stamp 0x5285815f, faulting module CUDAPm1_win64_20131118_CUDA_50.exe, version 0.0.0.0, time stamp 0x5285815f, exception code 0xc0000005, fault offset 0x000000000000dd20, process id 0xd78, application start time 0x01d355cc5142bacb. Event Xml: <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event"> <System> <Provider Name="Application Error" /> <EventID Qualifiers="0">1000</EventID> <Level>2</Level> <Task>100</Task> <Keywords>0x80000000000000</Keywords> <TimeCreated SystemTime="2017-11-05T00:23:36.000Z" /> <EventRecordID>256</EventRecordID> <Channel>Application</Channel> <Computer>eagle</Computer> <Security /> </System> <EventData> <Data>CUDAPm1_win64_20131118_CUDA_50.exe</Data> <Data>0.0.0.0</Data> <Data>5285815f</Data> <Data>CUDAPm1_win64_20131118_CUDA_50.exe</Data> <Data>0.0.0.0</Data> <Data>5285815f</Data> <Data>c0000005</Data> <Data>000000000000dd20</Data> <Data>d78</Data> <Data>01d355cc5142bacb</Data> </EventData> </Event> Normal progression, 83M: (end of stage 1) Iteration 987000 M83496293, 0xf2fb4b229c8521b0, n = 4608K, CUDAPm1 v0.20 err = 0.16919 (0:37 real, 36.8380 ms/iter, ETA 0:39) Iteration 988000 M83496293, 0x9ad528e521e85730, n = 4608K, CUDAPm1 v0.20 err = 0.16797 (0:37 real, 36.8401 ms/iter, ETA 0:03) M83496293, 0x232eab21eaf81e92, n = 4608K, CUDAPm1 v0.20 Stage 1 complete, estimated total time = 10:10:44 Starting stage 1 gcd. M83496293 Stage 1 found no factor (P-1, B1=685000, B2=12843750, e=2, n=4608K CUDAPm1 v0.20) Starting stage 2. Using b1 = 685000, b2 = 12843750, d = 2310, e = 2, nrp = 13 Zeros: 573917, Ones: 658723, Pairs: 125889 Processing 1 - 13 of 480 relative primes. Inititalizing pass... done. transforms: 270, err = 0.16406, (5.09 real, 18.8644 ms/tran, ETA NA) Transforms: 2106 M83496293, 0x52b341a257507f69, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:41 real, 19.4671 ms/tran, ETA 9:14:05) Transforms: 2010 M83496293, 0x905f255bd35e844b, n = 4608K, CUDAPm1 v0.20 err = 0.17188 (0:39 real, 19.5838 ms/tran, ETA 9:15:02) Transforms: 2014 M83496293, 0x673b942ac1fc4ae2, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:40 real, 19.5771 ms/tran, ETA 9:14:52) ... Processing 469 - 480 of 480 relative primes. Inititalizing pass... done. transforms: 357, err = 0.17090, (6.88 real, 19.2605 ms/tran, ETA 14:07) Transforms: 2090 M83496293, 0x284e7914442300ef, n = 4608K, CUDAPm1 v0.20 err = 0.17090 (0:41 real, 19.4700 ms/tran, ETA 13:26) Transforms: 2058 M83496293, 0xb1c240cc360984b8, n = 4608K, CUDAPm1 v0.20 err = 0.16797 (0:40 real, 19.5747 ms/tran, ETA 12:46) Transforms: 2012 M83496293, 0xfa21edbaa82e8d9d, n = 4608K, CUDAPm1 v0.20 err = 0.16992 (0:40 real, 19.5721 ms/tran, ETA 12:07) Transforms: 1958 M83496293, 0xfdc0e766f0aa5f44, n = 4608K, CUDAPm1 v0.20 err = 0.16992 (0:38 real, 19.5923 ms/tran, ETA 11:28) Transforms: 1980 M83496293, 0xf808c66bf88da80d, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:39 real, 19.5757 ms/tran, ETA 10:50) Transforms: 1998 M83496293, 0xed71c1b76d6c0757, n = 4608K, CUDAPm1 v0.20 err = 0.16602 (0:39 real, 19.5754 ms/tran, ETA 10:10) Transforms: 1910 M83496293, 0x9587bca9e6a92d95, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:37 real, 19.5884 ms/tran, ETA 9:33) Transforms: 1902 M83496293, 0xdd50dacef6b94028, n = 4608K, CUDAPm1 v0.20 err = 0.17383 (0:38 real, 19.5907 ms/tran, ETA 8:56) Transforms: 1930 M83496293, 0x5c01c876ba23af0e, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:38 real, 19.6468 ms/tran, ETA 8:18) Transforms: 1924 M83496293, 0x4967e5714a906dd8, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:37 real, 19.6022 ms/tran, ETA 7:40) Transforms: 1914 M83496293, 0xb5338d4f9734dcbf, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:38 real, 19.5649 ms/tran, ETA 7:03) Transforms: 1882 M83496293, 0xb3364da78f68767c, n = 4608K, CUDAPm1 v0.20 err = 0.17969 (0:37 real, 19.5884 ms/tran, ETA 6:26) Transforms: 1916 M83496293, 0x63c6b998ac49a7a0, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:37 real, 19.5861 ms/tran, ETA 5:49) Transforms: 1844 M83496293, 0x9b385d7b61a51d47, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:36 real, 19.5965 ms/tran, ETA 5:13) Transforms: 1882 M83496293, 0xe0d8af2fcfffed20, n = 4608K, CUDAPm1 v0.20 err = 0.17188 (0:37 real, 19.5938 ms/tran, ETA 4:36) Transforms: 1896 M83496293, 0x85a24d9c67bd9496, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:37 real, 19.5903 ms/tran, ETA 3:59) Transforms: 1986 M83496293, 0x71a887caf40e5bb7, n = 4608K, CUDAPm1 v0.20 err = 0.17627 (0:39 real, 19.5874 ms/tran, ETA 3:20) Transforms: 1978 M83496293, 0x65c7d9d6c70197bf, n = 4608K, CUDAPm1 v0.20 err = 0.16797 (0:39 real, 19.5815 ms/tran, ETA 2:41) Transforms: 1986 M83496293, 0x8f7ecc43a94105ef, n = 4608K, CUDAPm1 v0.20 err = 0.16406 (0:39 real, 19.5769 ms/tran, ETA 2:02) Transforms: 1950 M83496293, 0xaac5ccee0aafbde0, n = 4608K, CUDAPm1 v0.20 err = 0.16797 (0:38 real, 19.5877 ms/tran, ETA 1:24) Transforms: 2036 M83496293, 0x34e6f17ecab893b1, n = 4608K, CUDAPm1 v0.20 err = 0.17188 (0:40 real, 19.5862 ms/tran, ETA 0:44) Transforms: 2024 M83496293, 0x4b29a8a5677c72db, n = 4608K, CUDAPm1 v0.20 err = 0.17578 (0:40 real, 19.5816 ms/tran, ETA 0:04) Stage 2 complete, 1710522 transforms, estimated total time = 9:18:00 Starting stage 2 gcd. M83496293 Stage 2 found no factor (P-1, B1=685000, B2=12843750, e=2, n=4608K CUDAPm1 v0.20) (results.txt entry made, worktodo modified, next exponent started) Abnormal 84M exponent: (end of stage 1 crashes before gcd, program restarted attempts to begin at stage 2 fail, stage 1 gcd message missing) Iteration 994000 M84228073, 0xf6fe7d71235ae765, n = 4608K, CUDAPm1 v0.20 err = 0.21875 (0:37 real, 36.8486 ms/iter, ETA 0:55) Iteration 995000 M84228073, 0xed35e0151d83c908, n = 4608K, CUDAPm1 v0.20 err = 0.22656 (0:36 real, 36.8537 ms/iter, ETA 0:19) M84228073, 0xc840c55fb78fc6a2, n = 4608K, CUDAPm1 v0.20 Stage 1 complete, estimated total time = 10:15:26batch wrapper reports cudapm1 exited at Sat 11/04/2017 12:12:38.23 batch wrapper reports CUDAPm1 (re)launch at Sat 11/04/2017 12:12:39.17 (from here repeats except batch wrapper date/time stamps change, until worktodo file is manually modified to remove the stuck exponent) CUDAPm1 v0.20 Warning: Couldn't parse ini file option UnusedMem; using default. ------- DEVICE 0 ------- name Quadro 2000 Compatibility 2.1 clockRate (MHz) 1251 memClockRate (MHz) 1304 totalGlobalMem 1073741824 totalConstMem 65536 l2CacheSize 262144 sharedMemPerBlock 49152 regsPerBlock 32768 warpSize 32 memPitch 2147483647 maxThreadsPerBlock 1024 maxThreadsPerMP 1536 multiProcessorCount 4 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 65535,65535,65535 textureAlignment 512 deviceOverlap 1 No Quadro 2000 fft.txt file found. Using default fft lengths. For optimal fft selection, please run ./CUDAPm1 -cufftbench 1 8192 r for some small r, 0 < r < 6 e.g. CUDA reports 952M of 1024M GPU memory free. No Quadro 2000 threads.txt file found. Using default thread sizes. For optimal thread selection, please run ./CUDAPm1 -cufftbench 4608 4608 r for some small r, 0 < r < 6 e.g. Using threads: norm1 512, mult 128, norm2 128. No stage 2 checkpoint. Using up to 828M GPU memory. Selected B1=690000, B2=12937500, 3.07% chance of finding a factor Using B1 = 690000 from savefile. Continuing stage 2 from a partial result of M84228073 fft length = 4608K batch wrapper reports cudapm1 exited at Sat 11/04/2017 12:13:34.24 batch wrapper reports CUDAPm1 (re)launch at Sat 11/04/2017 12:13:36.14[/QUOTE] The plot thickens. I've successfully run higher exponents (~84.9m) on another Quadro 2000, with the same CUDAPm1 executable image, CUDA5.5 64-bit 20130923 V0.20 executable. BIOS versions on the GPUs differ in the right 6 characters; the problem occurred on the gpu with the lower BIOS version number 70 06 0F 00 0A, and not with 70 06 31 02 01. It was run with no fft file or threads file initially, 512, 128, 128 threads 4608k fft length, then retried to complete with fft and threads files and 256, 256, 32 threads, 4608k fft length and program still failed. The other GPU had fft and threads files created before beginning to run any P-1 attempts, which succeeded. I'm now attempting a new exponent ~84.9m on the unit that had trouble with 84.2m. If that fails I may run a thorough memory test on it. Other possibilities are card swap and retest, and BIOS update. Other ideas? |
[CODE]Transforms: 2024 M83496293, 0x4b29a8a5677c72db, n = 4608K, [COLOR=Red]CUDAPm1 v0.20[/COLOR] err = 0.17578 (0:40 real, 19.5816 ms/tran, ETA 0:04)[/CODE]I wish the part in [COLOR=Red]red[/COLOR] could be removed. It makes PowerShell, or Command Prompt, almost too wide to fit my screen.
|
[QUOTE=storm5510;473179][CODE]Transforms: 2024 M83496293, 0x4b29a8a5677c72db, n = 4608K, [COLOR=Red]CUDAPm1 v0.20[/COLOR] err = 0.17578 (0:40 real, 19.5816 ms/tran, ETA 0:04)[/CODE]I wish the part in [COLOR=Red]red[/COLOR] could be removed. It makes PowerShell, or Command Prompt, almost too wide to fit my screen.[/QUOTE]
Use smaller fonts? |
[QUOTE=Mark Rose;473185]Use smaller fonts?[/QUOTE]
That's an option. I generally run this in PowerShell. :smile: |
[QUOTE=kriesel;473144]The plot thickens. I've successfully run higher exponents (~84.9m) on another Quadro 2000, with the same CUDAPm1 executable image, CUDA5.5 64-bit 20130923 V0.20 executable. BIOS versions on the GPUs differ in the right 6 characters; the problem occurred on the gpu with the lower BIOS version number 70 06 0F 00 0A, and not with 70 06 31 02 01. It was run with no fft file or threads file initially, 512, 128, 128 threads 4608k fft length, then retried to complete with fft and threads files and 256, 256, 32 threads, 4608k fft length and program still failed. The other GPU had fft and threads files created before beginning to run any P-1 attempts, which succeeded. I'm now attempting a new exponent ~84.9m on the unit that had trouble with 84.2m. If that fails I may run a thorough memory test on it. Other possibilities are card swap and retest, and BIOS update. Other ideas?[/QUOTE]
Ok. Same GPU and system that reliably choked on exponents 84228073, 84228097, 84228103, 84228119, 84228229, just successfully ran to completion, M84861479, with same fft length etc. I'd expect the higher exponent to present more of a challenge, not less. CUDA reports 830M of 1024M GPU memory free. Index 64 Using threads: norm1 256, mult 256, norm2 32. Using up to 720M GPU memory. Selected B1=690000, B2=12420000, 3.05% chance of finding a factor Starting stage 1 P-1, M84861479, B1 = 690000, B2 = 12420000, fft length = 4608K Doing 995519 iterations Iteration 5000 M84861479, 0x85dcbca418bb3656, n = 4608K, CUDAPm1 v0.20 err = 0.27344 (3:03 real, 36.6115 ms/iter, ETA 10:04:24) ... Iteration 995000 M84861479, 0xb98ed42b48260d4a, n = 4608K, CUDAPm1 v0.20 err = 0.25000 (3:02 real, 36.5191 ms/iter, ETA 0:18) M84861479, 0x4a2093b79c7bf108, n = 4608K, CUDAPm1 v0.20 Stage 1 complete, estimated total time = 10:06:31 Starting stage 1 gcd. M84861479 Stage 1 found no factor (P-1, B1=690000, B2=12420000, e=0, n=4608K CUDAPm1 v0.20) Starting stage 2. Using b1 = 690000, b2 = 12420000, d = 2310, e = 2, nrp = 10 Zeros: 554802, Ones: 637038, Pairs: 121194 Processing 1 - 10 of 480 relative primes. ... Stage 2 complete, 1766191 transforms, estimated total time = 9:31:47 Starting stage 2 gcd. M84861479 Stage 2 found no factor (P-1, B1=690000, B2=12420000, e=2, n=4608K CUDAPm1 v0.20) Weird, but I'll take it. A couple other things I had thought of to try were matching OS and system ram on another box & GPU and retrying there. System that ran the problem 84.2m exponents to completion had a newer Windows OS and twice the system ram. ... |
CUDAPm1 runtime scaling, etc.
1 Attachment(s)
The previously posted pdf has been extended to include the effect of GPU ram size on number of relative primes processed in a pass in stage 2, for the exponents near the current primenet manual assignment issue values, ~85,000,000. (See page 4 of the attached pdf.)
Note, nrp has been observed to fluctuate from run to run on the same hardware, and/or identical GPU model on another system, for very similar exponents (examples 1GB, nrp 10 & 13; 1.5GB, 24 & 27). This may be due to some stage 2 runs beginning and selecting an NRP value while another application (mfaktc or cudalucas) was also running on the GPU and occupying some GPU ram. Values tabulated were those first obtained in testing, without attention to GPU sharing. So these values could be considered a lower bound for what should be feasible when not running other gpu applications. NRP is very linear with available GPU ram, up to 4GB, followed by only slight increase to 8GB, in this exponent range. |
[QUOTE=chalsall;473220]If a tree falls in the forest and there is no one around to hear, does it make a sound?[/QUOTE]
Yes, by definition. [url]https://en.wikipedia.org/wiki/Sound[/url]:google: Slow day? Are you bored and looking to get banned or blocked again? It would be better if you instead try to contribute something that might be useful or interesting, at least to a newbie. (Or remain below 0 db.) |
[QUOTE=kriesel;473224]Yes, by definition. [url]https://en.wikipedia.org/wiki/Sound[/url][/QUOTE]
Also No, by definition, using the SAME link. It depends on whether you use the physics definition or the physiology definition of sound. |
[QUOTE=wblipp;473519]Also No, by definition, using the SAME link. It depends on whether you use the physics definition or the physiology definition of sound.[/QUOTE]
Well, I'm an engineer, so tend to look toward the physical natural mechanisms. The speed of sound is derivable as a shock wave asymptotically approaching a pressure wave ratio of 1. Would it make sense to say there was no explosion if no one was there to be hurt? Also, I grew up on a farm, near forests, and both farm and forest held far more animals than humans. "No one" refers to humans, but many other creatures also have ears or other sense organs for acoustic signals. There is a sound (acoustic signal), even if the only potential hearer is deaf. To say otherwise is like saying that because an illiterate person looked at your message and got nothing out of it, there was no writing; or a person who doesn't understand English heard you read it, so there was no speech. |
Updated bug and wishlist for cudapm1
1 Attachment(s)
Here is today's version of the list I am maintaining. As always, this is in appreciation of the authors' past contributions. Users may want to browse this for workarounds included in some of the descriptions, and for an awareness of some known pitfalls. Please respond with any comments, additions or suggestions you may have.
|
Hi kriesel,
I am interested in running p-1 tests with my GPU, however finding information about cudap-1 is difficult. Is there a place with updated code? The only one I can find is the one from the original author. If I understood correctly, patches have been created, I would prefer to have them applied. The program being based on CUDALucas, I will guess that instructions are similar, but if you have any information for me in addition, that would be great! If binaries for Windows exists, I am also happy to test. |
[QUOTE=Cubox;481516]Hi kriesel,
I am interested in running p-1 tests with my GPU, however finding information about cudap-1 is difficult. Is there a place with updated code? The only one I can find is the one from the original author. If I understood correctly, patches have been created, I would prefer to have them applied. The program being based on CUDALucas, I will guess that instructions are similar, but if you have any information for me in addition, that would be great! If binaries for Windows exists, I am also happy to test.[/QUOTE] If you're referring to the bug and wish list I made, the code edits mentioned in some parts have not been turned into updated executables yet or tested & debugged. Available software is described periodically at [URL]http://www.mersenneforum.org/showthread.php?t=22450&page=3[/URL] |
Windows binaries for CudaPM1 are available at [url]https://download.mersenne.ca/[/url] but they're 5 years old.
|
[QUOTE=Cubox;481516]Hi kriesel,
I am interested in running p-1 tests with my GPU, however finding information about cudap-1 is difficult. Is there a place with updated code? The only one I can find is the one from the original author. If I understood correctly, patches have been created, I would prefer to have them applied. The program being based on CUDALucas, I will guess that instructions are similar, but if you have any information for me in addition, that would be great! If binaries for Windows exists, I am also happy to test.[/QUOTE] What model is your gpu? |
[QUOTE=James Heinrich;481536]Windows binaries for CudaPM1 are available at [url]https://download.mersenne.ca/[/url] but they're 5 years old.[/QUOTE]
I saw those, and do not wish to use them. I would like to ensure the software I run is updated. This is why I am asking here about updates to this code. [QUOTE=kriesel]What model is your gpu?[/QUOTE] MSI GTX 1070 8G I am running CUDALucas2.06beta at the moment, doing some double checking LLs. The card is stable-ish. Over the 53 DC I have done, only 3 (updated, was 4 before edit) were bad. (One was a stupid overclock I did). I am willing to compile my binaries and/or help with testing updated code if you have patches. |
[QUOTE=kriesel;481522]If you're referring to the bug and wish list I made, the code edits mentioned in some parts have not been turned into updated executables yet or tested & debugged.
Available software is described periodically at [URL]http://www.mersenneforum.org/showthread.php?t=22450&page=3[/URL][/QUOTE] The CUDAp-1 software mentionned in your list of mersenne hunting software pdf (very useful for newcomers!) states Jan 2016 as 'Approx date' for CUDAp-1. [URL]https://sourceforge.net/projects/cudapm1/files/[/URL] has last code update in 2013, last binaries are from 2013 as well. |
[QUOTE=Cubox;481580]I saw those, and do not wish to use them. I would like to ensure the software I run is updated. This is why I am asking here about updates to this code.
MSI GTX 1070 8G I am running CUDALucas2.06beta at the moment, doing some double checking LLs. The card is stable-ish. Over the 53 DC I have done, only 3 (updated, was 4 before edit) were bad. (One was a stupid overclock I did). I am willing to compile my binaries and/or help with testing updated code if you have patches.[/QUOTE] As far as I know, v0.20, approx Nov 2013, is the latest available executable for Windows. There was something dated June 2015 for linux. Thanks for volunteering to help change that. What programming experience do you have? Are you familiar with posting code on sourceforge? First step is to get the development environment together, and demonstrate to yourself that you can compile and link gpu code and produce something functional. (That doesn't have to be CUDAPm1 initially; could be CUDALucas or mfaktc, or any tiny demo CUDA app for quick turnaround.) I suggest aiming for CUDA6.5 or CUDA8.0, 64-bit Windows executables. (I've seen speed advantages with CUDA6.x over other versions, in CUDALucas with extensive benchmarking. Driver version didn't make any detectable difference. But it can vary vs. card.) The GTX1070 requires CUDA 8, as I recall. A lot of us have older cards that perform faster at lower CUDA levels. I think NVIDIA CUDA SDK; MS VC Community Edition. Perhaps Jerry (flashjh) could advise how to set up for multiple CUDA levels. Then we can get into developing a v0.21 beta with some minor tweaks and bug fixes, and go from there. Six percent bad runs seems a bit high to me (3/53) |
[QUOTE=kriesel;481604]As far as I know, v0.20, approx Nov 2013, is the latest available executable for Windows. There was something dated June 2015 for linux. Thanks for volunteering to help change that.
What programming experience do you have? Are you familiar with posting code on sourceforge? First step is to get the development environment together, and demonstrate to yourself that you can compile and link gpu code and produce something functional. (That doesn't have to be CUDAPm1 initially; could be CUDALucas or mfaktc, or any tiny demo CUDA app for quick turnaround.) I suggest aiming for CUDA6.5 or CUDA8.0, 64-bit Windows executables. (I've seen speed advantages with CUDA6.x over other versions, in CUDALucas with extensive benchmarking. Driver version didn't make any detectable difference. But it can vary vs. card.) The GTX1070 requires CUDA 8, as I recall. A lot of us have older cards that perform faster at lower CUDA levels. I think NVIDIA CUDA SDK; MS VC Community Edition. Perhaps Jerry (flashjh) could advise how to set up for multiple CUDA levels. Then we can get into developing a v0.21 beta with some minor tweaks and bug fixes, and go from there. Six percent bad runs seems a bit high to me (3/53)[/QUOTE] I am good with C, kinda good with C++, used to work on Linux and OSX, not Windows. I know all about posting source on Github. I'll try to go compile the latest CUDALucas. I will keep you updated, however due to my free time being an unknown quantity, I might take a few days. |
[QUOTE=Cubox;481663]I will keep you updated, however due to my free time being an unknown quantity, I might take a few days.[/QUOTE]
No problem, I can relate. Some things have waited nearly 5 years, some longer, they can wait a few more days or weeks. |
cudapm1 images
[QUOTE=James Heinrich;481536]Windows binaries for CudaPM1 are available at [URL]https://download.mersenne.ca/[/URL] but they're 5 years old.[/QUOTE]
This looks rather comprehensive for Windows binaries, and apparently contains no linux executables. Clicking on the link at mersenne.ca, [url]http://www.mersenneforum.org/CUDAPm1/[/url], I get a 404 error. The June 23 2015 Linux build is on sourceforge but not on mersenne.ca. I wonder if that linux version is the only build with [r52] "reduced register use on square kernel", since that sourceforge entry is dated Nov 25 2013, slightly after the newest Windows build (Nov 18 2013). [URL]https://sourceforge.net/p/cudapm1/code/HEAD/tree/trunk/[/URL] The wiki page at [url]http://mersennewiki.org/index.php/CUDAPm1[/url] is not an article (yet?), so much as 3 links, to James' mirror, the SourceForge folder, and this discussion thread. |
[QUOTE=Cubox;481581]The CUDAp-1 software mentioned in your list of mersenne hunting software pdf (very useful for newcomers!) states Jan 2016 as 'Approx date' for CUDAp-1.
[URL]https://sourceforge.net/projects/cudapm1/files/[/URL] has last code update in 2013, last binaries are from 2013 as well.[/QUOTE] Sorry, Jan 2016 in the CUDAPm1 date cell was probably a late-night-edit-error. (clLucas not CUDAPm1 as I recall.) See post 503 in this thread for a hopefully more accurate reflection of the latest CUDAPm1 versions currently available. I'll fix the pdf soon. (Then, hopefully, you'll make it obsolete, by producing something newer...) |
CUDAPm1 bug and wish list update
1 Attachment(s)
Here is today's version of the list I am maintaining. As always, this is in appreciation of the authors' past contributions. Users may want to browse this for workarounds included in some of the descriptions, and for an awareness of some known pitfalls. Please respond with any comments, additions or suggestions you may have.
|
The current version seems to be working on the GTX1080 Ti with W10 x64 (didn't do any extensive tests or performance optimalisations)
[code] C:\CUDAPm1_v0.20>CUDAPm1_v0.20.exe 60593041, -b1 1000 CUDAPm1 v0.20 Warning: Couldn't parse ini file option Threads; using default: 256 Warning: Couldn't parse ini file option CheckRoundoffAllIterations; using default: off Warning: Couldn't parse ini file option Polite; using default: 1 Warning: Couldn't parse ini file option DeviceNumber; using default: 0 Warning: Couldn't parse ini file option WorkFile; using default "worktodo.txt" Warning: Couldn't parse ini file option ResultsFile; using default "results.txt" Warning: Couldn't parse ini file option UnusedMem; using default. CUDA reports 9310M of 11264M GPU memory free. Index 50 No GeForce GTX 1080 Ti threads.txt file found. Using default thread sizes. For optimal thread selection, please run ./CUDAPm1 -cufftbench 3584 3584 r for some small r, 0 < r < 6 e.g. Using threads: norm1 256, mult 128, norm2 128. Using up to 4284M GPU memory. Starting stage 1 P-1, M60593041, B1 = 1000, B2 = 13320000, fft length = 3584K Doing 1475 iterations Running careful round off test for 1000 iterations. If average error > 0.25, the test will restart with a longer FFT. Iteration 100, average error = 0.01770, max error = 0.02539 Iteration 200, average error = 0.02034, max error = 0.02734 Iteration 300, average error = 0.02122, max error = 0.02734 Iteration 400, average error = 0.02165, max error = 0.02637 Iteration 500, average error = 0.02194, max error = 0.02734 Iteration 600, average error = 0.02210, max error = 0.02686 Iteration 700, average error = 0.02226, max error = 0.02734 Iteration 800, average error = 0.02232, max error = 0.02637 Iteration 900, average error = 0.02238, max error = 0.02637 Iteration 1000, average error = 0.02240 <= 0.25 (max error = 0.02734), continuing test. M60593041, 0x962b95049cafb7d9, n = 3584K, CUDAPm1 v0.20 Stage 1 complete, estimated total time = 0:03 Starting stage 1 gcd. M60593041 has a factor: 2105528336291622770155712978260232660484461209 (P-1, B1=1000, B2=1000, e=0, n=3584K CUDAPm1 v0.20) [/code] fft bench: [code] Device GeForce GTX 1080 Ti Compatibility 6.1 clockRate (MHz) 1607 memClockRate (MHz) 5505 fft max exp ms/iter 1 22133 0.0355 2 43633 0.0390 4 85933 0.0478 32 657719 0.0693 44 898213 0.0791 64 1296011 0.0839 81 1631969 0.0987 96 1927129 0.0989 112 2240863 0.1025 128 2553659 0.1204 160 3176779 0.1251 200 3951977 0.1446 224 4415431 0.1553 256 5031737 0.1925 288 5646379 0.2212 294 5761451 0.2562 320 6259537 0.2708 324 6336103 0.2832 392 7634537 0.3099 400 7786967 0.3304 448 8700169 0.3338 512 9914521 0.3805 576 11125619 0.4453 648 12484649 0.5054 686 13200581 0.5413 800 15343429 0.5486 864 16543493 0.6236 1024 19535569 0.6952 1080 20580341 0.8218 1120 21325891 0.8564 1152 21921901 0.8756 1176 22368691 0.9074 1296 24599717 0.9129 1372 26010389 1.0312 1568 29640913 1.0384 1600 30232693 1.0678 1728 32597297 1.1680 1792 33778141 1.2742 2048 38492887 1.2833 2160 40551479 1.5437 2304 43194913 1.5569 2560 47885689 1.7060 2592 48471289 1.7171 2625 49075057 1.9772 2688 50227213 1.9787 2744 51250889 1.9848 2800 52274087 2.0086 3136 58404433 2.0353 3200 59570449 2.2746 3240 60298969 2.2818 3584 66556463 2.3477 4096 75846319 2.5299 4608 85111207 3.0311 4800 88579669 3.3866 5120 94353877 3.3908 5184 95507747 3.4069 5292 97454309 3.8099 5600 103000823 3.8417 5832 107174381 4.0325 6144 112781477 4.1750 6272 115080019 4.2456 6400 117377567 4.4651 6480 118813021 4.5797 6912 126558077 4.6116 7168 131142761 4.7072 7200 131715607 4.9283 8192 149447533 5.1292 [/code] |
gtx1070 for comparison
[QUOTE=VictordeHolland;482568]The current version seems to be working on the GTX1080 Ti with W10 x64 (didn't do any extensive tests or performance optimalisations)[/QUOTE]
Looks like the 1080 Ti is nearly the equal of a pair of GTX1070s. What's the largest exponent you can successfully run on the 1080 Ti with its 11GB VRAM? I've run 314M on the 1070 ok, but 628M had problems continuing from the stage 1 gcd or performing it. (I think the former based on GPU-Z indications) The GTX480's limit was about 290M for stage 2 due to 1.5GB memory size becoming inadequate at nrp=1. [CODE]Device GeForce GTX 1070 Compatibility 6.1 clockRate (MHz) 1708 memClockRate (MHz) 4004 fft max exp ms/iter 2 43633 0.0606 4 85933 0.0630 8 169409 0.0911 16 333803 0.0913 32 657719 0.0953 64 1296011 0.1109 80 1612249 0.1237 81 1631969 0.1408 96 1927129 0.1428 100 2005673 0.1436 112 2240863 0.1488 120 2397383 0.1716 128 2553659 0.1794 144 2865601 0.1882 160 3176779 0.2148 162 3215629 0.2467 168 3332107 0.2524 200 3951977 0.2622 216 4261051 0.2945 224 4415431 0.2989 225 4434721 0.3248 256 5031737 0.3341 288 5646379 0.3603 320 6259537 0.4237 324 6336103 0.4458 336 6565633 0.5069 392 7634537 0.5102 400 7786967 0.5271 432 8395997 0.5558 448 8700169 0.5791 512 9914521 0.6009 540 10444757 0.7232 576 11125619 0.7246 640 12333809 0.8014 648 12484649 0.8258 672 12936919 0.9232 686 13200581 0.9234 720 13840423 0.9244 800 15343429 0.9298 864 16543493 1.0297 1024 19535569 1.1486 1080 20580341 1.3637 1125 21419011 1.4440 1134 21586693 1.4747 1152 21921901 1.4855 1176 22368691 1.5284 1280 24302527 1.5325 1296 24599717 1.5563 1323 25101101 1.7481 1344 25490893 1.7790 1350 25602229 1.7805 1400 26529691 1.7827 1568 29640913 1.8353 1600 30232693 1.8536 1728 32597297 2.0343 1750 33003301 2.2177 1792 33778141 2.2198 2048 38492887 2.2744 2304 43194913 2.6746 2560 47885689 3.0174 2592 48471289 3.0979 2688 50227213 3.5028 2700 50446621 3.5501 2800 52274087 3.5831 2916 54392209 3.6662 3136 58404433 3.7083 3200 59570449 4.0342 3240 60298969 4.1233 3584 66556463 4.2461 3600 66847171 4.6064 4096 75846319 4.6173 4608 85111207 5.4760 4800 88579669 6.1239 5120 94353877 6.1506 5184 95507747 6.2963 5292 97454309 6.9197 5600 103000823 7.0910 5832 107174381 7.4497 6144 112781477 7.7539 6272 115080019 7.8423 6400 117377567 8.4223 6480 118813021 8.5396 6912 126558077 8.5851 7168 131142761 9.0281 7200 131715607 9.4287 8192 149447533 9.7002 8640 157439981 11.4261 9216 167703023 11.7002 9408 171120919 12.9847 9600 174537299 12.9942 9720 176671801 13.2919 10080 183071879 13.7479 10240 185914837 13.9074 10368 188188471 14.6202 11200 202952693 14.6974 11664 211176269 15.7289 12096 218826341 16.3628 12544 226753511 16.5236 12800 231280639 17.2002 12960 234109067 17.6919 13824 249369863 18.0687 14336 258403573 18.5125 14400 259532291 19.2037 15552 279831199 20.5104 16384 294471259 20.9802 18432 330441847 23.5745 18816 337176443 26.0162 20480 366326371 26.8871 20736 370806323 29.1363 21168 378363589 29.6717 21504 384239189 30.1835 21952 392070229 30.5201 23040 411074273 30.6741 23328 416101459 32.0017 25088 446794913 34.3478 25600 455715121 35.5808 27648 491358173 37.0692 28672 509158127 38.0063 28800 511382147 38.6743 32768 580225813 41.9480 32805 580866907 47.4597 33075 585544397 48.3338 36864 651102253 49.4871 39200 691446799 56.7610 41472 730636397 58.1385 42336 745527179 62.3263 44800 787958201 62.5338 46080 809980289 64.9344 49152 862780273 68.7844 50176 880364279 71.1277 51200 897940567 75.0087 51840 908921869 75.8619 55296 968171579 77.0567 57344 1003244573 78.9115 57600 1007626787 80.0893 65536 1143276383 87.6720 [/CODE]Obtained with, and followed by, something resembling the following (actually run in stages) [CODE]set exe=cudaPm1_win64_20131118_CUDA_50.exe set model=GeForce GTX 1070 set ntimes=2 set dev=0 :some gpus can't do the whole span, so are run in portions to obtain some fft results %exe% -d %dev% -cufftbench 1 32768 1 >>cudapm1start.txt rename "%model% fft.txt" "%model% fft save.txt" if errorlevel 1 goto skip %exe% -d %dev% -cufftbench 32768 65536 1 >>cudapm1start.txt for %%a in ( 4096 5120 6144 ) do %exe% -d %dev% -cufftbench %%a %%a 1 >>cudapm1start.txt for %%a in ( 4608 4800 5184 5292 5600 5832 6272 6400 6480 6912 7168 7200 8192 ) do %exe% -d %dev% -cufftbench %%a %%a 1 >>cudapm1start.txt for %%a in ( 8640 9216 9408 9600 9720 10080 10240 10368 11200 11664 12096 12544 12800 12960 13824 14336 14400 15552 16384 ) do %exe% -d %dev% -cufftbench %%a %%a 1 >>cudapm1start.txt for %%a in ( 18432 18816 20480 20736 21168 21504 21952 23040 23328 25088 25600 27648 28672 28800 32768 ) do %exe% -d %dev% -cufftbench %%a %%a 1 >>cudapm1start.txt :>32m-64M for %%a in ( 32805 33075 36864 39200 41472 42336 44800 46080 49152 50176 51200 51840 55296 57344 57600 65536 ) do %exe% -d %dev% -cufftbench %%a %%a %ntimes% >>cudapm1start.txt [/CODE] |
highest exponents successfully run? Issues seen on high exponents?
What are the highest exponents you've successfully run in CUDAPm1 through stage 1 including gcd?
Through both stage 1 and stage 2 including gcds? What hardware was it run on? If a run failed on a high exponent, what issues were seen? |
Manually reported P-1 results are getting marked as expired assignments
FYI: more at [url]http://www.mersenneforum.org/showpost.php?p=486151&postcount=1499[/url]
|
Improved recovery from Windows TDRs on old gpus
See the detailed writeup at [URL]http://www.mersenneforum.org/showpost.php?p=488288&postcount=37[/URL]
|
[QUOTE=kriesel;488460]See the detailed writeup at [URL]http://www.mersenneforum.org/showpost.php?p=488288&postcount=37[/URL][/QUOTE]
I've ran it on Windows 10 x64 v1709 and those that came before. No issues with any. Now, MS is pushing 1803 at everyone. I had a couple of unrelated applications that would no longer function after the update. This, I have not tried, but will. |
Here is a Windows 10 x64 v1803 Benchmark:
[QUOTE]Device GeForce GTX 1080 Compatibility 6.1 clockRate (MHz) 1835 memClockRate (MHz) 5005 fft max exp ms/iter 1 22133 0.0208 2 43633 0.0279 4 85933 0.0427 32 657719 0.0448 36 738083 0.0618 64 1296011 0.0674 72 1454273 0.0805 80 1612249 0.0871 96 1927129 0.1100 100 2005673 0.1170 108 2162543 0.1219 112 2240863 0.1229 128 2553659 0.1298 144 2865601 0.1413 160 3176779 0.1476 162 3215629 0.1908 200 3951977 0.2008 208 4106587 0.2418 216 4261051 0.2467 225 4434721 0.2572 256 5031737 0.2673 288 5646379 0.3193 320 6259537 0.3488 324 6336103 0.3624 392 7634537 0.4049 400 7786967 0.4346 432 8395997 0.4580 448 8700169 0.4709 512 9914521 0.5011 576 11125619 0.6026 648 12484649 0.6723 686 13200581 0.7345 800 15343429 0.7485 864 16543493 0.8335 1024 19535569 0.9290 1080 20580341 1.1098 1120 21325891 1.1732 1125 21419011 1.1940 1152 21921901 1.2013 1176 22368691 1.2195 1296 24599717 1.2244 1372 26010389 1.4071 1568 29640913 1.4139 1600 30232693 1.4501 1728 32597297 1.5729 1792 33778141 1.7635 2048 38492887 1.7638 2160 40551479 2.1364 2304 43194913 2.1590 2592 48471289 2.3442 2700 50446621 2.7283 2744 51250889 2.7442 3136 58404433 2.7904 3200 59570449 3.1828 3240 60298969 3.2006 3584 66556463 3.2837 4096 75846319 3.5004 4608 85111207 4.2431 5184 95507747 4.7036 5292 97454309 5.3005 5600 103000823 5.4040 5832 107174381 5.6629 6048 111056879 5.8718 6144 112781477 5.9137 6272 115080019 5.9963 6400 117377567 6.2128 6480 118813021 6.4584 6912 126558077 6.5339 7168 131142761 6.6882 7200 131715607 6.9528 8192 149447533 7.1364 9216 167703023 8.5473 9408 171120919 9.5003 9600 174537299 9.6540 9604 174608443 9.9670 9720 176671801 10.1752 9800 178094491 10.2449 10080 183071879 10.4060 10240 185914837 10.4187 10368 188188471 10.9104 11200 202952693 10.9833 11664 211176269 11.5556 12096 218826341 11.9058 12544 226753511 12.3132 12800 231280639 12.6450 12960 234109067 13.1032 13824 249369863 13.3780 14336 258403573 13.6714 14400 259532291 14.0879 16384 294471259 15.1884[/QUOTE] |
Reference Material
I was offered "a blog area to consolidate all of your pdfs and guides and stuff" and accepted.
Feel free to have a look and suggest content. (G-rated only;) General interest gpu related reference material [URL]http://www.mersenneforum.org/showthread.php?t=23371[/URL] CUDAPm1 P-1 factoring with CUDA on gpus [URL]http://www.mersenneforum.org/showthread.php?t=23389[/URL] Future updates to material previously posted in this thread will probably occur on the blog threads and not here. Having in-place update without a time limit makes it more manageable there. |
P-1 stage 2 residues not reproducing
CUDAPm1 gives 64-bit residues in stage 1 and stage 2. I thought they would reproduce. So look at this. First run, start to finish on a GTX1060, gave in part,
[CODE]Iteration 4050000 M425000083, 0x45bcabd2d9a7a6f7, n = 24192K, CUDAPm1 v0.20 err = 0.26563 (44:19 real, 53.1666 ms/iter, ETA 42:48) M425000083, 0x03b1ecbe222d57ae, n = 24192K, CUDAPm1 v0.20 Stage 1 complete, estimated total time = 60:43:07 Starting stage 1 gcd. M425000083 Stage 1 found no factor (P-1, B1=2840000, B2=34080000, e=0, n=24192K CUDAPm1 v0.20) Starting stage 2. Using b1 = 2840000, b2 = 34080000, d = 2310, e = 2, nrp = 4 Zeros: 1632727, Ones: 1613513, Pairs: 274647 Processing 1 - 4 of 480 relative primes. Inititalizing pass... done. transforms: 202, err = 0.25000, (5.29 real, 26.1963 ms/tran, ETA NA) Transforms: 53864 M425000083, 0x240cabc495e881a9, n = 24192K, CUDAPm1 v0.20 err = 0.25000 (25:05 real, 27.9513 ms/tran, ETA 50:04:27) Processing 5 - 8 of 480 relative primes. Inititalizing pass... done. transforms: 233, err = 0.21250, (6.80 real, 29.1806 ms/tran, ETA 50:06:05) Transforms: 54016 M425000083, 0x4113fb6410f7f0d9, n = 24192K, CUDAPm1 v0.20 err = 0.24219 (25:11 real, 27.9686 ms/tran, ETA 49:41:43) Processing 9 - 12 of 480 relative primes. Inititalizing pass... done. transforms: 235, err = 0.20703, (6.93 real, 29.4773 ms/tran, ETA 49:42:03) Transforms: 54058 M425000083, 0x1f056d902f5168a7, n = 24192K, CUDAPm1 v0.20 err = 0.23438 (25:12 real, 27.9701 ms/tran, ETA 49:17:12) Processing 13 - 16 of 480 relative primes. Inititalizing pass... done. transforms: 245, err = 0.20703, (7.19 real, 29.3381 ms/tran, ETA 49:17:15) Transforms: 54030 M425000083, 0x8bec2d947e1fb288, n = 24192K, CUDAPm1 v0.20 err = 0.24219 (25:11 real, 27.9701 ms/tran, ETA 48:52:07) Processing 17 - 20 of 480 relative primes. Inititalizing pass... done. transforms: 251, err = 0.20703, (7.33 real, 29.1833 ms/tran, ETA 48:52:11) Transforms: 54092 M425000083, 0x896d2f455b59709a, n = 24192K, CUDAPm1 v0.20 err = 0.23438 (25:13 real, 27.9710 ms/tran, ETA 48:26:51) [/CODE]After it completed, for testing purposes, I copied an early stage two interim save file from the 1060 savefile folder, renamed it to checkfile type name, and found I also needed to have a stage one file there too or it would start over from scratch. Put them in the work folder for a gtx1050Ti run and made a corresponding worktodo entry. On the gtx1050Ti I got this; residues don't match, in stage 2, for the same nrp groups; 9-12 on 1050ti doesn't match 9-12 on the 1060, etc. [CODE]on gtx1050ti, from an early stage 2 save file from a GTX1060: Starting stage 2. Using b1 = 2840000, b2 = 34080000, d = 2310, e = 2, nrp = 4 Zeros: 1632727, Ones: 1613513, Pairs: 274647 Processing 9 - 12 of 480 relative primes. Inititalizing pass... done. transforms: 235, err = 0.20313, (9.42 real, 40.0848 ms/tran, ETA 49:43:53) Transforms: 54058 M425000083, 0xed32e096fa463f09, n = 24192K, CUDAPm1 v0.20 err = 0.25439 (38:33 real, 42.7948 ms/tran, ETA 57:59:19) Processing 13 - 16 of 480 relative primes. Inititalizing pass... done. transforms: 245, err = 0.21624, (10.63 real, 43.3778 ms/tran, ETA 58:01:16) Transforms: 54030 M425000083, 0xb1a4e401a42c9b89, n = 24192K, CUDAPm1 v0.20 err = 0.23438 (38:33 real, 42.8108 ms/tran, ETA 61:49:54) Processing 17 - 20 of 480 relative primes. Inititalizing pass... done. transforms: 251, err = 0.20313, (10.92 real, 43.5090 ms/tran, ETA 61:50:48) Transforms: 54092 M425000083, 0xf7400d0435b23338, n = 24192K, CUDAPm1 v0.20 err = 0.23438 (38:35 real, 42.8107 ms/tran, ETA 63:52:09) [/CODE]Exponent, b1, b2, d, e, nrp, zeros, ones, pairs, all the same. Run all through stage 1, gcd, and 1-8 nrp of stage 2 in common. 9-12 nrp residues and roundoffs differ, between the gtx1060 and gtx1050Ti. Roundoffs are close and at acceptable levels. 13-16 nrp residues and roundoffs differ also. 17-20 nrp residues and roundoffs differ also. Different roundoffs if differences are minor don't concern me. Differing residues do. The runs are both CUDAPm1 V0.20 64-bit CUDA 5.5 for Windows; different host systems, same OS version, same model system, different gpu model. Maybe I got the wrong stage one file, not quite finished, and that threw it off somehow? Ideas? |
[QUOTE=kriesel;490384]CUDAPm1 gives 64-bit residues in stage 1 and stage 2. I thought they would reproduce. So look at this. First run, start to finish on a GTX1060, gave in part,
[CODE]Iteration 4050000 M425000083, 0x45bcabd2d9a7a6f7, n = 24192K, CUDAPm1 v0.20 err = 0.26563 (44:19 real, 53.1666 ms/iter, ETA 42:48) M425000083, 0x03b1ecbe222d57ae, n = 24192K, CUDAPm1 v0.20 Stage 1 complete, estimated total time = 60:43:07 Starting stage 1 gcd. M425000083 Stage 1 found no factor (P-1, B1=2840000, B2=34080000, e=0, n=24192K CUDAPm1 v0.20) Starting stage 2. Using b1 = 2840000, b2 = 34080000, d = 2310, e = 2, nrp = 4 Zeros: 1632727, Ones: 1613513, Pairs: 274647 Processing 1 - 4 of 480 relative primes. Inititalizing pass... done. transforms: 202, err = 0.25000, (5.29 real, 26.1963 ms/tran, ETA NA) Transforms: 53864 M425000083, 0x240cabc495e881a9, n = 24192K, CUDAPm1 v0.20 err = 0.25000 (25:05 real, 27.9513 ms/tran, ETA 50:04:27) Processing 5 - 8 of 480 relative primes. Inititalizing pass... done. transforms: 233, err = 0.21250, (6.80 real, 29.1806 ms/tran, ETA 50:06:05) Transforms: 54016 M425000083, 0x4113fb6410f7f0d9, n = 24192K, CUDAPm1 v0.20 err = 0.24219 (25:11 real, 27.9686 ms/tran, ETA 49:41:43) Processing 9 - 12 of 480 relative primes. Inititalizing pass... done. transforms: 235, err = 0.20703, (6.93 real, 29.4773 ms/tran, ETA 49:42:03) Transforms: 54058 M425000083, 0x1f056d902f5168a7, n = 24192K, CUDAPm1 v0.20 err = 0.23438 (25:12 real, 27.9701 ms/tran, ETA 49:17:12) Processing 13 - 16 of 480 relative primes. Inititalizing pass... done. transforms: 245, err = 0.20703, (7.19 real, 29.3381 ms/tran, ETA 49:17:15) Transforms: 54030 M425000083, 0x8bec2d947e1fb288, n = 24192K, CUDAPm1 v0.20 err = 0.24219 (25:11 real, 27.9701 ms/tran, ETA 48:52:07) Processing 17 - 20 of 480 relative primes. Inititalizing pass... done. transforms: 251, err = 0.20703, (7.33 real, 29.1833 ms/tran, ETA 48:52:11) Transforms: 54092 M425000083, 0x896d2f455b59709a, n = 24192K, CUDAPm1 v0.20 err = 0.23438 (25:13 real, 27.9710 ms/tran, ETA 48:26:51) [/CODE]After it completed, for testing purposes, I copied an early stage two interim save file from the 1060 savefile folder, renamed it to checkfile type name, and found I also needed to have a stage one file there too or it would start over from scratch. Put them in the work folder for a gtx1050Ti run and made a corresponding worktodo entry. On the gtx1050Ti I got this; residues don't match, in stage 2, for the same nrp groups; 9-12 on 1050ti doesn't match 9-12 on the 1060, etc. [CODE]on gtx1050ti, from an early stage 2 save file from a GTX1060: Starting stage 2. Using b1 = 2840000, b2 = 34080000, d = 2310, e = 2, nrp = 4 Zeros: 1632727, Ones: 1613513, Pairs: 274647 Processing 9 - 12 of 480 relative primes. Inititalizing pass... done. transforms: 235, err = 0.20313, (9.42 real, 40.0848 ms/tran, ETA 49:43:53) Transforms: 54058 M425000083, 0xed32e096fa463f09, n = 24192K, CUDAPm1 v0.20 err = 0.25439 (38:33 real, 42.7948 ms/tran, ETA 57:59:19) Processing 13 - 16 of 480 relative primes. Inititalizing pass... done. transforms: 245, err = 0.21624, (10.63 real, 43.3778 ms/tran, ETA 58:01:16) Transforms: 54030 M425000083, 0xb1a4e401a42c9b89, n = 24192K, CUDAPm1 v0.20 err = 0.23438 (38:33 real, 42.8108 ms/tran, ETA 61:49:54) Processing 17 - 20 of 480 relative primes. Inititalizing pass... done. transforms: 251, err = 0.20313, (10.92 real, 43.5090 ms/tran, ETA 61:50:48) Transforms: 54092 M425000083, 0xf7400d0435b23338, n = 24192K, CUDAPm1 v0.20 err = 0.23438 (38:35 real, 42.8107 ms/tran, ETA 63:52:09) [/CODE]Exponent, b1, b2, d, e, nrp, zeros, ones, pairs, all the same. Run all through stage 1, gcd, and 1-8 nrp of stage 2 in common. 9-12 nrp residues and roundoffs differ, between the gtx1060 and gtx1050Ti. Roundoffs are close and at acceptable levels. 13-16 nrp residues and roundoffs differ also. 17-20 nrp residues and roundoffs differ also. Different roundoffs if differences are minor don't concern me. Differing residues do. The runs are both CUDAPm1 V0.20 64-bit CUDA 5.5 for Windows; different host systems, same OS version, same model system, different gpu model. Maybe I got the wrong stage one file, not quite finished, and that threw it off somehow? Ideas?[/QUOTE] IIRC, CUDAPm1 used the available memory and the type of GPU to define the optimal magic numbers for P-1 (b1, b2, d, e, brp). I didn't look at the code, but I guesstimate that the stage2 residue to start from was not compatible, or not correctly reshaped for the GTX 1050Ti. |
[QUOTE=ET_;490434]IIRC, CUDAPm1 used the available memory and the type of GPU to define the optimal magic numbers for P-1 (b1, b2, d, e, brp). I didn't look at the code, but I guesstimate that the stage2 residue to start from was not compatible, or not correctly reshaped for the GTX 1050Ti.[/QUOTE]
I had thought that it would be safe to go from a small-memory gpu to a new larger-memory gpu; more than adequate room to run the bounds, d, e, nrp combination that fit on the more restricted memory gpu. A gpu with more memory would, from a fresh start, probably select bigger bounds to take advantage of the roomier memory on the second gpu, but I think that's an optimization, not a requirement. The other way around, trying to run something started on a more-memory gpu, transplanted to a less-memory gpu, is likely to fail in stage 2 or perhaps even in stage 1 due to what you describe. The author said so in [url]http://www.mersenneforum.org/showpost.php?p=359086&postcount=421[/url] I've also found cases where start to finish on one gpu, the program selects bounds for stage 2 that have no hope of running to successful completion, requiring gigabytes more memory than is available on the gpu on which those bounds get selected by the program. But neither of those correspond to the case I posted about here. Looking at read_checkpoint_packed, and other routines, to write a script to export CUDAPm1 savefiles to neutral exchange format, I did not see anything other than these parameters (nothing explicit about how many ROPs or shaders or whatever the gpu had or must have, nor how much memory). The residue is a word stream, a pretty simple shape. I had the impression the entire save file is 4-byte unsigned integers. (times in seconds). That checked out with the total savefile size, to the byte as I recall. [CODE]# fread (x_packed, 1, sizeof (unsigned) * (end + 25) , fPtr) # x_packed[end] = q; # x_packed[end + 1] = 0; // n # x_packed[end + 2] = 1; // iteration number # x_packed[end + 3] = 0; // stage # x_packed[end + 4] = 0; // accumulated time # x_packed[end + 5] = 0; // b1 # // 6-9 reserved for extending b1 # // 10-24 reserved for stage 2 # x_packed[end + 10] = b2; # x_packed[end + 11] = d; # x_packed[end + 12] = e; # x_packed[end + 13] = nrp; # x_packed[end + 14] = 0; // m = number of relative primes already finished # x_packed[end + 15] = 0; // k = how far done with current crop of relative primes # x_packed[end + 16] = 0; // t = where to find next relative prime in the bit array # x_packed[end + 17] = 0; // extra initialization transforms from starting in the middle of a pass # x_packed[end + 18] = itran_done; # x_packed[end + 19] = ptran_done + num_tran; # x_packed[end + 20] = itime; # x_packed[end + 21] = ptime; #22-24?[/CODE]The words 0 to end-1 are x_packed. The rest is scalars which the export program I created claims are as follows. Note, these might be from an earlier file than the one I used. [CODE]Format Mersenne Neutral Exchange d0.4 FileOrigin "CUDAPm1export for Windows" "V0.1 2018-06-23" c425000083s2. 2018 Jun 23 20:52:21 UTC Type P-1 stage 2 Exponent 425000083 Iteration 4098308 N 24772608 AccumulatedTime 216019 B1 2840000 Reserved6 0 Reserved7 0 Reserved8 0 Reserved9 0 B2 34080000 D 2310 E 2 NRP 4 M 8 K 1229 T 8 Midpasstransforms 0 Itran_done 435 PtrandonePlusNumtran 107880 Itime 12 Ptime 3016 Reserved22 0 Reserved23 0 Reserved24 0 DataFormat binary bytes CRC32 0x07291d0b DataBinaryByteCount 53125012 EndOfHeader[/CODE]I see nothing gpu-specific there; no rops or shaders counts, not even choices of thread counts for the 3 phases of the computation. |
prime95 P-1 bug since fixed. Is it present in CUDAPm1?
[URL]http://www.mersenneforum.org/showthread.php?t=22776[/URL] shows an issue with prime95 P-1 stage 1 computations, since fixed. Looking at old prime95 source code shows it was present at least back to prime95 v28.5 source, dated 2014, & perhaps earlier, though the code in v27.7, dated 2012, is different. This does not rule out it being present in prime95 P-1 at the time CUDAPm1 was developed, in 2013 (February to November). Since CUDAPm1 development relied on reference to prime95's code and followed it, and CUDAPm1 development and maintenance ended well before the issue was found and fixed in prime95, the issue might also be present in the currently available versions of CUDAPm1.
|
B2 reported may not match B2 used
In CUDAPm1 v0.20, if a run is continued on a gpu with more memory than it was started on, new bounds are calculated and then the program indicates it will continue with the bounds in the save file. After the run is completed, the result record contains the B2 found from the selection calculation, not the value from the save file that the program indicates was used. Example log excerpts follow.
Using threads: norm1 512, mult 256, norm2 128. Stage 2 checkpoint found. Using up to 3780M GPU memory. Selected B1=3100000, B2=[B]62000000[/B], 3.18% chance of finding a factor Using B1 = 2840000 from savefile. Continuing stage 2 from a partial result of M425000083 fft length = 24192K Starting stage 2. [B]Using b1 = 2840000, b2 = 34080000[/B], d = 2310, e = 2, nrp = 4 Zeros: 1632727, Ones: 1613513, Pairs: 274647 Processing 9 - 12 of 480 relative primes. Inititalizing pass... done. transforms: 235, err = 0.20313, (9.42 real, 40.0848 ms/tran, ETA 49:43:53) Transforms: 54058 M425000083, 0xed32e096fa463f09, n = 24192K, CUDAPm1 v0.20 err = 0.25439 (38:33 real, 42.7948 ms/tran, ETA 57:59:19) ... Processing 477 - 480 of 480 relative primes. Inititalizing pass... done. transforms: 299, err = 0.21094, (12.89 real, 43.1271 ms/tran, ETA 39:01) Transforms: 53916 M425000083, 0x7efe91810f60cfa3, n = 24192K, CUDAPm1 v0.20 err = 0.25000 (38:28 real, 42.8098 ms/tran, ETA 0:46) Stage 2 complete, 6506485 transforms, estimated total time = 76:55:59 Starting stage 2 gcd. M425000083 Stage 2 found no factor (P-1, B1=2840000, B2=[B]62000000[/B], e=2, n=24192K CUDAPm1 v0.20) |
new to me GPU, new CUDAPm1 behavior seen
Based on what 2 GB Quadro 4000 and 3GB GTX 1060 can run, I thought a 2.5GB Quadro 5000 (which is CC 2.0) would be able to run exponents up to 300M, perhaps higher, in CUDAPm1 v0.20 x64 CUDA 5.5 20130923 version also. It passed a memory test and correctly found the factor for M50001781.
But it failed to run stage 2 on [CODE]M87771547, 0xf6c7342f2bab37fa, n = 5040K, CUDAPm1 v0.20 Stage 1 complete, estimated total time = 3:35:59 Starting stage 1 gcd. M87771547 Stage 1 found no factor (P-1, B1=755000, B2=17365000, e=0, n=5040K CUDAPm1 v0.20) Starting stage 2. Using b1 = 755000, b2 = 17365000, d = 2310, e = 2, nrp = 48 Zeros: 785147, Ones: 880453, Pairs: 172236 Processing 1 - 48 of 480 relative primes. Inititalizing pass... ) Quitting, estimated time spent = 0:03 [/CODE]With repeated restarts, this was repeatably quitting after a few seconds of stage 2 with no reason given. Same thing occurs on [CODE]M200000491, 0x8ef21dc89a0b7d8c, n = 11250K, CUDAPm1 v0.20 Stage 1 complete, estimated total time = 16:20:44 Starting stage 1 gcd. M200000491 Stage 1 found no factor (P-1, B1=1540000, B2=32340000, e=0, n=11250K CUDAPm1 v0.20) Starting stage 2. Using b1 = 1540000, b2 = 32340000, d = 2310, e = 2, nrp = 16 Zeros: 1515937, Ones: 1585823, Pairs: 290684 Processing 1 - 16 of 480 relative primes. Inititalizing pass... ) Quitting, estimated time spent = 0:03[/CODE]Again, repeated restarts produce "Quitting" after a few seconds. I'm trying a few other exponents. But for now, CUDAPm1 on this model GPU appears incapable of running stage 2 P-1 at exponents of current or future interest (p>88M), for some unknown reason. The test exponent 50001781 ran on threads: norm1 512, mult 256, norm2 128. and fft length 2688k, which don't appear in the fft file or threads file. Applicable threads entries are; (88M) 5040 64 64 32 11.5743 and (200M) 11250 128 64 1024 26.5149 Retry with 5040 128 64 32 in the threads file per [URL="http://www.mersenneforum.org/showpost.php?p=359096&postcount=424"]http://www.mersenneforum.org/showpost.php?p=359096&postcount=424,[/URL] on M88, it progresses. Any ideas what to do to get M200M running stage 2 successfully? Are there any CUDA55 or higher executables available with the 20131118 or later code fixes, for Windows? |
new behavior: 16 stage 2 residue values taking turns
anomalous Quadro 5000 m350000071 cudapm1 V0.20 20130923 CUDA 5.5 on Windows, interim stage 2 residues:
After a normal looking stage 1, the 120 residues output in stage 2 at NRP=4 are repetitive, over a very limited subset of 16 values, listed below by ascending value, and that look suspicious by their regularity. (I'm used to runs with pseudorandom looking stage 1 and stage 2 residues. This exponent/gpu combination had seemingly well behaved stage 1 residues but peculiarities throughout stage 2. [CODE] _____8___4___2___1 difference appearing in the respective bit positions 0xfff7fffbfffdfffe 0xfff7fffbfffdffff 0xfff7fffbfffffffe 0xfff7fffbffffffff 0xfff7fffffffdfffe 0xfff7fffffffdffff 0xfff7fffffffffffe 0xfff7ffffffffffff 0xfffffffbfffdfffe 0xfffffffbfffdffff 0xfffffffbfffffffe 0xfffffffbffffffff 0xfffffffffffdfffe 0xfffffffffffdffff 0xfffffffffffffffe 0xffffffffffffffff[/CODE]End of stage 1 and beginning of stage 2 looked normal. Stage 2 was using 1863MB of 2.5GB on the gpu. At stage 2 wrapup/gcd, it dropped to 746MB. [CODE] Iteration 3650000 M350000071, 0xfa26579b34919a34, n = 20412K, CUDAPm1 v0.20 err = 0.12109 (20:01 real, 48.0195 ms/iter, ETA 22:37) Iteration 3675000 M350000071, 0x3ca8420d52bd5a27, n = 20412K, CUDAPm1 v0.20 err = 0.11719 (20:01 real, 48.0155 ms/iter, ETA 2:37) M350000071, 0x509e08b93355b407, n = 20412K, CUDAPm1 v0.20 Stage 1 complete, estimated total time = 49:05:07 Starting stage 1 gcd. M350000071 Stage 1 found no factor (P-1, B1=2550000, B2=31875000, e=0, n=20412K CUDAPm1 v0.20) Starting stage 2. Using b1 = 2550000, b2 = 31875000, d = 2310, e = 2, nrp = 4 Zeros: 1527348, Ones: 1520172, Pairs: 260423 Processing 1 - 4 of 480 relative primes. Inititalizing pass... done. transforms: 198, err = 0.11328, (4.77 real, 24.0679 ms/tran, ETA NA) Transforms: 50660 M350000071, 0xfffffffbfffffffe, n = 20412K, CUDAPm1 v0.20 err = 0.11328 (21:53 real, 25.9248 ms/tran, ETA 43:39:45) Processing 5 - 8 of 480 relative primes. Inititalizing pass... done. transforms: 229, err = 0.10547, (5.98 real, 26.1210 ms/tran, ETA 43:42:27) Transforms: 50812 M350000071, 0xfff7fffbfffdffff, n = 20412K, CUDAPm1 v0.20 err = 0.10547 (21:57 real, 25.9243 ms/tran, ETA 43:19:29) Processing 9 - 12 of 480 relative primes. Inititalizing pass... done. transforms: 231, err = 0.10547, (5.99 real, 25.9324 ms/tran, ETA 43:20:31) Transforms: 50810 M350000071, 0xfff7fffffffdffff, n = 20412K, CUDAPm1 v0.20 err = 0.10547 (21:57 real, 25.9239 ms/tran, ETA 42:57:55) Processing 13 - 16 of 480 relative primes. Inititalizing pass... done. transforms: 241, err = 0.10547, (6.24 real, 25.8988 ms/tran, ETA 42:58:31) Transforms: 50762 M350000071, 0xfff7fffbfffffffe, n = 20412K, CUDAPm1 v0.20 err = 0.10547 (21:56 real, 25.9241 ms/tran, ETA 42:35:58) Processing 17 - 20 of 480 relative primes. Inititalizing pass... done. transforms: 247, err = 0.10547, (6.40 real, 25.9017 ms/tran, ETA 42:36:30) Transforms: 50814 M350000071, 0xfffffffbfffffffe, n = 20412K, CUDAPm1 v0.20 err = 0.10547 (21:57 real, 25.9239 ms/tran, ETA 42:14:22) [/CODE]Etc. It concluded with a result line no factor found. |
[QUOTE=kriesel;491191]
Same thing occurs on [CODE]M200000491, 0x8ef21dc89a0b7d8c, n = 11250K, CUDAPm1 v0.20 Stage 1 complete, estimated total time = 16:20:44 Starting stage 1 gcd. M200000491 Stage 1 found no factor (P-1, B1=1540000, B2=32340000, e=0, n=11250K CUDAPm1 v0.20) Starting stage 2. Using b1 = 1540000, b2 = 32340000, d = 2310, e = 2, nrp = 16 Zeros: 1515937, Ones: 1585823, Pairs: 290684 Processing 1 - 16 of 480 relative primes. Inititalizing pass... ) Quitting, estimated time spent = 0:03[/CODE] (200M) 11250 128 64 1024 26.5149 Any ideas what to do to get M200M running stage 2 successfully? [/QUOTE] Doubling norm1 for the 11250k fft length worked for the 200M exponent |
[QUOTE=kriesel;492089]anomalous Quadro 5000 m350000071 cudapm1 V0.20 20130923 CUDA 5.5 on Windows, interim stage 2 residues:
End of stage 1 and beginning of stage 2 looked normal. Stage 2 was using 1863MB of 2.5GB on the gpu. At stage 2 wrapup/gcd, it dropped to 746MB. [CODE] Iteration 3650000 M350000071, 0xfa26579b34919a34, n = 20412K, CUDAPm1 v0.20 err = 0.12109 (20:01 real, 48.0195 ms/iter, ETA 22:37) Iteration 3675000 M350000071, 0x3ca8420d52bd5a27, n = 20412K, CUDAPm1 v0.20 err = 0.11719 (20:01 real, 48.0155 ms/iter, ETA 2:37) M350000071, 0x509e08b93355b407, n = 20412K, CUDAPm1 v0.20 Stage 1 complete, estimated total time = 49:05:07 Starting stage 1 gcd. M350000071 Stage 1 found no factor (P-1, B1=2550000, B2=31875000, e=0, n=20412K CUDAPm1 v0.20) Starting stage 2. Using b1 = 2550000, b2 = 31875000, d = 2310, e = 2, nrp = 4 Zeros: 1527348, Ones: 1520172, Pairs: 260423 Processing 1 - 4 of 480 relative primes. Inititalizing pass... done. transforms: 198, err = 0.11328, (4.77 real, 24.0679 ms/tran, ETA NA) Transforms: 50660 M350000071, 0xfffffffbfffffffe, n = 20412K, CUDAPm1 v0.20 err = 0.11328 (21:53 real, 25.9248 ms/tran, ETA 43:39:45) Processing 5 - 8 of 480 relative primes. Inititalizing pass... done. transforms: 229, err = 0.10547, (5.98 real, 26.1210 ms/tran, ETA 43:42:27) Transforms: 50812 M350000071, 0xfff7fffbfffdffff, n = 20412K, CUDAPm1 v0.20 err = 0.10547 (21:57 real, 25.9243 ms/tran, ETA 43:19:29) Processing 9 - 12 of 480 relative primes. Inititalizing pass... done. transforms: 231, err = 0.10547, (5.99 real, 25.9324 ms/tran, ETA 43:20:31) Transforms: 50810 M350000071, 0xfff7fffffffdffff, n = 20412K, CUDAPm1 v0.20 err = 0.10547 (21:57 real, 25.9239 ms/tran, ETA 42:57:55) Processing 13 - 16 of 480 relative primes. Inititalizing pass... done. transforms: 241, err = 0.10547, (6.24 real, 25.8988 ms/tran, ETA 42:58:31) Transforms: 50762 M350000071, 0xfff7fffbfffffffe, n = 20412K, CUDAPm1 v0.20 err = 0.10547 (21:56 real, 25.9241 ms/tran, ETA 42:35:58) Processing 17 - 20 of 480 relative primes. Inititalizing pass... done. transforms: 247, err = 0.10547, (6.40 real, 25.9017 ms/tran, ETA 42:36:30) Transforms: 50814 M350000071, 0xfffffffbfffffffe, n = 20412K, CUDAPm1 v0.20 err = 0.10547 (21:57 real, 25.9239 ms/tran, ETA 42:14:22) [/CODE]Etc. It concluded with a result line no factor found.[/QUOTE] As a test, I repeated part of stage 2 from very early, and got the following, on a GTX1050Ti [CODE]Continuing stage 2 from a partial result of M350000071 fft length = 20412K Starting stage 2. Using b1 = 2550000, b2 = 31875000, d = 2310, e = 2, nrp = 4 Zeros: 1527348, Ones: 1520172, Pairs: 260423 Processing 5 - 8 of 480 relative primes. Inititalizing pass... done. transforms: 229, err = 0.10156, (7.87 real, 34.3738 ms/tran, ETA 43:43:20) Transforms: 50812 M350000071, 0x45dfef64c039aeff, n = 20412K, CUDAPm1 v0.20 err = 0.11133 (31:04 real, 36.6751 ms/tran, ETA 52:17:38) Processing 9 - 12 of 480 relative primes. Inititalizing pass... done. transforms: 231, err = 0.10156, (8.50 real, 36.7856 ms/tran, ETA 52:20:27) SIGINT caught, writing checkpoint. Transforms: 3300 M350000071, 0x8eb67bcffa00c096, n = 20412K, CUDAPm1 v0.20 err = 0.10938 (2:02 real, 36.8244 ms/tran, ETA 52:35:45) Quitting, estimated time spent = 55:18 [/CODE]So I concluded probably every bit of stage 2 on the Quadro 5000 was wrong, and resumed it from directly after stage 1 gcd on the GTX1050Ti. That caused it to select different parameters because of the greater available gpu memory. [CODE]Using b1 = 2550000, b2 = 62188750, d = 4620, e = 2, nrp = 14 Zeros: 3004088, Ones: 2961352, Pairs: 536666 Processing 1 - 14 of 960 relative primes.[/CODE] |
comments in worktodo for CUDAPm1!
While looking for something else, I stumbled across this:
The source of parse.c indicates # or \\ or / are comment characters marking the rest of a line as a comment I've confirmed by test that # or \\ work; / did not in my test, which placed them at the beginnings of records. I could tell by the line number in warning messages which did not work. |
How to build CUDAPm1 for current CUDA levels, development tools available?
Yes, there's a makefile, from 2013, and linux. But my experience is makefiles that work in linux don't work in Windows, even in the msys2/mingw64 environment, as is or with what look like merited edits. Also NVIDIA's compiler nvcc wants Visual Studio not g++. And a lot has changed in the 5 years since there were posted Windows executables.
Presumably various paths would need to be updated for the different OS, different CUDA toolkit version, different c++ compiler version etc. Also CUFLAGS would need to be updated for new available compute levels to be added, and probably for old ones no longer supported by nvcc to be dropped. And the CUDAPm1 makefile contains: L = -lcufft -lcudart -lm -lgmp Presumably gmp is that of [URL]https://gmplib.org/[/URL] which I've installed. And the m is this? [URL]https://stackoverflow.com/questions/1033898/why-do-you-have-to-link-the-math-library-in-c[/URL] |
[QUOTE=kriesel;494224]Yes, there's a makefile, from 2013, and linux. But my experience is makefiles that work in linux don't work in Windows, even in the msys2/mingw64 environment, as is or with what look like merited edits. Also NVIDIA's compiler nvcc wants Visual Studio not g++. And a lot has changed in the 5 years since there were posted Windows executables.
Presumably various paths would need to be updated for the different OS, different CUDA toolkit version, different c++ compiler version etc. Also CUFLAGS would need to be updated for new available compute levels to be added, and probably for old ones no longer supported by nvcc to be dropped. And the CUDAPm1 makefile contains: L = -lcufft -lcudart -lm -lgmp Presumably gmp is that of [URL]https://gmplib.org/[/URL] which I've installed. And the m is this? [URL]https://stackoverflow.com/questions/1033898/why-do-you-have-to-link-the-math-library-in-c[/URL][/QUOTE] Is there a reason why Visual Studio is not an option? |
[QUOTE=henryzz;494292]Is there a reason why Visual Studio is not an option?[/QUOTE]
It may be. I'm reluctant to spend a lot. I'm preparing to make a first attempt to build CUDAPm1. It seemed a useful exercise to identify what all code needs to be gathered for compile/link. And the necessary tools. And some understanding (which I'm still working on). Visual Studio might be part of the process. I have (free) VS 2017 Community Edition installed. Nvcc v9.2 is compatible with VS 2017 but nvcc 8.0 and earlier are not. From what I've read, nvcc preprocesses the CUDA specific stuff and then uses VS's cl.exe to compile and link. A specific version of nvcc is limited in what versions of VS it will work with, and in what compute capability levels are supported. I have a lot of old gpus, 2.x compute capability. Nvcc 9.x (only version compatible with VS 2017) doesn't support CUDA level 2.x or lower. VS availability for free is limited to only the latest flavor (2017, or 15.x currently). A VS Pro license for ~$500 also gets only the latest flavor. To get access to earlier versions, such as VS2012 that is compatible with many versions of CUDA toolkt, including as far back as v5.5, seems to require a Pro subscription $1200 first year, $800 annually thereafter. [URL]https://visualstudio.microsoft.com/vs/pricing/[/URL]. Or there are alternatives like used resold software on eBay. Or maybe I've misunderstood something while climbing this particular learning curve. If so, please share data/corrections. |
ah42 fork
Hi,
I stumbled on this a while back, noted it, forgot about it, and recently had another look. Has anyone compiled and run this? If so, how did it compare to the sourceforge version, which is what's mirrored at mersenne.ca? [URL]https://github.com/ah42/cuda-p1[/URL] |
[QUOTE=kriesel;494300]It may be. I'm reluctant to spend a lot. I'm preparing to make a first attempt to build CUDAPm1. It seemed a useful exercise to identify what all code needs to be gathered for compile/link. And the necessary tools. And some understanding (which I'm still working on).
Visual Studio might be part of the process. I have (free) VS 2017 Community Edition installed. Nvcc v9.2 is compatible with VS 2017 but nvcc 8.0 and earlier are not. From what I've read, nvcc preprocesses the CUDA specific stuff and then uses VS's cl.exe to compile and link. A specific version of nvcc is limited in what versions of VS it will work with, and in what compute capability levels are supported. I have a lot of old gpus, 2.x compute capability. Nvcc 9.x (only version compatible with VS 2017) doesn't support CUDA level 2.x or lower. VS availability for free is limited to only the latest flavor (2017, or 15.x currently). A VS Pro license for ~$500 also gets only the latest flavor. To get access to earlier versions, such as VS2012 that is compatible with many versions of CUDA toolkt, including as far back as v5.5, seems to require a Pro subscription $1200 first year, $800 annually thereafter. [URL]https://visualstudio.microsoft.com/vs/pricing/[/URL]. Or there are alternatives like used resold software on eBay. Or maybe I've misunderstood something while climbing this particular learning curve. If so, please share data/corrections.[/QUOTE] The old installers can be got through [url]https://visualstudio.microsoft.com/vs/older-downloads/[/url] There are a few hoops as part of this you have to subscribe to the developer essentials package for free which wasn't obvious initially for me. |
[QUOTE=henryzz;494325]The old installers can be got through [URL]https://visualstudio.microsoft.com/vs/older-downloads/[/URL]
There are a few hoops as part of this you have to subscribe to the developer essentials package for free which wasn't obvious initially for me.[/QUOTE] Thanks for the tip. The earliest version available there is VS 2013. (I'd hoped to be able to get back to VS2010.) After multiple failed download attempts via my crappy slow costly ISP (768k/128k DSL, 4.8GB 14 hour download projected if things were working well, actual 1.5GB max per attempt, multiple days elapsed), the utility contractor working in my neighborhood to install fiber put an end to it by cutting the neighborhood's telco voice/DSL cable. Driving to another location got the 4.8GB ISO download on the first try in under 3 hours. With such slow and unreliable internet, I tend to go for a full install image that can be put on a local file server, download once, and reuse locally. Crappy-slow-costly-ISP was immediately contacted within 10 minutes of the start of the outage, took an hour of phone time to generate a trouble ticket, and projected beginning to repair after a week of no service, and claimed they would process a bill credit. Service cut was on the first day of the billing cycle. I've already received a bill for a full month's service not received or receivable, beginning the day the cable was cut, and the bill did not include the promised credit for outage. The DSL in this neighborhood runs from the nearest village, miles away, preventing high speed, instead of running from the nearest hut, a half mile away, that could probably provide 25.Mbps. |
I've had ISP troubles like that in the past (it once took an ISP 6 weeks of no internet before they fixed whatever was broken), so I can sympathize. I'm happy to be on 250Mbps service now (4.8GB ISO should complete in under 3 mins). I hope your fiber install is completed soon.
|
[QUOTE=James Heinrich;494889]I've had ISP troubles like that in the past (it once took an ISP 6 weeks of no internet before they fixed whatever was broken), so I can sympathize. I'm happy to be on 250Mbps service now (4.8GB ISO should complete in under 3 mins). I hope your fiber install is completed soon.[/QUOTE]Ouch, 6 weeks, that would be trouble. What cut my cable was work to bring service being marketed as a choice of 300M/300M, or 1000M/400M fiber service.
I turned road warrior for a week and now have a spreadsheet documenting open hours and speed tests for the nearest free WiFi. It was beginning to get highly inconvenient as various Prime95 workers ran out of work and completed work was spooled up. There have been times when the ping times to the nearest university are routinely 1 to 4 seconds or longer (normal is under 70msec). There have been times when the ISP's dns server is quite hosed, and I provided tech support to their "tech support" phone drone, and it stayed broken for days or weeks, and I switched to using opendns not theirs. Etc. My greedy cell provider is another story. Ancient plan as is, any fraction of a MB that's not an SMS message is $3. Outbound SMS text is $0.25 each, which is $1.60 and up per KILOBYTE. Nope, can't just add a decent data option to the existing plan, must switch to a hundreds-per-year-more-costly base plan to add data. (Similar story from 3 different locations I tried. One tried to roll a $200 wireless router purchase in to it also.) Base plan without those charges costs more than double per month, an alternative I've recently discovered, while trying to just add some tide-me-over slow wireless data connecting to one laptop during the DSL outage. If they'd offered something reasonable they could have kept a loyal long term customer who's already paid them several thousand dollars over the years. In my opinion there are only one or two cell providers charging reasonably and offering plans that fit a range of usage levels, while the others are all upselling and gouging. Time to go phone shopping on line. |
| All times are UTC. The time now is 23:19. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.