![]() |
[QUOTE=ixfd64;516561]Are P-1 save files incompatible between versions 29.4 and 29.8 on macOS?[/QUOTE]
Yes. A bug was fixed in version 29.4 build 7 which made P-1 save files in the first stage incompatible with prior versions. Thanks for the example save file. |
[QUOTE=ixfd64;516638]I noticed a minor inconsistency: Prime95 always appends two newlines to [C]results.bench.txt[/C] when it adds en entry. In the other results files, Prime95 only adds one newline at a time. I assume this is unintentional.[/QUOTE]
I think I found the cause. In the [C]commonb.c[/C] file, there is an extra line of code: [CODE]writeResultsBench ("\n");[/CODE] George, can you confirm whether this is intended? |
[QUOTE=ixfd64;519554]I think I found the cause. In the [C]commonb.c[/C] file, there is an extra line of code:
[CODE]writeResultsBench ("\n");[/CODE] George, can you confirm whether this is intended?[/QUOTE] I think the extra newline looks good. |
Thanks for the clarification. I agree it improves the readability of the benchmark results.
However, you may still want to look at the issue where the timestamp is printed when there are no further results: [url]https://mersenneforum.org/showpost.php?p=516838&postcount=226[/url] |
2 Attachment(s)
I recently set up Prime95 on my new MacBook Pro. This computer has a dual-core processor. I want to run two workers using one core each, but the option to adjust the number of cores per worker is greyed out. Prime95 complains there is not enough cores to run all workers when I try to save the settings. I've set [C]WorkerThreads=2[/C] and [C]CoresPerTest=1[/C] in my [C]local.txt[/C] file, but the issue persists.
|
Just cosmetic "bug"
Win version 29.8 build 3 I run light P-1 on candidate like this Pminus1=1,115000020,262144,1,5000,80000 Prediction time per test that Prime95 show is about 7 days, and on one core i5-3570K it is finished in 360 seconds. |
Windows 64-bit 29.8 build 5 is available. Very minor bugs fixed as documented in post #2
|
1 Attachment(s)
Build 5 installed but no change ( see screenshot)
Second link still point to build3 [QUOTE]Download links: Windows 64-bit: [URL]ftp://mersenne.org/gimps/p95v298b3.win64.zip[/URL][/QUOTE] Third, found bug mentioned before in this post ( on first page ) Worker 1 and worker 2 do P-1 on workunits. But both workunits have same exponent ( in this case 262144) So when time for writing intermediate file worker 1 write file and worker 2 try to read that file, but it not correct one, since Prime95 write only one file. I assume error is Prime95 name intermediate file by exponent and that is root of problem. Suggest to add prefix to name of intermediate file that will mark worker name and problem is solved. ( something like worker1_m0262144. bu and worker2_m0262144.bu) [Mon Jul 01 02:36:21 2019] 115002638^262144+1 completed P-1, B1=5000, B2=80000, Wh4: 06E0462F Error reading intermediate file: m0262144 Renaming m0262144 to m0262144.bad1 All intermediate files bad. Temporarily abandoning work unit. Trying backup intermediate file: m0262144.bad1 Error reading intermediate file: m0262144.bad1 |
[QUOTE=pepi37;520423]...Suggest to add prefix to name of intermediate file that will mark worker name and problem is solved. ( something like worker1_m0262144. bu and worker2_m0262144.bu)...[/QUOTE]
That seems like a good idea, but it's not practical if you sometimes move exponents between workers for some reason. I don't know if it's common, but I used to do it a lot when I had more machines... I sometimes moved an almost done exponent to a different machine entirely so I could focus one of the "big boys" on a larger exponent or whatever. Anyway, things like that would throw a wrench into naming temp files for the worker #. But otherwise, in rare cases like yours, there should be some way to avoid having the workers clobber each other's intermediate files. I just don't know how common that is since it's not something you'd do with LL or PRP. Only factoring work would make sense to do that, so perhaps something else in the filename that indicates things like TF level, bounds, etc. so they're distinct in that way. |
Considering that the work being done was on different numbers, the solution is obvious -- the save filename should be a function of the full number (k*b^n+c), rather than just the exponent.
Gone are the days when P95 was used only for mersenne numbers where such a scheme would have worked. |
Since so few people will work on two different numbers using the same exponent, and since it is such a specialised type of work, wouldn't it be easier if the user ran the program from different folders ?
Jacob (Because using a unique name based on the full number might be cumbersome when the numbers are millions of bits...) |
[QUOTE=S485122;520433](Because using a unique name based on the full number might be cumbersome when the numbers are millions of bits...)[/QUOTE]
Numbers might be millions of bits, but k,b,n,c are not. And their unique hash can be smaller still. Most file systems now support really long file names. |
[QUOTE=axn;520443]Numbers might be millions of bits, but k,b,n,c are not. And their unique hash can be smaller still.
Most file systems now support really long file names.[/QUOTE]I think a hash is a needless complication, and a potential clash problem. Just put k_b_n_c directly. |
[QUOTE=retina;520446]I think a hash is a needless complication, and a potential clash problem. Just put k_b_n_c directly.[/QUOTE]
Perfect¡ |
[QUOTE=Prime95;520421]Windows 64-bit 29.8 build 5 is available. Very minor bugs fixed as documented in post #2[/QUOTE]
Was the download link in Post 1 of this thread not updated? Prime95.exe inside the ZIP file has a date of April 22nd. Also - Did they ever look into the ETA issues? On my new 9900k estimated completion times are all over the map and wildly incorrect. |
[QUOTE=rainchill;520462]Was the download link in Post 1 of this thread not updated? Prime95.exe inside the ZIP file has a date of April 22nd.
Also - Did they ever look into the ETA issues? On my new 9900k estimated completion times are all over the map and wildly incorrect.[/QUOTE] I will update link in first post. The issue of estimated completion dates has not been investigated. |
[QUOTE=Prime95;520421]Windows 64-bit 29.8 build 5 is available[/QUOTE](Just as I'm finishing 29.8b3 rollout... ;)
|
user entered last entries in results.json.txt overwritten
In a non-primenet-connected V29.8b5 instance, in results.json.txt I enter [QUOTE]reported (some date)[/QUOTE]and CRLF on a line following what I manually report, save, and close.
The next result prime95 v29.8b5 enters overwrites such manual entries. This has happened multiple times. GPU apps don't do that, they merely append to results.txt. My recollection is earlier versions of prime95 did not overwrite such user entries in results.txt, merely appending to whatever's there. |
I dont know is reported but if is not: benchmark in 29.8b3(Linux) and 29.8b5 ( Win) is totally broken when you try benchmark on small FFT ( like 192 or 224)
|
[QUOTE=pepi37;523043]I dont know is reported but if is not: benchmark in 29.8b3(Linux) and 29.8b5 ( Win) is totally broken when you try benchmark on small FFT ( like 192 or 224)[/QUOTE]
Please elaborate. What dialog box inputs? What happens? What CPU? |
1 Attachment(s)
[QUOTE=Prime95;523044]Please elaborate. What dialog box inputs? What happens? What CPU?[/QUOTE]
Results is attached: Linux version of mprime , latest one and 29.7 and 29.6 Only on 29.6 I got results [QUOTE]Prime95 64-bit version 29.6, RdtscTiming=1 Timings for 224K FFT length (3 cores, 1 worker): 0.73 ms. Throughput: 1366.25 iter/sec. Timings for 224K FFT length (3 cores, 2 workers): 1.44, 0.98 ms. Throughput: 1715.73 iter/sec. Timings for 224K FFT length (3 cores, 3 workers): 1.38, 1.42, 1.47 ms. Throughput: 2109.09 iter/sec. [/QUOTE] If I run benchmark on 29.7 and 29.8 I got same result but after line [QUOTE]Prime95 64-bit version 29.7(8, RdtscTiming=1[/QUOTE] benchmark dont start , just put me back to menu Same is on Windows |
Very interesting. Please set AffinityVerbosityBench=3 in prime.txt and try the failing test again.
|
[QUOTE=Prime95;523074]Very interesting. Please set AffinityVerbosityBench=3 in prime.txt and try the failing test again.[/QUOTE]
This solve problem! [QUOTE]Prime95 64-bit version 29.8, RdtscTiming=1 Timings for 448K FFT length (3 cores, 1 worker): 1.04 ms. Throughput: 958.49 iter/sec. [Sun Aug 4 19:11:00 2019] Timings for 448K FFT length (3 cores, 2 workers): 2.22, 1.36 ms. Throughput: 1184.68 iter/sec. Timings for 448K FFT length (3 cores, 3 workers): 2.17, 2.22, 2.18 ms. Throughput: 1368.89 iter/sec.[/QUOTE] Thanks! |
[QUOTE=pepi37;523081]This solve problem![/QUOTE]
This wasn't supposed to solve the problem! It was only supposed to print out more information to help with debugging. P.S. Your new output was for the 448K FFT, not the 192K and 224K on which you were reporting problems. |
1 Attachment(s)
[QUOTE=Prime95;523104]This wasn't supposed to solve the problem! It was only supposed to print out more information to help with debugging.
P.S. Your new output was for the 448K FFT, not the 192K and 224K on which you were reporting problems.[/QUOTE] But it works on those cases also: if you need I will provide to you. Also this is benchmark test from my I7-2700K ( HT is off) , look first results, it is absurd ( ten or even more times slower, it should be faster) |
[QUOTE=kriesel;522947]In a non-primenet-connected V29.8b5 instance, in results.json.txt I enter reported date and CRLF on a line following what I manually report, save, and close.
The next result prime95 v29.8b5 enters overwrites such manual entries. This has happened multiple times. GPU apps don't do that, they merely append to results.txt. My recollection is earlier versions of prime95 did not overwrite such user entries in results.txt, merely appending to whatever's there.[/QUOTE]Never mind, false alarm, pilot error (specifically, navigation). |
I'm doing some throughput benchmarking on one of my computers and I don't really understand how mprime decides how many times each FFT size gets tested. I have "Benchmark all FFT implementations to find best one for your machine" selected "Y", so is this a fixed, hardcoded number of implementations for each FFT size? If so is it possible that we could get a overall time estimate before the benchmark starts?
So something like: [code] You have selected to test 12800 combinations of parameters for 10s each, this is estimated to take about 35h33m Accept the answers above? (Y): [/code] My problem is that its not clear how many different FFT sizes exists for a given range, *and* its not clear how many implementations exist for a given size. So I have no idea how long to expect any benchmark to take. Could be minutes, could be hours, or days. Or maybe some indication during the benchmark of how progress is going, percent complete? So for example I started my benchmark on 2560K. And every time it goes through testing the core count,worker count combinations, it just write "Timing 2560K FFT" again with no indication of the difference from last round. Could it instead print "Timing 2560K FFT, Implementation [2/20]" or whatever the total is so I have some idea if/when it will get past this FFT size? edit: A couple other benchmark settings related questions: 1) "Benchmark with round-off checking enabled" I don't really get what this option means, is it the error checking? Would it be more accurate timings(truer to real work usage) to enable this? 2) "Benchmark all-complex FFTs (for LLR,PFGW,PRP users)" Again, not sure. Does this mean I should select Y if I intend to do "PRP" worktypes directly through mprime/PrimeNet? |
Also I just ran into configuration which seems to cause a repeatable crash on this computer
Dual socket Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz mprime 29.8b3 On Ubuntu Server 18.04 [code] Your choice: 16 │ │ Benchmark type (0 = Throughput, 1 = FFT timings, 2 = Trial factoring) (0): │ │ FFTs to benchmark │ Minimum FFT size (in K) (2560): │ Maximum FFT size (in K) (2560): │ Benchmark with round-off checking enabled (N): │ Benchmark all-complex FFTs (for LLR,PFGW,PRP users) (N): │ │ CPU cores to benchmark │ Number of CPU cores (comma separated list of ranges) (24): 12 │ Benchmark hyperthreading (N): │ │ Throughput benchmark options │ Benchmark all FFT implementations to find best one for your machine (N): │ Number of workers (comma separated list of ranges) (1,2,12): 2 │ Time to run each benchmark (in seconds) (5): │ │ Accept the answers above? (Y): │ Main Menu │ │ 1. Test/Primenet │ 2. Test/Worker threads │ 3. Test/Status │ 4. Test/Stop │ 5. Test/Exit │ 6. Advanced/Test │ 7. Advanced/Time │ 8. Advanced/P-1 │ 9. Advanced/ECM │ 10. Advanced/Manual Communication │ 11. Advanced/Unreserve Exponent │ 12. Advanced/Quit Gimps │ 13. Options/CPU │ [Main thread Aug 12 10:54] Starting worker. │ 14. Options/Preferences │ 15. Options/Torture Test │ 16. Options/Benchmark │ 17. Help/About │ 18. Help/About PrimeNet Server │ Your choice: [Worker #1 Aug 12 10:54] Worker starting │ [Worker #1 Aug 12 10:54] Your timings will be written to the results.txt file. │ [Worker #1 Aug 12 10:54] Compare your results to other computers at http://www.mersenne.org/report_benchmarks │ [Worker #1 Aug 12 10:54] Setting affinity to run worker on CPU core #1 │ [Worker #1 Aug 12 10:54] Affinity set to cpuset 0x01000001 │ [Worker #1 Aug 12 10:54] Benchmarking multiple workers to measure the impact of memory bandwidth │ [Worker #1 Aug 12 10:54] Timing 2560K FFT, 12 cores, 2 workers. [Aug 12 10:54] Setting affinity to run worker on CPU core #1 │ [Worker #1 Aug 12 10:54] Setting affinity to run worker on CPU core #13 │ [Worker #1 Aug 12 10:54] Affinity set to cpuset 0x01000001 │ [Worker #1 Aug 12 10:54] Affinity set to cpuset 0x00000010,0x00001000 │ Floating point exception (core dumped) [/code] |
I am testing a bunch more combinations of core count and worker counts and found some other situations causing errors or crashes.
First, I found that benchmarking all possible [1,24] core counts are fine if I limit to just 1 worker. Then I found a couple more weird combinations which crashed: [code] [Worker #1 Aug 12 12:16] Timing 2560K FFT, 16 cores, 10 workers. Floating point exception (core dumped) ... [Worker #1 Aug 12 12:20] Timing 2560K FFT, 16 cores, 12 workers. Floating point exception (core dumped) [/code] Should these combinations (where workers does not divide cores) even be allowed/attempted? I then tried a more comprehensive test of all multiple of 2 core counts, and multiple of 2 workers, which ended up with a bunch of errors trying to set cpu affinities from #25 up to cpu #524 [code] [Worker #1 Aug 12 12:24] Timing 2560K FFT, 2 cores, 2 workers. [Aug 12 12:24] Error setting affinity to core #25. There are 24 cores. [Worker #1 Aug 12 12:24] Error setting affinity to core #26. There are 24 cores. [Worker #1 Aug 12 12:24] Error setting affinity to core #27. There are 24 cores. ... (Errors for all numbers core #25 through #524) ... [Worker #1 Aug 12 12:24] Error setting affinity to core #523. There are 24 cores. [Worker #1 Aug 12 12:24] Error setting affinity to core #524. There are 24 cores. [Worker #1 Aug 12 12:24] Timing 2560K FFT, 4 cores, 2 workers. [Aug 12 12:24] Error setting affinity to core #25. There are 24 cores. ... 500 errors again ... [Worker #1 Aug 12 12:26] Timing 2560K FFT, 4 cores, 4 workers. [Aug 12 12:26] Error setting affinity to core #25. There are 24 cores. ... 500 errors again ... [Worker #1 Aug 12 12:26] Timing 2560K FFT, 6 cores, 2 workers. [Aug 12 12:26] Error setting affinity to core #25. There are 24 cores. ... 500 errors again ... [Worker #1 Aug 12 12:27] Timing 2560K FFT, 6 cores, 4 workers. [Aug 12 12:27] Error setting affinity to core #25. There are 24 cores. ... 500 errors again ... [Worker #1 Aug 12 12:27] Timing 2560K FFT, 6 cores, 6 workers. [Aug 12 12:27] Error setting affinity to core #25. There are 24 cores. ... 500 errors again ... [Worker #1 Aug 12 12:28] Timing 2560K FFT, 8 cores, 2 workers. [Aug 12 12:28] Error setting affinity to core #25. There are 24 cores. ... 500 errors again ... [Worker #1 Aug 12 12:28] Timing 2560K FFT, 8 cores, 4 workers. [Aug 12 12:28] Error setting affinity to core #25. There are 24 cores. ... 500 errors again ... [Worker #1 Aug 12 12:28] Timing 2560K FFT, 8 cores, 6 workers. [Aug 12 12:28] Error setting affinity to core #25. There are 24 cores. ... 500 errors again ... [Worker #1 Aug 12 12:29] Timing 2560K FFT, 8 cores, 8 workers. [Aug 12 12:29] Error setting affinity to core #25. There are 24 cores. ... 500 errors again ... [Worker #1 Aug 12 12:29] Timing 2560K FFT, 10 cores, 2 workers. [Aug 12 12:29] Error setting affinity to core #25. There are 24 cores. ... 500 errors again ... [Worker #1 Aug 12 12:29] Timing 2560K FFT, 10 cores, 4 workers. [Aug 12 12:29] Error setting affinity to core #25. There are 24 cores. ... 500 errors again ... [Worker #1 Aug 12 12:29] Timing 2560K FFT, 10 cores, 6 workers. Floating point exception (core dumped) [/code] I tried to keep the output small and manageable for this, so I didn't have the "AffinityVerbosityBench=3" set as in my previous crash report. Please let me know if there's any specific configuration you'd like me to provide more detailed logs for. |
[QUOTE=hansl;523605]I am testing a bunch more combinations of core count and worker counts and found some other situations causing errors or crashes. [/QUOTE]
I've finally found the source of this bug! I'll work on a fix. For now, put "NumThreadingNodes=1" in local.txt and the problem should go away. |
[QUOTE=Prime95;523848]I've finally found the source of this bug! I'll work on a fix.
For now, put "NumThreadingNodes=1" in local.txt and the problem should go away.[/QUOTE] Great! I was able to test again all multiples of 2 cores and workers with this and none failed. One more question: I noticed that higher worker count seems to add to the time of the individual benchmark stages, even though I am testing at supposedly 5 seconds each. I was just manually looking at the clock and counting seconds during a run, and it seemed that 24 cores/1 worker was around 5-6s but 24 cores/24 workers took about 36 seconds. Is that expected, and would the throughput numbers still be correct if its expecting 5 seconds? |
[QUOTE=hansl;523876] I noticed that higher worker count seems to add to the time of the individual benchmark stages, even though I am testing at supposedly 5 seconds each. I was just manually looking at the clock and counting seconds during a run, and it seemed that 24 cores/1 worker was around 5-6s but 24 cores/24 workers took about 36 seconds.
Is that expected, and would the throughput numbers still be correct if its expecting 5 seconds?[/QUOTE] The process is to launch 24 threads, init them all, then wait for all to complete initialization, then do the 5 seconds of counting iterations. I hope the increased run time is for doing 24 initializations vs. just one initialization. I'll add an option to print a message that all workers have finished initialization so that you can see if the wall clock time for the single worker and 24 worker cases are both about 5 seconds once initialization completes. |
In 29.8 build 6 add "BenchInitCompleteMessage=1" to prime.txt.
|
[QUOTE=Prime95;523883]In 29.8 build 6 add "BenchInitCompleteMessage=1" to prime.txt.[/QUOTE]
OK, thank you! Does this mean the build is out already? If so can you link to the linux 64bit version? |
29.8 build 6 is now ready. See first post.
|
[QUOTE=Prime95;523898]29.8 build 6 is now ready. See first post.[/QUOTE]
Chrome will soon be dropping support for the FTP protocol, starting with version 82. It's currently at version 76. Can we change the links in the first post to: [c]https://www.mersenne.org/ftp_root/gimps/p95v298b6.linux64.tar.gz[/c] and so forth? That will have the added benefit of being more secure than downloading executables over an unencrypted connection. |
[QUOTE=Prime95;523898]29.8 build 6 is now ready. See first post.[/QUOTE]Trying to run the win64 version I get the following error message : "prime95,exe - Application error / The application was unable to start correctly (0xc000007b). Click OK to close the application." It seems the file libhwloc-15.dll is the culprit : using the version from 29.8b3 does not give the problem. (Could it be that the dll is the 32 bits version ? it is smaller than the version shipped with 29.8b3...)
Then a cosmetic correction is also needed : in the Windows 64 version, the File Version and the Product version are stuck at 28.8.1.0 and 29.8.0.0. Jacob |
Was running 29.6b3... now updated to 29.8b6, Linux 64-bit.
I changed the CPU on one machine, from Ryzen 3 2200G to Ryzen 5 3600. It seems to be running fine with the memory at 3600 MHz, even on a B350 motherboard (too early to tell for sure, but a few hours of torture tests were OK). Just a few questions, though. Is there anything that I need to do because the hardware has now basically changed quite a bit? I already changed the number of cores used from 4 to 6. And on Zen 2, what FFT is supposed to be used, FMA3 or AVX2? For some reason, the program selects FMA3 (FFT length 2688K). On the 2200G, I could force AVX2 through options in local.txt, but of course, it ran slower than FMA3, as expected. On the 3600, it gives this error, when trying to continue from a savefile: [CODE][Work thread Aug 19 14:37] Cannot initialize FFT code, errcode=1002 [Work thread Aug 19 14:37] Number sent to gwsetup is too large for the FFTs to handle. [/CODE] CPU info from results.bench.txt :(I'm not including the cache lines, probably not relevant info anyway) [CODE]AMD Ryzen 5 3600 6-Core Processor CPU speed: 4184.81 MHz, 6 hyperthreaded cores CPU features: 3DNow! Prefetch, SSE, SSE2, SSE4, AVX, AVX2 L1 cache size: 6x32 KB, L2 cache size: 6x512 KB, L3 cache size: 2x16 MB L1 cache line size: 64 bytes, L2 cache line size: 64 bytes Machine topology as determined by hwloc library: Machine#0 (total=16351064KB, DMIProductName="To Be Filled By O.E.M.", DMIProductVersion="To Be Filled By O.E.M.", DMIBoardVendor=ASRock, DMIBoardName="AB350M Pro4", DMIBoardVersion=, DMIBoardAssetTag=, DMIChassisVendor="To Be Filled By O.E.M.", DMIChassisType=3, DMIChassisVersion="To Be Filled By O.E.M.", DMIChassisAssetTag="To Be Filled By O.E.M.", DMIBIOSVendor="American Megatrends Inc.", DMIBIOSVersion=P6.00, DMIBIOSDate=08/02/2019, DMISysVendor="To Be Filled By O.E.M.", Backend=Linux, LinuxCgroup=/, OSName=Linux, OSRelease=4.19.0-2-amd64, OSVersion="#1 SMP Debian 4.19.16-1 (2019-01-17)", HostName=palvi2, Architecture=x86_64, hwlocVersion=2.0.4, ProcessName=mprime) Package#0 (total=16351064KB, CPUVendor=AuthenticAMD, CPUFamilyNumber=23, CPUModelNumber=113, CPUModel="AMD Ryzen 5 3600 6-Core Processor ", CPUStepping=0) [/CODE] |
[QUOTE=S485122;523925]Trying to run the win64 version I get the following error message : "prime95,exe - Application error / The application was unable to start correctly (0xc000007b). Click OK to close the application." It seems the file libhwloc-15.dll is the culprit : using the version from 29.8b3 does not give the problem. (Could it be that the dll is the 32 bits version ? it is smaller than the version shipped with 29.8b3...)
Then a cosmetic correction is also needed : in the Windows 64 version, the File Version and the Product version are stuck at 28.8.1.0 and 29.8.0.0. Jacob[/QUOTE] I can confirm the error. 8700K CPU on win10 x64 1903. Tested in a clean VM as well and received the same error. |
[QUOTE=Random;523953]I can confirm the error. 8700K CPU on win10 x64 1903. Tested in a clean VM as well and received the same error.[/QUOTE]
My bad, included the 32-bit hwloc-15.dll. I repaired and uploaded the win64 zip file |
[QUOTE=Prime95;523962]My bad, included the 32-bit hwloc-15.dll.
I repaired and uploaded the win64 zip file[/QUOTE] The app starts normally now, thanks a lot for all your work. |
[QUOTE=nomead;523929]
And on Zen 2, what FFT is supposed to be used, FMA3 or AVX2? For some reason, the program selects FMA3 (FFT length 2688K). On the 2200G, I could force AVX2 through options in local.txt, but of course, it ran slower than FMA3, as expected. [/quote] All Ryzen CPUs should use FMA3 FFTs for maximum speed. [quote] On the 3600, it gives this error, when trying to continue from a savefile: [CODE][Work thread Aug 19 14:37] Cannot initialize FFT code, errcode=1002 [Work thread Aug 19 14:37] Number sent to gwsetup is too large for the FFTs to handle. [/CODE] [/QUOTE] Hmm. Does the worktodo.txt entry specify a specific FFT size? If so, remove it. Otherwise, PM me the savefile and worktodo.txt entry that is failing. |
[QUOTE=GP2;523915]Chrome will soon be dropping support for the FTP protocol, starting with version 82. It's currently at version 76.
Can we change the links in the first post to: [c]https://www.mersenne.org/ftp_root/gimps/p95v298b6.linux64.tar.gz[/c] and so forth? That will have the added benefit of being more secure than downloading executables over an unencrypted connection.[/QUOTE] Somewhat related issue: the directory at [url]https://mersenne.org/ftp_root/gimps[/url] gives a 403 error. This makes it hard to download older versions of software. |
[QUOTE=Prime95;523882]The process is to launch 24 threads, init them all, then wait for all to complete initialization, then do the 5 seconds of counting iterations. I hope the increased run time is for doing 24 initializations vs. just one initialization. I'll add an option to print a message that all workers have finished initialization so that you can see if the wall clock time for the single worker and 24 worker cases are both about 5 seconds once initialization completes.[/QUOTE]
[QUOTE=Prime95;523883]In 29.8 build 6 add "BenchInitCompleteMessage=1" to prime.txt.[/QUOTE] So just to follow up: I now confirm that after initialization, the actual benchmarking seems to be constant 5s as configured. Thanks again. |
[QUOTE=ixfd64;523969]Somewhat related issue: the directory at [url]https://mersenne.org/ftp_root/gimps[/url] gives a 403 error. This makes it hard to download older versions of software.[/QUOTE]
Although that web page does just point to the same spot as the FTP files, I haven't enabled directory browsing for it (separate option for web pages). I think it'd be safe to do that since it's the same as the FTP site. I'll double-check... it's hard to overcome the knee-jerk reaction to NEVER enable directory browsing on a web page. :smile: When I was first testing out ftp vs http for the file downloads, HTTP was beating FTP, in terms of speed, by a pretty good amount. Which seemed counter intuitive to me at first. I set the download page accordingly, to prefer http over ftp when it was picking which mirror to use (right now there really aren't any up to date mirrors, so the only one you'll get is http at Primenet anyway). I think someone recently pointed out that although the rest of the site forces HTTPS, the downloads are still HTTP. It didn't make much sense to me to bother encrypting the zip files in transit. If someone's worried about a MITM when downloading, they can compare the hashes we put on the download page, which [I]is[/I] HTTPS. (I know that's all way more info than you asked about, but I like to throw a little "fun fact" stuff out there, just in case more questions come up) |
[QUOTE=Madpoo;523984]Although that web page does just point to the same spot as the FTP files, I haven't enabled directory browsing for it (separate option for web pages).
I think it'd be safe to do that since it's the same as the FTP site. I'll double-check... it's hard to overcome the knee-jerk reaction to NEVER enable directory browsing on a web page.[/QUOTE] Done... file listing enabled: [URL="http://www.mersenne.org/ftp_root/gimps/"]http://www.mersenne.org/ftp_root/gimps/[/URL] |
[QUOTE=Prime95;523964]All Ryzen CPUs should use FMA3 FFTs for maximum speed.[/QUOTE]
Okay, good to know that it's supposed to do that even on Zen 2. [QUOTE=Prime95;523964]Hmm. Does the worktodo.txt entry specify a specific FFT size? If so, remove it. Otherwise, PM me the savefile and worktodo.txt entry that is failing.[/QUOTE] [C]DoubleCheck=(AID censored),49683839,74,1[/C] for example. And apparently it doesn't even require a savefile to get that error. It does that if I try to force it to use AVX2 or even AVx (by setting [C]CpuSupportsFMA3=0[/C] and then [C]CpuSupportsAVX=1[/C] in local.txt). Used to work on the Ryzen 3 2200G, but as mentioned, it was slower than FMA3. But anyway, if it's supposed to use FMA3 anyway, I'll be happy with that information. |
What sort of operations does mprime use libgmp for? From grepping the docs I only found a mention to Jacobi error checking, is that the only thing?
I ask because I recently built a dev version of libgmp which has some Zen architecture optimizations(znver1) and it seemed to speed up a particular PARI/GP script for me by maybe 2x. I can't tell what version is bundled with mprime, because apparently all version of libgmp.so are just labeled 10.3.2 for some reason, even thought current stable is 6.1.2. |
[QUOTE=hansl;524009]What sort of operations does mprime use libgmp for? From grepping the docs I only found a mention to Jacobi error checking, is that the only thing?
I ask because I recently built a dev version of libgmp which has some Zen architecture optimizations(znver1) and it seemed to speed up a particular PARI/GP script for me by maybe 2x. I can't tell what version is bundled with mprime, because apparently all version of libgmp.so are just labeled 10.3.2 for some reason, even thought current stable is 6.1.2.[/QUOTE] Jacobi check and GCDs (P-1 and ECM). Mprime is bundled with 6.1.2. |
[QUOTE=Prime95;523964]All Ryzen CPUs should use FMA3 FFTs for maximum speed.[/QUOTE]
Are you really sure? Zen 2 has double the AVX-256 bit speed compared to Zen 1. All data paths were widened for this purpose also. I was kinda hoping for an AVX-256 bit implementation for Zen 2. |
[QUOTE=Evil Genius;524033]Are you really sure? Zen 2 has double the AVX-256 bit speed compared to Zen 1. All data paths were widened for this purpose also. I was kinda hoping for an AVX-256 bit implementation for Zen 2.[/QUOTE]
FMA3 *is* AVX-256. For that matter, AVX is also AVX-256. You're getting AVX-256 and AVX-512 confused. SSE is 128 bit registers. AVX is 256 bit registers. AVX-512 is 512 bit registers. |
[QUOTE=Evil Genius;524033]Are you really sure? Zen 2 has double the AVX-256 bit speed compared to Zen 1. All data paths were widened for this purpose also. I was kinda hoping for an AVX-256 bit implementation for Zen 2.[/QUOTE]
Zen 2 doesn't downclock when doing AVX-256 either. |
[QUOTE=AG5BPilot;524035]FMA3 *is* AVX-256.[/QUOTE]
Then how do Piledriver cores support FMA3 without supporting AVX-256? |
[QUOTE=Mark Rose;524041]Then how do Piledriver cores support FMA3 without supporting AVX-256?[/QUOTE]
There's no such thing as "AVX-256". There's AVX, AVX2 (which isn't important for Prime95/gwnum but comes along with FMA3, which is important), and AVX-512. What you're calling AVX-256 is plain old original AVX, which has been supported by AMD for as long as Intel has supported it. AMD's implementation was crippled, however, so it hasn't been used here until Zen 2. With Zen 2, it's useful, finally -- but it's always been there. Zen2 supports FMA3. And it supports AVX. And, finally, they're as good as Intel's implementation. They don't support AVX-512, but that's a whole different discussion. Edit: Please see the Wikipedia page for the Piledriver architecture: [url]https://en.wikipedia.org/wiki/Piledriver_(microarchitecture)[/url] . It clearly states that Piledriver supports AVX and FMA3. |
George, could you please have a look at this issue?
[url]https://mersenneforum.org/showpost.php?p=519786&postcount=296[/url] On macOS, Prime95 doesn't let me set one core per worker unless I edit the configuration files. This happens even for non-100 million digit work types. |
[QUOTE=AG5BPilot;524035]FMA3 *is* AVX-256.
For that matter, AVX is also AVX-256. You're getting AVX-256 and AVX-512 confused. SSE is 128 bit registers. AVX is 256 bit registers. AVX-512 is 512 bit registers.[/QUOTE] Sigh. No I'm not confused. AVX also has an 128-bit compatibility mode. Which is used on current Zen models. Of which the older don't have 256-bit logic, so they used two consecutive 128-bit operations. The newer Zen models can execute AVX-256 bit code without penalty. |
I should elaborate to prevent confusion:
vxorpd %xmm0,%xmm0,%xmm0 -> AVX-128 bit vxorpd %ymm0,%ymm0,%ymm0 -> AVX-256 bit See the difference? There are also two FMA3s: a 128-bit one, and a 256-bit one. The '3' only implies the number of arguments: vfmaddpd213 %xmm2,%xmm1,%xmm1 -> FMA3-128 bit vfmaddpd213 %ymm2,%ymm1,%ymm1 -> FMA3-256 bit |
[QUOTE=Evil Genius;524074]I should elaborate to prevent confusion:
vxorpd %xmm0,%xmm0,%xmm0 -> AVX-128 bit vxorpd %ymm0,%ymm0,%ymm0 -> AVX-256 bit See the difference?[/QUOTE] The difference, yes, but not your point. AVX added the 256-bit ymm# registers. The 128-bit xmm# registers are SSE registers (also usable by AVX instructions). Are you trying to say that Piledriver lacked the 16 256 bit ymm# registers? |
[QUOTE=ixfd64;524071]George, could you please have a look at this issue?
[url]https://mersenneforum.org/showpost.php?p=519786&postcount=296[/url] On macOS, Prime95 doesn't let me set one core per worker unless I edit the configuration files. This happens even for non-100 million digit work types.[/QUOTE] I cannot replicate. If I set NumCPUs=2 and CoresPerTest=1 the WorkerWindows dialog comes up properly greyed out. If I set NumCPUs=2 and CoresPerTest=2 the WorkerWindows dialog comes up and lets me edit the CPU counts. What do I need to do differently? |
[QUOTE=Prime95;524077]I cannot replicate. If I set NumCPUs=2 and CoresPerTest=1 the WorkerWindows dialog comes up properly greyed out. If I set NumCPUs=2 and CoresPerTest=2 the WorkerWindows dialog comes up and lets me edit the CPU counts.
What do I need to do differently?[/QUOTE] I don't have [c]NumCPUs[/c] set on this computer. I'm also not able to reproduce this on a Mac Pro. It seems this issue only affects certain computers. |
[QUOTE=AG5BPilot;524076]The difference, yes, but not your point. AVX added the 256-bit ymm# registers.
The 128-bit xmm# registers are SSE registers (also usable by AVX instructions). Are you trying to say that Piledriver lacked the 16 256 bit ymm# registers?[/QUOTE] No, I'm saying that although there's a 256-bit implementation on the outside, on the inside many AVX compatible processors implemented 256-bit operations as 2 consecutive 128-bit operations. Zen 1(+) was no exception, but this changed with Zen 2. Also of note is that if there's no native 256-bit implementation, the 128-bit implementation is faster. |
[QUOTE=Evil Genius;524079]Also of note is that if there's no native 256-bit implementation, the 128-bit implementation is faster.[/QUOTE]
Only in Bulldozer and its descendants. AMD made a real mess of their initial AVX implementation. All Ryzens (to my knowledge) are faster using the 256-bit implementation even if internally it is done 128-bits at a time. |
[QUOTE=ixfd64;524078]I don't have [c]NumCPUs[/c] set on this computer. I'm also not able to reproduce this on a Mac Pro. It seems this issue only affects certain computers.[/QUOTE]
I set NumCPUs=2 to emulate your dual-core machine. |
[QUOTE=AG5BPilot;524053]There's no such thing as "AVX-256". There's AVX, AVX2 (which isn't important for Prime95/gwnum but comes along with FMA3, which is important), and AVX-512.
What you're calling AVX-256 is plain old original AVX, which has been supported by AMD for as long as Intel has supported it. AMD's implementation was crippled, however, so it hasn't been used here until Zen 2. With Zen 2, it's useful, finally -- but it's always been there. Zen2 supports FMA3. And it supports AVX. And, finally, they're as good as Intel's implementation. They don't support AVX-512, but that's a whole different discussion. Edit: Please see the Wikipedia page for the Piledriver architecture: [url]https://en.wikipedia.org/wiki/Piledriver_(microarchitecture)[/url] . It clearly states that Piledriver supports AVX and FMA3.[/QUOTE] On Excavator, AVX and FMA3 is just working well. The A10-9700(2m4c4t)'s performance is similar to a hyperthreaded Kaby Lake i5-8250u (4c8t). |
[QUOTE=Prime95;524081]I set NumCPUs=2 to emulate your dual-core machine.[/QUOTE]
I added [C]NumCPUs=2[/C] to my [C]local.txt[/C] file as an experiment, and it didn't make a difference. I'm at a loss as to why Prime95 won't let me set less than two cores per worker on a dual-core machine when using two worker windows. Any other Mac users seen this issue? |
[QUOTE=Mark Rose;524040]Zen 2 doesn't downclock when doing AVX-256 either.[/QUOTE]
They do, just not in the same way Intel does it. I assume we're familiar with Intel's AVX offset. If there is AVX code running, the clock may be reduced by some fixed amount. It is crude but does the job. Zen 2 doesn't have an AVX offset concept, but when running Prime95 like code with FMA, it still generates a lot of heat. Based on observation of actual behaviour, running stock, you will hit PPT limit and current limit is also close, so it does still clock down compared to running lesser loads. From memory, on my 3600 it runs all core around 3900 MHz with 128k FFT per core, and a lower stress like Cinebench R15 it is well above 4 GHz. AMD took a more GPU-like approach on Zen 2, it will adjust its clock based on power, current, temperature... so they're not detecting AVX and dropping, but it still uses more power and hits other limits earlier than otherwise so it still drops. |
[QUOTE=Prime95;524080]All Ryzens (to my knowledge) are faster using the 256-bit implementation even if internally it is done 128-bits at a time.[/QUOTE]
So I have been running AVX-256 bit FFTs for years on my Ryzen 1700. You learn something new everyday. What's your secret? Good instruction scheduling? My own FFT implementation runs slightly faster on the 1700 when I use AVX-128 bit instead of AVX-256 bit. |
[QUOTE=Evil Genius;524159]What's your secret? Good instruction scheduling? My own FFT implementation runs slightly faster on the 1700 when I use AVX-128 bit instead of AVX-256 bit.[/QUOTE]
I've written SSE2 (128-bit) and AVX (256-bit) versions of FFTs. I've never tried writing an AVX version that only uses half of the register width. I don't see why that would be beneficial. |
[QUOTE=Prime95;524164]I've written SSE2 (128-bit) and AVX (256-bit) versions of FFTs. I've never tried writing an AVX version that only uses half of the register width. I don't see why that would be beneficial.[/QUOTE]
1. three-register non-destructive mode! 2. implied SSE4_2 support 3. possible FMA support 4. faster on some processors (not the mainstream ones, however) 5. explicit zeroing of upper half of register (only important when switching between 128-bit and 256-bit) Expect about a 5% speed increase compared to SSE2. |
I forgot an important one:
1.5 better scheduling on processors that chop 256-bit operations in half To summarize: * three-register non-destructive mode gives AVX-128 about a 5% speed advantage over SSE2 * better instruction scheduling gives AVX-128 about a 5% speed advantage over AVX-256 when the processor doesn't natively support 256 bit YMMV |
[QUOTE=Evil Genius;524185]
1.5 better scheduling on processors that chop 256-bit operations in half * better scheduling gives AVX-128 about a 5% speed advantage over AVX-256 when the processor doesn't natively support 256 bit YMMV[/QUOTE] I think my mileage will/should vary. No arguments about the improvements over SSE2. I contend that using the full 256 register should be significantly faster unless the chip developer completely screwed up. a) Twice as much data in registers -- the fastest possible place to store data. b) Half as many instructions to be read and decoded. c) Guaranteed no data dependencies executing on data in the upper vs. lower 128 bits (makes it easier to schedule 128-bit uops). IMO, when AMD screwed up their implementation of splitting 256-bit instructions into two 128-bit uops it was not my job to fix it. I do admit that when I worked on Bulldozer several years ago I did not think of timing AVX on 128-bit operands. |
[QUOTE=Prime95;524195]IMO, when AMD screwed up their implementation of splitting 256-bit instructions into two 128-bit uops it was not my job to fix it. I do admit that when I worked on Bulldozer several years ago I did not think of timing AVX on 128-bit operands.[/QUOTE]
They were not the only ones. They deemed compatibility more important than throughput at the time. The reason their cores are more efficient power-wise. But if you have time to spare give it a try on a Zen 1. The results may surprise you. |
[QUOTE=Evil Genius;524197]
But if you have time to spare give it a try on a Zen 1. The results may surprise you.[/QUOTE] I'm not sure what you are referring to when you say "Zen 1". All the benchmark results I've seen from Ryzen chips indicate they are excellent performers. Like Intel chips, they are memory bandwidth limited (for large FFT sizes). |
[QUOTE=Prime95;524209]I'm not sure what you are referring to when you say "Zen 1".
All the benchmark results I've seen from Ryzen chips indicate they are excellent performers. Like Intel chips, they are memory bandwidth limited (for large FFT sizes).[/QUOTE] Zen 1 comprises the AMD Ryzen 1000 series, Zen 1+ the 2000 series, and Zen 2 the 3000 series. AMD Threadrippers will have double the memory channels however. |
[QUOTE=Evil Genius;524211]Zen 1 comprises the AMD Ryzen 1000 series, Zen 1+ the 2000 series, and Zen 2 the 3000 series.
AMD Threadrippers will have double the memory channels however.[/QUOTE] Nope. Ryzen 1000 series is Zen, 2000 series is Zen+, and the 3000 series is Zen 2. |
1 Attachment(s)
Very minor issue: on macOS, part of the text box is always cut off.
|
Can Prime95 be configured in such way to show what FFT will be used when processing of candidate is started?
To put that info in some log file or something like that |
[QUOTE=pepi37;524526]Can Prime95 be configured in such way to show what FFT will be used when processing of candidate is started?
To put that info in some log file or something like that[/QUOTE] It does already appear in the main worker window for me, but not specifically in a log from what I can see: [code] Starting primality test of MXXXXXXXX using FMA3 FFT length 2560K, Pass1=640, Pass2=4K, clm=4, 4 threads [/code] |
[QUOTE=hansl;524529]It does already appear in the main worker window for me, but not specifically in a log from what I can see:
[code] Starting primality test of MXXXXXXXX using FMA3 FFT length 2560K, Pass1=640, Pass2=4K, clm=4, 4 threads [/code][/QUOTE] I would like to put list of candidates in worktodo.txt so Prime95 scan and write what FFT will be used. |
take a look at the link for Mac OS X from the top posting
Mac OS X: [URL]https://mersenne.org/gimps/ftp_root/p95v298b6.MacOSX.zip[/URL]
The web folders ftp_root and gimps are transposed and so it won't download. Takes you to the Home tab on the main website. |
[QUOTE=pepi37;524536]I would like to put list of candidates in worktodo.txt so Prime95 scan and write what FFT will be used.[/QUOTE]
With a little bit of experience, you'll know which FFT is used; it's not like LLR where the FFT choice has multiple inputs (k and n), so you'll learn the approximate cutoffs in fairly short time. If you post the architecture you run P95 on, I bet some experienced user can share them from (human) memory. |
[QUOTE=ssybesma;524543]Mac OS X: [URL]https://mersenne.org/gimps/ftp_root/p95v298b6.MacOSX.zip[/URL]
The web folders ftp_root and gimps are transposed and so it won't download. Takes you to the Home tab on the main website.[/QUOTE]On the website also there are some links that deserve corrections, it is still version 3 that is posted. Also on the site on the download page all links point to http instead of https. Jacob |
[QUOTE=S485122;524563]On the website also there are some links that deserve corrections, it is still version 3 that is posted. Also on the site on the download page all links point to http instead of https.
Jacob[/QUOTE] Yep, if I see something I'll let you know. |
[QUOTE=VBCurtis;524554]With a little bit of experience, you'll know which FFT is used; it's not like LLR where the FFT choice has multiple inputs (k and n), so you'll learn the approximate cutoffs in fairly short time.
If you post the architecture you run P95 on, I bet some experienced user can share them from (human) memory.[/QUOTE] Yes ut since FMA and AVX2 appear , calculation is not easy anymore. |
[QUOTE=S485122;524563]On the website also there are some links that deserve corrections, it is still version 3 that is posted. Also on the site on the download page all links point to http instead of https.
Jacob[/QUOTE] The links have been updated, but the header still says version 29.8 build 3 is the latest. Also, only the mersenne.org HTTP mirror works. The others (mersenneforum.org and mersenne.ca) currently give a 404 error. |
[QUOTE=ixfd64;524699]Also, only the mersenne.org HTTP mirror works. The others (mersenneforum.org and mersenne.ca) currently give a 404 error.[/QUOTE]I forced an update of the mersenne.ca mirror so those download links should work. I'm not sure who can upload to mersenneforum.org
|
[QUOTE=ixfd64;524699]The others (mersenneforum.org and mersenne.ca) currently give a 404 error.[/QUOTE]The forum mirror works for us.
[URL]https://mersenneforum.org/gimps/[/URL] [QUOTE=James Heinrich;524702]I'm not sure who can upload to mersenneforum.org[/QUOTE]We update the mirror monthly. (It is current now.) :mike: |
@George: the front page and the red header on the downloads page still say build 3 is the latest version.
|
I guess this a feature request. Please can all the custom settings in torture test window and also benchmark settings be retained between runs? Gets a bit tedious typing it in each time when trying different hardware configurations.
|
I think I have discovered an abnormal behavior in mprime 29.8 build 6.
While running ECM curves with GmpEcmHook=1, when it finishes a curve it updates results.txt but leaves the backup files in place. If the program is then interrupted before the backup files are updated again, then when restarted it resumes the previous curve data and completes it, and writes a duplicate entry for it in results.txt. Probably the backup files should be deleted when results.txt gets updated. |
hwloc 2.1 was released a few days ago: [url]https://www.open-mpi.org/software/hwloc/v2.1[/url]
George, does this fix any of the issues encountered using the previous version? |
[QUOTE=ixfd64;528009]
George, does this fix any of the issues encountered using the previous version?[/QUOTE] I don't think there were any issues with the previous version. There was one bug that I thought could be hwloc related, but it turned out to be a prime95 bug dealing with a CPU that had multiple L3 caches. |
If a user has 29.8 build 5, what are the circumstances under which build 6 is necessary or just a good idea instead?
|
[QUOTE=Jwb52z;528162]If a user has 29.8 build 5, what are the circumstances under which build 6 is necessary or just a good idea instead?[/QUOTE]
See second post in this thread: [CODE]22) P-1 is frequently missing factors of numbers of the form 2*3^n+1. Fixed in 29.8 build 6. 23) In a throughput benchmark on machines with multiple L3 caches, some combinations of number-of-cores / number-of-workers would raise errors setting affinity or crash. Fixed in 29.8 build 6. [/CODE] So, basically nobody needs to upgrade from build 5 to build 6. |
Feature request: Exit when out of work.
@George: Could you please add this feature when you have a moment.
Basic use-case is in a fully scripted Colab / Kaggle environment. It would be nice having a setting (probably in prime.txt) which causes mprime to exit when worktodo.txt is empty. Rather than it continuing to run, consuming 0% of the CPU, while it watches worktodo.txt for more work. Some of my scripts watch worktodo.txt and SIGINTs mprime when it becomes empty. But in some designs it would be much easier if we could just "`./mprime`;" (Perl; "!./mprime" in Python) and then loop around and fetch more work and feed it to a new run of mprime. |
[QUOTE=chalsall;529474]@George: Could you please add this feature when you have a moment.
Basic use-case is in a fully scripted Colab / Kaggle environment. It would be nice having a setting (probably in prime.txt) which causes mprime to exit when worktodo.txt is empty. Rather than it continuing to run, consuming 0% of the CPU, while it watches worktodo.txt for more work.[/QUOTE] Are you expecting mprime to first get work and populate worktodo.txt and run until it is empty? Or are you pre-populating worktodo.txt and running mprime until worktodo.txt is empty? Is this option only for one-worker mprime runs? |
[QUOTE=Prime95;529599]Are you expecting mprime to first get work and populate worktodo.txt and run until it is empty? Or are you pre-populating worktodo.txt and running mprime until worktodo.txt is empty?[/QUOTE]
The latter. [QUOTE=Prime95;529599]Is this option only for one-worker mprime runs?[/QUOTE] Not quite sure what you mean by this. The workflow would be a new compute instance (nominally a Colab or Kaggle run) is the host for an ephemeral mprime run. Work in the worktodo.txt file is to be processed by mprime, and then mprime exits. It would then be up to the instance's "bootstrap" code to fetch more work, place it in worktodo.txt, and then mprime is launched again. I hope that makes sense. Thanks for considering this. |
| All times are UTC. The time now is 20:42. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.