mersenneforum.org gpuOwL: an OpenCL program for Mersenne primality testing
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

2020-07-20, 21:01   #2366
preda

"Mihai Preda"
Apr 2015

1,223 Posts

Quote:
 Originally Posted by kriesel Results.txt contains zero bytes.
-verify in gpuowl ATM is more of a debugging tool, allowing to check the correctness of a proof, but that's not how the proof verification is implemented by primenet. That's why it does not produce a result (in results.txt).

In brief this is how the primenet proof verification will work:
1. the server will do a bit of pre-processing of the proof file (run all the verification steps *excluding* the bulk of the iterations that took most of the time at the end in your verification), and produce a pair of residues that must match A^(2^n)==B.
2. the server will apply "random", a form of let's say encrypting both A and B such:
A'=A^random
B'=B^random
and A'^(2^n)==B' still holds.
3. the new work-type "CERT" consists in the client downloading A' and n, and the result consists of hash(A'^(2^n)) that is sent back to the server.

So:
- CERT is not yet implemented in GpuOwl (but shouldn't be hard)
- the proof file isn't needed
- OTOH a download of A' (one residue) from primenet is needed for CERT.

A question is whether the CERT worktype needs to be implemented at all in GpuOwl. As this work-type is rather tiny and limited in supply (i.e. limited by the number of PRPs completed), probably a tiny amount of participants running mprime can exaust all the CERTs satisfactorilly.

Concerning your proof: keep it around a bit more, next upload it to primenet as soon as the uploader becomes available. After the upload, it will be turned into a CERT that will be run most likely by somebody else (but you could run it too, the fact that you're the original author and hold onto the full proof does not weaken the verification (because of the "random" trick above)).

2020-07-21, 00:33   #2367
storm5510
Random Account

Aug 2009
U.S.A.

5·311 Posts

Quote:
 Originally Posted by preda Yes, the GCD is done on the CPU using GNU-MP. It's a convenient solution from the coding POV. The GCD is infrequent, and one GCD takes on the order of 1min on one core of the CPU, no big deal. Porting the fancy GCD algo to GPU would be a lot of work. Worth it if somebody was doing mainly GCDs, but that's not the case for gpuowl ATM.
I am running small P-1's for James Heinrich. The use of the CPU is considerable in what I would call "Stage 1." I can tell by the temperature. I have a widget which sits in the upper-right corner of the screen. "Stage 2" is the same, but with more GPU involvement. The CPU stays around 70°C. I do not see this as a problem.

Overall, I have been quite satisfied with it.

2020-07-21, 21:11   #2368
ewmayer
2ω=0

Sep 2002
República de California

2×13×443 Posts

Quote:
 Originally Posted by storm5510 I am running small P-1's for James Heinrich. The use of the CPU is considerable in what I would call "Stage 1." I can tell by the temperature. I have a widget which sits in the upper-right corner of the screen. "Stage 2" is the same, but with more GPU involvement. The CPU stays around 70°C. I do not see this as a problem.
I would think major CPU involvement would be similar for stages 1 and 2, occurring only at end of each stage, when the GCD is run.

2020-07-21, 22:33   #2369
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

116916 Posts

Quote:
 Originally Posted by ewmayer I would think major CPU involvement would be similar for stages 1 and 2, occurring only at end of each stage, when the GCD is run.
Unless it's the cpu-core-saturated issue that has variously shown up in Windows and Linux. https://www.mersenneforum.org/showpo...7&postcount=11
https://www.mersenneforum.org/showpo...postcount=1829
https://www.mersenneforum.org/showpo...postcount=1730
https://www.mersenneforum.org/showpo...postcount=1587
etc.
Or maybe unusual options, such as -log 10000.

Last fiddled with by kriesel on 2020-07-21 at 22:39

2020-08-04, 17:25   #2370
ATH
Einyen

Dec 2003
Denmark

56008 Posts

I'm trying to optimize the speed of the newest version. From Readme.md:

Quote:
 -use NEW_FFT8,OLD_FFT5,NEW_FFT10
What is FFT5, FFT8, FFT10? Is it using all 3, so I have to optimize OLD, NEW or NEWEST for each of them? or is it using 1 of the 9 possible combination?

Btw the FFT10 is not mentioned in the "gpuowl.cl" along with FFT5 and FFT8.

 2020-08-04, 18:10 #2371 storm5510 Random Account     Aug 2009 U.S.A. 5·311 Posts Could someone elaborate on exactly what CERT is? I must not have gpuOwl configured properly. For each exponent tested, a checkpoint folder is created using the exponent as the folder name. When any particular exponent test is finished, the folder is left behind. I do not see a reason. About "-use." I changed the default and have it set to use NEW_FFT8 only! An afterthought is did this help or is it a hindrance? Perhaps I need to restore the default and see if there is any difference.
 2020-08-04, 18:14 #2372 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 11100000000112 Posts Neither Mihai or I have a nVidia card. We tend to leave around some code that may or may not be useful in the case of nVidia. The FFT5 and FFT10 would only be used in FFT lengths divisible by 5. The wavefront just passed into 5.5M and 6M territory. The FFT8 one won't be used for quite a while (pass1 or pass2 is 512). Gpuowl prefers to use 256 and 1024 for its passes (less registers used). So, not much to be gained messing with any of the above. The best hope I think is playing with memory layouts. These are the IN and OUT settings, not all combinations work. From the code (numbers are Radeon VII timings): // OUT_WG=256, OUT_SIZEX=4, OUT_SPACING=1 (old WorkingOut4) : 154 + 252 = 406 (but may be best on nVidia) // OUT_WG=256, OUT_SIZEX=8, OUT_SPACING=1 (old WorkingOut3): 124 + 260 = 384 // OUT_WG=256, OUT_SIZEX=32, OUT_SPACING=1 (old WorkingOut5): 105 + 281 = 386 // OUT_WG=256, OUT_SIZEX=8, OUT_SPACING=2: 122 + 249 = 371 // OUT_WG=256, OUT_SIZEX=32, OUT_SPACING=4: 108 + 257 = 365 <- best // IN_WG=256, IN_SIZEX=4, IN_SPACING=1 (old WorkingIn4) : 177 + 164 (but may be best on nVidia) // IN_WG=256, IN_SIZEX=8, IN_SPACING=1 (old WorkingIn3): 129 + 166 = 295 // IN_WG=256, IN_SIZEX=32, IN_SPACING=1 (old WorkingIn5): 107 + 171 = 278 <- best // IN_WG=256, IN_SIZEX=8, IN_SPACING=2: 139 + 166 = 305 // IN_WG=256, IN_SIZEX=32, IN_SPACING=4: 121 + 161 = 282 Use the -time command line argument.
2020-08-04, 21:35   #2373
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

445710 Posts

Quote:
 Originally Posted by storm5510 Could someone elaborate on exactly what CERT is? I must not have gpuOwl configured properly. For each exponent tested, a checkpoint folder is created using the exponent as the folder name. When any particular exponent test is finished, the folder is left behind. I do not see a reason. About "-use." I changed the default and have it set to use NEW_FFT8 only! An afterthought is did this help or is it a hindrance? Perhaps I need to restore the default and see if there is any difference.
Some of this will depend on what software version you are running. CERT is a PrimeNet work type, to verify a PRP proof file. It's not available as a manual assignment, so not applicable to gpuowl.

From Gpuowl help, put -proof in your command line or config.txt, to generate proof files in future PRP test runs. It must be there from the start of an exponent's PRP test, for a Gpuowl version that supports it. From Gpuowl's help output,
Code:
-proof [<power>]   : enable PRP proof generation. Default <power> is 8. Use 8 - 10.
Then after the primality test is finished, the proof file is generated, and the proof file must be uploaded to the server, either through gpuowl's primenet.py or through the uploader program George provided, before a CERT verification can be run on it to validate the primality test. PrimeNet and prime95 automate that upload, and also download of a file to verify, as the input to the CERT assignment, and upload of the verification result, as the output of the CERT assignment.
For example, https://www.mersenne.org/report_expo...7829899&full=1 was PRP tested by Mihai and then the PrimeNet-connected prime95 on my laptop called falcon got assigned and completed the CERT assignment. The prime95 cert does more than gpuowl's verify,
Code:
-verify <file>|<exponent> : verify PRP-proof contained in <file> or in the folder <exponent>/
since prime95 sends verification results back to the server which confirms it's valid.

2020-08-05, 00:30   #2374
storm5510
Random Account

Aug 2009
U.S.A.

5×311 Posts

Quote:
 Originally Posted by Prime95 The FFT5 and FFT10 would only be used in FFT lengths divisible by 5. The wavefront just passed into 5.5M and 6M territory.
I ran a wavefront P-1 to test the affects of inserting OLD_FFT5 ahead of NEW_FFT8 in my "-use" line. The result was considerable. In previous tests with NEW_FFT8, the runtime for a similar test was 150 minutes, give or take. The inclusion of OLD_FFT5 reduced this to 105 minutes. I should have left well-enough alone.

Quote:
 Originally Posted by kriesel CERT is a PrimeNet work type, to verify a PRP proof file. For example, https://www.mersenne.org/report_expo...7829899&full=1 was PRP tested by Mihai and then the PrimeNet-connected prime95 on my laptop called falcon got assigned and completed the CERT assignment. The prime95 cert does more than gpuowl's verify,
Thank you for the reply. I understand the process. A person runs the PRP test, another runs the DC test, and still another will be assigned the CERT verification test. I believe I may see where this is headed, the elimination of LL and LL-DC. Even though LL and PRP are not my cup-of-tea because of the time required, I still like to keep a grasp on the processes involved.

2020-08-05, 00:39   #2375
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

11100000000112 Posts

Quote:
 Originally Posted by storm5510 I understand the process. A person runs the PRP test, another runs the DC test, and still another will be assigned the CERT verification test.
Not quite. There is no DC test. You run the PRP test, upload the proof file, someone runs the CERT test. Done.

 2020-08-05, 13:29 #2376 ATH Einyen     Dec 2003 Denmark 27·23 Posts On Google Colab Pro Tesla P100-PCIE-16GB-0 the best I could get with the old v6.11-238-g62a3025 was 809 us/iteration on an 91.6M exponent (5M FFT): -use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,OUT_WG=64,OUT_SIZEX=8,OUT_SPACING=4,IN_WG=64,IN_SIZEX=8,IN_SPACING=2 In the new version v6.11-366-gf887d6e the best I can get after a lot of testing on the same exponent is 840-841 us/iteration, which is close enough considering we now save a DC. The settings are almost the same except the settings that no longer exist: -use CARRY32,OUT_WG=64,OUT_SIZEX=8,OUT_SPACING=4,IN_WG=64,IN_SIZEX=8,IN_SPACING=4 Any of the 4 combination of these settings give 840-841 us/iteration: OUT_WG=64,OUT_SIZEX=8,OUT_SPACING=4 OUT_WG=16,OUT_SIZEX=8,OUT_SPACING=8 IN_WG=64,IN_SIZEX=8,IN_SPACING=4 IN_WG=128,IN_SIZEX=16,IN_SPACING=1 Last fiddled with by ATH on 2020-08-05 at 13:32

 Thread Tools

 Similar Threads Thread Thread Starter Forum Replies Last Post Bdot GPU Computing 1637 2020-09-27 16:39 xx005fs GpuOwl 0 2019-07-26 21:37 1260 Software 17 2015-08-28 01:35 CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12 Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 13:07.

Thu Oct 1 13:07:45 UTC 2020 up 21 days, 10:18, 2 users, load averages: 2.24, 2.13, 1.86

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.