mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Operation Billion Digits

Reply
 
Thread Tools
Old 2021-10-30, 14:15   #12
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

17·349 Posts
Default

(edits in quote are in bold)
Quote:
Originally Posted by kriesel View Post
Approximate example: if system A is twice as fast as system B, the two will begin work at the same time, and B2start = B1 (an oversimplification) = 17,000,000, B2range=1G-17M = 983M. B2range/3=~328M.
System A runs stage 2 for bounds 17M to 328*2+17 =673M; system B runs stage 2 for bounds 673M to 1G.
...
I expect multiple OBD candidates to be TF completed to 92 bits within 6 months.
On an RTX2080, OBD TF 89-90 bits is ~19 days, and to 91 ~+35 days initially indicated, so 92 ~+70 days.
(All using mfaktc GPU kernel "barrett92_mul32_gs")
Starting from 88 done (level 22) that would be ~10 +19 +35 +70 = 134 days ~ 4.5 months.
There are one candidate each now at 91 bits TF completed, 90, and ~halfway between 89 and 90.
RTX2080 run indicates ~2143. GHD/day; GTX1650 at such levels is ~753. GHD/day; GTX1650 Super 954.
A GTX1650 is estimated to complete 88-92 in 12.8 months; a GTX 1650 Super in 10.1 months.

Mlucas s1 files for OBD are ~405MB each, so for p & q pairs, ~810MB per exponent, just two exponents' sets fit in the dropbox 2GB free space. Google drive at 15 GB is roomier, allowing 18 pairs.

It would be good to avoid a surplus of stage-1-completed and paucity of stage-2 attempts.
Operationally, it would be simplest if stage 1 and stage 2 occur on the same system, or at least same LAN (avoiding the cloud-storage shuffle and multiuser coordination on a single exponent).
There is some throughput advantage to doing stage 1 on 16 GiB systems and running exclusively stage 2 on big-ram systems.

WSL vs. native Linux performance should be investigated before committing to lengthy runs. (And I'm seeing substantial indications of perhaps WSL-related memory leak on a test system running Mlucas v20.1 P-1 on 665M. Of order 100MB/hour in first day or two after a system restart. Haven't attempted an mprime on WSL on the same system comparison yet.)

Last fiddled with by kriesel on 2021-10-30 at 14:27
kriesel is offline   Reply With Quote
Old 2021-11-04, 14:10   #13
clowns789
 
clowns789's Avatar
 
Jun 2003
The Computer

3·7·19 Posts
Default

It also begs the question if P-1 should be done following reaching 92 bits or if we should wait until one system can do both P-1 stages and PRP consecutively. Since gpuowl 7 requires this and perhaps future CPU-based software will as well, I want to avoid any redundancy or at least any failure to optimize should it be deemed that P-1 and PRP should run on the same system the same way Ken mentioned that P-1 Stage 1 and Stage 2 should be on the same system.
clowns789 is offline   Reply With Quote
Old 2021-11-04, 16:21   #14
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

17·349 Posts
Default

Quote:
Originally Posted by clowns789 View Post
It also begs the question if P-1 should be done following reaching 92 bits or if we should wait until one system can do both P-1 stages and PRP consecutively. ...
A reasonable question. One might also ask whether P-1 should occur between 91 and 92 bits TF. Since GPU TF 91-92 is ~0.2 years for ~1.1% chance of a factor, while P-1 is ~1 year+ on multicore CPU for ~4% chance of a factor, I think it's better to go to 92 on TF first, now, while performing P-1 system qualification in parallel, and P-1 software debugging has been occurring.

CUDAPm1 is currently moot for OBD P-1. It has demonstrated in light testing an inability to perform both stages above ~430M exponent or achieve resume from a save file above some higher exponent. It is limited by its source code to 231-1. It is slower than Gpuowl on same hardware. There are numerous bugs. It is no longer being maintained. Etc.

Gpuowl is currently moot for OBD P-1. No existing version of Gpuowl supports OBD P-1. (So consecutive P-1/PRP as in gpuowl 6.x, or combined P-1 stage 1/PRP powering of 3, as in gpuowl V7.x, are both moot. Gpuowl development has ceased, at least for now, with the last GitHub commit occurring in late March 2021. Note that on the same hardware and exponent, on Windows, gpuowl v7.x combined P-1/prp has been observed to be slower than gpuowl v6.11 serial P-1 standalone followed by PRP for wavefront or 100Mdigit or ~gigabit IIRC. Some versions of gpuowl nominally support OBD PRP but they lack PRP proof, and the prp time is ~5 years/exponent on a Radeon VII.)

I think mprime/prime95 will remain moot for OBD P-1 and PRP also. On AVX512 only, prime95 supports 64M max fft, 1/3 the size needed for OBD P-1 or PRP. George may implement parallel P-1/PRP in prime95/mprime in the future, but is likely to only support up to ~64M fft length ~1.17G exponent. He's not interested in coding for cases that would take decades to run to completion. There are lots of things competing for George's attention.

Mlucas recently added standalone P-1 capability. It might someday get the combined P-1/PRP capability, but not until after it gets PRP proof generation capability. Mlucas now supports Mersenne exponents up to 233, but OBD PRP would take decades on affordable/available hardware in either case.

CUDALucas is moot for OBD PRP. It's LL only, has no Jacobi symbol based error check, per the source code maxes out at 231-1, and in practice crashes above ~1.4G exponent. It is not being maintained. If fixed it would likely be slower than gpuowl on the same NVIDIA hardware and exponent. https://www.mersenneforum.org/showpo...6&postcount=14

Demonstrating feasibility of OBD standalone P-1 appeals to me. Usage of software informs future development, finds bugs, and independently confirms resolution of bugs.

Suggesting running both P-1 stages on the same system was motivated by an operational efficiency consideration, not a feasibility or primarily performance consideration. Consider some alternate scenarios following, all of which presume Mlucas v20.x or later:

1) Same system, standalone P-1: system with plenty of ram runs stage 1 using little of the ram, then stage 2 using most of the ram. No user intervention or file transfer between systems or between participants. Simplest.

2) Same LAN, mixed systems, parallelism at stage level:
~16 GiB systems run stage 1.
~64+ GiB systems run stage 2. Manually copy files across LAN via private server as needed, or, have different systems' program instances run in same private LAN server folders sequentially (which would in my case involve lots of fiddling with jobs.sh and mlucas.cfg at the stage 1/stage 2 switch because of disparate processor instruction sets, core counts etc).
Less simple; somewhat better utilization of high-ram systems; greater total throughput by letting 16GiB systems help.

3) Same LAN, mixed systems, parallelism at stage level and more:
~16 GiB systems run stage 1. Perhaps two run the same first exponent and bounds and then the resulting p<exponent> files are compared, to try to detect stage 1 error / crudely assess reliability. If no match, try a third afterward.
~64+ GiB systems run stage 2. Manually copy stage 1 files across LAN via private server to the stage 2 systems as needed.
Split stage 2 to two or more large-ram systems with B2 subranges chosen to roughly equalize run times.
Even less simple; somewhat better utilization of high-ram systems; greater throughput by letting 16 GiB systems help; reduced latency for a given task.

4) Collaboration among participants via internet on a single exponent:
One participant (A) runs stage 1 solo, on a ~16 GiB system. Stage 1 is not amenable to splitting or parallelism at the systems level.
Participant A posts stage 1 result files p<exponent>, and optionally q<exponent>, on a cloud drive accessible by others, such as Dropbox or google drive with sharing enabled.
Multiple participants previously prequalified hardware by run-time scaling tests and posting those results.
The other participants must basically trust that A did stage 1 accurately, and did not just generate random bits.
(Or a helpful or cautious participant could run stage 1 again to the same B1 and compare the resulting files.)
By mutual agreement, participant(s) A, B (, etc?) subdivide the stage 2 bounds of the single exponent, for roughly equal stage 2 run time on probably disparate hardware.
The other participants download A's stage 1 results files for their respective parallel stage 2 runs on different B2 subranges. They then work in parallel to complete the OBD P-1 stage 2 and report the results of each distinct B2 subrange.
This is probably the fastest way to get one OBD P-1 completed, and the most complicated. It might be the only feasible way for some participants to constructively contribute in under a year or two run time with the hardware they have.


The collaboration scenario is the same model Ernst has chosen for the F33 P-1 effort, which is ~order of magnitude larger than one OBD P-1. Participation in that requires >120 GiB ram.
It may also be how one of the earliest OBD P-1 completed gets run.

Last fiddled with by kriesel on 2021-11-04 at 16:23
kriesel is offline   Reply With Quote
Old 2021-11-04, 21:02   #15
Viliam Furik
 
Viliam Furik's Avatar
 
"Viliam Furík"
Jul 2018
Martin, Slovakia

2·5·71 Posts
Default

Quote:
Originally Posted by kriesel View Post
Mlucas s1 files for OBD are ~405MB each, so for p & q pairs, ~810MB per exponent, just two exponents' sets fit in the dropbox 2GB free space. Google drive at 15 GB is roomier, allowing 18 pairs.
I didn't pay enough attention to this thread to know why you need that storage, but I can offer at least 1 TB on OneDrive cloud storage in case you are interested.
Viliam Furik is offline   Reply With Quote
Old 2021-11-29, 17:24   #16
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

17·349 Posts
Default Qualification (#1 "roa")

Run time scaling completed twice, from 10M to 500M or higher plus short stage 1 OBD interim timing, yields multiple extrapolations to OBD GPU72 bounds P-1 both stages completion within one year of continuous running, if not loaded with too much other work simultaneously. About 10-11 months solo looks feasible on this dual xeon e5-2697v2 system with 128 GiB ram on Ubuntu, WSL1, Win10. The initial two attachments at the link below describe some possibilities for yet faster run time. (Run time scaling results for additional systems may be posted at the same link later.)

See https://www.mersenneforum.org/showpo...5&postcount=17 for more background
kriesel is offline   Reply With Quote
Old 2021-11-29, 17:31   #17
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

17·349 Posts
Default Misc. updates

There is a November 28 patch for Mlucas v20.1.1.
Mersenne.ca is now accepting P-1 results from Mlucas and Gpuowl for p>109, via bulk factor file upload.
As far as I know, there is no reservation system for P-1 work for p>109 at this time other than posting in this thread.

Current TF status is shown at https://www.mersenne.ca/obd. Current indicated Level is 22.06.
3321928171 30% complete from 91 to 92, ETA to 92 2022-01-31 by johnny_jack
3321928307 83% complete from 90 to 91, ETA to 91 2021-12-04 & will continue immediately to 92 by kriesel
3321928319 31% complete from 90 to 91, ETA to 91 2021-12-26 & will continue immediately to 92 by kriesel
3321938373 01% complete from 89 to 90, ETA to 90 2022-01-21 by kriesel
There are 13 others in progress at bit levels 86-87 to 88-89 by kriesel as part of an effort to go expeditiously to OBD Level 24 or higher.
Faster GPUs are running the higher of the bit levels indicated above; slower GPUs and colab sessions are running the lower of the bit levels mentioned above.
All exponents 3321928171 to 3321929987 (currently 33 exponents) are being taken higher, with the expectation that some will have a factor discovered before reaching level 26 (TF to 92 bits completed on 26 surviving unfactored exponents) or level 28 (28 to 92 bits TF and through stage 2 P-1 to good bounds without factors found). These P-1 survivors would be preparation for someone someday after additional considerable hardware advance, using hardware that does not yet exist and won't for years, using software that does not yet exist either, going to level 29 (29 OBD PRP/GEC/proof-gen) & level 30 (30 OBD PRP/GEC/proof-gen & verified).

So at the moment:

OBD TF completed to 91, reserved to 92 bits: 1

OBD TF completed to 92 bits: 0
Systems with qualification(s) completed for OBD P-1: 1
Exponents reserved for P-1: 0
Exponents completed thru stage1 P-1: 0
Exponents completed thru stage2 P-1: 0
There's no expectation of being able to qualify less than supercomputer hardware for PRP, and PRP/GEC/proof software for OBD or higher does not exist, so as expected, systems with qualifications completed for OBD PRP etc = 0.

Last fiddled with by kriesel on 2021-11-29 at 18:31
kriesel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
A couple of 15e candidates fivemack NFS@Home 1 2014-11-30 07:52
How to calculate FFT lengths of candidates pepi37 Riesel Prime Search 8 2014-04-17 20:51
No available candidates on server japelprime Prime Sierpinski Project 2 2011-12-28 07:38
Adding New Candidates wblipp Operation Billion Digits 6 2011-04-10 17:45
new candidates for M...46 and M48 cochet Miscellaneous Math 4 2008-10-24 14:33

All times are UTC. The time now is 07:47.


Wed Dec 8 07:47:43 UTC 2021 up 138 days, 2:16, 1 user, load averages: 1.33, 1.51, 1.57

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.