mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Operation Billion Digits

Reply
 
Thread Tools
Old 2019-06-04, 05:25   #1
lavalamp
 
lavalamp's Avatar
 
Oct 2007
Manchester, UK

55D16 Posts
Question P-1 on OBD candidates

Does anyone know of some utility that can handle P-1 on these monsters?

v29.8 of prime95 says it only accepts exponents up to 595,800,000, which I assume corresponds to its maximum FFT size.

A few years ago LaurV made a post about potentially implementing P-1 in CUDA which sounds encouraging, but I don't know if he or anyone else got much further.
https://www.mersenneforum.org/showpo...3&postcount=11

At least Prime95 can give optimal bounds for P-1. If I put in a candidate TF'd to 86 bits, it recommends B1=B2=44,680,000, no stage 2 due to RAM limitations I believe. This doesn't sound completely unreasonable, and offers a 3.53% chance of a factor, this is slightly higher than the chance of a factor from continuing TF up to 89 bits (3/89 ~ 3.37%).

If I say the candidate has been TF'd to 91 bits instead (which seems to be vaguely where GPU TFing should probably stop), then Prime95 offers the bounds B1=B2=30,920,000 with a 2.07% chance of a factor. Seems a bit odd to me that the bounds are LOWER when TF has progressed more, but alright.
lavalamp is online now   Reply With Quote
Old 2019-06-04, 06:00   #2
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

11011101111012 Posts
Default

On an FMA3-capable system prime95 should be capable of going to 920M (50M fft since V29.2). I'm running 701M now on 29.7b1 x64.
https://www.mersenneforum.org/showpo...&postcount=218
CUDAPm1 has been around for years but doesn't reach that high, due to various issues, although it nominally supports sufficiently large fft lengths, on gpus with sufficient ram. https://www.mersenneforum.org/showthread.php?t=23389

Quote:
Originally Posted by lavalamp View Post
If I say the candidate has been TF'd to 91 bits instead (which seems to be vaguely where GPU TFing should probably stop), then Prime95 offers the bounds B1=B2=30,920,000 with a 2.07% chance of a factor. Seems a bit odd to me that the bounds are LOWER when TF has progressed more, but alright.
Total odds of finding a factor, are odds from TF, plus odds from P-1. It's not worth going as deep in P-1, if a great deal of effort has been expended in TF. Prime95 and CUDAPm1 contain code that computes many different bounds combinations' chances of finding P-1 factors, and selects bounds for the optimal probable compute time savings.

Last fiddled with by kriesel on 2019-06-04 at 06:26
kriesel is offline   Reply With Quote
Old 2019-06-04, 06:10   #3
lavalamp
 
lavalamp's Avatar
 
Oct 2007
Manchester, UK

1,373 Posts
Default

Quote:
Originally Posted by kriesel View Post
On an FMA3-capable system prime95 should be capable of going to 920M. I'm running 701M now.
That'd be why then, I'm not. Still rocking an i5 3570K, the generation before Haswell and FMA.

Even so, 920M is still not high enough for OBD, that needs 3320M+.
lavalamp is online now   Reply With Quote
Old 2019-06-04, 06:46   #4
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

33×263 Posts
Default

Quote:
Originally Posted by lavalamp View Post
That'd be why then, I'm not. Still rocking an i5 3570K, the generation before Haswell and FMA.

Even so, 920M is still not high enough for OBD, that needs 3320M+.
Right, 920M<gigabit<gigadigit. Highest stage 1 I've been able to get through CUDAPm1 is 735M-740M; stage 2, 432M. CUDAPm1 would need some alteration and debugging to get there. https://www.mersenneforum.org/showpo...65&postcount=7
kriesel is offline   Reply With Quote
Old 2019-06-04, 08:05   #5
lavalamp
 
lavalamp's Avatar
 
Oct 2007
Manchester, UK

1,373 Posts
Default

Perhaps in time the limits will be raised such that stage 1 for these numbers will be possible. Though I completely understand why enabling such functionality is not exactly top priority.

For the memory usage, is there any possibility that it could be lowered if the second stage was broken down into multiple chunks, similar to stage 2 of ECM?
lavalamp is online now   Reply With Quote
Old 2019-06-11, 19:24   #6
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

33×263 Posts
Default

Quote:
Originally Posted by lavalamp View Post
Perhaps in time the limits will be raised such that stage 1 for these numbers will be possible. Though I completely understand why enabling such functionality is not exactly top priority.

For the memory usage, is there any possibility that it could be lowered if the second stage was broken down into multiple chunks, similar to stage 2 of ECM?
Stage 1 takes of order 2 bytes/bit in CUDAPm1, so an 8GB or larger gpu should be ok in stage 1. Existing software already does stage 2 in chunks, when possible and necessary, by varying the number of relative primes to fit within available memory. In my experience one of the walls CUDAPm1 hits, on smaller-gpu-ram cards, is the exponent at which even nrp=1 requires more ram than the gpu has; p~177-178M for 1GB on the Quadro 2000 for example; ~290M on the 1.5GB GTX480; ~340M on the 2GB Quadro 4000, etc. Fitting through the 1 and 1.5GB data, and extrapolating widely, suggests ~15GB could be enough, and a few high end gpus currently offer 16GB. On prime95, I suppose one could resort to virtual memory with a solid state disk if lacking enough ram. A recent prime95 run was 701M in 8192MB allowed (half of actual system ram), giving nrp=25 as I recall. CUDAPm1 is limited by gpu ram a bit, but by other issues (bugs in this alpha software) to a much greater extent, and already nominally supports fft lengths up to 256M. That fft length would be more than sufficient for gigadigit primes, if the variable for p was unsigned 32-bit, but currently it is signed, capping exponent at 231-1. In practice, CUDAPm1 is limited to much lower p. For prime95, the current limitation is maximum fft length available; as I recall, the largest implemented is 50M, but gigadigit processing requires around 192M. Gpuowl may be a possibility, although the very few tests I've tried have failed.

Of course, they are all going to be impacted by the roughly p2.2 run time scaling also. The 701M P-1 run on my i7-8750H (all cores one worker) took 32.3 days at NRP~25, while a recent 430M P-1 on a 3GB GTX1060 took ~5. days at NRP=5 (for both stages, no factor found). Those would scale to ~989. days and ~449. days respectively, per P-1 on a gigadigit candidate. Note that run time also lengthens when NRP goes toward 1 due to memory size limitations. (CUDAPm1 reference info https://www.mersenneforum.org/showthread.php?t=23389
re prime95 see https://www.mersenneforum.org/showthread.php?t=23900)

Last fiddled with by kriesel on 2019-06-11 at 19:26
kriesel is offline   Reply With Quote
Old 2021-10-22, 01:05   #7
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

11011101111012 Posts
Default OBD P-1 coordination proposal

Mlucas v20.x added P-1 factoring, and has sufficiently large fft lengths to tackle (slowly; it's a big job) gigadigit Mersenne P-1. The P-1 feature is maturing, with beta testing ongoing, and occasionally bugs found, reported, and fixed. So OBD P-1 is now becoming feasible.

I propose the following:
  1. OBD candidates are taken to 92 bits TF done, before P-1 is attempted. (Due to run time, RTX206x or faster GPUs and mfaktx are recommended.)
  2. Whoever does the 91-92 bit TF gets first choice at the P-1 for the same exponent.
  3. P-1 is only attempted on OBD candidates with no known factors, except for software testing in which finding again a known factor supports confidence in the P-1 software.
  4. If the TF-to-92 user decides not to attempt P-1 on the exponent, announcing that in this thread would be helpful.
  5. In the absence of a reservation by the last-TF user, and after a month past TF completion to 92 bits, or upon the TF-to-92 user posting its release before a month passes, the exponent becomes fair game for anyone with sufficient hardware resources to reserve for P-1.
  6. P-1 is estimated to be ~21,663 GHzDays per attempt of both stages without factor found early, and require ~16GiB of ECC ram. Those who haven't the hardware resources to complete that within a year ought not make P-1 reservations or attempts. (~59.4 GHzD/day minimum at ~192M fft length needed for completion within one year running continuously nonstop. Roughly a Xeon Phi 7210 or dual-12-core Xeon e5-2697v2)
  7. To qualify for reserving OBD exponents for P-1, post in this thread, results of run-time scaling derived from multiple widely-spaced exponents full P-1 runs, including above 100Mdigit (ideally also near 1Gbit) that demonstrates scaling consistent with completion within a year for OBD P-1, on the hardware planned to be used, along with a description of the software and hardware, sufficient for OBD P-1. At least one run time scaling example will be posted at the Mlucas v20.1 P-1 run time scaling post. Post title "Qualification" is suggested.
  8. Reservation is by posting a statement in this thread, of intent to P-1 factor a specific exponent. Post title "Reservation " followed by the exponent is suggested.
  9. Use P-1 bounds corresponding to the GPU72 row for the exponent at mersenne.ca. https://www.mersenne.ca/exponent/3321928307 shows B1=17,000,000, B2=1,000,000,000.
  10. Reporting P-1 progress in this thread roughly monthly to quarterly is suggested. (4-12 times per year.)
  11. Reservations expire after any of the following occur:
    1. One year passes after last progress report that showed progress on that reservation.
    2. Two years pass after initial reservation.
    3. A factor is found and reported here.
    4. The user posts surrender of a reservation here.
  12. If a new factor is found, it should be reported here. It should also get added to the mersenne.ca database. That may be accomplished by PM to James Heinrich. There might be other ways at some point.
Constructive civil comments are invited.
kriesel is offline   Reply With Quote
Old 2021-10-27, 18:10   #8
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

33·263 Posts
Default

Clarifying P-1 stage memory requirements for Mlucas and OBD candidates:
Quote:
Originally Posted by kriesel View Post
P-1 is estimated to be ~21,663 GHzDays per attempt of both stages without factor found early, and require ~16GiB for stage 1; for stage 2, ~64 GiB of ECC ram, and 128 would be better. Those who haven't the hardware resources to complete that within a year ought not make P-1 reservations or attempts. (~59.4 GHzD/day minimum at ~192M fft length needed for completion within one year running continuously nonstop. Roughly a Xeon Phi 7210 or dual-12-core Xeon e5-2697v2)
See also https://www.mersenneforum.org/showpo...8&postcount=19

Mlucas will support performing subdivisions of stage 2 on multiple systems. Each system so employed will need adequate ram for running stage 2. Subdivision of stage 2 bounds reduces total calendar time for a run, but does not reduce memory requirement per system.

Last fiddled with by kriesel on 2021-10-27 at 18:23
kriesel is offline   Reply With Quote
Old 2021-10-29, 13:47   #9
clowns789
 
clowns789's Avatar
 
Jun 2003
The Computer

401 Posts
Default

Thanks to James, I was able to get mersenne.ca/obd to support reservations to 92 bits. Thus, the 3321928171 exponent is now available to that bit depth.

Quote:
Originally Posted by kriesel View Post
[*]Reservations expire after any of the following occur:
  1. One year passes after last progress report that showed progress on that reservation.
  2. Two years pass after initial reservation.
  3. A factor is found and reported here.
  4. The user posts surrender of a reservation here.
In this case, we should emphasize the one year to completion suggestion. For example, if someone is 90% of the way done after two years, they may be forced to surrender it, although they would still want to finish it even if someone else is allowed to check it out. In my case, I might have to forego using my dual hex server since this scenario might pop up (although I still need to benchmark and maybe upgrade from CentOS 7 for better results).

With the advent of DDR5 and presumably ECC RAM in the mainstream, within 1-2 years we should have an exponent factored to 92 bits and boxes with DDR5 RAM and a processor like a 12th Gen i9 or upcoming Zen 4 that would negate the need for multiple processors.
clowns789 is offline   Reply With Quote
Old 2021-10-29, 14:58   #10
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

33·263 Posts
Default

I think we can make some adjustments as situations arise.
If someone runs stage 1 and realizes they won't be able to complete stage 2 in time, or at all, they could:
  • offer the stage 1 file via dropbox or similar to someone else to run stage 2, or
  • split the stage 2 bounds among multiple systems they manage with adequate ram each, or
  • seek help from someone else to do part of the stage 2 bounds.
It beats throwing away and repeating a lot of work, if the recipient trusts the gifted interim files.
The point of writing a post about how to coordinate is to hopefully promote efficiency.

One could run stage 1 on a 16 GiB system, copy the files, and split stage 2 to 2 or more systems containing at least ~64 GiB ram each, apportioning the stage 2 bounds on the split so the systems working in tandem would finish at about the same time. Apportion stage 2 delta bounds according to iters/sec throughput for the relevant fft length. Selftest in Mlucas produces the reciprocal, time/iteration, expressed in msec/iter, in mlucas.cfg.

Approximate example: if system A is twice as fast as system B, the two will begin work at the same time, and B2start = B1 (an oversimplification) = 17,000,000, B2range=1G-17M = 983M. B2range/3=~327M.
System A runs stage 2 for bounds 17M to 327*2+17 =671M; system B runs stage 2 for bounds 671M to 1G.
This would be good practice for participating in Ernst's planned significantly parallel performance of F33 P-1 stage 2, which will require ~208 GiB ram per system.

I expect multiple OBD candidates to be TF completed to 92 bits within 6 months.
On an RTX2080, OBD TF 89-90 bits is ~19 days, so to 91 ~+38 days, to 92 ~+76 days.
(All using mfaktc GPU kernel "barrett92_mul32_gs")
Starting from 88 done (level 22) that would be ~10 +19 +38 +76 = 143 days.

While the faster GPUs climb the mountain to 92, lesser GPUs can work on lower remaining bit levels on additional exponents.
On a GTX 1650, exp=3321929519 bit_min=87 bit_max=88 (9435.16 GHz-days) is ~12.7 days, so 88-89 takes ~25 days, 89-90 51 days, etc.

I consider the used dual-Xeon or Xeon Phi workstations or servers a good combination of features, performance and cost for PRP or P-1.

Last fiddled with by kriesel on 2021-10-29 at 15:33
kriesel is offline   Reply With Quote
Old 2021-10-30, 03:45   #11
clowns789
 
clowns789's Avatar
 
Jun 2003
The Computer

401 Posts
Default

That sounds good. I think Stage 1 and Stage 2 should be checked out separately, but the user who completed Stage 1 will have preference to complete Stage 2, just like the user who completed the TF to 92 bits would have preference for Stage 1. Then the times could be reduced, i.e. Stage 1 would be forfeited after one year or six months with no progress report. Stage 2 would probably be more lenient, especially if multiple users are working on the same exponent simultaneously.

This would be good for many users, since in my case I could run it on my server, but it would probably be too slow to do both stages solo. If I split Stage 2 with Ken's dual 12 core machine, for example, it would make things go fast and still utilize the 128 GB of ECC RAM I have available.
clowns789 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
A couple of 15e candidates fivemack NFS@Home 1 2014-11-30 07:52
How to calculate FFT lengths of candidates pepi37 Riesel Prime Search 8 2014-04-17 20:51
No available candidates on server japelprime Prime Sierpinski Project 2 2011-12-28 07:38
Adding New Candidates wblipp Operation Billion Digits 6 2011-04-10 17:45
new candidates for M...46 and M48 cochet Miscellaneous Math 4 2008-10-24 14:33

All times are UTC. The time now is 09:00.


Sun Nov 27 09:00:27 UTC 2022 up 101 days, 6:29, 0 users, load averages: 0.52, 0.86, 0.88

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔