![]() |
30.8 & optimal P-1 for wavefront exponents
I tried several different bounds here:
[url]https://www.mersenne.ca/prob.php?exponent=108654281&factorbits=77&b1=450000&b2=22000000[/url] My result: Assuming that v30.8 is 2.5x faster for wavefront exponents, and 1.1 tests are saved if a factor is found, then the "PrimeNet" bounds in mersenne.ca is almost optimal. However, we can reduce these bounds a bit, since not everyone has large enough memory (>30G) to do P-1 at peak speed. The b1=450000&b2=22000000 in the link above should not be far from optimal. |
[QUOTE=Zhangrc;595171]Assuming that v30.8 is 2.5x faster for wavefront exponents...[/QUOTE]
That sounds like a big assumption considering Mr. Woltman's previous comments that wavefront P-1 "will not benefit much" (see post #18 in [url]https://www.mersenneforum.org/showthread.php?p=593861#post593861[/url]). Is 2.5x purely a spitball / an extrapolation or did you get that number through empirical testing? If the latter, how much RAM do you have allocated? [QUOTE=Zhangrc;595171]However, we can reduce these bounds a bit, since not everyone has large enough memory (>30G) to do P-1 at peak speed.[/QUOTE] The P-1 cost optimizer in recent stable versions is already set up to take RAM allocation into account. From my limited experience, this seems to appear as B1 decreasing when more RAM is allocated, because a bigger B2 can be run at the same speed for the same chance of finding a factor in less total (S1 + S2) run time. I imagine something more aggressive but along the same lines will easily be worked into shipping versions of 30.8, once Mr. Woltman gets the "important" code more buttoned up and can focus on the new cost optimizer. Incidentally, in the extreme low-end case (not enough RAM allocated for stage 2 to start), B1 seems to be selected such that stage 1 takes almost as long to run as both stages would take together for a user with enough RAM allocated to run them. I've seen e.g. curtisc turn in pre-PRP P-1 results of B2 = B1 = 1.2 M. Unfortunately, this still tends to produce factor chances of < 2%. |
[QUOTE=techn1ciaN;595213]That sounds like a big assumption considering Mr. Woltman's previous comments that wavefront P-1 "will not benefit much". How much RAM do you have allocated?[/quote]
Zhangrc acknowledged he is using a more-than-customary 30GB. [QUOTE=Zhangrc;595171]Assuming that v30.8 is 2.5x faster for wavefront exponents, and 1.1 tests are saved if a factor is found, then the "PrimeNet" bounds in mersenne.ca is almost optimal.[/QUOTE] My takeaway here is that prime95 should reduce its default setting to 1.0 or 1.1 tests saved. Yes, this will find fewer factors now, but in the future when your 1TB RAM machine is commonplace we will want to redo the P-1 to take advantage of all that RAM. In other words, if we set tests_saved to either 1.0 or 2.0 we will be doing more P-1 in the future. Why double the amount of P-1 effort today that will almost certainly be redone in the future? My second takeaway is that once 30.8 is fully ready, GIMPS would benefit greatly from owners of machines with lots of RAM switching to P-1. |
[QUOTE=Prime95;595224]My takeaway here is that prime95 should reduce its default setting to 1.0 or 1.1 tests saved.[/QUOTE]
Would this also require the effort at which PrimeNet retires the P-1 task to be lowered? I don't think so because I've seen some exponents get pretty horrendous standalone P-1 (sometimes even with B2 = B1) and still be released directly to primality testing, but the last thing you would want is to change assignment lines to have [c]tests_saved=1[/c] and inadvertently start generating a bunch of useless results from people whose work was just barely over the threshold with [c]tests_saved=2[/c]. My vote between 1.0 and 1.1 is the former, perhaps just because it's a whole number and might cause less confusion for people who run Prime95 casually and don't always have a firm grasp on what the work window is printing (this was me for my first few years of GIMPS membership). If [URL="https://www.mersenneforum.org/showpost.php?p=592407&postcount=3"]Kriesel's analysis[/URL] is correct (I have no reason to believe it isn't), the empirically optimal number assuming more granularity than tenths would be ~1.0477. At that point, it's pretty much a coin flip whether to round up or down. I'll contend that some of the factors pushing the raw number up from 1.000 are to some degree transitory, so 1.0 should be a better choice for the long term. (The last few people bootlegging FTC LL or doing unproofed PRP will eventually either upgrade or stop testing, for one example. Increasing storage drive sizes should eventually bring up the average proof power, for another.) |
[QUOTE=Prime95;595224]My takeaway here is that prime95 should reduce its default setting to 1.0 or 1.1 tests saved. Yes, this will find fewer factors now, but in the future when your 1TB RAM machine is commonplace we will want to redo the P-1 to take advantage of all that RAM.
In other words, if we set tests_saved to either 1.0 or 2.0 we will be doing more P-1 in the future. Why double the amount of P-1 effort today that will almost certainly be redone in the future?.[/QUOTE] I don't have the expertise to say you are right or wrong. I'm just trying to understand the reasoning. When GPUs started TFing many times faster than PCs the consensus was let's TF a few bits deeper and save many more expensive LL/DC tests. Granted 1 PRP replaces 2 of LL &DC. Why aren't we using the same reasoning here? A P1 that used to take 5 hours now takes 1 (or so). So after the full rollout of 30.8 even if the number of P1ers doesn't change they'll be doing 5 times as many P1s in the same time. Wouldn't they get way ahead of the PRP wavefront? And if so, aren't we better off to stay just ahead and P1 deeper and save more PRPs? Granted deeper P1 in the future with 1TB machines will get more factors but aren't they more beneficial before the PRP is done? Ok now that I've spent 10 minutes one-finger typing on my mobile it just occurred to me that the average PC today won't have enough RAM to do P1 much faster at the PRP wavefront even with 30.8. Oh well, someone can slap me now. |
[QUOTE=techn1ciaN;595213]Is 2.5x purely a spitball / an extrapolation or did you get that number through empirical testing? [/QUOTE]
Purely an assumption, because the 2.5 is written in undoc.txt. I allocate 12GB of memory; can't use more because I have only 16GB. Usually it's enough for wavefront exponents, but for 30.8 it's always beneficial to allocate more RAM. |
[QUOTE=petrw1;595234]When GPUs started TFing many times faster than PCs the consensus was let's TF a few bits deeper and save many more expensive LL/DC tests.
Granted 1 PRP replaces 2 of LL &DC. Why aren't we using the same reasoning here?[/QUOTE] You seem to be operating with the assumption that new Prime95 versions keep the same P-1 cost calculator even when P-1 gets faster, when the reason we have a [c]tests_saved[/c] parameter in the first place is so that individual Prime95 instances can dynamically calculate their own optimal B1 and B2 values. This synthesizes completed TF depth and the primality test effort that a P-1 factor would save with the speed of the P-1 implementation available (which includes how much RAM has been allocated). We can assume some work line [c]Pfactor=N/A,1,2,[exponent],-1,[TF depth],1[/c]. Loading this into Prime95 30.7 might produce bounds that take five hours to run. We suppose 30.8 could run the same bounds twice as fast on the same machine (for the sake of the example, because it probably can't for wavefront exponents in actuality). Then a 30.8 installation wouldn't calculate those bounds for that work line at all; it would calculate something appropriately larger independent of anything in the line itself needing to be changed. In simpler terms, larger P-1 bounds are always built into any boost to P-1 throughput (assuming Mr. Woltman doesn't make a serious mistake when revising the cost calculator, which there's no reason to believe he would). We could go with your initial assumption that 30.8's P-1 is drastically faster even at the PRP wavefront, and setting [c]tests_saved=5[/c] (for example) because of it still wouldn't accomplish anything besides wasting a load of cycles. [QUOTE=Zhangrc;595235]...the 2.5 is written in undoc.txt.[/QUOTE] Could you pinpoint exactly where? I downloaded the latest 30.8 tarball and Ctrl+F'ed "2.5" in its undoc.txt with no hits. Do you happen to be talking about the default [c]Pm1CostFudge[/c] value? If so, that's just (approximately) the factor by which the new stage 2 cost calculator tends to undershoot; it doesn't indicate anything about the speed of the new P-1 in the abstract. |
[QUOTE=techn1ciaN;595241]it doesn't indicate anything about the speed of the new P-1 in the abstract.[/QUOTE]
It couldn't. There is no one single constant that can indicate speedup. The speedup is very much dependent on the amount of RAM. |
[QUOTE=techn1ciaN;595241]You seem to be operating with the assumption that new Prime95 versions keep the same P-1 cost calculator even when P-1 gets faster, when the reason we have a [c]tests_saved[/c] parameter in the first place is so that individual Prime95 instances can dynamically calculate their own optimal B1 and B2 values. This synthesizes completed TF depth and the primality test effort that a P-1 factor would save with the speed of the P-1 implementation available (which includes how much RAM has been allocated).[/QUOTE]
That's not my intent. I understand the Cost Calculator needs keep up with P-1 improvements. At the risk of oversimplifying let me try with actual numbers. [URL="https://www.mersenne.ca/prob.php?exponent=108000043&factorbits=77"]prob.php[/URL] tells me that suggested P-1 value takes about 15 GhzDays in the 108M ranges A PRP test in that same range takes about 450 GhzDays. That is 30 to 1. Interestingly (with a little rounding) the success rate is about 1/30. So 450 GhzDays of P-1 should do 30 tests and save on average 1 PRP test. --- I hope I didn't mess this up. I guess it assumes 450 GhzDays of each take approximately the same clock time. It may not. --- So if at a point in time, at the leading edge, the available P-1 GhzDays is 1/30 of the PRP GhzDays then P-1 should just keep up to PRP. However, if either due to personal choice or due to the increased speed of 30.8, we find P-1 getting too far ahead of PRP then would it make sense for P-1 to choose bigger B1/B2 and save more PRP tests instead. On the contrary if P-1 falls behind it would choose lower B1/B2. Or is this simply what you mean by: [QUOTE]dynamically calculate their own optimal B1 and B2 values[/QUOTE] |
[QUOTE=petrw1;595249]However, if either due to personal choice or due to the increased speed of 30.8, we find P-1 getting too far ahead of PRP then would it make sense for P-1 to choose bigger B1/B2 and save more PRP tests instead.
On the contrary if P-1 falls behind it would choose lower B1/B2.[/QUOTE] We shouldn't optimize based on available compute power for each work type. First things first. When P-1 stage 2 becomes faster, the software's calculation of optimal P-1 bounds changes. It changes in a way that increases the bounds. So, the amount of time the software spends on P-1 wouldn't necessarily drastically reduce. In fact, paradoxically, it might increase (whether it does or not is a different thing, but in principle this could happen). So we need more data to understand what is the impact of 30.8 on wavefront P-1. Second. The principle of what is the optimal cross over point of "TF vs PRP" or "P-1 vs PRP" is based on relative time it takes to run both types of computation on the _same_ processor. We do 3-4 bits of extra TF on GPU not because GPUs are faster than CPUs, but rather GPUs do better in TF relative to PRP. Like, a GPU might be 100x faster than CPUs on TF, but only 10x faster than CPUs in PRP, so the GPU's cross-over point of TF vs PRP will be a few bits higher than a CPU. If GPU was 100x faster in TF, but also 100x faster in PRP, then we wouldn't do extra TF bits with GPU (no matter how much GPU power we have). Similarly, optimal P-1 bound is / should be independent of how many dedicated P-1 crunchers are there. We assume that, if there were no P-1 work available, they would switch over to PRP (not a 100% accurate assumption, but the only feasible way to model this). If we get a surplus of dedicated P-1 crunchers who refuses to do anything else, c'est la vie. I guess they have the option to manually change the "tests save" and do whatever they wish, but the project shouldn't waste resources by using sub-optimal parameter. After all, the original point of P-1 was to speed up the clearing of exponents. |
[QUOTE=petrw1;595249]However, if either due to personal choice or due to the increased speed of 30.8, we find P-1 getting too far ahead of PRP then would it make sense for P-1 to choose bigger B1/B2 and save more PRP tests instead.
On the contrary if P-1 falls behind it would choose lower B1/B2.[/QUOTE] I don't see how tying P-1 bounds to how much P-1 is being done is supposed to improve GIMPS throughput overall. For similarly-sized exponents, a given PC can complete X PRP tests in some amount of time, or it can find Y P-1 factors in the same amount of time. Y is obviously dependent upon the P-1 bounds used. If you let it do its thing, Prime95 optimizes to have A) Y > AX (where A is the [c]tests_saved[/c] value passed), then B) the highest Y value possible. You seem to suggest that ignoring this optimization and accepting a lower value of Y (or even accepting Y < X) will become a good idea if P-1 gets far ahead of the PRP wavefront, but in that case more benefit would be had from some P-1 users simply switching to primality testing. Since large B1 and B2 values quickly run into diminishing returns with respect to the cycles needed (yes, even with 30.8; "large" is just higher for B2), P-1 past Prime95's optimized bounds [I]will not[/I] "save more PRP tests" than just, well, running the full PRPs. You brought up GPU TF earlier, so we can analogously apply your logic there. GPUs are very efficient for TF, but they can run primality tests as well, so there is still an optimization puzzle: GIMPS/GPU72 must select a TF threshold such that, in the time it would take a given GPU to complete one primality test, the same GPU will find more than one factor (on average). For most consumer GPUs, this seems to be ((Prime95 TF threshold) + 4). With that threshold, GPU72 is currently very far ahead of even high-category PRP (I believe they're currently pushing around 120 M or even higher). Does it then make sense that GPU72 should go to ((Prime95 threshold) + 5) at the PRP wavefront even though that wouldn't be optimal*, just because the threshold that [I]is[/I] optimal is easily being handled? No; anyone doing GPU TF who wants the PRP wavefront to advance more quickly should simply switch to GPU PRP. * Some recent Nvidia models have such crippled FP64 throughput that this extra level actually [I]can[/I] be optimal. I have such a one. However, I don't believe enough TFers own these to recommend the extra level universally. |
| All times are UTC. The time now is 14:06. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.