![]() |
P95, P-1 and HighMemWorkers
I was curious about how P95 works in regards to doing P-1 when you have HighMemWorkers=1. If I start 2 cores on P-1, obviously the first one to finish stage 1 starts stage 2 while the 2nd one (I'm guessing) finishes stage 1 and then goes on to do another stage 1. The question I have is what happens when the stage 2 finishes? I figure that either 1) the second core stops the current P-1 in progress and starts the Stage 2 it skipped, 2) it completes the stage 1 in progress and then goes back to run the skipped stage 2 followed by the stage 2 for the exp just completed. Can someone shed light on this? Thanks.
|
[QUOTE=bcp19;281722]I was curious about how P95 works in regards to doing P-1 when you have HighMemWorkers=1. If I start 2 cores on P-1, obviously the first one to finish stage 1 starts stage 2 while the 2nd one (I'm guessing) finishes stage 1 and then goes on to do another stage 1. The question I have is what happens when the stage 2 finishes? I figure that either 1) the second core stops the current P-1 in progress and starts the Stage 2 it skipped, 2) it completes the stage 1 in progress and then goes back to run the skipped stage 2 followed by the stage 2 for the exp just completed. Can someone shed light on this? Thanks.[/QUOTE]
Absolutely.....I've got several multi core PCs doing all P-1. A. Both cores start stage 1 B. Assume core X finishes stage 1 first and starts stage 2 C. Later core Y finishes stage 1; Max HMW exceeded; it proceeds to stage 1 on the second exponent D. Core Y will likely finish stage 1 on second exponent and proceed to stage 1 on third exponent before Core X finishes stage 2 on first exponent. E. Core X finishes stage 2; Core Y IMMEDIATELY stops stage 1 in progress and goes back to first exponent to do stage 2. F. Core X finishes stage 1 on its second exponent and goes to stage 1 on third exponent exponent G. Core Y finishes stage 2 on first exponent and continues stage 1 where it left off on the second exponent etc. etc |
@[URL="http://www.mersenneforum.org/member.php?u=2879"]petrw1[/URL]: I really appreciate the detailed explanation of the "Dance of the P-1 Workers". Thanks!
|
[QUOTE=petrw1;281735]Absolutely.....I've got several multi core PCs doing all P-1.[/QUOTE]Thank you for this scenario.
[quote]< snip > D. Core Y will likely finish stage 1 on second exponent and proceed to stage 1 on third exponent before Core X finishes stage 2 on first exponent. < snip > G. Core Y finishes stage 2 on first exponent and continues stage 1 where it left off on the [strike]second[/strike][/quote] [I](you meant)[/I] third [quote]exponent etc. etc[/quote] |
[QUOTE=cheesehead;281738]Thank you for this scenario.
[I](you meant)[/I] third[/QUOTE] Oops correct because it is already done stage 1 on the second. |
Stupid question: Does the "switching off" of the "sequential work" in prime.txt has any influence on the behavior of P-1 that you described?
|
[QUOTE=petrw1;281735]Absolutely.....I've got several multi core PCs doing all P-1.
A. Both cores start stage 1 B. Assume core X finishes stage 1 first and starts stage 2 C. Later core Y finishes stage 1; Max HMW exceeded; it proceeds to stage 1 on the second exponent D. Core Y will likely finish stage 1 on second exponent and proceed to stage 1 on third exponent before Core X finishes stage 2 on first exponent. E. Core X finishes stage 2; Core Y IMMEDIATELY stops stage 1 in progress and goes back to first exponent to do stage 2. F. Core X finishes stage 1 on its second exponent and goes to stage 1 on third exponent exponent G. Core Y finishes stage 2 on first exponent and [B]starts stage 2 on exp 2[/B] [B]H. Core X finishes stage 1 on exp 3, starts stage 1 on exp 4[/B] [B]I. Core Y finishes stage 2 on 2nd exponent and [/B]continues stage 1 where it left off on the [B]third [/B]exponent [B]J. Core X stops work on exp 4 and does stage 2 on exp 2 and 3[/B] etc. etc[/QUOTE] If I am understanding you correctly, the bold I put in above is what you meant to say since the time to run stage 2 is generally about 1.5x the time to do stage 1? Meaning over time you will end up with a large stack of stage 2 waiting to be run, which is probably why I see people talk about tossing in a DC every now and then to let it catch up? |
[QUOTE=bcp19;281778]If I am understanding you correctly, the bold I put in above is what you meant to say since the time to run stage 2 is generally about 1.5x the time to do stage 1? Meaning over time you will end up with a large stack of stage 2 waiting to be run, which is probably why I see people talk about tossing in a DC every now and then to let it catch up?[/QUOTE]
Yes, and good point...1.5 is probably a good average; I've seen or heard of Stage 2 being anywhere from about equal to stage 1 and as high as twice. But the bottom line is, as you noted, that as soon as you run more than 1 core of P-1 on a PC and have a highmemworkers less than the number of P-1 cores for even some of the day you will start to see pending stage 2's pile up. And this happens because even the best PCs drop in thruput if all cores are working on stage 2 P-1; and on top of that most PCs are used for real work with Prime95/mprime in the background; and running a few memory intensive P-1 will to a small degree or even a large degree affect this real work. So we try to find the right balance of highmemworkers to maximize thruput and minimize impacts and at the same time not fall too far behind in stage 2. A few strategies are: 1. A you suggested, occasionally have a P-1 worker do other work until stage 2 catches up on others. 2. Allow for more highmemworkers over night or whenever your PC is less likely to be used for real work. 3. Some have done stage 1 on family PC#1 and the migrated the work to secondary PC for stage 2. |
[QUOTE=LaurV;281741]Stupid question: Does the "switching off" of the "sequential work" in prime.txt has any influence on the behavior of P-1 that you described?[/QUOTE]
There must be something which does indeed affect the description given above. In my experience, none of the workers ever stop what they are doing. So everytime a worker finishes it's job, it goes through it's worktodo list and picks up the first available job which satisfies the P95 work policies. |
Yes, the newer versions (from 26.x upward ??) have the SequentialWork=1 by default in prime.txt (even if you put nothing, is treated as 1). I had the feeling in the past that the switch is also influencing the P-1 behavior, but this I will not swear, it maybe was only my impression. Anyhow, I reverted it to SequentialWork=0, because I am happy with doing first the assignments that take shortest time, then struggle with the others. The disadvantage is that sometime LL tests are delayed forever and I have to take care not to add other type of work to worktodo.txt until they are finished.
|
My understanding is that SequentialWorkToDo=0 causes p95 to prioritise initial factoring (both trial and P-1) of LL and DC assignments over other types of work.
|
A few more observations
I have a couple PCs running P-1 on all 4 cores. The better architected one (i5-750) handles 3 highmemworkers quite well. The older (Q9550) not so well; I generally limit it to 2 HMWs; ocassionally 3 to try to limit the backlog but it slows noticeably. Because Stage 2 is about 50% longer than stage 1, with 3 out of 4 HMWs I have yet to see a worker have to skip ahead to stage 1 on more than 1 extra exponent before a HMW becomes available.
Observation 1: With only 2 out of 4 HMW its an entirely different story. Stage 2 does fall behind; but what I find curious ([B]and curious means an opportunity for future versions to tweak the algorithm[/B]) is that the falling behind is NOT evenly distributed. I find that workers 3 and 4 fall behind more than workers 1 and 2. In fact right now on my Q9550 worker 4 is doing stage 1 on the 6th exponent in worktodo.txt while the other 3 workers are all caught up. Another related observation [B](curiosity)[/B]. When I tried varying the HMW based on time of day (i.e. 2 during the day and 3 at night); every day when "DAY" time was met one of the workers had to be stopped. It seemed to be fairly random which HMW was stopped but when "NIGHT" time was met it favored restarting the lower worker numbers first; so again workers 3 and 4 fell behind. Third observation: I no longer use the Memory= parm in local.txt for each worker because I don't want to leave RAM unused when there are less HMW than expected. In this situation another "dance of the HighMemWorkers" happens. More like a Salsa. 1. Worker 1 completes stage 1 and starts stage 2. It takes all 2400MB and processes about 100 relative primes. 2. Worker 2 completes stage 1 and starts stage 2. Prime95 takes half of the RAM from worker 1 and each workers gets 1200MB and process 50 primes each...so far so good. 3. Worker 3 completes stage 1 but cannot start stage 2 because HMW=2 so it goes on to stage 1 on exponent 2. 4. Worker 1 gets to the last 30 relative primes and releases RAM; worker 2 grabs it (but not immediately - see fifth below) and now processes 70 relative primes. 5. Worker 1 completes stage 2 and moves on to stage 1 of exponent 2. 6. Worker 3 immediately restarts and goes back to do stage 2 of exponent 1. But only has enough RAM available to do 30 primes. Prime95 does NOT stop worker 2 to redistribute the RAM. Note how quickly the RAM allocation imbalance can happen....When I have had 3 HMW I have noticed 1 worker drop to as low as 8 relative primes while another may be processing 50 or more. And why might we care? The term "knee-of-the-curve" comes to mind here. I'm not sure where the knee is but I notice it exists. What I am trying to say is that with more RAM a worker can process more "relative primes". It is very apparent that the more that are processed at one time the less overall time stage 2 takes; it is NOT linear. For example: right now my two HMW are as follows: 1. Processing 56 relative primes (out of 480): 136 minutes. 480/56*136/60 = 19.42 hours (with several events that will impact the time) 2. Processing 20 relative primes (out of 480): 64 minutes. 480/20*64/60 = 25.6 hours. Fourth observation: When I have had 2 HMW during the day and 3 at night, the minute "DAY" time was met a worker was stopped. However, the RAM freed up by this worker was NOT immediately grabbed by one of the remaining 2 HMW even though it is knows that doing so will speed up that worker. (See fifth below) Fifth observation: When "NIGHT" time was met it did NOT immediately start the 3rd HMW up again; I recall it was not until one of the HMWs completed a batch of relative primes: then it was stopped and half of its RAM given to the 3rd HMW in waiting. |
Keep in mind that moving the RAM around is not as easy as it seems. To do just Stage 2 initialization, just one HMW, takes 8 minutes for me, so readjusting the RAM every hour or more might cause more delay than it's worth. Obviously I don't think so practically, but it's something to keep in mind.
|
[QUOTE=petrw1;282200]Observation 1: With only 2 out of 4 HMW its an entirely different story. Stage 2 does fall behind; but what I find curious ([B]and curious means an opportunity for future versions to tweak the algorithm[/B]) is that the falling behind is NOT evenly distributed. I find that workers 3 and 4 fall behind more than workers 1 and 2. In fact right now on my Q9550 worker 4 is doing stage 1 on the 6th exponent in worktodo.txt while the other 3 workers are all caught up.
[/QUOTE] Follow up...once Worker 4 does fairly get a Stage 2 turn (after going 9 deep in Stage 1's) it keeps it until the Stage 2 is all caught up. And as expected Worker 3 started to get lost in Stage 1 land. That is, unless the Workers are restarted for some reason, then Workers 1 and 2 take the Stage 2 work back. |
The extreme on the low end....
Just by the luck of the draw one P-1 Stage 2 worker got to the point of processing the last 1 of 480 primes....and it took 22 minutes...over a week at this rate
This same core completed the penultimate 25 primes in about 75 minutes...24 hours at this rate. |
I have 2x spare CPU cores in my farm now.
Converting them to P-1. Not going to worry about the P-1 dance. Setup 2x workers, affinity on the spare cores and set 1.5GB ram each worker, 3.5GB total. I thought it might be wise to allow a bit of headroom. -- Craig |
| All times are UTC. The time now is 05:57. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.