Thread: Intel Xeon PHI?
View Single Post
Old 2021-07-20, 15:19   #207
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

578010 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
I don't get to do stage 2 P-1.
The benchmarks you show (presumably on Linux) are showing significantly different tradeoffs than what I've observed and documented on Windows 10, Xeon Phi 7250.

Ram does not need to be permanently restricted to available-amount/number-of-workers, or equally allocated between workers. Let prime95 / mprime dynamically allocate what's useful, where useful, leaving the rest for a possible second (, improbable third, etc.) coincident stage 2 to use. When stage 2 completes on a given worker, the additional ram is returned to the available pool for some worker to use when needed. Assuming a Xeon Phi system contains only the 16GB MCDRAM, allow up to somewhere between 2-12GB per whatever worker needs it at the moment for P-1 stage 2. Even if you start out with almost equal exponents at t0, needing P-1 before PRP iteration zero, synchronized, all requiring P-1, allowing some to get allocated enough ram to run stage 2, some won't, and P-1 overall run times will differ, so gradually the workers will desynchronize and reduce collision between P-1 memory needs of the workers. Consider several cases.

Case 1: most workers get PRP with P-1 already performed. One worker is assigned to do P-1 continuously. That one worker should get much of the available RAM allowed to be allocated to it by prime95/mprime.
In local.txt
Code:
Memory=12288 during 7:30-23:30 else 12288

[Worker #1]

[Worker #2]

[Worker #3]

[Worker #4]

...
Case 2: With several workers running PRP, some assignments of which will need P-1 at the beginning, most of the time only one worker will be in P-1 stage 2 at a time, since P-1 overall takes ~1/30 as long as a PRP, and about 1/2 of that 1/30 is P-1 stage 2. On the occasions where one P-1 stage 2 is about to launch while another is already using a lot of memory, prime95 will restrict the newly launching stage 2 to what is left of allowed stage 2 memory global usage. I would chance 12GB as a global limit for prime95 with 4-8 workers, somewhat less for higher worker count, and see what mprime does in that case. With similar memory limit settings as for case 1. (Check free ram with all PRP no P-1 running for your chosen worker count. Allow up to ~90% of that in stage 2. Chancing a LITTLE paging/swapping of other processes isn't the end of the world. A 1-worker instance running 57M LL DC uses only ~47MB ram. A wavefront PRP appears to only need ~100-200MB ram per worker.)

Case 3: A more conservative stance for several workers running PRP and occasional needed P-1. Set both global and per-worker limits, so that with high worker counts there is a guarantee that at least 2 workers running P-1 stage 2 simultaneously each get significant memory for stage 2.
In local.txt
Code:
Memory=12288 during 7:30-23:30 else 12288

[Worker #1]
Memory=6144 during 7:30-23:30 else 6144

[Worker #2]
Memory=6144 during 7:30-23:30 else 6144

[Worker #3]
Memory=6144 during 7:30-23:30 else 6144

[Worker #4]
Memory=6144 during 7:30-23:30 else 6144

...
Case 4: (Not recommended) Restrict all workers to prime95 default 0.3GB ram limit for P-1/ECM stage 2, or available-ram/number-of-workers which might be as low as 12/64= 0.1875 GiB. Stage 2 might not run on any worker, so on average P-1 will be less effective than preceding cases. This I consider a de-optimization. The difference in benchmark PRP performance between 64 and 32 workers on Xeon Phi 7210 is roughly comparable to the loss of P-1 factor probability performance by skipping stage 2. Prime95 readme.txt indicates 0.2GB minimum for 100M stage 2 P-1, and proportional to exponent. Latencies are also very long with high worker counts.

Even a quite large exponent's P-1 stage 2 does not use all 12GiB allowed, leaving ~1 GiB for another worker to run P-1 stage 2 simultaneously.
Extrapolating the readme's guidance for desirable, setting per-worker caps ~1.2GiB could allow up to 10 simultaneous ~110M stage 2's!
With 3GB set on a single-worker 6GB ram system, that's configured to also support dual GPUs' Gpuowl system ram needs & Win10, gives ~4.2% P-1 factoring probability:

Code:
[Sat Jul 17 23:53:29 2021]
UID: Kriesel/test, M105169069 completed P-1, B1=789000, B2=28852000, ...
[Sun Jul 18 16:23:34 2021]
UID: Kriesel/test, M105175207 completed P-1, B1=789000, B2=28854000, ...
[Mon Jul 19 09:27:09 2021]
UID: Kriesel/test, M105186817 completed P-1, B1=789000, B2=28857000, ...
[Tue Jul 20 02:04:44 2021]
UID: Kriesel/test, M105190549 completed P-1, B1=789000, B2=28857000, ...
Compared to ~1.66% for B1=910000 only.
Attached Thumbnails
Click image for larger version

Name:	heterogenous ram allocation.png
Views:	27
Size:	105.5 KB
ID:	25309  
kriesel is offline   Reply With Quote