mersenneforum.org Intel Xeon PHI?
 Register FAQ Search Today's Posts Mark Forums Read

 2021-07-19, 01:07 #199 paulunderwood     Sep 2002 Database er0rr 1111010101112 Posts I have a many-core Phi. My question is about how well mprime or mlucas runs on it for current wave-front tests -- without doing benchmark tests. Would I have to run 64 instances to get maximum throughput? Is mprime better than mlucas on it? I'd rather run one instance but if this would be foolish then I will have to rethink the situation. Last fiddled with by paulunderwood on 2021-07-19 at 01:12
2021-07-19, 01:35   #200
ewmayer
2ω=0

Sep 2002
República de California

101101100110002 Posts

Quote:
 Originally Posted by paulunderwood I have a many-core Phi. My question is about how well mprime or mlucas runs on it for current wave-front tests -- without doing benchmark tests. Would I have to run 64 instances to get maximum throughput? Is mprime better than mlucas on it? I'd rather run one instance but if this would be foolish then I will have to rethink the situation.
I'd have to dig out the data I generated when first testing my KNL, but IIRC I got close to 100% || scaling up to 4-threads for current-wavefront FFTs, rapid ||-scaling deterioration above that threadcount, and saw no gain from using the HT. Thus, were I not using cores 0:63 for Fermat-number work (at large FFTs all those cores can more effectively be brought to bear on a single dataset), I'd run 17 4-thread instances, -cpu [0:3],[4:7],...,[64:67], all using numactl and forcing the HBM to be used for owners of systems with RAM on top of that.

If you have an Mlucas avx-512 build on your KNL/Phi, here's the simplest way to see:

'./Mlucas -s m -cpu 0 && mv mlucas.cfg mlucas.cfg.1t' to get best FFT-config for 1-thread.

'./Mlucas -s m -cpu 0:3 && mv mlucas.cfg mlucas.cfg.4t' to get best FFT-config for 4-thread.

Then, if 4-thread scaling is >= 90% good, take the FFT radix set in the mlucas.cfg.4t file for some FFT length of interest - say 6M - and use it to fire up 17 more-or-less-simultaneous instances as follows:

./Mlucas -s m -cpu 0:3 -fftlen 6144 -radset [radix set here] -iters 1000 -cpu 0:3 &
[ibid, -cpu 4:7]
...
[ibid], -cpu 64:67

Divide the resulting average ms/iter for the 17 test-runs by 17 to get ms/iter on a total-throughput basis.

Then run mprime in timing-test-mode and compare its best-total-throughput.

Last fiddled with by ewmayer on 2021-07-19 at 01:36

 2021-07-19, 14:41 #201 paulunderwood     Sep 2002 Database er0rr F5716 Posts I followed your instructions up to a point and decided 4 cores per worker was optimal. I then fired up mprime and it automatically chose 16 workers. Now the box is screaming on this hot July day.
2021-07-19, 19:53   #202
ewmayer
2ω=0

Sep 2002
República de California

23·1,459 Posts

Quote:
 Originally Posted by paulunderwood I followed your instructions up to a point and decided 4 cores per worker was optimal. I then fired up mprime and it automatically chose 16 workers. Now the box is screaming on this hot July day.
Ah, you must have a 64-core rig, not 68. What does 'sensors | grep Package' show? Here's output for mine:

Package id 0: +68.0°C (high = +80.0°C, crit = +90.0°C) ALARM (CRIT)

This system is sitting in a corner on the floor under an open window, warm day, ambient close to 80F/27C. The side panel nearest one wall forming the corner has been removed, leaving about a 1-inch gap for air exchange between case innards and ambient. (On top of the 2 top case fans removing heat from the water-cooling unit radiator, naturally). Note I get the scary-looking 'ALARM (CRIT)' warning whenever the inside temp ticks above 70C at any time since-last-boot, and it persists even when the temp drops back down below the triggering threshold. IOW, the guard dog won't stop barking once it starts.

Last fiddled with by ewmayer on 2021-07-19 at 19:53

2021-07-19, 20:57   #203
paulunderwood

Sep 2002
Database er0rr

3·7·11·17 Posts

Quote:
 Originally Posted by ewmayer Ah, you must have a 64-core rig, not 68. What does 'sensors | grep Package' show? Here's output for mine: Package id 0: +68.0°C (high = +80.0°C, crit = +90.0°C) ALARM (CRIT) This system is sitting in a corner on the floor under an open window, warm day, ambient close to 80F/27C. The side panel nearest one wall forming the corner has been removed, leaving about a 1-inch gap for air exchange between case innards and ambient. (On top of the 2 top case fans removing heat from the water-cooling unit radiator, naturally). Note I get the scary-looking 'ALARM (CRIT)' warning whenever the inside temp ticks above 70C at any time since-last-boot, and it persists even when the temp drops back down below the triggering threshold. IOW, the guard dog won't stop barking once it starts.

Package id 0: +72.0°C (high = +80.0°C, crit = +90.0°C)

I too have the case side off and that open side being 1 inch from the wall

I'm impressed by the 32 day turn around for the PRP tests of 16 112Mb candidates. About half the speed of an AMD Radeon VII but 1/2 as much electricity consumption more -- 300w. Nothing beats an R VII.

I had to severely limit disk space to 2GB per worker. Also mprime skipped stage 2 P-1, presumably because of the lack to RAM.

2021-07-19, 23:40   #204
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22·7·211 Posts

Quote:
 Originally Posted by paulunderwood Nothing beats an R VII. I had to severely limit disk space to 2GB per worker. Also mprime skipped stage 2 P-1, presumably because of the lack to RAM.
There are reports of occasional sightings in Google Colab of NVIDIA A100.

Xeon Phi does pretty well with 4 workers, one designated for P-1 on 12 GB of MCDRAM.

2021-07-20, 04:03   #205
paulunderwood

Sep 2002
Database er0rr

75278 Posts

Quote:
 Originally Posted by kriesel Xeon Phi does pretty well with 4 workers, one designated for P-1 on 12 GB of MCDRAM.
It's a tricky balancing act. More throughput vs. find a stage 2 factor. Here is the (edited) benchmark I ran:

Code:
Prime95 64-bit version 30.3, RdtscTiming=1
Timings for 6048K FFT length (64 cores, 1 worker):   Throughput: 218.85 iter/sec.
Timings for 6048K FFT length (64 cores, 2 workers):  Throughput: 432.09 iter/sec.
Timings for 6048K FFT length (64 cores, 4 workers):  Throughput: 602.49 iter/sec.
Timings for 6048K FFT length (64 cores, 8 workers):  Throughput: 654.40 iter/sec.
Timings for 6048K FFT length (64 cores, 16 workers): Throughput: 666.12 iter/sec.
Timings for 6048K FFT length (64 cores, 32 workers): Throughput: 693.20 iter/sec.
Timings for 6048K FFT length (64 cores, 64 workers): Throughput: 717.88 iter/sec.
I get ~10% more throughput by running 16 workers compared to 4 workers, but I don't get to do stage 2 P-1. Also, I am not prepared to wait for a ~4 months to turn around 64 candidates! Plus the current HDD in the Phi is not very big.

Edit: I plan to install a 500GB disk and run 64 workers. With mprime's nice runtime auto-tuning I should get 64 candidates done in ~96 days. Presumably, I just have to alter some settings and I can continue with what I have done so far as well as new work.

Last fiddled with by paulunderwood on 2021-07-20 at 06:21

2021-07-20, 12:24   #206
axn

Jun 2003

5,179 Posts

Quote:
 Originally Posted by paulunderwood I get ~10% more throughput by running 16 workers compared to 4 workers
8 worker and 16 worker have practically the same thruput, so between those two, 8 should be preferred.

Similary 32 and 64 workers are also practically same, so 32 should be preferred over 64

2021-07-20, 15:19   #207
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22·7·211 Posts

Quote:
 Originally Posted by paulunderwood I don't get to do stage 2 P-1.
The benchmarks you show (presumably on Linux) are showing significantly different tradeoffs than what I've observed and documented on Windows 10, Xeon Phi 7250.

Ram does not need to be permanently restricted to available-amount/number-of-workers, or equally allocated between workers. Let prime95 / mprime dynamically allocate what's useful, where useful, leaving the rest for a possible second (, improbable third, etc.) coincident stage 2 to use. When stage 2 completes on a given worker, the additional ram is returned to the available pool for some worker to use when needed. Assuming a Xeon Phi system contains only the 16GB MCDRAM, allow up to somewhere between 2-12GB per whatever worker needs it at the moment for P-1 stage 2. Even if you start out with almost equal exponents at t0, needing P-1 before PRP iteration zero, synchronized, all requiring P-1, allowing some to get allocated enough ram to run stage 2, some won't, and P-1 overall run times will differ, so gradually the workers will desynchronize and reduce collision between P-1 memory needs of the workers. Consider several cases.

Case 1: most workers get PRP with P-1 already performed. One worker is assigned to do P-1 continuously. That one worker should get much of the available RAM allowed to be allocated to it by prime95/mprime.
In local.txt
Code:
Memory=12288 during 7:30-23:30 else 12288

[Worker #1]

[Worker #2]

[Worker #3]

[Worker #4]

...
Case 2: With several workers running PRP, some assignments of which will need P-1 at the beginning, most of the time only one worker will be in P-1 stage 2 at a time, since P-1 overall takes ~1/30 as long as a PRP, and about 1/2 of that 1/30 is P-1 stage 2. On the occasions where one P-1 stage 2 is about to launch while another is already using a lot of memory, prime95 will restrict the newly launching stage 2 to what is left of allowed stage 2 memory global usage. I would chance 12GB as a global limit for prime95 with 4-8 workers, somewhat less for higher worker count, and see what mprime does in that case. With similar memory limit settings as for case 1. (Check free ram with all PRP no P-1 running for your chosen worker count. Allow up to ~90% of that in stage 2. Chancing a LITTLE paging/swapping of other processes isn't the end of the world. A 1-worker instance running 57M LL DC uses only ~47MB ram. A wavefront PRP appears to only need ~100-200MB ram per worker.)

Case 3: A more conservative stance for several workers running PRP and occasional needed P-1. Set both global and per-worker limits, so that with high worker counts there is a guarantee that at least 2 workers running P-1 stage 2 simultaneously each get significant memory for stage 2.
In local.txt
Code:
Memory=12288 during 7:30-23:30 else 12288

[Worker #1]
Memory=6144 during 7:30-23:30 else 6144

[Worker #2]
Memory=6144 during 7:30-23:30 else 6144

[Worker #3]
Memory=6144 during 7:30-23:30 else 6144

[Worker #4]
Memory=6144 during 7:30-23:30 else 6144

...
Case 4: (Not recommended) Restrict all workers to prime95 default 0.3GB ram limit for P-1/ECM stage 2, or available-ram/number-of-workers which might be as low as 12/64= 0.1875 GiB. Stage 2 might not run on any worker, so on average P-1 will be less effective than preceding cases. This I consider a de-optimization. The difference in benchmark PRP performance between 64 and 32 workers on Xeon Phi 7210 is roughly comparable to the loss of P-1 factor probability performance by skipping stage 2. Prime95 readme.txt indicates 0.2GB minimum for 100M stage 2 P-1, and proportional to exponent. Latencies are also very long with high worker counts.

Even a quite large exponent's P-1 stage 2 does not use all 12GiB allowed, leaving ~1 GiB for another worker to run P-1 stage 2 simultaneously.
Extrapolating the readme's guidance for desirable, setting per-worker caps ~1.2GiB could allow up to 10 simultaneous ~110M stage 2's!
With 3GB set on a single-worker 6GB ram system, that's configured to also support dual GPUs' Gpuowl system ram needs & Win10, gives ~4.2% P-1 factoring probability:

Code:
[Sat Jul 17 23:53:29 2021]
UID: Kriesel/test, M105169069 completed P-1, B1=789000, B2=28852000, ...
[Sun Jul 18 16:23:34 2021]
UID: Kriesel/test, M105175207 completed P-1, B1=789000, B2=28854000, ...
[Mon Jul 19 09:27:09 2021]
UID: Kriesel/test, M105186817 completed P-1, B1=789000, B2=28857000, ...
[Tue Jul 20 02:04:44 2021]
UID: Kriesel/test, M105190549 completed P-1, B1=789000, B2=28857000, ...
Compared to ~1.66% for B1=910000 only.
Attached Thumbnails

 2021-07-20, 18:44 #208 paulunderwood     Sep 2002 Database er0rr 3×7×11×17 Posts I tried 64 workers instead of 16 and tests went from 26 days to 170 days, plus I got 48 unwanted double checks. I scrubbed all that work. Now I am running 4 workers each with 16 cores and 3072MB of RAM for stage 2 P-1 work each. Edit Doh, those timings were heavily skewed by a an old @reboot cronjob left on the "new" disk I just fitted. Anyway I am going to stick with 4 workers. Last fiddled with by paulunderwood on 2021-07-20 at 19:05
 2021-07-21, 08:21 #209 paulunderwood     Sep 2002 Database er0rr 3×7×11×17 Posts 4 workers is takes about 10 days whereas 16 would have taken 25-26 days. I'll try 32 workers when these 4 have finished. If it's less than 50 days (after tuning) I think I can forego P-1 stage 2 and still expect maximum throughput -- this should fit inside the 16GB MCDRAM (and I will not have 64 threads running a legacy cronjob). Last fiddled with by paulunderwood on 2021-07-21 at 08:22

 Similar Threads Thread Thread Starter Forum Replies Last Post dtripp Software 3 2013-02-19 20:20 nucleon Hardware 2 2012-05-10 23:53 R.D. Silverman Programming 19 2011-09-17 01:43 mack Information & Answers 7 2009-09-13 01:48 penguain NFSNET Discussion 0 2006-06-12 01:31

All times are UTC. The time now is 03:15.

Sat Nov 27 03:15:10 UTC 2021 up 126 days, 21:44, 0 users, load averages: 1.18, 1.10, 1.17