![]() |
|
|
#199 |
|
Sep 2002
Database er0rr
10010010011012 Posts |
I have a many-core Phi. My question is about how well mprime or mlucas runs on it for current wave-front tests -- without doing benchmark tests. Would I have to run 64 instances to get maximum throughput? Is mprime better than mlucas on it? I'd rather run one instance but if this would be foolish then I will have to rethink the situation.
Last fiddled with by paulunderwood on 2021-07-19 at 01:12 |
|
|
|
|
|
#200 | |
|
∂2ω=0
Sep 2002
República de California
22×2,939 Posts |
Quote:
If you have an Mlucas avx-512 build on your KNL/Phi, here's the simplest way to see: './Mlucas -s m -cpu 0 && mv mlucas.cfg mlucas.cfg.1t' to get best FFT-config for 1-thread. './Mlucas -s m -cpu 0:3 && mv mlucas.cfg mlucas.cfg.4t' to get best FFT-config for 4-thread. Then, if 4-thread scaling is >= 90% good, take the FFT radix set in the mlucas.cfg.4t file for some FFT length of interest - say 6M - and use it to fire up 17 more-or-less-simultaneous instances as follows: ./Mlucas -s m -cpu 0:3 -fftlen 6144 -radset [radix set here] -iters 1000 -cpu 0:3 & [ibid, -cpu 4:7] ... [ibid], -cpu 64:67 Divide the resulting average ms/iter for the 17 test-runs by 17 to get ms/iter on a total-throughput basis. Then run mprime in timing-test-mode and compare its best-total-throughput. Last fiddled with by ewmayer on 2021-07-19 at 01:36 |
|
|
|
|
|
|
#201 |
|
Sep 2002
Database er0rr
5·937 Posts |
I followed your instructions up to a point and decided 4 cores per worker was optimal. I then fired up mprime and it automatically chose 16 workers. Now the box is screaming on this hot July day.
|
|
|
|
|
|
#202 | |
|
∂2ω=0
Sep 2002
República de California
1175610 Posts |
Quote:
Package id 0: +68.0°C (high = +80.0°C, crit = +90.0°C) ALARM (CRIT) This system is sitting in a corner on the floor under an open window, warm day, ambient close to 80F/27C. The side panel nearest one wall forming the corner has been removed, leaving about a 1-inch gap for air exchange between case innards and ambient. (On top of the 2 top case fans removing heat from the water-cooling unit radiator, naturally). Note I get the scary-looking 'ALARM (CRIT)' warning whenever the inside temp ticks above 70C at any time since-last-boot, and it persists even when the temp drops back down below the triggering threshold. IOW, the guard dog won't stop barking once it starts. Last fiddled with by ewmayer on 2021-07-19 at 19:53 |
|
|
|
|
|
|
#203 | |
|
Sep 2002
Database er0rr
5×937 Posts |
Quote:
Package id 0: +72.0°C (high = +80.0°C, crit = +90.0°C) I too have the case side off and that open side being 1 inch from the wall I'm impressed by the 32 day turn around for the PRP tests of 16 112Mb candidates. About half the speed of an AMD Radeon VII but 1/2 as much electricity consumption more -- 300w. Nothing beats an R VII. ![]() I had to severely limit disk space to 2GB per worker. Also mprime skipped stage 2 P-1, presumably because of the lack to RAM. |
|
|
|
|
|
|
#204 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
24·3·163 Posts |
Quote:
Xeon Phi does pretty well with 4 workers, one designated for P-1 on 12 GB of MCDRAM. |
|
|
|
|
|
|
#205 | |
|
Sep 2002
Database er0rr
468510 Posts |
Quote:
Code:
Prime95 64-bit version 30.3, RdtscTiming=1 Timings for 6048K FFT length (64 cores, 1 worker): Throughput: 218.85 iter/sec. Timings for 6048K FFT length (64 cores, 2 workers): Throughput: 432.09 iter/sec. Timings for 6048K FFT length (64 cores, 4 workers): Throughput: 602.49 iter/sec. Timings for 6048K FFT length (64 cores, 8 workers): Throughput: 654.40 iter/sec. Timings for 6048K FFT length (64 cores, 16 workers): Throughput: 666.12 iter/sec. Timings for 6048K FFT length (64 cores, 32 workers): Throughput: 693.20 iter/sec. Timings for 6048K FFT length (64 cores, 64 workers): Throughput: 717.88 iter/sec. Edit: I plan to install a 500GB disk and run 64 workers. With mprime's nice runtime auto-tuning I should get 64 candidates done in ~96 days. Presumably, I just have to alter some settings and I can continue with what I have done so far as well as new work. Last fiddled with by paulunderwood on 2021-07-20 at 06:21 |
|
|
|
|
|
|
#206 | |
|
Jun 2003
23×683 Posts |
Quote:
Similary 32 and 64 workers are also practically same, so 32 should be preferred over 64 |
|
|
|
|
|
|
#207 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
24·3·163 Posts |
The benchmarks you show (presumably on Linux) are showing significantly different tradeoffs than what I've observed and documented on Windows 10, Xeon Phi 7250.
Ram does not need to be permanently restricted to available-amount/number-of-workers, or equally allocated between workers. Let prime95 / mprime dynamically allocate what's useful, where useful, leaving the rest for a possible second (, improbable third, etc.) coincident stage 2 to use. When stage 2 completes on a given worker, the additional ram is returned to the available pool for some worker to use when needed. Assuming a Xeon Phi system contains only the 16GB MCDRAM, allow up to somewhere between 2-12GB per whatever worker needs it at the moment for P-1 stage 2. Even if you start out with almost equal exponents at t0, needing P-1 before PRP iteration zero, synchronized, all requiring P-1, allowing some to get allocated enough ram to run stage 2, some won't, and P-1 overall run times will differ, so gradually the workers will desynchronize and reduce collision between P-1 memory needs of the workers. Consider several cases. Case 1: most workers get PRP with P-1 already performed. One worker is assigned to do P-1 continuously. That one worker should get much of the available RAM allowed to be allocated to it by prime95/mprime. In local.txt Code:
Memory=12288 during 7:30-23:30 else 12288 [Worker #1] [Worker #2] [Worker #3] [Worker #4] ... Case 3: A more conservative stance for several workers running PRP and occasional needed P-1. Set both global and per-worker limits, so that with high worker counts there is a guarantee that at least 2 workers running P-1 stage 2 simultaneously each get significant memory for stage 2. In local.txt Code:
Memory=12288 during 7:30-23:30 else 12288 [Worker #1] Memory=6144 during 7:30-23:30 else 6144 [Worker #2] Memory=6144 during 7:30-23:30 else 6144 [Worker #3] Memory=6144 during 7:30-23:30 else 6144 [Worker #4] Memory=6144 during 7:30-23:30 else 6144 ... Even a quite large exponent's P-1 stage 2 does not use all 12GiB allowed, leaving ~1 GiB for another worker to run P-1 stage 2 simultaneously. Extrapolating the readme's guidance for desirable, setting per-worker caps ~1.2GiB could allow up to 10 simultaneous ~110M stage 2's! With 3GB set on a single-worker 6GB ram system, that's configured to also support dual GPUs' Gpuowl system ram needs & Win10, gives ~4.2% P-1 factoring probability: Code:
[Sat Jul 17 23:53:29 2021] UID: Kriesel/test, M105169069 completed P-1, B1=789000, B2=28852000, ... [Sun Jul 18 16:23:34 2021] UID: Kriesel/test, M105175207 completed P-1, B1=789000, B2=28854000, ... [Mon Jul 19 09:27:09 2021] UID: Kriesel/test, M105186817 completed P-1, B1=789000, B2=28857000, ... [Tue Jul 20 02:04:44 2021] UID: Kriesel/test, M105190549 completed P-1, B1=789000, B2=28857000, ... |
|
|
|
|
|
#208 |
|
Sep 2002
Database er0rr
5·937 Posts |
I tried 64 workers instead of 16 and tests went from 26 days to 170 days, plus I got 48 unwanted double checks. I scrubbed all that work.
Now I am running 4 workers each with 16 cores and 3072MB of RAM for stage 2 P-1 work each. Edit Doh, those timings were heavily skewed by a an old @reboot cronjob left on the "new" disk I just fitted. Anyway I am going to stick with 4 workers.
Last fiddled with by paulunderwood on 2021-07-20 at 19:05 |
|
|
|
|
|
#209 |
|
Sep 2002
Database er0rr
5×937 Posts |
4 workers is takes about 10 days whereas 16 would have taken 25-26 days. I'll try 32 workers when these 4 have finished. If it's less than 50 days (after tuning) I think I can forego P-1 stage 2 and still expect maximum throughput -- this should fit inside the 16GB MCDRAM (and I will not have 64 threads running a legacy cronjob).
Last fiddled with by paulunderwood on 2021-07-21 at 08:22 |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| AMD vs Intel | dtripp | Software | 3 | 2013-02-19 20:20 |
| Intel NUC | nucleon | Hardware | 2 | 2012-05-10 23:53 |
| Intel RNG API? | R.D. Silverman | Programming | 19 | 2011-09-17 01:43 |
| AMD or Intel | mack | Information & Answers | 7 | 2009-09-13 01:48 |
| Intel Mac? | penguain | NFSNET Discussion | 0 | 2006-06-12 01:31 |