Thread: Intel Xeon PHI?
View Single Post
Old 2021-07-19, 01:35   #200
ewmayer's Avatar
Sep 2002
Rep├║blica de California

2D8C16 Posts

Originally Posted by paulunderwood View Post
I have a many-core Phi. My question is about how well mprime or mlucas runs on it for current wave-front tests -- without doing benchmark tests. Would I have to run 64 instances to get maximum throughput? Is mprime better than mlucas on it? I'd rather run one instance but if this would be foolish then I will have to rethink the situation.
I'd have to dig out the data I generated when first testing my KNL, but IIRC I got close to 100% || scaling up to 4-threads for current-wavefront FFTs, rapid ||-scaling deterioration above that threadcount, and saw no gain from using the HT. Thus, were I not using cores 0:63 for Fermat-number work (at large FFTs all those cores can more effectively be brought to bear on a single dataset), I'd run 17 4-thread instances, -cpu [0:3],[4:7],...,[64:67], all using numactl and forcing the HBM to be used for owners of systems with RAM on top of that.

If you have an Mlucas avx-512 build on your KNL/Phi, here's the simplest way to see:

'./Mlucas -s m -cpu 0 && mv mlucas.cfg mlucas.cfg.1t' to get best FFT-config for 1-thread.

'./Mlucas -s m -cpu 0:3 && mv mlucas.cfg mlucas.cfg.4t' to get best FFT-config for 4-thread.

Then, if 4-thread scaling is >= 90% good, take the FFT radix set in the mlucas.cfg.4t file for some FFT length of interest - say 6M - and use it to fire up 17 more-or-less-simultaneous instances as follows:

./Mlucas -s m -cpu 0:3 -fftlen 6144 -radset [radix set here] -iters 1000 -cpu 0:3 &
[ibid, -cpu 4:7]
[ibid], -cpu 64:67

Divide the resulting average ms/iter for the 17 test-runs by 17 to get ms/iter on a total-throughput basis.

Then run mprime in timing-test-mode and compare its best-total-throughput.

Last fiddled with by ewmayer on 2021-07-19 at 01:36
ewmayer is offline   Reply With Quote