![]() |
![]() |
#12 | |
∂2ω=0
Sep 2002
República de California
32×1,303 Posts |
![]() Quote:
Suggest you save that 1-core/1-thread cfg-file as mlucas.cfg.1c1t so subsequent self-tests don't overwrite it. The /proc/cpuinfo file alas says nothing about which quartets of entries map to the same one of the 18 physical cores, and I still haven't found any docs which explain the logical core numbering convention for your SMT4 setup. So let's see if there any appreciable difference between the 4-thread timings given by the following: ./Mlucas -s m -cpu 0:3 >& test2.log ./Mlucas -s m -cpu 0:71:18 >& test3.log The first uses the AMD core numbering convention, logical cores 0-3 all map to the same physical core; the second uses the Intel convention, where for an 18-physical-core 4-way-SMT CPU, logical cores 0,18,36,54 all map to the same physical core. I plan to add support for the hwloc topology-extracting freeware library next year, need to see if there's a simple way to build/install that in standalone mode so one can just use it as-is to get topology for one's platform. It would be best to suspend/restart your ongoing DCs and whatnot via 'kill -[STOP|CONT] pid' to run the above. Based on your 1c1t runtime, each should take ~10 minutes. Thanks, -Erns |
|
![]() |
![]() |
![]() |
#13 | ||
"Simon Josefsson"
Jan 2020
Stockholm
3·11 Posts |
![]()
I put it here:
https://gist.github.com/jas4711/100d...omment-3957631 Quote:
Quote:
Code:
root@vello:~# ppc64_cpu --cores-present Number of cores present = 18 root@vello:~# ppc64_cpu --threads-per-core Threads per core: 4 root@vello:~# ppc64_cpu --info Core 0: 0* 1* 2* 3* Core 1: 4* 5* 6* 7* Core 2: 8* 9* 10* 11* Core 3: 12* 13* 14* 15* Core 4: 16* 17* 18* 19* Core 5: 20* 21* 22* 23* Core 6: 24* 25* 26* 27* Core 7: 28* 29* 30* 31* Core 8: 32* 33* 34* 35* Core 9: 36* 37* 38* 39* Core 10: 40* 41* 42* 43* Core 11: 44* 45* 46* 47* Core 12: 48* 49* 50* 51* Core 13: 52* 53* 54* 55* Core 14: 56* 57* 58* 59* Core 15: 60* 61* 62* 63* Core 16: 64* 65* 66* 67* Core 17: 68* 69* 70* 71* root@vello:~# FWIW, it passed another LL DC some I'm pretty confident everything works even if it probably can be optimized more. https://www.mersenne.org/report_expo...2113993&full=1 |
||
![]() |
![]() |
![]() |
#14 | |
"Simon Josefsson"
Jan 2020
Stockholm
3×11 Posts |
![]() Quote:
These didn't take the 10 minutes you guessed, not sure what is wrong but here is the results... https://gist.github.com/jas4711/100d...68592ae56e9a12 Scroll down or search for "test2.log" and "test3.log" respectively. As you can see, the first run took 4.5 hours and the second run took 1h40m. I put the 'grep passed' output in a comment at the bottom. The optimal setting I've found by experimenting is still -cpu 0:63:2. |
|
![]() |
![]() |
![]() |
#15 | |
∂2ω=0
Sep 2002
República de California
101101110011112 Posts |
![]() Quote:
Thanks for the data - my comments: o Leading radix-32 appears to have been miscompiled - suggest trying an incremental rebuild at a lower opt-level: In /obj, 'gcc -c -O2 -g -DUSE_THREADS -march=native ../src/radix32_*cy*c && gcc -o Mlucas *o -lm -lrt -lpthread -lgmp', then './Mlucas -fft 2M -iters 100 -radset 32,32,32,32' to quick-test. If that still yields incorrect residues, try -O0. But I didn't see any cases in your test-logs where having said leading-radix working properly would have yielded the best timing at a given FFT length. o I forgot to ask also for the mlucas.cfg files resulting from each self-test with a different -cpu arguments, but those numbers are extracted easily enough from the test-log data. Best-timings-in-msec/iter-and-radix-sets for various FFT lengths and -cpu args; asterisks mark timing anomalies, i.e. cases where timing for a given FFT length is slower than the next-larger one: Code:
FFT 1c1t 0:3 0:71:18 ---- -------------------- -------------------- -------------------- 2048 75.47 1024,32,32 58.82 64,32,32,16 20.40 1024,32,32 2304 91.20 36,32,32,32 66.96 36,32,32,32 26.82 36,32,32,32 2560 108.77 40,32,32,32 82.46 40,32,32,32 30.54 40,32,32,32 2816 114.72 44,32,32,32 86.36 44,32,32,32 32.17 176,32,16,16 3072 121.73 48,32,32,32 92.61 48,32,32,32 33.26 48,32,32,32 3328 137.32 52,32,32,32 102.31 208,32,16,16 38.02 52,32,32,32 3584 144.97 56,32,32,32 109.29 224,32,16,16 38.08 56,32,32,32 3840 157.37 60,32,32,32 118.77* 60,32,32,32 43.04 60,32,32,32 4096 163.65 128,32,32,16 117.51 128,32,32,16 43.85 64,32,32,32 4608 183.77 144,32,32,16 135.94 144,32,32,16 48.24 144,32,32,16 5120 234.53* 160,32,32,16 170.85* 160,32,32,16 61.20* 160,32,32,16 5632 228.96 176,32,32,16 169.98 176,32,32,16 59.85 176,32,32,16 6144 252.86 192,32,32,16 188.42 192,32,32,16 65.20 192,32,32,16 6656 272.21 208,32,32,16 199.11 208,32,32,16 70.45 208,32,32,16 7168 291.60 224,32,32,16 214.91 224,32,32,16 75.87 224,32,32,16 7680 352.16 240,32,32,16 248.33 240,32,32,16 91.04 240,32,32,16 The only real way to be sure is to use the above data to guide experiment. I suggest saving the 3 cfg-files corresponding to each of the above best-timing columns into mlucas.cfg.1c1t, mlucas.cfg.1c4t, and mlucas.cfg.4c4t, respectively. Then, under the dir containing the Mlucas binary and these cfg-files, create 18 run directories, say run0-h, using a hexadecimal-style numbering system. Put a separate worktodo.ini file into each rundir. Then try two different 18-instance configurations, each of which fills up all 72 logical cores, but in 2 different ways: [1] In each of run0-G, 'ln -sf ../mlucas.cfg.1c4t mlucas.cfg', then in dir run0 invoke the binary with -cpu 0:3, in run1 with -cpu 4:7, all the way through runG with -cpu 68:71 . Let those 18 jobs get through at least several savefile writes, or just let them run for around 24 hours before doing 'killall -9 Mlucas'. Copy down the average time between checkpoints: 'tail -1 run*/p*stat' alas won't work because 'tail' has a stupid limitation that disallows a -[line count] argument for wildcarded invocations, but if you have multitail installed that will work nicely. [2] In each of run0-G, 'ln -sf ../mlucas.cfg.4c4t mlucas.cfg', then in dir run0 invoke the binary with -cpu 0:71:4, in run1 with -cpu 1:71:4, all the way through dirG with -cpu 17:71:4 . Let those 18 jobs get through at least several savefile writes, or just let them run for around 24 hours. Compare the average time between checkpoints against that for setup [1]; if it's slower again 'killall -9 Mlucas' and revert to setup [1]. Whichever of [1] and [2] is faster, the setup can be encoded in a bash script to make run resumption after machine downtime or whatnot simple. It'll be interesting to see those total throughput numbers, and to compare to what iters/sec you get for -cpu 0:63:2, which puts 2 threads on each of physical cores 0-15. Last fiddled with by ewmayer on 2021-11-24 at 18:52 |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Core i5 2500K vs Core i7 2600K (Linear algebra phase) | em99010pepe | Hardware | 0 | 2011-11-11 15:18 |
How to retire one core in a dual-core CPU? | Rodrigo | PrimeNet | 4 | 2011-07-30 14:43 |
Dual Core to Quad Core Upgrade | Rodrigo | Hardware | 6 | 2010-11-29 18:48 |
exclude single core from quad core cpu for gimps | jippie | Information & Answers | 7 | 2009-12-14 22:04 |
Optimising work for Intel Core 2 Duo or Quad Core | S485122 | Software | 0 | 2007-05-13 09:15 |